0 Comments

Its getting harder and harder to come up with relevant subtitles for this particular topic. I should probably just starting using numbers.

Totally nailed this one though.

To the topic at hand though, you might be forgiven for thinking that we had our AWS Lambda ELB Logs Processor well under control from my last post. It was processing files happily all the way through and everything seemed to be arriving in our log stack in an acceptable timeframe.

To give a complete picture, you can see the source for the Lambda function here.

Inside the source, the #{something} notation is an Octopus Deploy variable substitution. Basically, when this code is deployed into AWS using Octopus Deploy, those strings are replaced with real values appropriate for the environment being deployed to.

Its long enough that I didn’t just want to paste it into this post, but still short enough to fit comfortably in a single file. It streams a file from S3 line by line, parses each line into its constituent parts and then posts each line to a logstash TCP endpoint. It has some basic connection pooling (homegrown) and features some basic logging.

Unfortunately, the logging turned out to be pretty painful to use in production. This was primarily because there were multiple places inside the function that write log messages for each line encountered. In our sample data this is fine, because there are tens of lines. In production, we get over 60000 lines every run of the function, which means the CloudWatch logs are basically unusable. Not only that, but the information that comes out in the logs is pretty useless at scale, stating things like “starting posting to logstash”, “processing file”, “posted to logstash” and so on.

The other issue was the connection pooling. It seemed to be working (in that it reduced the pace of the function such that it no longer ran out of TCP connections), but I wasn’t entirely sure it was doing what we thought it was. To be honest, the code was pretty hard to understand, mostly as a result of the use of aysnchronous callbacks. To me, it seemed like a connection pool should be something that other people have needed, so surely there was a package available that would meet our needs.

Taking these two points into considering, I set about improving the function by:

  1. Finding and using a library for the connection pool, hopefully making it easier to read and reason about.
  2. Making the logging output actually useful when the function was executed in production.

Car Pool

Looking around there are a few packages that enable pooling of various resources available in NPM. As seems to be the pattern though, they are of varying quality and completeness. In the end I settled on a package called “generic-pool” (Github here), which seemed pretty good. A major point in its favour was that it had a version greater than 0, which seems rare for an NPM package.

Unfortunately for me, it was promise based,and I had no idea what promises were or how to use them. I am, however, familiar with the Task Parallel Library in C#, which seems to be similar conceptually, so I didn’t have to learn a whole bunch of concepts entirely from scratch.

Using the library to set up a pool of TCP connections was relatively straightforward, as the snippet below shows.

const poolFactory = {
    create: function() {
        return new Promise(function(resolve, reject) {
            const socket = net.createConnection(config.logstashPort, config.logstashHost);
            summary.connections.created += 1;
            resolve(socket);
        })
    },
    destroy: function(socket) {
        return new Promise(function(resolve, reject) {
            socket.end();
            socket.destroy();
            summary.connections.destroyed += 1;
            resolve();
        })
    },
    validate: function(socket) {
        return new Promise(function(resolve, reject){
            resolve(!socket.destroyed);
        })
    }
};

var poolOptions = {
    max: config.connectionCountLimit,
    min: 0,
    acquireTimeoutMillis: config.connectionWaitMilliseconds,
    testOnBorrow: true
};

var pool = gPool.createPool(poolFactory, poolOptions);

With a pool in place, all that was left to do was to incorporate it into the actual line parsing/processing engine, which meant a shift away from asynchronous callbacks to promises.

summary.lines.parsed += 1;

var promise = pool.acquire()
    .then((socket) => { return post(socket, entry); })
    .then((socket) => { pool.release(socket); })
    .catch((error) => { 
        summary.failures.sending.total += 1;
        if (summary.failures.sending.lastFew.length >= 5) {
            summary.failures.sending.lastFew.shift();
        }
        summary.failures.sending.lastFew.push(error);
    });

promises.push(promise);

The promises collection/array is used to determine when the function is complete, which is the opportunity to do anything that needs to happen right at the end.

Like logging.

Log Ride

Speaking of logging, the first improvement was to remove all instances where the function wrote a message for each line. Easy. Now at least the production logs wouldn’t be an unreadable mess.

The next step was to add some metrics to the logging, so we could track down the function was doing. This needed to include some statistics like lines read, lines parsed and lines sent to logstash.

This wasn’t so bad (simply introduce a function scoped variable for the metrics, and then increment the various counters whenever the specified event occurred), but it did start to cause issues when I tried to refactor the function to break it down into smaller, more easily reasoned about components. I couldn’t easily move any of the inner functions around because then they wouldn’t have the metrics object in scope. I think I might have been able to solve this problem by adding the metrics object in as a parameter to each of the functions that needed to edit it, but this would have made using those functions in the various asynchronous callbacks much harder.

With the line based metrics sorted, it was a relatively easy matter to add metrics for connection creation and destruction via the pool construct outlined above, so I don’t need to go into too much detail about that.

Finally, I also wanted to include some information about errors that occur in the process, making sure that if every single line failed for some reason, the CloudWatch logs would still remain readable. The easiest way to do this was to accumulate a small queue of the most recent errors, pushing older ones out in favour of newer ones with the understanding that the last 5 errors is probably enough information to diagnose issues. You can see the code for doing this in the promise catch handler above.

The only thing left to do was to actually log the metrics object that contained all of this juicy information, and the only place where this was possible was in the handler for when the readline module had finished reading the entire file.

function handleReaderClose() {
    console.log('File reader for ELB log file is closing because all lines have been read. Waiting for all promises (for sending parsed lines to logstash) to resolve');
    Promise
        .all(promises)
        .then(() => { console.log("Cleaning up the connection pool, which has [%s/%s] (current/max) connections", pool.size, pool.max); return pool.drain(); })
        .then(() => pool.clear())
        .then(() => { 
            console.log("All processing complete. Summary follows"); 
            console.log("%s", JSON.stringify(summary, fixStringifyError, 4)); 
        });
}

There is a few other things in there that I’m not going into too much detail about, like cleaning up the pool and a custom function to stringify errors, but the important part for this discussion is the loggin.

Conclusion

After doing the work above, I’m a bit happier with the current state of the AWS Lambda ELB Logs Processor. I feel like its better structured and easier to reason about, mostly due to the introduction of promises instead of callbacks. In addition to the quality of the code itself, the improved log output makes it easier to see how the function is going on a day to day basis as it chews through thousands and thousands of lines every run.

In particular, the summary that gets output right at the end is extremely useful for getting a high level picture of the output of a single run without overloading the reader.

An example of the sort of output we see from a normal (successful) run is below:

START RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769 Version: $LATEST
2017-03-09T23:25:59.785Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    Retrieving ELK log file from S3 bucket/key specified in the initiating event. Bucket: [REDACTED], Key: [REDACTED]
2017-03-09T23:27:31.344Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    File reader for ELB log file is closing because all lines have been read. Waiting for all promises (for sending parsed lines to logstash) to resolve
2017-03-09T23:27:34.924Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    Cleaning up the connection pool, which has [48/100] (current/max) connections
2017-03-09T23:27:34.967Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769    All processing complete. Summary follows
2017-03-09T23:27:35.003Z    c0e94f4d-051f-11e7-8fbb-2b8356f39769
{
    "lines": {
        "encountered": 76464,
        "parsed": 76464,
        "sent": 76464
    },
    "connections": {
        "created": 48,
        "destroyed": 48
    },
    "failures": {
        "parsing": {
            "total": 0,
            "lastFew": []
        },
        "sending": {
            "total": 0,
            "lastFew": []
        }
    }
}
END RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769
REPORT RequestId: c0e94f4d-051f-11e7-8fbb-2b8356f39769    Duration: 95227.56 ms    Billed Duration: 95300 ms Memory Size: 192 MB    Max Memory Used: 165 MB

There are still many improvements to be made (especially regarding the structure of the repo and its complete lack of automated tests), but its definitely more betterer than it was.

0 Comments

And that marks the end of the series on our synchronization process. Surprising absolutely no-one, it turns out that building such a process is quite challenging, with lots of different edge cases that need to be taken into account to ensure quality at all of the various levels.

To summarise:

At the end, we’re left with the following algorithm:

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

Algorithms are great, and form the basis of pretty much everything we do as software developers, but you can’t really deploy one directly. At least not one written mostly as plain text, like the one above.

So how do we actually put it into practice?

Software, Hardware. What About Just-Right-Ware?

Apart from a little bit of information in the first post of the series, I’ve mostly been writing about the synchronization process conceptually. To finish everything up, I’m going to explain a little bit about how we actually put it into practice and the pieces of software in play, just to make it a little bit more concrete.

Implementation-wise, there are three relevant components.

  • A piece of desktop software, installed on the client side (C#)
  • An API (C# using Nancy)
  • A database (multi tenant, AWS RDS w. PostgreSQL)

Using the desktop software, client’s register a database for “cloud functionality”, agreeing for their information to be synchronized for use in other applications. This registration process gives the database a unique identifier, and if we combine this with their client id, we can safely segregate databases from one another. One one or more databases are registered, the desktop software executes approximately every minute, performing a single run of the synchronization algorithm as specified above, using the API to discover information about the remote side of the equation.

The API itself is relatively simple, and is dedicated to facilitate the synchronization process, mainly acting as an application layer on top of the database. It is primarily a REST API (where the entities are customers, databases and tables) and also features some basic authentication and authorization (using the client id, database id and some other user specific information).

At a high level, it features endpoints like this:

GET /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
POST /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
DELETE /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}
GET /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}/manifest

Its got some other endpoints as well, but the endpoints above are the most relevant to the synchronization process (the other endpoints are for things like health checks, uptime checks and administrative functionality).

To extrapolate:

  • The first GET endpoint returns the versioning information for the table (i.e. the count, max version, max modified date) which is the primary input for the synchronization process (when compared to the same information locally).
  • The POST endpoint on table name allows for inserting/uploading data, supplied as a set of rows appropriate for the table in question.
  • The DELETE endpoint on table name unsurprisingly allows for the deletion of data by supplying a set of keys to delete, but also allows for delete operations based on range (i.e. everything > version X) or deleting everything all at once.
  • Finally, the GET endpoint on table/manifest allows for the retrieval of a manifest describing a section of the table, which is used for the differencing check.

The database is a replica of the client side database, with additional columns for data segregation based on the combination of client id and database id (as mentioned above).

Working together, these three components make up the concrete implementation of the synchronization process, replicating local, on-premises data successfully to a remote location, for use in other applications and processes as necessary.

Conclusion

Its taken me 5 weeks to describe the synchronization process, but it took us months to build it incrementally, adapting and improving as we found different situations where it didn’t quite work the way we wanted. Obviously I summarised most of that in this series of posts, but we were constantly deploying and monitoring the results of each deployment as we went through the process. The process has been running in production without major incident for a few months now, and we’ve started to expand it to include more and more tables as necessary.

Unfortunately, as we expand the range of tables included in the process, we’ve discovered some that don’t follow the most common pattern (mostly around primary keys), which means we’re going to have to adapt the process again to take that sort of situation into account. That’s software though.

To finish everything up, I have to say that all this talk about syncing has really brought a…Titanic sense of gravitas to the whole situation, don’t you think?

I’ll see myself out.

0 Comments

Well, now its next week, and its time to deal with the problem.

What problem?

The problem where we delete literally all of the data for a table and then synchronize it all over again whenever the maximum local version gets lower than the maximum remote version.

That problem.

If that sort of thing only happened very rarely, I probably wouldn’t care all that much. The problem is, it can occur whenever a database restore/rollback is performed (which is annoyingly frequent for our users) or when a user removes an entity they just added and that entity is subject to hard deletion rules (which happens quite a lot).

With the table annihilation occurring far more frequently than we would like, we’re looking at a couple of very real problems.

  • During the period where the data is re-syncing, anything using that data (third party APIs, mobile applications, websites, etc) will be out of date. Syncing a table from scratch can take a few hours (depending on table size), so it’s not a great situation to be in.
  • Repeatedly pushing the same information consumes valuable bandwidth. In Australia, where the internet is mostly run over taut pieces of string and unlimited data plans are not the norm, consuming bandwidth for no real gain is foolish.
  • Some tables with hard delete have such high churn that a constant cycle of delete and re-sync can be maintained almost indefinitely, which exacerbates the two points above.

Also, such a glaring and gross inefficiency makes me sad as an engineer, which, ultimately, is the most important reason. A happy engineer is an effective engineer after all.

Controlled Fission

Rather than going with the nuclear approach of “delete all the things”, it makes a lot more sense to act in a more controlled, surgical fashion and only remove the things that need to be removed when the version goes backwards. Identifying what needs to be removed is relatively easy, its just everything with a version greater than the maximum local version.

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

This is a decent solution, but without the differencing check, it has a major flaw that can lead to missing data. This was the primary reason we went with the “delete it all, let god sort if out” approach originally, as we’d rather have everything eventually be correct, than risk getting ourselves into a state where the remote data is not identical to the local data, and the synchronization process thinks that its done.

The sequence that can lead to missing data with the above algorithm in play is not straightforward, so I’ll try to explain it using an example.

Imagine we have two sets of data, one representing the local store (left side) and other the remote (right side). The first column is the row ID, the second is the version.

4245 4245
3234 3234
2221 2221
1210 1210

 

According to the representation above, both local and remote are in sync. Assume at this point a snapshot was made locally.

Now the user does something to update the row with ID 3, which gives it a new version.

3257 4245
4245 3234
2221 2221
1210 1210

 

The sync process kicks in, detects the new row (because its version is higher than the remote) and sends it out, where it is in turn updated in the database by its primary key.

3257 3257
4245 4245
2221 2221
1210 1210

 

If the user now performs a rollback/restore using the snapshot they took earlier, the data now looks like this.

4245 3257
3234 4245
2221 2221
1210 1210

 

Finally, the algorithm above will react to this situation by removing all data from the remote where the version is greater than the local.

4245   
3234 4245
2221 2221
1210 1210

 

Unless something happens to the row with ID 3 (i.e. its updated in some way), it will never be synced remotely, ruining everything.

The good news is that with the differencing check this situation is (eventually) fixed, because when scanning through the database it will eventually discover the missing row and upload it, allowing us to implement partial rollbacks in the interest of efficiency.

And that makes the engineer in me happy.

The Worst of the Worst

Everything is not quite puppies and roses though.

The synchronization process as described above is robust enough to handle just about any situation you can throw at it.

Except one.

What happens if the high watermark process can never upload its batch of changes, even at the minimum size? Keeping in mind, the minimum size is one.

There are a number of reasons why a batch size of one might be reached, but the most common are:

  • API unavailable for an extended period of time. This could be as a result of a local or remote issue, but once its available again, the batch size will increase, so we don’t really have anything to worry about here.
  • Some critical problem with uploading that particular row.

The second one is an issue that we actually ran into, where a single row contained something ridiculous like 50MB of crazy embedded data, which was more than the maximum request size we were allowing on the API, so it kept failing miserably.

As a last resort, we improved the algorithm to skip single rows if the minimum batch size has failed more than a few times. If the row is broken that badly, then we reasoned that missing data (the bad row) is preferably to not being able to sync at all. On the off chance the row is fixed, the process will pick it up and upload it back in its rightful place.

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

To Be Continued

With the final flaws in the process rectified, that’s pretty much it for data synchronization. Until we discover the next flaw that is, which I’m sure will happen eventually.

I’ll make a summary post next week to draw everything together and bring the entire saga to a close.

0 Comments

Since I left last weeks post on a terrible cliffhanger, I thought that this week I’d take a detour and write some more about our AWS Lambda ELB logs processor, just to keep the suspense going for as long as possible.

I’m sure my reader will be irate.

But seriously, lets talk about hard deletes and why they make everything more difficult.

Funnily enough, we have actually had some more issues with our AWS Lambda ELB logs processor, which I will probably have to talk about at some point. Specifically, while the connection pooling worked and stopped the execution from erroring out, the process can’t get through an entire ELB log file within its time limit (300 seconds, which is the maximum time available for the execution of a Lambda function). The files are pretty juicy, clocking in at 20ish megabytes, so either we need to optimise the Lambda function itself or we need to create ourselves a divide and conquer strategy.

At some unknown point in the past, the application at the core of this entire syncing adventure was changed to make the deletion of major entities reversible. Specifically, instead of simply removing the offending information from the database (along with any appropriate relationships), the primary entity entry was instead just marked as “deleted”, and then ignored for basically every interaction other than “undelete”. Being able to undo accidental (or malicious) deletes within resorting to a full database restore is pretty useful, so the soft delete pattern was applied to most of the entities in the system.

Alas, “most” is a far cry from “all”.

Some entities are just deleted the old fashioned way, gone forever unless someone does a database restore. Thinking about it, it makes sense to not retain full history for tables with high churn (like appointments), so we can’t just mandate soft deletion for everything just to make our lives easier. With that constraint in place, we need to adapt our process to take hard deletes into account.

The question then becomes, how do you find something if its just not there anymore?

Suspicious Holes

Our versioning based delta algorithm will not detect deletions unless they occur at the very top of a table. When something is deleted from that specific location, the version will appear to be lower locally than remotely, so the entire table will be removed and re-uploaded from scratch.

I’ll come back to this particular situation in another post, but lets just ignore it for now, because the solution will work in those particular cases, even if its terrible.

The more interesting case is when the deletion occurs literally anywhere else in the table.

The RowVersion construct we’re relying on won’t allow us to detect if something has been removed, at least not in the same way that we’ve been using it to detect changes/additions.

What we can do, however, is use it to generate a manifest of sorts, describing everything that is in the table in a relatively efficient fashion and then use that manifest to compare the local to the remote. RowVersion is unique (barring exceptional circumstances), so we can use it as a pseudo key of sorts, allowing us to easily compare local and remote data for any table that already features versioning.

Since we’re dealing primarily with two lists of numbers, the comparison is relatively easy. Look for everything that is in the remote but not in the local and you’ll find everything that’s been deleted locally. You can then use that information to construct some commands to remove the offending data and bring everything back to harmonious balance.

But where does this secondary differencingprocess fit into our top level algorithm?

At first we tried running the two checks in parallel, but it quickly became obvious that the differencing process needed to be aware of the boundaries of the other process (which I’m going to call high watermark from this point forward in order to maintain clarity). Without the context supplied by the high watermark (i.e. the remote maximum version), the differencing process would get confused and everything became much harder to reason about. The easy solution was to only run the differencing process when the high watermark thought it was finished, so we slotted it into the otherwise dead area of the algorithm for when the remote and local version maximums were the same.

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
        Delete from Remote  
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload
If Local < Remote
    Delete Everything Remotely

If you look closely, you can see that I’ve added in batching logic for the differencing process similarly to how it works for the high watermark. The reality of the situation is that you can’t easily compare two entire tables (remove vs local) all at once, even if you’re only dealing with relatively simple numerical sequences. There might be millions and millions of rows at play, and that’s way too much deal with in a single HTTP request. Batching just breaks the problem down into management chunks, and all you have to do is remember where you were last up to and make sure you roll back to the start once you reach the end.

Completely Trustworthy Holes

After editing the algorithm to take into account hard deletes, the synchronization process is in a pretty good place.

Additions and updates are identified and uploaded quickly (within minutes) and hard deletes are identified and resolved eventually (still quickly, but slower due to the scanning nature of the differencing process).

What about the other side of the differencing check though?

What if we find something that is in the local but not in the remote?

If we look at the process as it currently stands, there is no obvious way for it to get into a situation where there is missing data on the remote end to be picked up during the differencing process. In the interests of completeness though, we should really take care of that situation, because even if we don’t think it can happen, many years of software development has proven to me that it probably will, maybe as a result of a bug, maybe as a result of some future change. Might as well cover it now and save time later.

Taking this into account, the algorithm becomes:

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
If Local < Remote
    Delete Everything Remotely

The only addition is the reverse of the hard delete handler, finding everything in the difference batch that isn’t store remotely and uploading it. With the differencing process never running before the high watermark is finished (and using that high watermark to limit itself), we’re in a pretty good place.

To Be Continued

The obvious weakness in the process is that crazy nuclear delete that happens whenever the local version is less than the remote. Its not the end of the world and it definitely does the job it’s supposed to, but it’s pretty nasty. As I mentioned earlier in this post though, it doesn’t only occur when a database has been restored, it also occurs with tables that do hard deletes and experience a lot of churn. Its pretty crazy to delete an entire table just because the user added a new thing then removed it again.

That’s a problem for next week though.

0 Comments

Continuing on from last week, if we’re syncing entire rows at a time, versioning is taken care of by the SQL Server RowVersion construct and we only care about syncing in one direction, all that’s left to to do is to devise an algorithm taking all of those disparate pieces into account.

So without further ado, lets get straight into it.

Row, Row, Row Your Boat

The simplest syncing algorithm is one obtains the versions of both the local and the remote data sources, compares them and then uses that information to decide what to do.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Local Rows > Remote Version
    Upload

It basically boils down to just retrieving and uploading all rows with a version greater than the remote, assuming the upload is done as an upsert and the tables contain appropriate primary keys separate from the version.

See? Simple, just like I said.

Job done, post finished.

Admitting You Have A Problem Is Hard

Alas, not quite.

The first time the algorithm runs, there is a high chance that the difference between the local and remote will be an extremely large number of rows. Some of the tables in the database (the uninteresting ones of course) only have tens of rows, but many tables have millions of rows. Even ignoring the potential impact querying for that much data might have on the usability of the application, uploading millions of rows all at once is likely to break all sorts of HTTP communication limits (request size being the obvious one).

If we’re aiming to have the process run silently in the background, without affecting anything, we need a way to break our updates into smaller batches.

Amending the algorithm to take batching into account gives us this.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload

We don’t need to keep any information about what position we were at last time because we can always find that out by asking the remote end, which is useful. Unfortunately, this does have the side effect of making the process a little more chatty than it strictly needs to be, but optimization can always be done later. Of course, the tradeoff here is timeliness. Small batches are good for performance, but bad for the overall sync time (due to the process running around every minute), while large batches are the opposite. There is a sweet spot somewhere in the middle, but that sweet spot moves over time based on changing environmental factors, both local and remote.

A good example of these sorts of factors is that because the algorithm is being run on-premises, in a location we have zero control over, someone might try to run it with an internet connection that is nothing more than a few carrier pigeons with USB sticks strapped to their legs. Even worse than that, they might be trying to run it on ADSL more than a few kilometres from an exchange.

*shudder*

However, we can help these poor unfortunate souls by making the batch size adaptive. Basically, we set a default, min and max batch size, and then if errors occur while uploading (like timeouts), we make the batch size smaller for next time. If everything is going well, we make the batch size larger. With some logic in there to avoid an indecisive situation where the batch size never stops flipping around, we’ve got a system that will intelligently adjust itself to the capabilities of the environment that it is running in.

I’ve left the adaptive batch size stuff out of the algorithm definition above in order to simplify it, but imagine that it goes into the part that gets the batch size and that it takes into account the last 10 or so runs.

Back To The…Past?

Now that we’ve got batching in place, you might notice that the algorithm is missing a branch in the local vs remote version comparison.

What happens when the Local version is less than the Remote? Actually, how can that even happen? If the internal versioning process always results in the version increasing for every single change, then how could the system ever get into a situation where the number goes down?

The answer to that is database restores.

Unfortunately, database restores are not only possible for the application in question, they are downright common.

There is a long and storied history about why database restores are an integral part of the application, but the short version is that errors and corruption occur all too frequently, or have occurred frequently in the past, and someone decided that the capability to arbitrarily snapshot and then restore the entire database was an amazing solution to the problem.

In truth, it is an amazing solution to the problem, but it really does make integrations of all sorts incredibly difficult, especially data synchronization.

Amending the algorithm to take database restores into account leaves us with this.

Get Local Version
Get Remote Version
If Local == Remote
    Do Nothing
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload
If Local < Remote
    Delete Everything Remotely

As you can see, we went with the nuclear approach.

Removing all of the data from the remote location is obviously a sub-optimal approach, but its also the easiest and most reliable. Until we get into a situation where we need to optimise, its good enough. I’ll be coming back to this particular decision later in this series of posts, but for now, the algorithm just removes everything when the local version goes below the remote.

Soft, Like Pudding

Looking at the algorithm I’ve defined above, it takes into account quite a few scenarios:

  • If a new entity is added in the application, one or more rows are added, they get appropriate RowVersions, are picked up by the algorithm and are uploaded to the remote location.
  • If an entity is changed in the application, one or more rows are changed, they get new RowVersions, are picked up by the algorithm and are uploaded to the remote location.
  • If an entity is deleted, one or more rows are changed (marked as deleted), they get new RowVersions, are picked up by the algorithm and are uploaded to the remote location.

Actually, that last one is a little tricky.

A good percentage of the entities in the application are only ever deleted softly. This means that they are not physically removed from the database, but are instead updated to indicate deletion and then ignored for subsequent interactions. Soft deletes make a synchronization process easy because if something is there, it will always be there, and if you have good change tracking via versioning or something similar, you’re fine.

Some things are hard deleted though, and that makes for complications.

To Be Continued

Handling hard deletes is a topic unto itself, so I’m going to leave that for the next post.

Pretty terrible cliffhanger though.