Syncing the Ship, Part 2
- Posted in:
- data synchronization
- algorithms
Continuing on from last week, if we’re syncing entire rows at a time, versioning is taken care of by the SQL Server RowVersion construct and we only care about syncing in one direction, all that’s left to to do is to devise an algorithm taking all of those disparate pieces into account.
So without further ado, lets get straight into it.
Row, Row, Row Your Boat
The simplest syncing algorithm is one obtains the versions of both the local and the remote data sources, compares them and then uses that information to decide what to do.
Get Local Version Get Remote Version If Local == Remote Do Nothing If Local > Remote Get Local Rows > Remote Version Upload
It basically boils down to just retrieving and uploading all rows with a version greater than the remote, assuming the upload is done as an upsert and the tables contain appropriate primary keys separate from the version.
See? Simple, just like I said.
Job done, post finished.
Admitting You Have A Problem Is Hard
Alas, not quite.
The first time the algorithm runs, there is a high chance that the difference between the local and remote will be an extremely large number of rows. Some of the tables in the database (the uninteresting ones of course) only have tens of rows, but many tables have millions of rows. Even ignoring the potential impact querying for that much data might have on the usability of the application, uploading millions of rows all at once is likely to break all sorts of HTTP communication limits (request size being the obvious one).
If we’re aiming to have the process run silently in the background, without affecting anything, we need a way to break our updates into smaller batches.
Amending the algorithm to take batching into account gives us this.
Get Local Version Get Remote Version If Local == Remote Do Nothing If Local > Remote Get Next [BATCH SIZE] Local Rows > Remote Version Upload
We don’t need to keep any information about what position we were at last time because we can always find that out by asking the remote end, which is useful. Unfortunately, this does have the side effect of making the process a little more chatty than it strictly needs to be, but optimization can always be done later. Of course, the tradeoff here is timeliness. Small batches are good for performance, but bad for the overall sync time (due to the process running around every minute), while large batches are the opposite. There is a sweet spot somewhere in the middle, but that sweet spot moves over time based on changing environmental factors, both local and remote.
A good example of these sorts of factors is that because the algorithm is being run on-premises, in a location we have zero control over, someone might try to run it with an internet connection that is nothing more than a few carrier pigeons with USB sticks strapped to their legs. Even worse than that, they might be trying to run it on ADSL more than a few kilometres from an exchange.
*shudder*
However, we can help these poor unfortunate souls by making the batch size adaptive. Basically, we set a default, min and max batch size, and then if errors occur while uploading (like timeouts), we make the batch size smaller for next time. If everything is going well, we make the batch size larger. With some logic in there to avoid an indecisive situation where the batch size never stops flipping around, we’ve got a system that will intelligently adjust itself to the capabilities of the environment that it is running in.
I’ve left the adaptive batch size stuff out of the algorithm definition above in order to simplify it, but imagine that it goes into the part that gets the batch size and that it takes into account the last 10 or so runs.
Back To The…Past?
Now that we’ve got batching in place, you might notice that the algorithm is missing a branch in the local vs remote version comparison.
What happens when the Local version is less than the Remote? Actually, how can that even happen? If the internal versioning process always results in the version increasing for every single change, then how could the system ever get into a situation where the number goes down?
The answer to that is database restores.
Unfortunately, database restores are not only possible for the application in question, they are downright common.
There is a long and storied history about why database restores are an integral part of the application, but the short version is that errors and corruption occur all too frequently, or have occurred frequently in the past, and someone decided that the capability to arbitrarily snapshot and then restore the entire database was an amazing solution to the problem.
In truth, it is an amazing solution to the problem, but it really does make integrations of all sorts incredibly difficult, especially data synchronization.
Amending the algorithm to take database restores into account leaves us with this.
Get Local Version Get Remote Version If Local == Remote Do Nothing If Local > Remote Get Next [BATCH SIZE] Local Rows > Remote Version Upload If Local < Remote Delete Everything Remotely
As you can see, we went with the nuclear approach.
Removing all of the data from the remote location is obviously a sub-optimal approach, but its also the easiest and most reliable. Until we get into a situation where we need to optimise, its good enough. I’ll be coming back to this particular decision later in this series of posts, but for now, the algorithm just removes everything when the local version goes below the remote.
Soft, Like Pudding
Looking at the algorithm I’ve defined above, it takes into account quite a few scenarios:
- If a new entity is added in the application, one or more rows are added, they get appropriate RowVersions, are picked up by the algorithm and are uploaded to the remote location.
- If an entity is changed in the application, one or more rows are changed, they get new RowVersions, are picked up by the algorithm and are uploaded to the remote location.
- If an entity is deleted, one or more rows are changed (marked as deleted), they get new RowVersions, are picked up by the algorithm and are uploaded to the remote location.
Actually, that last one is a little tricky.
A good percentage of the entities in the application are only ever deleted softly. This means that they are not physically removed from the database, but are instead updated to indicate deletion and then ignored for subsequent interactions. Soft deletes make a synchronization process easy because if something is there, it will always be there, and if you have good change tracking via versioning or something similar, you’re fine.
Some things are hard deleted though, and that makes for complications.
To Be Continued
Handling hard deletes is a topic unto itself, so I’m going to leave that for the next post.
Pretty terrible cliffhanger though.