Syncing the Ship, Part 3
- Posted in:
- data synchronization
- algorithms
Since I left last weeks post on a terrible cliffhanger, I thought that this week I’d take a detour and write some more about our AWS Lambda ELB logs processor, just to keep the suspense going for as long as possible.
I’m sure my reader will be irate.
But seriously, lets talk about hard deletes and why they make everything more difficult.
Funnily enough, we have actually had some more issues with our AWS Lambda ELB logs processor, which I will probably have to talk about at some point. Specifically, while the connection pooling worked and stopped the execution from erroring out, the process can’t get through an entire ELB log file within its time limit (300 seconds, which is the maximum time available for the execution of a Lambda function). The files are pretty juicy, clocking in at 20ish megabytes, so either we need to optimise the Lambda function itself or we need to create ourselves a divide and conquer strategy.
At some unknown point in the past, the application at the core of this entire syncing adventure was changed to make the deletion of major entities reversible. Specifically, instead of simply removing the offending information from the database (along with any appropriate relationships), the primary entity entry was instead just marked as “deleted”, and then ignored for basically every interaction other than “undelete”. Being able to undo accidental (or malicious) deletes within resorting to a full database restore is pretty useful, so the soft delete pattern was applied to most of the entities in the system.
Alas, “most” is a far cry from “all”.
Some entities are just deleted the old fashioned way, gone forever unless someone does a database restore. Thinking about it, it makes sense to not retain full history for tables with high churn (like appointments), so we can’t just mandate soft deletion for everything just to make our lives easier. With that constraint in place, we need to adapt our process to take hard deletes into account.
The question then becomes, how do you find something if its just not there anymore?
Suspicious Holes
Our versioning based delta algorithm will not detect deletions unless they occur at the very top of a table. When something is deleted from that specific location, the version will appear to be lower locally than remotely, so the entire table will be removed and re-uploaded from scratch.
I’ll come back to this particular situation in another post, but lets just ignore it for now, because the solution will work in those particular cases, even if its terrible.
The more interesting case is when the deletion occurs literally anywhere else in the table.
The RowVersion construct we’re relying on won’t allow us to detect if something has been removed, at least not in the same way that we’ve been using it to detect changes/additions.
What we can do, however, is use it to generate a manifest of sorts, describing everything that is in the table in a relatively efficient fashion and then use that manifest to compare the local to the remote. RowVersion is unique (barring exceptional circumstances), so we can use it as a pseudo key of sorts, allowing us to easily compare local and remote data for any table that already features versioning.
Since we’re dealing primarily with two lists of numbers, the comparison is relatively easy. Look for everything that is in the remote but not in the local and you’ll find everything that’s been deleted locally. You can then use that information to construct some commands to remove the offending data and bring everything back to harmonious balance.
But where does this secondary differencingprocess fit into our top level algorithm?
At first we tried running the two checks in parallel, but it quickly became obvious that the differencing process needed to be aware of the boundaries of the other process (which I’m going to call high watermark from this point forward in order to maintain clarity). Without the context supplied by the high watermark (i.e. the remote maximum version), the differencing process would get confused and everything became much harder to reason about. The easy solution was to only run the differencing process when the high watermark thought it was finished, so we slotted it into the otherwise dead area of the algorithm for when the remote and local version maximums were the same.
Get Local Version Get Remote Version If Local == Remote Get Last Local Position Get Next [BATCH SIZE] Local Rows from last position Get Min & Max Version in Batch Query Remote for Manifest Between Min/Max Local Version Create Manifest from Local Batch Compare Find Remote Not in Local Delete from Remote If Local > Remote Get Next [BATCH SIZE] Local Rows > Remote Version Upload If Local < Remote Delete Everything Remotely
If you look closely, you can see that I’ve added in batching logic for the differencing process similarly to how it works for the high watermark. The reality of the situation is that you can’t easily compare two entire tables (remove vs local) all at once, even if you’re only dealing with relatively simple numerical sequences. There might be millions and millions of rows at play, and that’s way too much deal with in a single HTTP request. Batching just breaks the problem down into management chunks, and all you have to do is remember where you were last up to and make sure you roll back to the start once you reach the end.
Completely Trustworthy Holes
After editing the algorithm to take into account hard deletes, the synchronization process is in a pretty good place.
Additions and updates are identified and uploaded quickly (within minutes) and hard deletes are identified and resolved eventually (still quickly, but slower due to the scanning nature of the differencing process).
What about the other side of the differencing check though?
What if we find something that is in the local but not in the remote?
If we look at the process as it currently stands, there is no obvious way for it to get into a situation where there is missing data on the remote end to be picked up during the differencing process. In the interests of completeness though, we should really take care of that situation, because even if we don’t think it can happen, many years of software development has proven to me that it probably will, maybe as a result of a bug, maybe as a result of some future change. Might as well cover it now and save time later.
Taking this into account, the algorithm becomes:
Get Local Version Get Remote Version If Local == Remote Get Last Local Position Get Next [BATCH SIZE] Local Rows from last position Get Min & Max Version in Batch Query Remote for Manifest Between Min/Max Local Version Create Manifest from Local Batch Compare Find Remote Not in Local Delete from Remote Find Local Not in Remote Upload to Remote If Local > Remote Get Next [BATCH SIZE] Local Rows > Remote Version Upload to Remote If Local < Remote Delete Everything Remotely
The only addition is the reverse of the hard delete handler, finding everything in the difference batch that isn’t store remotely and uploading it. With the differencing process never running before the high watermark is finished (and using that high watermark to limit itself), we’re in a pretty good place.
To Be Continued
The obvious weakness in the process is that crazy nuclear delete that happens whenever the local version is less than the remote. Its not the end of the world and it definitely does the job it’s supposed to, but it’s pretty nasty. As I mentioned earlier in this post though, it doesn’t only occur when a database has been restored, it also occurs with tables that do hard deletes and experience a lot of churn. Its pretty crazy to delete an entire table just because the user added a new thing then removed it again.