Syncing the Ship, Part 4

February 21. 2017 0 Comments

Well, now its next week, and its time to deal with the problem.

What problem?

The problem where we delete literally all of the data for a table and then synchronize it all over again whenever the maximum local version gets lower than the maximum remote version.

That problem.

If that sort of thing only happened very rarely, I probably wouldn’t care all that much. The problem is, it can occur whenever a database restore/rollback is performed (which is annoyingly frequent for our users) or when a user removes an entity they just added and that entity is subject to hard deletion rules (which happens quite a lot).

With the table annihilation occurring far more frequently than we would like, we’re looking at a couple of very real problems.

During the period where the data is re-syncing, anything using that data (third party APIs, mobile applications, websites, etc) will be out of date. Syncing a table from scratch can take a few hours (depending on table size), so it’s not a great situation to be in.
Repeatedly pushing the same information consumes valuable bandwidth. In Australia, where the internet is mostly run over taut pieces of string and unlimited data plans are not the norm, consuming bandwidth for no real gain is foolish.
Some tables with hard delete have such high churn that a constant cycle of delete and re-sync can be maintained almost indefinitely, which exacerbates the two points above.

Also, such a glaring and gross inefficiency makes me sad as an engineer, which, ultimately, is the most important reason. A happy engineer is an effective engineer after all.

Controlled Fission

Rather than going with the nuclear approach of “delete all the things”, it makes a lot more sense to act in a more controlled, surgical fashion and only remove the things that need to be removed when the version goes backwards. Identifying what needs to be removed is relatively easy, its just everything with a version greater than the maximum local version.

Get Local Version
Get Remote Version
If Local == Remote
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

This is a decent solution, but without the differencing check, it has a major flaw that can lead to missing data. This was the primary reason we went with the “delete it all, let god sort if out” approach originally, as we’d rather have everything eventually be correct, than risk getting ourselves into a state where the remote data is not identical to the local data, and the synchronization process thinks that its done.

The sequence that can lead to missing data with the above algorithm in play is not straightforward, so I’ll try to explain it using an example.

Imagine we have two sets of data, one representing the local store (left side) and other the remote (right side). The first column is the row ID, the second is the version.

4	245	4	245
3	234	3	234
2	221	2	221
1	210	1	210

According to the representation above, both local and remote are in sync. Assume at this point a snapshot was made locally.

Now the user does something to update the row with ID 3, which gives it a new version.

3	257	4	245
4	245	3	234
2	221	2	221
1	210	1	210

The sync process kicks in, detects the new row (because its version is higher than the remote) and sends it out, where it is in turn updated in the database by its primary key.

3	257	3	257
4	245	4	245
2	221	2	221
1	210	1	210

If the user now performs a rollback/restore using the snapshot they took earlier, the data now looks like this.

4	245	3	257
3	234	4	245
2	221	2	221
1	210	1	210

Finally, the algorithm above will react to this situation by removing all data from the remote where the version is greater than the local.

4	245
3	234	4	245
2	221	2	221
1	210	1	210

Unless something happens to the row with ID 3 (i.e. its updated in some way), it will never be synced remotely, ruining everything.

The good news is that with the differencing check this situation is (eventually) fixed, because when scanning through the database it will eventually discover the missing row and upload it, allowing us to implement partial rollbacks in the interest of efficiency.

And that makes the engineer in me happy.

The Worst of the Worst

Everything is not quite puppies and roses though.

The synchronization process as described above is robust enough to handle just about any situation you can throw at it.

Except one.

What happens if the high watermark process can never upload its batch of changes, even at the minimum size? Keeping in mind, the minimum size is one.

There are a number of reasons why a batch size of one might be reached, but the most common are:

API unavailable for an extended period of time. This could be as a result of a local or remote issue, but once its available again, the batch size will increase, so we don’t really have anything to worry about here.
Some critical problem with uploading that particular row.

The second one is an issue that we actually ran into, where a single row contained something ridiculous like 50MB of crazy embedded data, which was more than the maximum request size we were allowing on the API, so it kept failing miserably.

As a last resort, we improved the algorithm to skip single rows if the minimum batch size has failed more than a few times. If the row is broken that badly, then we reasoned that missing data (the bad row) is preferably to not being able to sync at all. On the off chance the row is fixed, the process will pick it up and upload it back in its rightful place.

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

To Be Continued

With the final flaws in the process rectified, that’s pretty much it for data synchronization. Until we discover the next flaw that is, which I’m sure will happen eventually.

I’ll make a summary post next week to draw everything together and bring the entire saga to a close.