Syncing the Ship, Well And Truly Sunk
- Posted in:
- data synchronization
- algorithms
And that marks the end of the series on our synchronization process. Surprising absolutely no-one, it turns out that building such a process is quite challenging, with lots of different edge cases that need to be taken into account to ensure quality at all of the various levels.
To summarise:
- I started by writing about the situation that we needed to deal with, and some constraints that needed to be taken into account.
- Then I wrote about the very basic, version based algorithm for identifying and synchronizing changes.
- Then it was all about hard deletes and how they mess everything up, leading to the introduction of a scanning process for out of band changes.
- And the last post was about optimizing some inefficiencies and dealing with some annoyingly bad data.
At the end, we’re left with the following algorithm:
Get Local Version Get Remote Version If Local == Remote Calculate [BATCH SIZE] Using Historical Data Get Last Local Position Get Next [BATCH SIZE] Local Rows from last position Get Min & Max Version in Batch Query Remote for Manifest Between Min/Max Local Version Create Manifest from Local Batch Compare Find Remote Not in Local Delete from Remote Find Local Not in Remote Upload to Remote If Local > Remote Calculate [BATCH SIZE] Using Historical Data Get Next [BATCH SIZE] Local Rows > Remote Version Upload to Remote Record Result for [BATCH SIZE] Tuning If Failure & Minimum [BATCH SIZE], Skip Ahead If Local < Remote Find Remote > Local Version Delete from Remote
Algorithms are great, and form the basis of pretty much everything we do as software developers, but you can’t really deploy one directly. At least not one written mostly as plain text, like the one above.
So how do we actually put it into practice?
Software, Hardware. What About Just-Right-Ware?
Apart from a little bit of information in the first post of the series, I’ve mostly been writing about the synchronization process conceptually. To finish everything up, I’m going to explain a little bit about how we actually put it into practice and the pieces of software in play, just to make it a little bit more concrete.
Implementation-wise, there are three relevant components.
- A piece of desktop software, installed on the client side (C#)
- An API (C# using Nancy)
- A database (multi tenant, AWS RDS w. PostgreSQL)
Using the desktop software, client’s register a database for “cloud functionality”, agreeing for their information to be synchronized for use in other applications. This registration process gives the database a unique identifier, and if we combine this with their client id, we can safely segregate databases from one another. One one or more databases are registered, the desktop software executes approximately every minute, performing a single run of the synchronization algorithm as specified above, using the API to discover information about the remote side of the equation.
The API itself is relatively simple, and is dedicated to facilitate the synchronization process, mainly acting as an application layer on top of the database. It is primarily a REST API (where the entities are customers, databases and tables) and also features some basic authentication and authorization (using the client id, database id and some other user specific information).
At a high level, it features endpoints like this:
GET /v1/customers/{client-id}/databases/{database-id}/tables/{tablename} POST /v1/customers/{client-id}/databases/{database-id}/tables/{tablename} DELETE /v1/customers/{client-id}/databases/{database-id}/tables/{tablename} GET /v1/customers/{client-id}/databases/{database-id}/tables/{tablename}/manifest
Its got some other endpoints as well, but the endpoints above are the most relevant to the synchronization process (the other endpoints are for things like health checks, uptime checks and administrative functionality).
To extrapolate:
- The first GET endpoint returns the versioning information for the table (i.e. the count, max version, max modified date) which is the primary input for the synchronization process (when compared to the same information locally).
- The POST endpoint on table name allows for inserting/uploading data, supplied as a set of rows appropriate for the table in question.
- The DELETE endpoint on table name unsurprisingly allows for the deletion of data by supplying a set of keys to delete, but also allows for delete operations based on range (i.e. everything > version X) or deleting everything all at once.
- Finally, the GET endpoint on table/manifest allows for the retrieval of a manifest describing a section of the table, which is used for the differencing check.
The database is a replica of the client side database, with additional columns for data segregation based on the combination of client id and database id (as mentioned above).
Working together, these three components make up the concrete implementation of the synchronization process, replicating local, on-premises data successfully to a remote location, for use in other applications and processes as necessary.
Conclusion
Its taken me 5 weeks to describe the synchronization process, but it took us months to build it incrementally, adapting and improving as we found different situations where it didn’t quite work the way we wanted. Obviously I summarised most of that in this series of posts, but we were constantly deploying and monitoring the results of each deployment as we went through the process. The process has been running in production without major incident for a few months now, and we’ve started to expand it to include more and more tables as necessary.
Unfortunately, as we expand the range of tables included in the process, we’ve discovered some that don’t follow the most common pattern (mostly around primary keys), which means we’re going to have to adapt the process again to take that sort of situation into account. That’s software though.
To finish everything up, I have to say that all this talk about syncing has really brought a…Titanic sense of gravitas to the whole situation, don’t you think?
…
I’ll see myself out.