0 Comments

In previous posts, I’ve briefly mentioned how one of the most valuable outcomes of the data synchronization algorithm was the capability to gather meaningful business intelligence about our users. This includes information like how often they use the software, what parts they are using and how many of its entities they are creating (and deleting) during their normal daily process.

When used responsibly, this sort of intelligence is invaluable in helping to create a better, more valuable experience for everyone involved. Literally every part of your standard organization (support, sales, development, marketing, finance) can make use of the data in some way.

Of course, when its all sequestered inside a PostgreSQL database, the barrier to entry can be quite high. Generally this just means that the people who are best positioned to make use of the data either don’t get access to it (a waste) or they end up constantly bugging the members of the small group of people with the capability to do the analysis.

When I’m a member of that small group, I get suddenly motivated to find an alternate way to give them what they need.

Re-education Campaign

There are really two kinds of analysis that tend to happen for business intelligence purposes.

The first is answering ad-hoc questions. Typically these sorts of questions have not been asked before, or they could be a slight variation on the same theme as a previous question.

Regardless, these generally require someone who actually knows the data and has the skills to wrangle it in order to get to the meat of the issue and provide the answer. This can be taught (with some effort), but not everyone has the time to dedicate to that sort of thing, so new analysis will probably always have to be performed by trained professionals.

The second type of analysis is re-answering a previously answered question, now that time has passed.

This tends to occur quite frequently, especially when it comes to metrics (like “how many customers have X, how many Y does each customer have, etc), and it is here that there is an opportunity to optimise.

In the long term, there should be an easy platform for people to get at the information they want, without having to understand how to run SQL scripts (or Powershell). This is likely going to come in the form of Metabase, but unfortunately this is probably still a little ways off in the future, and people are asking for data now.

So in the short term, some way to easily run pre-canned queries and get the results as a CSV is desirable.

Copy That

The good news is that PostgreSQL makes it pretty easy to run a query and get CSV output using the COPY command:

COPY ({query}) TO STDOUT WITH DELIMITER ',' CSV HEADER

Executing that command will result in a stream containing the data specified in the query (along with headers), in CSV format.

Of course, its kind of useless without a mechanism to easily run it, so if you’re using .NET, you can leverage Npgsql to actually execute the command using the BeginTextExport function on an NpgsqlConnection, like in the following class:

using System.IO;

namespace Solavirum.Exporting
{
    public class PostgresTableStreamer
    {
        private readonly INpgsqlConnectionFactory _npgsqlConnectionFactory;

        public PostgresTableStreamer(INpgsqlConnectionFactory npgsqlConnectionFactory)
        {
            _npgsqlConnectionFactory = npgsqlConnectionFactory;
        }

        public long Stream(string copySql, StreamWriter output)
        {
            var numberOfLinesInFile = 0;
            using (var conn = _npgsqlConnectionFactory.Create())
            {
                conn.Open();
                using (var reader = conn.BeginTextExport(copySql))
                {
                    var i = reader.Read();
                    var insideQuotedField = false;

                    while (i != -1)
                    {
                        var c = (char) i;
                        if (c == '"') insideQuotedField = !insideQuotedField;
                        switch (c)
                        {
                            case '\r':
                                i = reader.Read();
                                continue;
                            case '\n':
                                if (insideQuotedField) output.Write("\\n");
                                else
                                {
                                    output.Write(c);
                                    numberOfLinesInFile++;
                                }
                                break;
                            default:
                                output.Write(c);
                                break;
                        }
                        i = reader.Read();
                    }
                    output.Flush();
                }
                conn.Close();
            }
            return numberOfLinesInFile;
        }
    }
}

The only tricksy thing is escaping new line characters so that they don’t explode the formatting of the resulting file.

A Model View

Of course, a C# class is just as useless to a non-technical person as an SQL query, so now we have to think about delivery.

The easiest way to give people access to some piece of functionality is typically a website, so that seems like a good approach.

My skills are somewhat limited when it comes to putting together a website in a hurry. Our previous websites have been React.JS, but they need a bit of a pipeline to get up and running and I didn’t want to invest that much in this prototype.

Eventually, I settled on an ASP.NET CORE MVC website, deployed manually to a single Linux instance in AWS.

Nothing too complicated, just a single controller that lists the queries that are available (which are deployed with the code, no custom queries here), and then a way to run a single query and get the result as a file.

Getting ASP.NET CORE MVC up and running on a Linux box is pretty straightforward, assuming you use a local Nginx instance as a reverse proxy (which is easy).

As always, because we’re not stupid, a little touch of security is also necessary, so:

  • The box is behind a load balancer, to allow for easy HTTPS.
  • The load balancer is only accessible from the static IP at our main office. If you’re out of the office you can still use the service, but you need to be on the VPN, which is required for a bunch of our other things anyway.

If the website lives for long enough, I’ll probably augment it with Active Directory logins, so people can just use their domain credentials, but short term, its secure enough.

Conclusion

That’s basically it.

Start to finish, it all took about 4 hours, and while I wouldn’t say it was my best work, Its good enough to solve the immediate problem. At least until we get Metabase up and running and we throw the whole thing out anyway.

I’ll probably clean it up a bit (the manual deployment rankled me more than I wanted to admit) and then wash my hands of the whole thing.

More importantly, the amount of people asking me to “please run query X and email me the results” has dropped dramatically, so I’ll consider this a success.

Now I have more time for the really important things.

Like participating in the Mario Kart Tournament.

0 Comments

The pieces of software that are easiest to manage are often the most isolated, mostly because of a combination of two things:

  1. You control the entire world that the software cares about (maybe its all in a database, maybe the file system, maybe written into the fabric of the universe, but you control it all)
  2. You don’t really have to care about any other parties in the relationship except for your users, and you control their world, so that’s easy

Unfortunately, the most isolated software is also often the least valuable, because it does not participate in a wider community.

How is this relevant to me?

Well, my team is responsible for a piece of software that was originally very well isolated. It used to be that everything the user cared about (all of their data), was encapsulated nice and neat in an SQL Server database. Connect to a different database, and the user can easily change their entire context.

This database centric view naturally led to an excellent backup and restore process, and over time the users became familiar with using restores for a number of purposes, from resolving actual software bugs that caused data corruption to simply undoing actions that they didn’t mean to do.

All was well and the world was happy, but everything changed when the third party integrations attacked.

Back To The….Past?

From a software point of view, when you no longer control the entire universe, things get substantially more difficult. This is not new information, but from our point of view, the users much loved (and frequently executed) backup and restore operations became something of a liability.

Integrating with an entirely separate system is challenging at the best of times, but when you add into the mix the capability for one of the participants in the relationship to arbitrarily regress to an earlier point in its timeline, it gets positively frustrating.

We’ve done a lot of integrations in the time that I’ve been working on the software in question, but there are a few that stick out in my mind as being good examples of the variety of ways that we’ve had to use to deal with the issue:

  • A third party service that made use of the client data, to in turn create things that went back into the client data. This one wasn’t too bad, as the worst thing that could happen would be an entity coming back to the client based on data that was no longer relevant. In this case, we can discard the irrelevant information, usually confirming with the user first.
  • A data synchronization process that replicates client data to a remote location. The nice thing about this one is that the data is only flowing one way, so all we really needed to do was detect the database restore and react appropriately. The impact of the database restore on the overall complexity of the synchronization algorithm was relatively minimal.
  • A third party service that we integrated with directly (as opposed to integrating with us like in the first point). Third party entities are linked to local entities, but have separate lifetimes, so the biggest issue as a result of database restores was orphans. Unfortunately, this third party service was a payment provider, so an orphan entity still capable of taking real money is a pretty big problem. I’ll get into it more detail later on in this post, but another thing we had to consider here was detecting and avoiding duplicate operations, as a result of the user repeating actions that had been undone by a database restore.

Of those three integrations, the third one is the most recent, and has proven to be the most challenging for a variety of reasons.

Localised Area Of Effect

As a result of the restore capabilities inherent in the product, we’ve had to repeatedly take certain measures in order to ensure that our users don’t get themselves into difficult situations.

Where possible, we try to store as much state as we can in the local database, so if a database restore occurs, at least everything goes back to an earlier time in unison. This works remarkably well as long as the source of truth is the local data source, so if a third party integration differs from the local, trust the local and remove the offending information. Essentially, we are extending the scope of the database restore to the third party, as much as you can anyway.

Of course, just annihilating user data is not always possible (or desirable), even if that’s technically what they asked us to do by performing a restore.

Our second approach is to allow the data to live elsewhere, but force it to synchronize automatically to the local database as necessary. This technically splits the source of truth responsibility, so it can have unfortunate side effects, like synchronizing information that is no longer relevant to the restored state of the database, which isn’t great. A technical mechanism that can make this synchronization process easier is for each entity to have a unique, monotonically increasing numeric identifier, and to be able to query the entities by that identifier and its range (i.e. give me next 100 > X). We’ve used this approach a few times now, and assuming the remote entities don’t change over time, its fairly effective.

As you can imagine, both of these things involve a significant amount of communication overhead, both technically and cognitively and can result in some seriously confusing bugs and behaviour if it all goes pear shaped.

A Potent Concept

The approaches that I was just writing about are a bit more relevant when we are acting as our own third party (i.e. we’re writing both sides of the integration, which happens a fair bit).

Sometimes we have to integrate with services and systems provided by actual third parties.

In those cases, we have to get creative, or, unfortunately, put some of the responsibility on the users themselves. If we take orphan data as an example, when detected, we let the user know, and they are then required to deal with the problems and bring the system back into an acceptable state before they can move on. Obviously you have to be careful with how you identify the orphans, especially if you’re not the only integrator, but that is a solvable problem.

If you’re lucky, the third party service will allow you to do things in an idempotent manner, such that even if you accidentally do the same operation twice (or more), the ramifications are controlled. This is particularly relevant when integrating with services that deal with actual money, because a double (or triple) transaction is pretty catastrophic from a reputation point of view.

Of course, in order for the command or operation to be idempotent, you need to have some sort of consistent way to identify and classify it. Usually this means an ID of some sort that doesn’t change.

This gets challenging for us, because a database restore is a brutally efficient operation. Everything that was, ceases to exist, as if it never was to begin with. If the user then chooses to recreate some entity that was interacting with the third party service, there is no guarantee that its identity will be the same, even if it represents the same conceptual thing. For example, IDs generated at the database level might differ based on the order of operations, or might just be different altogether, as in the case when automatically generating GUIDs.

When this happens, we typically have to rely on the user to know what to do, which is a bit of a cop out, but compromises have to be made sometimes.x

Conclusion

Easily one of the most important lessons from supporting arbitrary database backups and restores is just how costly and complicated they can make integrations.

The older and more well established a piece of software gets, the more likely it will be to have revenue growth opportunities involving other parties, especially if it forms a community of sorts. I mean, its probably not cost effective to try and do everything yourself, no matter how amazing your organisation is.

Keep in mind, to the user, all of this should just work, because they don’t see the complexity inherent in the situation.

That means if you screw it up, the dissatisfaction level will be high, but if you get it right, at best they will be neutral.

To quote a great episode of Futurama:

When you do things right, people won’t be sure you’ve done anything at all

0 Comments

I’ve written a lot of words on this blog about the data synchronization algorithm. Probably too many to be honest, but its an interesting technical topic for me, so the words come easily.

Not much has changed since the optimization to the differencing check to stop it from wastefully scanning the entire table, its just been quietly chugging along, happily grabbing data whenever clients opt in, and just generally being useful.

As we accumulate more data though, a flaw in the system is becoming more obvious.

Duplicates.

Once Uploaded, Data Lives Forever

I like to think that the data synchronization algorithm is really good at what it does.

Given a connection to a legacy database and some identifying information (i.e. the identity of the client), it will make sure that a copy of the data in that data exists remotely, and then, as the underlying database changes, ensure those changes are also present.

The actual sync algorithm is provided to clients in the form of a plugin for a component that enables services for their entire office, like server side automated backups and integrations with (other) cloud services. All of this hangs on a centralised store of registered databases, which the client is responsible for maintaining. The whole underlying system was built before my time, and while its a little rough around the edges, its pretty good.

Unfortunately, it does have one major flaw.

When a client registers a database (generally done by supplying a connection string), that database is given a unique identifier.

If the client registers the same physical database again (maybe they moved servers, maybe they lost their settings due to a bug, maybe support personnel think that re-registering databases is like doing a computer reboot), they get a new database identifier.

For the sync process, this means that all of the data gets uploaded again, appearing as another (separate) database belonging to the same client. In most cases the old registration continues to exist, but it probably stops being updated. Sometimes the client is actively uploading data from the same database more than once though, but that kind of thing is pretty rare.

For most use cases this sort of thing is mostly just annoying, as the client will select the right database whenever they interact with whatever system is pulling from the data in the cloud (and we generally try to hide databases that look like they would have no value to the customer, like ones that haven’t been actively updated in the last 24 hours).

For business intelligence though, it means a huge amount of useless duplicate data, which has all sorts of negative effects on the generated metrics.

Well, Until We Delete It And Take Its Power

From an engineering point of view, we should fix the root flaw, ensuring that the same physical database is identified correctly whenever it participates in the system.

As always, reality tends to get in the way, and unpicking that particular beast is not a simple task. Its not off the table completely, its just less palatable than it could be.

Even if the flaw is fixed though, the duplicate data that already exists is not going to magically up and disappear out of a respect for our engineering prowess. We’re going to have to deal with it anyway, so we might as well start there.

Algorithmically, a data set (customer_id-database_id pair) can be considered a duplicate of another data set if and only if:

  • The customer_id matches (we ignore duplicates across clients, for now anyway)
  • The data set contains at least 25 GUID identifiers that also appear in the other data set (each entity in the database generally has both a numerical and GUID identifier, so we just use the most common entity)

Nothing particularly complicated or fancy.

For the automated process itself, there are a few things worth considering:

  • It needs to communicate clearly what it did and why
  • There is value in separating the analysis of the data sets from the actions that were performed (and their results)
  • We’ll be using TeamCity for actually scheduling and running the process,so we can store a full history of what the process has done over time
  • To minimise risk, its useful to be able to tell the process “identify all the duplicates, but only delete the first X”, just in case it tries to delete hundreds and causes terrible performance problems

Taking all of the above into account, we' created a simple C# command line application that could be run like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete}

Like everything we do, it gets built, tested, packaged (versioned), and uploaded to our private Nuget feed. For execution, there is a daily task in TeamCity to download the latest package and run it against our production database.

Its Like A Less Impressive Quickening

The last thing to do is make sure that we don’t ever delete any data that might still have value, to either us or the client.

As I mentioned above, the main reason that duplicates happen is when a client re-registers the same database for some reason. Upon re-registration, the “new” database will begin its data synchronization from scratch.

During the period of time where data is still uploading for the “new” database, but all the “old” data is still hanging around, how can we reasonably say which data set is the duplicate and should be deleted?

If we go off number of records, we’d almost certainly delete the “new” database mistakenly, which would just start it uploading again from scratch, and we’d get into an infinite stupidity loop.

We need some sort of indication that the data is “recent”, but we can’t use the timestamps on the data itself, because they are just copies from the local database, and oldest data uploads first.

Instead we need to use timing information from when the data set last participated in the sync process, i.e. a recency indicator.

A small modification of the tool later, and its execution looks like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete} –recencyLimitDays {dont-delete-if-touched-this-recently}

We currently use 7 days as our recency limit, but once we’re more comfortable with the process, we’ll probably tune it down to 1 or 2 days (just to get rid of the data as soon as we can).

Conclusion

To be honest, we’ve known about the duplicate data flaw for a while now, but as I mentioned earlier, it didn’t really affect customers all that much. We’d put some systems in place to allow customers to only select recently synced databases already, so from their point of view, there might be a period where they could see multiple, but that would usually go away relatively quickly.

It wasn’t until we noticed the duplicates seeping into our metrics (which we use the make business decisions!) that we realised we really needed to do something about them, thus the automated cleanup.

A nice side effect of this, is that when we did the duplicate analysis, we realised that something like 30% of the entire database worthless duplicate data, so there might actually be significant performance gains once we get rid of it all, which is always nice.

To be honest, we probably should have just fixed the flaw as soon as we noticed it, but its in a component that is not well tested or understood, so there was a significant amount of risk in doing so.

Of course, as is always the case when you make that sort of decision, now we’re paying a different price altogether.

And who can really say which one is more expensive in the end?

0 Comments

When we first started putting together our data synchronization algorithm, it was intended to provide the users of our legacy software with access to their data outside of the confines of their offices, starting with a rudimentary web portal. The idea was that this would help improve the longevity of the software that they were using, while also giving them some exposure to what a cloud based offering would be like, easing the eventual transition to our new product.

Unfortunately the web portal part of the system ended up dead-ending pretty hard as the business moved in a different direction.

The data synchronization that powered it though?

That ended up becoming something incredibly valuable in its own right, as it very quickly became clear that if our goal was to ease the transition from our classic desktop software to a new and shiny cloud platform, it would help to have a sexy migration experience.

We Named Our Squad Humpback

In the scheme of things, data migrations from our legacy product to the cloud platform are old news. In fact, we’ve been doing that sort of thing since not long after the cloud platform was ready for the very first customer.

My team hasn’t owned the process that entire time though (even though we owned the data synchronization), so it wasn’t until relatively recently that I really started digging into how the whole shebang works.

At a high level, we have a team focused on actually stewarding new customers through the onboarding and migrations process. This team makes extensive use of a tool that automates parts of that process, including the creation of the overarching customer entity and the migration of data from their old system. The tool is…okay at doing what it needs to do, but it definitely has a bunch of areas that could be improved.

For example, when a migration fails, its not immediately apparent to the user why. You have to do a little bit of digging to get to the root of the problem.

Speaking of failures, they can generally be broken down into two very broad categories:

  • Complete and total failure to execute the migration
  • Failure to do parts of the migration, even though it technically completed

The second point is the most common, and the users take that information (“these things did not migrate”) and supply it to the client, so that they can either fix their underlying problems or sign off that they are okay with those limitations.

The first point is rare, but brutal, and usually encapsulates some sort of fundamental error, like the database going missing, or some critical communications problem.

Recently we hit a brand new complete and total failure which presented with the following error:

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x92\x9B\xF0\x9F...' for column 'data' at row 1

Those escaped characters in the error message?

That’s the ❤️ emoji (and then some other stuff).

Because Whales Migrate

Of course, we didn’t know that the escaped characters in the error message were the ❤️ emoji at first. We just knew that it failed miserably.

At a high level, the entire migration process is a pretty simple ETL:

  1. Grabs the data for the customer (extract)
  2. Transforms that data into an appropriate structure (transform)
  3. Makes a series of API requests to create the appropriate entities in the cloud platform (load)

In between steps one and two, the data is actually inserted into a staging database, which is MySQL (Aurora specifically when its deployed in AWS).

As near as we could tell (the error message and logging weren’t great), when some of the data from the legacy system was inserted into the staging database, the whole thing exploded.

A quick internet search on the error message later, and:

  • The first character was totally the ❤️ emoji
  • We had almost certainly set the charset wrong on the database, which is why it was rejecting the data

The ❤️ emoji specifically (but also other important emojis, like [NOTE: For some weird reason I could not get the pile of poop emoji to work here, so you’ll just have to imagine it]) require four bytes to be represented, and unfortunately the utf8 charset in MySQL doesn’t really support that sort of thing. In fact, strictly speaking, the utf8 charset in MySQL only partially implements UTF-8, which can lead to all sorts of weird problems much more important than simple emoji loss.

Luckily, the fix for us is simple enough. Even if the final destination does not support emojis (the cloud platform), the migration process shouldn’t just explode. This means we need to change the charset of the staging database to utf8mb4, and then do a bit of a song and dance to get everything working as expected for all of the relevant columns.

Once we have the data, the transform can safely trim out the emojis, and we can give appropriate messages to the client being migrated explaining what we did to their data and why. Its not even a fatal error, just a warning that we had to munge their data on the way through in order for it to work properly.

Conclusion

I’m honestly surprised that the emojis made it all the way to the migration before the first failure.

I fully expected the legacy software to explode as soon as it encountered an emoji, or maybe the data synchronization process at the very least. Being that the sync process goes from an MSSQL database (on the client side), through a .NET API (involving JSON serializations) and then into a PostgreSQL database (server side), I figured at least one of those steps would have had some sort of issue, but the only thing we encountered was PostgreSQL’s dislike of the null character (and that was ages ago).

In the end, the problem was surmountable, but it was very unexpected all the same.

The saddest thing is that this isn’t even the craziest user data that we’ve seen.

One time we found 50 MB of meaningless RTF documented embedded in a varchar(MAX) column.