I’m Not Sure If I ❤️ Emojis
When we first started putting together our data synchronization algorithm, it was intended to provide the users of our legacy software with access to their data outside of the confines of their offices, starting with a rudimentary web portal. The idea was that this would help improve the longevity of the software that they were using, while also giving them some exposure to what a cloud based offering would be like, easing the eventual transition to our new product.
Unfortunately the web portal part of the system ended up dead-ending pretty hard as the business moved in a different direction.
The data synchronization that powered it though?
That ended up becoming something incredibly valuable in its own right, as it very quickly became clear that if our goal was to ease the transition from our classic desktop software to a new and shiny cloud platform, it would help to have a sexy migration experience.
We Named Our Squad Humpback
In the scheme of things, data migrations from our legacy product to the cloud platform are old news. In fact, we’ve been doing that sort of thing since not long after the cloud platform was ready for the very first customer.
My team hasn’t owned the process that entire time though (even though we owned the data synchronization), so it wasn’t until relatively recently that I really started digging into how the whole shebang works.
At a high level, we have a team focused on actually stewarding new customers through the onboarding and migrations process. This team makes extensive use of a tool that automates parts of that process, including the creation of the overarching customer entity and the migration of data from their old system. The tool is…okay at doing what it needs to do, but it definitely has a bunch of areas that could be improved.
For example, when a migration fails, its not immediately apparent to the user why. You have to do a little bit of digging to get to the root of the problem.
Speaking of failures, they can generally be broken down into two very broad categories:
- Complete and total failure to execute the migration
- Failure to do parts of the migration, even though it technically completed
The second point is the most common, and the users take that information (“these things did not migrate”) and supply it to the client, so that they can either fix their underlying problems or sign off that they are okay with those limitations.
The first point is rare, but brutal, and usually encapsulates some sort of fundamental error, like the database going missing, or some critical communications problem.
Recently we hit a brand new complete and total failure which presented with the following error:
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x92\x9B\xF0\x9F...' for column 'data' at row 1
Those escaped characters in the error message?
That’s the ❤️ emoji (and then some other stuff).
Because Whales Migrate
Of course, we didn’t know that the escaped characters in the error message were the ❤️ emoji at first. We just knew that it failed miserably.
At a high level, the entire migration process is a pretty simple ETL:
- Grabs the data for the customer (extract)
- Transforms that data into an appropriate structure (transform)
- Makes a series of API requests to create the appropriate entities in the cloud platform (load)
In between steps one and two, the data is actually inserted into a staging database, which is MySQL (Aurora specifically when its deployed in AWS).
As near as we could tell (the error message and logging weren’t great), when some of the data from the legacy system was inserted into the staging database, the whole thing exploded.
A quick internet search on the error message later, and:
- The first character was totally the ❤️ emoji
- We had almost certainly set the charset wrong on the database, which is why it was rejecting the data
The ❤️ emoji specifically (but also other important emojis, like [NOTE: For some weird reason I could not get the pile of poop emoji to work here, so you’ll just have to imagine it]) require four bytes to be represented, and unfortunately the utf8 charset in MySQL doesn’t really support that sort of thing. In fact, strictly speaking, the utf8 charset in MySQL only partially implements UTF-8, which can lead to all sorts of weird problems much more important than simple emoji loss.
Luckily, the fix for us is simple enough. Even if the final destination does not support emojis (the cloud platform), the migration process shouldn’t just explode. This means we need to change the charset of the staging database to utf8mb4, and then do a bit of a song and dance to get everything working as expected for all of the relevant columns.
Once we have the data, the transform can safely trim out the emojis, and we can give appropriate messages to the client being migrated explaining what we did to their data and why. Its not even a fatal error, just a warning that we had to munge their data on the way through in order for it to work properly.
Conclusion
I’m honestly surprised that the emojis made it all the way to the migration before the first failure.
I fully expected the legacy software to explode as soon as it encountered an emoji, or maybe the data synchronization process at the very least. Being that the sync process goes from an MSSQL database (on the client side), through a .NET API (involving JSON serializations) and then into a PostgreSQL database (server side), I figured at least one of those steps would have had some sort of issue, but the only thing we encountered was PostgreSQL’s dislike of the null character (and that was ages ago).
In the end, the problem was surmountable, but it was very unexpected all the same.
The saddest thing is that this isn’t even the craziest user data that we’ve seen.
One time we found 50 MB of meaningless RTF documented embedded in a varchar(MAX) column.