0 Comments

Over two months ago I commented that I was going to rewrite our ELB Logs Processor in C#. You know, a language that I actually like and respect.

No, it didn’t take me two months to actually do the rewrite (what a poor investment that would have been), I finished it up a while ago. I just kept forgetting to write about it.

Thus, here we are.

HTTP Is The Way For Me

To set the stage, lets go back a step.

The whole motivation for even touching the ELB Logs Processor again was to stop it pushing all of its log events to our Broker via TCP. The reasoning here was that TCP traffic is much harder to reason about, and Logstash has much better support around tracking, retrying and otherwise handling HTTP traffic than it does for TCP, so the switch makes sense as a way to improve the overall health of the system.

I managed to flip the Javascript ELB Logs Processor to HTTP using the Axios library (which is pretty sweet), but it all fell apart when I pushed it into production, and it failed miserably with what looked like memory issues.

Now, this is the point where a saner person might have just thrown enough memory at the Lambda function to get it to work and called it a day. Me? I’d had enough of Javascript at that point, so I started looking elsewhere.

At this point I was well aware that AWS Lambda offers execution of C# via .NET Core, as an alternative to Node.js.

Even taking into account my inexperience with doing anything at all using .NET Core, C# in Lambda was still infinitely preferable to Javascript. The assumption was that the ELB Logs Processor doesn’t actually do much (handle incoming S3 event, read file, send log events to Logstash), so rewriting it shouldn’t take too much effort.

Thus, here we are.

You Got .NET Core In My Lambda!

The easiest way to get started with using C# via .NET Core in AWS Lamnda, is to install the latest version of the AWS Toolkit and then just use Visual Studio. The toolkit provides a wealth of functionality for working with AWS in Visual Studio, including templates for creating Lambda functions as well as functionality for publishing your code to the place where it will actually be executed.

Actually setting up the Lambda function to run .NET Core is also relatively trivial. Assuming you’ve already setup a Lambda function with the appropriate listeners, its literally a combo box in the AWS Dashboard where you select the language/runtime version that your code requires.

Using the integrated toolkit and Visual Studio is a great way to get a handle on the general concepts in play and the components that you need to manage, but its not how you should do it in a serious development environment, especially when it comes to publishing.

A better way to do it is to go for something a bit more portable, like the actual .NET Core SDK, which is available for download as a simple compressed archive, with no installation required. Once you’ve got a handle on the command line options, its the best way to put something together that will also work on a build server of some description.

I’ll talk about the build and deployment process in more detail in the second post in this series, but for now, its enough to know that you can use the combination of the command line tools from the .NET Core SDK along with Visual Studio to create, develop, build, package and publish your .NET Core AWS Lambda function without too much trouble.

At that point you can choose to either create your Lambda function manually, or automate its creation through something like CloudFormation. It doesn’t really matter from the point of view of the function itself.

You Got Lambda In My .NET Core!

From the perspective of C#, all you really need to write is a class with a handler function.

namespace Lambda
{
    public class Handler
    {
        public void Handle(System.IO.Stream e)
        {
            // do some things here
        }
    }
}

A stream is the most basic input type, leaving the handling of the incoming data entirely up to you, using whatever knowledge you have about its format and structure.

Once you have the code, you can compile and publish it to a directory of your choosing (to get artefacts that you can move around), and then push it to the AWS Lambda function. As long as you’ve targeted the appropriate runtime as part of the publish (i.e. not a standalone platform independent application, you actually need to target the same runtime that you specified in the AWS Lambda function configuration), the only other thing to be aware of is how to tell the function where to start execution.

Specifying the entry point for the function is done via the Handler property, which you can set via the command line (using the AWS CLI), via Powershell Cmdlets, or using the AWS dashboard. Regardless of how you set it, its in the format “{namespace}::{class}::{function}”, so if you decide to arbitrarily change your namespaces or class names, you’ll have to keep your lambda function in sync or it will stop working the next time you publish your code.

At this point you should have enough pieces in place that you can trigger whatever event you’re listening for (like a new S3 file appearing in a bucket) and track the execution of the function via CloudWatch or something similar.

Helpful Abstractions

If you look back at the simple function handler above, you can probably conclude that working with raw streams isn’t the most efficient way of handling incoming data. A far more common approach is to get AWS Lambda to deserialize your incoming event for you. This allows you to use an actually structured object of your choosing, ranging from the presupported AWS events (like an S3 event), through to a POCO of your own devising.

In order to get this working all you have to do is add a reference to the Amazon.Lambda.Serialization.Json nuget package, and then annotate your handler with the appropriate attribute, like so:

namespace Lambda
{
    public class Handler
    {
        [LambdaSerializer(typeof(Amazon.Lambda.Serialization.Json.JsonSerializer))]
        public void Handle(Amazon.Lambda.S3Events.S3Event e)
        {
            // Do stuff here
        }
    }
}

In the function above, I’ve also made use of the Amazon.Lambda.S3Events nuget package in order to strongly type my input object, because I know that the only listener configured will be for S3 events

I’m actually not sure what you would have to do to create a function that handles multiple input types, so I assume the model is to just fall back to a more generic input type.

Or maybe to have multiple functions or something.

To Be Continued

I think that’s as good a place as any to stop and take a breath.

Next week I’ll continue on the same theme and talk about the build, package and publish process that we put together in order to actually get the code deployed.

0 Comments

Choo choo goes the Elasticsearch train.

After the last few blog posts about rolling updates to our Elasticsearch environment, I thought I might as well continue with the Elasticsearch theme and do a quick post about reindexing.

Cartography

An index in Elasticsearch is kind of similar to a table in a relational database, but not really. In the same vein, index templates are kind of like schemas, and field mappings are kind of like columns.

But not really.

If you were using Elasticsearch purely for searching through some set of data, you might create an index and then add some mappings to it manually. For example, if you wanted to make all of the addresses in your system searchable, you might create fields for street, number, state, postcode and other common address elements, and maybe another field for the full address combined (like 111 None St, Brisbane, QLD, 4000 or something), to give you good coverage over the various sort of searches that might be requested.

Then you jam a bunch of documents into that index, each one representing a different address that needs to be searchable.

Over time, you might discover that you could really use a field to represent the unit or apartment number, to help narrow down those annoying queries that involve a unit complex or something.

Well, with Elasticsearch you can add a new field to the index, in a similar way to how you add a new column to a table in a relational database.

Except again, not really.

You can definitely add a new field mapping, but it will only work for documents added to the index after you’ve applied the change. You can’t make that new mapping retroactive. That is to say, you can’t magically make it apply to every document that was already in the index when you created the new mapping.

When it comes to your stock standard ELK stack, your data indexes are generally time based and generated from an index template, which adds another layer of complexity. If you want to change the mappings, you typically just change the template and then wait for some time period to rollover.

This leaves you in an unfortunate place for historical data, especially if you’ve been conservative with your field mappings.

Or does it?

Dexterous Storage

In both of the cases above (the manually created and maintained index, the swarm of indexes created automatically via a template) its easy enough to add new field mappings and have them take effect moving forward.

The hard part is always the data that already exists.

That’s where reindexing comes in.

Conceptually, reindexing is taking all of the documents that are already in an index and moving them to another index, where the new index has all the field mappings you want in it. In moving the raw documents like that, Elasticsearch will redo everything that it needs to do in order to analyse and breakdown the data into the appropriate fields, exactly like the first time the document was seen.

For older versions of Elasticsearch, the actual document migration had to be done with an external tool or script, but the latest versions (we use 5.5.1) have a reindex endpoint on the API, which is a lot simpler to use.

curl -XPUT "{elasticsearch_url}/{new_index}?pretty" -H "accept:application/json"
curl -XPOST "{elasticsearch_url}/_reindex?pretty" H "content-type:application/json" -H "accept:application/json" -d "{ "source": { "index": "{old_index}" }, "dest": { "index": "{new_index}", "version_type": "external" } }"

It doesn’t have to be a brand new index (there are options for how to handle documents that conflict if you’re reindexing into an index that already has data in it), but I imagine that a new index is the most common usage.

The useful side effect of this, is that in requiring a different index, the old one is left intact and unchanged. Its then completely up to you how to use both the new and old indexes, the most common operation being to delete the old one when you’re happy with how and where the new one is being used.

Seamless Replacement

We’ve changed our field mappings in our ELK stack over time, so while the most recent indexes do what we want them to, the old indexes have valuable historical data sitting around that we can’t really query or aggregate on.

The naive implementation is just to iterate through all the indexes we want to reindex (maybe using a regex or something to identify them), create a brand new index with a suffix (like logstash-2017.08.21-r) and then run the reindex operation via the Elasticsearch API, similar to the example above.

That leaves us with two indexes with the same data in them, which is less than ideal, especially considering that Kibana will quite happily query both indexes when you ask for data for a time period, so we can’t really leave the old one around or we’ll run into issues with duplicate data.

So we probably want to delete the old index once we’re finished reindexing into the new one.

But how do we know that we’re finished?

The default mode for the reindex operation is to wait for completion before returning a response from the API, which is handy, because that is exactly what we want.

The only other thing we needed to consider is that after a reindex, all of the indexes will have a suffix of –r, and our Curator configuration wouldn’t pick them up without some changes. In the interest of minimising the amount of things we had to touch just to reindex, we decided to do the reindex again from the temporary index back into an index named the same as the one we started with, deleting the temporary index once that second operation was done.

When you do things right, people wont be sure you’ve done anything at all.

Danger Will Robinson

Of course, the first time I ran the script (iterate through indexes, reindex to temporary index, delete source, reindex back, delete temp) on a real Elasticsearch cluster I lost a bunch of documents.

Good thing we have a staging environment specifically for this sort of thing.

I’m still not entirely sure what happened, but I think it had something to do with the eventually consistent nature of Elasticsearch, the fact we connect to the data nodes via an AWS ELB and the reindex being “complete” according to the API but not necessarily synced across all nodes, so the deletion of the source index threw a massive spanner in the works.

Long story short, I switched the script to start the reindex asynchronously and then poll the destination index until it returned the same number of documents as the source. As a bonus, this fixed another problem I had with the HTTP request for the reindex timing out on large indexes, which was nice.

The only downside of this is that we can’t reindex an index that is currently being written to (because the document counts will definitely change over the period of time the reindex occurs), but I didn’t want to do that anyway.

Conclusion

I’ve uploaded the full script to Github. Looking at it now, its a bit more complicated than you would expect, even taking into account the content of this post, but as far as I can tell, its pretty robust.

All told, I probably spent a bit longer on this than I should have, especially taking into account that its not something we do every day.

The flip side of that is that its extremely useful to know that old data is not just useless when we update our field mappings, which is nice.

0 Comments

In last weeks post I explained how we would have liked to do updates to our Elasticsearch environment using CloudFormation. Reality disagreed with that approach and we encountered timing problems as a result of the ES cluster and CloudFormation not talking with one another during the update.

Of course, that means that we need to come up with something ourselves to accomplish the same result.

Move In, Now Move Out

Obviously the first thing we have to do is turn off the Update Policy for the Auto Scaling Groups containing the master and data nodes. With that out of the way, we can safely rely on CloudFormation to update the rest of the environment (including the Launch Configuration describing the EC2 instances that make up the cluster), safe in the knowledge that CloudFormation is ready to create new nodes, but will not until we take some sort of action.

At that point its just a matter of controlled node replacement using the auto healing capabilities of the cluster.

If you terminate one of the nodes directly, the AWS Auto Scaling Group will react by creating a replacement EC2 instance, and it will use the latest Launch Configuration for this purpose. When that instance starts up it will get some configuration deployed to it by Octopus Deploy, and shortly afterwards will join the cluster. With a new node in play, the cluster will react accordingly and rebalance, moving shards and replicas to the new node as necessary until everything is balanced and green.

This sort of approach can be written in just about any scripting language, out poison of choice is Powershell, which was then embedded inside the environment nuget package to be executed whenever an update occurs.

I’d copy the script here, but its pretty long and verbose, so here is the high level algorithm instead:

  • Iterate through the master nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current master node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
  • Iterate through the data nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current data node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
      • Wait for the cluster to go yellow (indicating rebalancing is occurring
      • Wait for the cluster to go green (indicating rebalancing is complete). This can take a while, depending on the amount of data in the cluster

As you can see, there isn’t really all that much to the algorithm, and the hardest part of the whole thing is knowing that you should wait for the node to leave/join the cluster and for the cluster to rebalance before moving on to the next replacement.

If you don’t do that, you risk destroying the cluster by taking away too many of its parts before its ready (which was exactly the problem with leaving the update to CloudFormation).

Hands Up, Now Hands Down

For us, the most common reason to run an update on the ELK environment is when there is a new version of Elasticsearch available. Sure we run updates to fix bugs and tweak things, but those are generally pretty rare (and will get rarer as time goes on and the stack gets more stable).

As a general rule of thumb, assuming you don’t try to jump too far all at once, new versions of Elasticsearch are pretty easily integrated.

In fact, you can usually have nodes in your cluster at the new version while there are still active nodes on the old version, which is nice.

There are at least two caveats that I’m aware of though:

  • The latest version of Kibana generally doesn’t work when you point it towards a mixed cluster. It requires that all nodes are running the same version.
  • If new indexes are created in a mixed cluster, and the primary shards for that index live on a node with the latest version, nodes with the old version cannot be assigned replicas

The first one isn’t too problematic. As long as we do the upgrade overnight (unattended), no-one will notice that Kibana is down for a little while.

The second one is a problem though, especially for our production cluster.

We use hourly indexes for Logstash, so a new index is generally created every hour or so. Unfortunately it takes longer than an hour for the cluster to rebalance after a node is replaced.

This means that the cluster is almost guaranteed to be stuck in the yellow status (indicating unassigned shards, in this case the replicas from the new index that cannot be assigned to the old node), which means that our whole process of “wait for green before continuing” is not going to work properly when we do a version upgrade on the environment that actually matter, production.

Lucky for us, the API for Elasticsearch is pretty amazing, and allows you to get all of the unassigned shards, along with the reason why they were unassigned.

What this means is that we can keep our process the same, and when the “wait for green” part of the algorithm times out, we can check to see whether or not the remaining unassigned shards are just version conflicts, and if they are, just move on.

Works like a charm.

Tell Me What You’re Gonna Do Now

The last thing that we need to take into account during an upgrade is related to Octopus Tentacles.

Each Elasticsearch node that is created by the Auto Scaling Group registers itself as a Tentacle so that it can have the Elasticsearch  configuration deployed to it after coming online.

With us terminating nodes constantly during the upgrade, we generate a decent number of dead Tentacles in Octopus Deploy, which is not a situation you want to be in.

The latest versions (3+ I think) of Octopus Deploy allow you to automatically remove dead tentacles whenever a deployment occurs, but I’m still not sure how comfortable I am with that piece of functionality. It seems like if your Tentacle is dead for a bad reason (i.e. its still there, but broken) then you probably don’t want to just clean it up and keep on chugging along.

At this point I would rather clean up the Tentacles that I know to be dead because of my actions.

As a result of this, one of the outputs from the upgrade process is a list of the EC2 instances that were terminated. We can easily use the instance name to lookup the Tentacle in Octopus Deploy, and remove it.

Conclusion

What we’re left with at the end of this whole adventure is a fully automated process that allows us to easily deploy changes to our ELK environment and be confident that not only have all of the AWS components updated as we expect them to, but that Elasticsearch has been upgraded as well.

Essentially exactly what we would have had if the CloudFormation update policy had worked the way that I initially expected it to.

Speaking of which, it would be nice if AWS gave a little bit more control over that update policy (like timing, or waiting for a signal from a system component before moving on), but you can’t win them all.

Honestly, I wouldn’t be surprised if there was a way to override the whole thing with a custom behaviour, or maybe a custom CloudFormation resource or something, but I wouldn’t even know where to start with that.

I’ve probably run the update process around 10 times at this point, and while I usually discover something each time I run it, each tweak makes it more and more stable.

The real test will be what happens when Elastic.co releases version 6 of Elasticsearch and I try to upgrade.

I foresee explosions.

0 Comments

Its been a little while since I made a post on Elasticsearch. Time to remedy that.

Our log stack has been humming along relatively well ever since we took control of it. Its not perfect, but its much better than it was.

One of the nicest side effects of the restructure has been the capability to test our changes in the CI/Staging environments before pushing them into Production. Its saved us from a a few boneheaded mistakes already (mostly just ES configuration blunders), which has been great to see. It does make pushing things into the environment actually care about a little bit slower than they otherwise would be, but I’m willing to make that tradeoff for a bit more safety.

When I was putting together the process for deploying our log stack (via Nuget, Powershell and Octopus Deploy), I tried to keep in mind what it would be like when I needed to deploy an Elasticsearch version upgrade. To be honest, I thought I had a pretty good handle on it:

  • Make an AMI with the new version of Elasticsearch on it
  • Change the environment definition to reference this new AMI instead of the old one
  • Deploy the updated package, leveraging the Auto Scaling Group instance replacement functionality
  • Dance like no-one is watching

The dancing part worked perfectly. I am a graceful swan.

The rest? Not so much.

Rollin’, Rollin’

I think the core issue was that I had a little bit too much faith in Elasticsearch to react quickly and robustly in the face of random nodes dying and being replaced.

Don’t get me wrong, its pretty amazing at what it does, but there are definitely situations where it is understandably incapable of adjusting and balancing itself.

Case in point, the process that occurs when an AWS Auto Scaling Group starts doing a rolling update because the definition of its EC2 instance launch configuration has changed.

When you use CloudFormation to initialize an Auto Scaling Group, you define the instances inside that group through a configuration structure called a Launch Configuration. This structure contains the definition of your EC2 instances, including the base AMI, security groups, tags and other meta information, along with any initialization that needs to be performed on startup (user data, CFN init, etc).

Inside the Auto Scaling Group definition in the template, you decide what should be the appropriate reaction upon detecting changes to the launch configuration, which mostly amounts to a choice between “do nothing” or “start replacing the instances in a sane way”. That second option is referred to as a “rolling update”, and you can specify a policy in the template for how you would like it to occur.

For our environment, a new ES version means a new AMI, so theoretically, it should be a simple matter to update the Launch Configuration with the new AMI and push out an update, relying on the Auto Scaling Group to replace the old nodes with the new ones, and relying on Elasticsearch to rebalance and adjust as appropriate.

Not that simple unfortunately, as I learned when I tried to apply it to the ES Master and Data ASGs in our ELK template.

Whenever changes were detected, CloudFormation would spin up a new node, wait for it to complete its initialization (which was just machine up + octopus tentacle registration), then it would terminate an old node and rinse and repeat until everything was replaced. This happened for both the master nodes and data nodes at the same time (two different Auto Scaling Groups).

Elasticsearch didn’t stand a chance.

With no feedback loop between ES and CloudFormation, there was no way for ES to tell CloudFormation to wait until it had rebalanced the cluster, replicated the shards and generally recovered from the traumatic event of having a piece of itself ripped out and thrown away.

The end result? Pretty much every scrap of data in the environment disappeared.

Good thing it was a scratch environment.

Rollin’, Rollin’

Sticking with the whole “we should probably leverage CloudFormation” approach. I implemented a script to wait for the node to join the cluster and for the cluster to be green (bash scripting is fun!). The intent was that this script would be present in the baseline ES AMI, would be executed as part of the user data during EC2 instance initialization, and would essentially force the auto scaling process to wait for Elasticsearch to actually be functional before moving on.

This wrought havoc with the initial environment creation though, as the cluster isn’t even valid until enough master nodes exist to elect a primary (which is 3), so while it kind of worked for the updates, initial creation was broken.

Not only that, but in a cluster with a decent amount of data, the whole “wait for green” thing takes longer than the maximum time allowed for CloudFormation Auto Scaling Group EC2 instance replacements, which would cause the auto scaling to time out and the entire stack to fail.

So we couldn’t use CloudFormation directly.

The downside of that is that CloudFormation is really good at detecting changes and determining if it actually has to do anything, so not only did we need to find another way to update our nodes, we needed to find a mechanism that would safely know when that node update should be applied.

To Be Continued

That’s enough Elasticsearch for now I think, so next time I’ll continue with the approach we actually settled on.

0 Comments

Last week I described the re-occurrence of an issue from late last year with our data synchronization process.

Basically, read IOPS on the underlying database much higher than expected, causing issues with performance in AWS when the database volume ran out of burst balance (or IO credits as they are sometimes called).

After identifying that the most recent problem had been caused by the addition of tables, we disabled those tables, restoring performance and stability.

Obviously we couldn’t just leave them turned off though, and we couldn’t throw money at the problem this time, like we did last time. It was time to dig in and fix the problem properly.

But first we’d need to reproduce it.

But The Read Performance Issue In My Store

The good thing about ensuring that your environment is codified is that you can always spin up another one that looks just like the existing one.

Well, theoretically anyway.

In the case of the sync API environment everything was fine. One script execution later and we had a new environment called “sync performance” with the same number of identically sized API instances running the exact same code as production.

The database was a little trickier unfortunately.

You see, the database environment was from the time before Iimproved our environment deployment process. This meant that it was easy to make one, but hard to update an existing one.

Unfortunately, it was hard enough to update an existing one that the most efficient course of action had been to simply update the live one each time we had to tweak it, so we had diverged from the source code.

First step? Get those two back in sync.

Second step, spin up a new one that looks just like prod, which needed to include a copy of the prod data. Luckily, RDS makes that easy with snapshots.

With a fully functional, prod-like data synchronization environment up and running, all we needed was traffic.

Good thing Gor exists. We still had a deployable Gor component from last time I wanted to replicate traffic, so all we needed to do was make a new Octopus project, configure it appropriately and deploy it to our production API.

Now we had two (mostly) identical environments processing the same traffic, behaving pretty much the same. Because we’d turned off multiple tables in order to stop the high read IOPS, it was a simple matter to turn one back on, causing a less severe version of the issue to reoccur in both environments (higher than normal read IOPS, but not enough to eat burst balance).

With that in place we were free to investigate and experiment.

Is Still Abhorred

I’m going to cut ahead in the timeline here, but we analysed the behaviour of the test environment for a few days, trying to get a handle on what was going on.

Leveraging some of the inbuilt query statistics in postgreSQL, it looked like the the most frequent and costly type of query was related to getting a “version” of the remote table for synchronization orchestration. The second most costly type of query was related to getting a “manifest” of a subset of the table being synchronized for the differencing engine.

Disabling those parts of the API (but leaving the uploads alone) dropped the IOPS significantly, surprising exactly zero people. This did disagree with our hypothesis from last time though, so that was interesting.

Of course, the API is pretty useless without the ability to inform the synchronization process, so it was optimization time.

  • We could try to reduce the total number of calls, reducing the frequency that those queries are executed. We’d already done some work recently to dramatically reduce the total number of calls to the API from each synchronization process though, so it was unlikely we would be able to get any wins here
  • We could implement a cache in front of the API, but this just complicates things and all it will really result in is doing work repeatedly for no benefit (if the process syncs data then asks the API for a status, and gets the cached response, it will just sync the data again)
  • We could reduce the frequency of syncing, doing it less often. Since we already did the work I mentioned above to reduce overall calls, the potential gains here were small
  • We could try to make the queries more efficient. The problem here was that the queries were already using the primary keys of the tables in question, so I’m not entirely sure that any database level optimizations on those tables would have helped
  • We could make getting an answer to the question “give me the remote version of this table” more efficient by using a dedicated data structure to service those requests, basically a fancy database level cache

We prototyped the last option (basically a special cache within the database that contained table versions in a much easier to query format) and it had a positive effect on the overall read IOPS.

But it didn’t get rid of it entirely.

Within The Sound Of Synchronization

Looking into our traffic, we discovered that our baseline traffic had crept up since we’d implemented the skipping strategy in the sync process. Most of that baseline traffic appeared to be requests relating to the differencing engine (i.e. scanning the table to get primary key manifests for comparison purposes), which was one of the expensive type of queries that we identified above.

We’d made some changes to the algorithm to incorporate the ability to put a cap on the number of skips we did (for safety, to avoid de-sync edge cases) and to introduce forced skips for tables whose changes we were happy to only sync a few times a day.

A side effect of these changes was that whenever we decided NOT to skip using the local comparison, the most common result of the subsequent local vs remote comparison was choosing to execute the differencing engine. There was a piece of the algorithm missing where it should have been choosing to do nothing if the local and remote were identical, but that did not seem to be working due to the way the skip resolution had been implemented.

Fixing the bug and deploying it cause the read IOPS on our normal production server to drop a little bit, which was good.

The different pattern of traffic + our prototype table version cached caused a much more dramatic drop in read IOPS in our test environment though. It looked like the two things acting together apparently reduced the demands on the database enough to prevent it from having to read so much all the time.

Conclusion

We’re still working on a production quality cached table version implementation, but I am cautiously optimistic. There are some tricky bits regarding the cache (like invalidation vs updates, and where that sort of decision is made), so we’ve got a bit of work ahead of us though.

At this point I’m pretty thankful that we were easily able to both spin up an entirely separate and self contained environment for testing purposes, and that we were able to replicate traffic from one environment to the other without a lot of fuss. Without the capability to reproduce the problem disconnected from our clients and experiment, I don’t think we would have been able to tackle the problem as efficiently as we did.

I’m a little disappointed that a bug in our sync process managed to slip through our quality assurance processes, but I can understand how it happened. It wasn’t strictly a bug with the process itself, as the actions it was performing were still strictly valid, just not optimal. Software with many interconnected dependent components can be a very difficult thing to reason about, and this problem was relatively subtle unless you were looking for it specifically.  We might have been able to prevent it from occurring with additional tests, but its always possible that we actually had those tests anyway, and during one of the subsequent changes a test failed and was fixed in an incorrect way. I mean, if that was the case, then we need to be more careful about “fixing” broken tests.

Regardless, we’re starting to head into challenging territory with our sync process now, as it is a very complicated beast. So complicated in fact that its getting difficult to keep the entire thing in your head at the same time.

Which is scary.