Health Check On Aisle One

July 26. 2016 0 Comments

Posted in:
monitoring
API
aws

The way we structure our services for deployment should be familiar to anyone who’s worked with AWS before. An ELB (Elastic Load Balancer) containing one or more EC2 (Elastic Compute Cloud) instances, each of which has our service code deployed on it by Octopus Deploy. The EC2 instances are maintained by an ASG (Auto Scaling Group) which uses a Launch Configuration that describes how to create and initialize an instance. Nothing too fancy.

An ELB has logic in it to direct incoming traffic to its instance members, in order to balance the load across multiple resources. In order to do this, it needs to know whether or not a member instance is healthy, and it accomplishes this by using a health check, configured when the ELB is created.

Health checks can be very simple (does a member instance respond to a TCP packet over a certain port), or a little bit more complex (hit some API endpoint on a member instance via HTTP and look for non 200 response codes), but conceptually they are meant to provide an indication that the ELB is okay to continue to direct traffic to this member instance. If a certain number of status checks fail over a period of time (configurable, but something like “more than 5 within 5 minutes”, the ELB will mark a member instance as unhealthy and stop directing traffic to it. It will continue to execute the configured health check in the background (in case the instance comes back online because the problem was transitory), but can also be configured (when used in conjunction with an ASG) to terminate the instance and replace it with one that isn’t terrible.

Up until recently, out services have consistently used a /status endpoint for the purposes of the ELB health check.

Its a pretty simple endpoint, checking some basic things like “can I connect to the database” and “can I connect to S3”, returning a 503 if anything is wrong.

When we started using an external website monitoring service called Pingdom, it made sense to just use the /status endpoint as well.

Then we complicated things.

Only I May Do Things

Earlier this year I posted about how we do environment migrations. The contents of that post are still accurate (even though I am becoming less and less satisfied with the approach as time goes on), but we’ve improved the part about “making the environment unavailable” before starting the migration.

Originally, we did this by just putting all of the instances in the auto scaling group into standby mode (and shutting down any manually created databases, like ones hosted on EC2 instances), with the intent that this would be sufficient to stop customer traffic while we did the migration. When we introduced the concept of the statistics endpoint, and started comparing data before and after the migration, this was no longer acceptable (because there was no way for us to hit the service ourselves when it was down to everybody).

What we really wanted to do was put the service into some sort of mode that let us interact with it from an administrative point of view, but no-one else. That way we could execute the statistics endpoint (and anything else we needed access to), but all external traffic would be turned away with a 503 Service Unavailable (and a message saying something about maintenance).

This complicated the /status endpoint though. If we made it return a 503 when maintenance mode was on, the instances would eventually be removed from the ELB, which would make the service inaccessible to us. If we didn’t make it return a 503, the external monitoring in Pingdom would falsely report that everything was working just fine, when in reality normal users would be completely unable to use the service.

The conclusion that we came to was that we made a mistake using /status for two very different purposes (ELB membership and external service availability), even though they looked very similar at a glance.

The solutoion?

Split the /status endpoint into two new endpoints, /health for the ELB and /test for the external monitoring service.

Healthy Services

The /health endpoint is basically just the /status endpoint with a new name. Its response payload looks like this:

{
    "BuildVersion" : "1.0.16209.8792",
    "BuildTimestamp" : "2016-07-27T04:53:04+00:00",
    "DatabaseConnectivity" : true,
    "MaintenanceMode" : false
}

There’s not really much more to say about it, other than its purpose now being clearer and more focused. Its all about making sure the particular member instance is healthy and should continue to receive traffic. We can probably extend it to do other health related things (like available disk space, available memory, and so on), but it’s good enough for now.

Testing Is My Bag Baby

The /test endpoint though, is a different beast.

At the very least, the /test endpoint needs to return the information present in the /health endpoint. If /health is returning a 503, /test should too. But it needs to do more.

The initial idea was to simply check whether or not we were in maintenance mode and then just make the /test endpoint return a 503 when maintenance mode was on as the only differentiator between /health and /test. That’s not quite good enough though, as what we want conceptually is the /test endpoint to act more like a smoke test of common user interactions and just checking the maintenance mode flag doesn’t give us enough faith that the service itself is working from a user point of view.

So we added some tests that get executed when the endpoint is called, checking commonly executed features.

The first service we implemented the /test endpoint for, was our auth service, so the tests included things like “can I generate a token using a known user” and “can I query the statistics endpoint”, but eventually we’ll probably extend it to include other common user operations like “add a customer” and “list customer databases”.

Anyway, the payload for the /test endpoint response ended up looking like this:

{
    "Health" : {
        "BuildVersion" : "1.0.16209.8792",
        "BuildTimestamp" : "2016-07-27T04:53:04+00:00",
        "DatabaseConnectivity" : true,
        "MaintenanceMode" : false
    },
    "Tests" : [{
            "Name" : "QueryApiStatistics",
            "Description" : "A test which returns the api statistics.",
            "Endpoint" : "/admin/statistics",
            "Success" : true,
            "Error" : null
        }, {
            "Name" : "GenerateClientToken",
            "Description" : "A test to check that a client token can be generated.",
            "Endpoint" : "/auth/token",
            "Success" : true,
            "Error" : null
        }
    ]
}

I think it turned out quite informative in the end, and it definitely meets the need of detecting when the service is actually available for normal usage, as opposed to the very specific case of detecting maintenance mode.

The only trick here was that the tests needed to hit the API via HTTP (i.e. localhost), rather than just shortcutting through the code directly. The reason for this is that otherwise each test would not be an accurate representation of actual usage, and would give false results if we added something in the Nancy pipeline that modified the response before it got to the code that the test was executing.

Subtle, but important.

Conclusion

After going to the effort of clearly identifying the two purposes that the /status endpoint was fulfilling, it was pretty clear that we would be better served by having two endpoints rather than just one. It was obvious in retrospect, but it wasn’t until we actually encountered a situation that required them to be different that we really noticed.

A great side effect of this was that we realised we needed to have some sort of smoke tests for our external monitoring to access on an ongoing basis, to give us a good indication of whether or not everything was going well from a users point of view.

Now our external monitoring actually stands a chance of tracking whether or not users are having issues.

Which makes me happy inside, because that’s the sort of thing I want to know about.

A Little Bit Backward

July 19. 2016 0 Comments

Accidentally breaking backwards compatibility is an annoying class of problem that can be particularly painful to deal with once it occurs, especially if it gets out into production before you realise.

From my experience, the main issue with this class of problem is that its unlikely to be tested for, increasing the risk that when it does occur, you don’t notice until its dangerously late.

Most of the time, you’ll write tests as you add features to a piece of software, like an API. Up until the point where you release, you don’t really need to think too hard about breaking changes, you’ll just do what is necessary to make sure that your feature is working right now and then move on. You’ll probably implement tests for it, but they will be aimed at making sure its functioning.

The moment you release though, whether it be internally or externally, you’ve agreed to maintain whatever contract you’ve set in place, and changing this contract can have disastrous consequences.

Even if you didn’t mean to.

Looking Backwards To Go Forwards

In my case, I was working with that service that uses RavenDB, upgrading it to have a few new endpoints to cleanup orphaned and abandoned data.

During development I noticed that someone had duplicated model classes from our models library inside the API itself, apparently so they could decorate them with attributes describing how to customise their serialisation into RavenDB.

As is normally the case with this sort of thing, those classes had already started to diverge, and I wanted to nip that problem in the bud before it got any worse.

Using attributes on classes as the means to define behaviour is one of those things that sounds like a good idea at the time, but quickly gets troublesome as soon as you want to apply them to classes that are in another library or that you don’t control. EF is particularly bad in this case, featuring attributes that define information about how the resulting data schema will be constructed, sometimes only relevant to one particular database provider. I much prefer an approach where classes stand on their own and there is something else that provides the meta information normally provided by attributes. The downside of this of course is that you can no longer see everything to do with the class in one place, which is a nicety that the attribute approach provides. I feel like the approach that allows for a solid separation of concerns is much more effective though, even taking into account the increased cognitive load.

I consolidated some of the duplicate classes (can’t fix everything all at once) and ensured that the custom serialisation logic was maintained without using attributes with this snippet of code that initializes the RavenDB document store.

public class RavenDbInitializer
{
    public void Initialize(IDocumentStore store)
    {
        var typesWithCustomSerialization = new[]
        {
            typeof(CustomStruct),
            typeof(CustomStruct?),
            typeof(CustomStringEnumerable), 
            typeof(OtherCustomStringEnumerable)
        };

        store.Conventions.CustomizeJsonSerializer = a => a.Converters.Add(new RavenObjectAsJsonPropertyValue(typesWithCustomSerialization));
        store.ExecuteIndex(new Custom_Index());
    }
}

The RavenObjectAsJsonPropertyValue class allows for the custom serialization of a type as a pure string, assuming that the type supplies a Parse method and a ToString that accurately represents the class. This reduces the complexity of the resulting JSON document and avoids mindlessly writing out all of the types fields and properties (which we may not need if there is a more efficient representation).

After making all of the appropriate changes I ran our suite of unit, integration and functional tests. They all passed with flying colours, so I thought that everything had gone well.

I finished up the changes relating to the new cleanup endpoint (adding appropriate tests), had the code reviewed and then pushed it into master, which pushed it through our CI process and deployed it to our Staging environment, reading for a final test in a production-like environment.

Then the failures began.

Failure Is Not An Option

Both the new endpoint and pre-existing endpoints were failing to execute correctly.

Closer inspection revealed that the failures were occurring whenever a particular object was being read from the underlying RavenDB database and that the errors occurring were related to being unable to deserialize the JSON representation of the object.

As I still remembered making the changes to the custom serialization, it seemed pretty likely that that was the root cause.

It turns out that when I consolidated the duplicate classes, I had forgotten to actually delete the now useless duplicated classes. As a result of not deleting them, a few references to the deprecated objects had sneakily remained, and because I had manually specified which classes should have custom serialization (which were the non-duplicate model classes), the left over classes were just being serialized in the default way (which just dumps all the properties of the object).

But all the tests passed?

The tests passed because I had not actually broken anything from a self contained round trip point of view. Any data generated by the API was able to be read by the same API.

Data generated by previous versions of the API was no longer valid, which was unintentional.

Once I realise my mistake, the fix was easy enough to conceive. Delete the deprecated classes, fix the resulting compile errors and everything would be fine.

Whenever I come across a problem like this though I prefer to put measures in place such that the exact same thing never happens again.

In this case, I needed a test to verify that the currently accepted JSON representation of the entity continued to be able to be deserialized from a RavenDB document store.

With a failing test I could be safe in the knowledge that when it passed I had successfully fixed the problem, and I can have some assurance that the problem will likely not reoccur in the future if someone breaks something in a similar way (and honestly, its probably going to be me).

Testing Is An Option Though

Being that the failure was occurring at the RavenDB level, when previously stored data was being retrieved, I had to find a way to make sure that the old data was accessible during the test. I didn’t want to have to setup and maintain a complete RavenDB instance that just had old data in it, but I couldn’t just use the current models to add data to the database in the normal way, because that would not actually detect the problem that I experienced.

RavenDB is pretty good as far as testing goes though. Its trivially easy to create an embedded data store and while the default way of communicating with the store is to use classes directly (i.e. Store<T>), it gives access to all of the raw database commands as well.

The solution? Create a temporary data store, use the raw database commands to insert a JSON document of the old format into a specific key then use the normal RavenDB session to query the entity that was just inserted via the generic methods.

[Test]
[Category("integration")]
public void WhenAnEntityFromAPreviousVersionOfOfTheApplicationIsInsertedIntoTheRavenDatabaseDirectly_ThatEntityCanBeReadSuccessfully()
{
    using (var resolverFactory = new IntegrationTestResolverFactory())
    {
        var assemblyLocation = Assembly.GetAssembly(this.GetType()).Location;
        var assemblyFolder = Path.GetDirectoryName(assemblyLocation);

        var EntitySamplePath = Path.Combine(assemblyFolder, @"Data\Entity.json");
        var EntityMetadataPath = Path.Combine(assemblyFolder, @"Data\Entity-Metadata.json");

        var Entity = File.ReadAllText(EntitySamplePath);
        var EntityMetadata = File.ReadAllText(EntityMetadataPath);

        var key = "Entity/1";

        var store = resolverFactory.Application().Get<IDocumentStore>();
        store.DatabaseCommands.Put(key, null, RavenJObject.Parse(Entity), RavenJObject.Parse(EntityMetadata));

        using (var session = store.OpenSession())
        {
            var entities = session.Query<Entity>().ToArray();
            entities.Length.Should().Be(1);
        }
    }
}

The test failed, I fixed the code, the test passed and everyone was happy. Mostly me.

Conclusion

I try to make sure that whenever I’m making changes to something I take backwards compatibility into account. This is somewhat challenging when you’re working at the API level and thinking about models, but is downright difficult when breaking changes can occur during things like serialization at the database level. We had a full suite of tests, but they all worked on the current state of the code and didn’t take into account the chronological nature of persisted data.

In this particular case, we caught the problem before it became a real issue thanks to our staging environment accumulating data over time and our excellent tester, but that’s pretty late in the game to be catching these sorts of things.

In the future it may be useful to create a class of tests specifically for backwards compatibility shortly after we do a release, where we take a copy of any persisted data and ensure that the application continues to work with that structure moving into the future.

Honestly though, because of the disconnected nature of tests like that, it will be very hard to make sure that they are done consistently.

Which worries me.

Source Of Truth

July 12. 2016 0 Comments

Posted in:
nuget
c#
debugging

Like any normal, relatively competent group of software developers, we write libraries to enable reuse, mostly so we don’t have to write the same code a thousand times. Typically each library is small and relatively self contained, offering only a few pieces of highly cohesive functionality that can be easily imported into something with little to no effort. Of course, we don’t always get this right, and there are definitely a few “Utility” or “Common” packages hanging around, but we do our best.

Once you start building and working with libraries in .NET, the obvious choice for package management is Nuget. Our packages don’t tend to be released to the greater public, so we don’t publish to Nuget.org directly, we just use the built-in Nuget feed supplied by TeamCity.

Well we did, up until recently, when we discovered that the version of TeamCity we were using did not support the latest Nuget API and Visual Studio 2015 didn’t like the old version of the Nuget feed API. This breaking change was opportune, because while it was functional, the TeamCity Nuget feed had been bugging us for a while. Its difficult to add packages that weren’t created by a Buidl Configuration and packages are intrinsically tied to the Artefacts of a Build Configuration, so they can be accidentally deleted when cleaning up. So we decided to move to an actual Nuget server, and we chose Myget because we didn’t feel like rolling our own.

That preface should set the stage for the main content of this post, which is about dealing with the difficulties in debugging when a good chunk of your code is in a library that was prepared earlier.

Step In

To me, debugging is a core part of programming.

Sure you can write tests to verify functionality and you can view your resulting artefact(webpage, application, whatever), but at some pointits incredibly helpful to be able to step through your code, line by line, in order to get to the bottom of something (usually a bug).

I’m highly suspicious of languages and development environments that don’t offer an easy way to debug, and of developers that rely on interpreting log statements to understand their code. I imagine this is probably because of a history of using Visual Studio, which has had all of these features forever, but trying to do anything but the most basic actions without a good IDE is hellish. Its one of the many reasons why I hate trying to debug anything in a purely command line driven environment (like some flavours of Linux or headless Windows).

I spend most of my time dealing with two languages, C# and Powershell, which are both relatively easy to get a good debugging experience. Powershell ISE is functional (though not perfect) and Visual Studio is ridiculously feature rich when it comes to debugging.

To tie all of this back in with the introduction though, debugging into C# libraries can be somewhat challenging.

In order to debug C# code you need debugging information in the form of PDB files. These files either need to be generated at compile time, or they need to be generated from the decompiled DLL before debugging. PDB files give the IDE all of the information it needs in order to identify the constituent components of the code, including some information about the source files the code was compiled from. What these files don’t give, however, is the actual source code itself. So while you might have the knowledge that you are in X function, you have no idea what the code inside that function is, nor what any of the variables in play are.

Packaging the PDB files along with the compiled binaries is easy enough using Nuget, assuming you have the option enabled to generate them for your selected Build Configuration.

Packaging source files is also pretty easy.

The Nuget executable provides a –symbols parameter that will automatically create two packages, the normal one and another one containing all of the necessary debug information + all of your source files as of the time of compile.

What do you do with it though?

Source Of All Evil

Typically, if you were uploading your package to Nuget.org, the Nuget executable would automatically upload the symbols package to SymbolSource. SymbolSource is a service that can be configured as a symbol server inside Visual Studio, allowing the application to automatically download debugging information as necessary (including source files).

In our case, we were just including the package as an artefact inside TeamCity, which automatically includes it in the TeamCity Nuget feed. No symbol server was involved.

In fact, I have a vague memory of trying to include both symbol and non-symbol packages and having it not work at all.

At the time, the hacky solution was put in place to only include the symbols package in the artefacts for the Build Configuration in TeamCity, but to rename it to not include the word symbols in the filename. This kept TeamCity happy enough, and it allowed us to debug into our libraries when we needed to (the PDB files would be automatically loaded and it was a simple matter to select the source file inside the appropriate packages directory whenever the IDE asked for them).

This worked up until the point where we switched to Myget.

Packages restored from Myget no longer had the source files inside them, leaving us in a position where our previous approach no longer worked. Honestly, our previous approach was a little hacky, so its probably for the best that it stopped working.

Sourcery

It turns out that Myget is a bit smarter than TeamCity when it comes to being a Nuget feed.

It expects you to upload both the normal package and the symbols package, and then it deals with all the fun stuff on the server. For example, it will automatically interpret and edit the PDB files to say that the source can be obtained from the Myget server, rather than from wherever the files were when it was compiled and packaged.

Our hacky solution to get Symbols and Source working when using TeamCity was getting in the way of a real product.

Luckily, this was an easy enough problem to solve while also still maintaining backwards compatibility.

Undo the hack that enforced the generation of a single output from Nuget (the renamed symbols package), upload both files correctly to Myget and then upload just a renamed symbols package to TeamCity separately.

Now anything that is still using TeamCity will keep working in the same way as it always was (manual location of source files as necessary) and all of the projects that have been upgraded to use our new Myget feed will automatically download both symbols and source files whenever its appropriate (assuming you have the correct symbol server setup in Visual Studio).

Conclusion

Being able to debug into the source files of any of your packages automatically is extremely useful, but more importantly, it provides a seamless experience to the developer, preventing the waste of time and focus that occurs when flow gets interrupted in order to deal with something somewhat orthogonal to the original task.

The other lesson learned (which I should honestly have engraved somewhere highly visible by now) is that hacked solutions always have side effects that you can’t foresee. Its one of those unfortunate facts of being a software developer.

From a best practices point of view, its almost always better to fully understand the situation before applying any sort of fix, but you have to know how to trade that off against speed. We could have chosen to implement a proper Nuget server when we first ran across the issue with debugging our libraries, rather than hacking together a solution to make TeamCity work approximately like we wanted, but we would have paid some sort of opportunity cost in order to do that.

At the least, what we should have done was treat symbols package rename for TeamCity in a isolated fashion, so the build still output the correct packages.

This would have made it obvious that something special was happening for TeamCity and for TeamCity only, making it easier to work with a completely compliant system at a later date.

I suppose the lesson is, if you’re going to put a hack into place, make sure its a well constructed hack.

Which seems like a bit of an oxymoron.

A Deftly Encoded Mystery

July 5. 2016 0 Comments

We use TeamCity as our Continuous Integration tool.

Unfortunately, our setup hasn’t been given as much love as it should have. Its not bad (or broken or any of those things), its just not quite as well setup as it could be, which increases the risk that it will break and makes it harder to manage (and change and keep up to date) than it could be.

As with everything that has got a bit crusty around the edges over time, the only real way to attack it while still delivering value to the business is by doing it slowly, piece by piece, over what seems like an inordinate amount of time. The key is to minimise disruption, while still making progress on the bigger picture.

Our setup is fairly simple. A centralised TeamCity server and at least 3 Build Agents capable of building all of our software components. We host all of this in AWS, but unfortunately, it was built before we started consistently using CloudFormation and infrastructure as code, so it was all manually setup.

Recently, we started using a few EC2 spot instances to provide extra build capabilities without dramatically increasing our costs. This worked fairly well, up until the spot price spiked and we lost the build agents. We used persistent requests, so they came back, but they needed to be configured again before they would hook up to TeamCity because of the manual way in which they were provisioned.

There’s been a lot of instability in the spot price recently, so we were dealing with this manual setup on a daily basis (sometimes multiple times per day), which got old really quickly.

You know what they say.

“If you want something painful automated, make a developer responsible for doing it manually and then just wait.”

Its Automatic

The goal was simple.

We needed to configure the spot Build Agents to automatically bootstrap themselves on startup.

On the upside, the entire process wasn’t completely manual. We were at least spinning up the instances from a pre-built AMI that already had all of the dependencies for our older, crappier components as well as an unconfigured TeamCity Build Agent on it, so we didn’t have to automate absolutely everything.

The bootstrapping would need to tag the instance appropriately (because for some reason spot instances don’t inherit the tags of the spot request), configure the Build Agent and then start it up so it would connect to TeamCity. Ideally, it would also register and authorize the Build Agent, but if we used controlled authorization tokens we could avoid this step by just authorizing the agents once. Then they would automatically reappear each time the spot instance came back,.

So tagging, configuring, service start, using Powershell, with the script baked into the AMI. During provisioning we would supply some UserData that would execute the script.

Not too complicated.

Like Graffiti, Except Useful

Tagging an EC2 instance is pretty easy thanks to the multitude of toolsets that Amazon provides. Our tool of choice is the Powershell cmdlets, so the actual tagging was a simple task.

Getting permission to the do the tagging was another story.

We’re pretty careful with our credentials these days, for some reason, so we wanted to make sure that we weren’t supply and persisting any credentials in the bootstrapping script. That means IAM.

One of the key features of the Powershell cmdlets (and most of the Amazon supplied tools) is that they are supposed to automatically grab credentials if they are being run on an EC2 instance that currently has an instance profile associated with it.

For some reason, this would just not work. We tried a number of different things to get this to work (including updating the version of the Powershell cmdlets we were using), but in the end we had to resort to calling the instance metadata service directly to grab some credentials.

Obviously the instance profile that we applied to the instance represented a role with a policy that only had permissions to alter tags. Minimal permission set and all that.

Service With a Smile

Starting/stopping services with Powershell is trivial, and for once, something weird didn’t happen causing us to burn days while we tracked down some obscure bug that only manifests in our particular use case.

I was as surprised as you are.

Configuration Is Key

The final step should have been relatively simple.

Take a file with some replacement tokens, read it, replace the tokens with appropriate values, write it back.

Except it just wouldn’t work.

After editing the file with Powershell (a relatively simple Get-Content | For-Each { $_ –replace {token}, {value} } | Out-File) the TeamCity Build Agent would refuse to load.

Checking the log file, its biggest (and only) complaint was that the serverUrl (which is the location of the TeamCity server) was not set.

This was incredibly confusing, because the file clearly had a serverUrl value in it.

I tried a number of different things to determine the root cause of the issue, including:

Permissions? Was the file accidentially locked by TeamCity such that the Build Agent service couldn’t access it?
Did the rewrite of the tokens somehow change the format of the file (extra spaces, CR LF when it was just expecting LF)
Was the serverUrl actually configured, but inaccessible for some reason (machine proxy settings for example) and the problem was actually occurring not when the file was rewritten but when the script was setting up the AWS Powershell cmdlets proxy settings?

Long story short, it turns out that Powershell doesn’t remember file encoding when using the Out-File functionality in the way we were using it. It was changing the Byte Order Mark (BOM) on the file from ASCII to Unicode Little Endian, and the Build Agent did not like that (it didn’t throw an encoding error either, which is super annoying, but whatever).

The error message was both a red herring (yes the it was configured) and also truthly (the Build Agent was incapable of reading the serverUrl).

Putting It All Together

With all the pieces in place, it was a relatively simple matter to create a new AMI with those scripts baked into it and put it to work straightaway.

Of course, I’d been doing this the whole time in order to test the process, so I certainly had a lot of failures building up to the final deployment.

Conclusion

Even simple automation can prove to be time consuming, especially when you run into weird unforseen problems like components not performing as advertised or even reporting correct errors for you to use for debugging purposes.

Still, it was worth it.

Now I never have to manually configure those damn spot instances when they come back.

And satisfaction is worth its weight in gold.

Eventually, It All Comes Together

June 28. 2016 0 Comments

As I mentioned briefly last week, I’m currently working on some maintenance endpoints that allow us to clean up data that has been identified as abandoned in one of our services (specifically, the service backed by a RavenDB database).

The concept at play is not particularly difficult. We have meta information attached to each set of data belonging to a customer, and this meta information allows us to decide whether or not the data (which is supposed to be transitory anyway) should continue to live in the service, or whether it should be annihilated to reduce load.

In building the endpoints that will allow this functionality to be reliably executed, I needed to construct a set of automated tests to verify that the functionality worked as expected with no side-effects. In other words, the normal approach.

Unfortunately, that’s where the challenge mode kicked in.

Like Oil And Water

One of the key concepts in RavenDB is the index.

Indexes are used for every query you do, except for loading a document by ID, finding the set of documents by ID prefix or finding documents by updated timestamp.

Helpfully, RavenDB will create and maintain one or more indexes for you based on your querying profile (automatic indexes), which means that if you don’t mind relinquishing control over the intricacies of your persistence, it will handle all the things for you. Honestly, it does a pretty good job of this, until you do a query that is somewhat different from all the previous queries, and RavenDB creates a new index to handle the query, which results in a massive CPU, memory and disk spike, which craters your performance.

Our inexperience with RavenDB when this service was first built meant that the difference between automatic and static indexes was somewhat lost on us, and the entire service uses nothing but automatic indexes. Being that we don’t allow for arbitrary, user defined queries, this is fine. In fact, unless automatic indexes are somehow at the root of all of our performance problems, we haven’t had any issues with them at all.

A key takeaway about indexes, regardless of whether they are automatic or static, is that they are an eventually consistent representation of the underlying data.

All writes in RavenDB take place in a similar way to writes in most relational databases like SQL Server. Once a write is accepted and completes the data is available straight away.

The difference is that in RavenDB, that data is only available straightaway if you query for it based off the ID, the prefix (which is part of the ID) or the order in which it was updated.

Any other query must go through an index, and indexes are eventually consistent, with an unknown time delay between when the write is completed and the data is available via the index.

The eventually consistent model has benefits. It means that writes can be faster (because writes don’t have to touch every index before they are “finished”), and it also means that queries are quicker, assuming you don’t mind that the results might not be completely up to date.

That’s the major cost or course, you trade-off speed and responsiveness against the ability to get a guaranteed definitive answer at any particular point in time.

Or Maybe Fire And Gasoline?

Eventual consistency is one of those concepts that infects everything it touches. If you have an API that uses a database that is eventually consistent, that API is also eventually consistent. If your API is eventually consistent, your clients are probably eventually consistent. So on and so forth all the way through the entire system.

If you’re used to relying on something being true after doing it, you have to completely rearrange your thinking, because that is no longer guaranteed to be true.

This is especially painful when it comes to tests, which brings us full circle back to where I started this post.

A standard test consists of some sort of setup (arrange), followed by the execution of some operation or set of operations (act), followed by some validation (assert).

In my case, I wanted to write a Functional Test that validated if I added a bunch of data and then executed the maintenance endpoint, some of that data should have been removed (the stuff identified as abandoned) and the rest should still be present. A nice high level test to validate that everything was working together as it should be, to be executed against the API itself (via HTTP calls).

Our functional tests are typically executed against a component deploy via our continuous integration/deployment system. We run Unit and Integration tests after the build, deploy via Octopus to our CI environment, then run Functional tests to validate that all is well. The success of the Functional tests determines whether or not that particular build/package should be allowed to propagate out into our Staging environment and beyond.

The endpoints in play were:

GET /entities
PUT /entities/{id}
GET /admin/maintenance/entities/abandoned
DELETE /admin/maintenance/entities/abandoned

Conceptually, the test would:

Seed a bunch of entities to the API (some meeting the abandoned conditions, some not)
Run a GET on the maintenance endpoint to confirm that the appropriate number of entities were being identified as abandoned
Run a DELETE on the maintenance endpoint, removing a subset of the abandoned entities, returning a response detailing what was removed and how much is left
Run a GET on the maintenance endpoint to validate that its response concurred with the DELETE response
Repeatedly do DELETE/GET calls to the maintenance endpoint to delete all abandoned entities
Run a GET on the entities endpoint to validate that all of entities that should still be present are, and all entities that should have been deleted are

Like I said, a nice, high level test that covers a lot of functionality and validates that the feature is working as it would be used in reality. Combined with a set of Unit and Integration tests, this would provide some assurance that the feature was working correctly.

Unfortunately, it didn’t work.

The Right Answer Is Just A While Loop Away

The reason it didn’t work was that it was written with the wrong mindset. I was expecting to be able to add a bunch of data and have that data be immediately visible.

That’s not possible with an eventually consistent data store like RavenDB.

At first I thought I could just deal with the issue by wrapping everything in a fancy retry loop, to check that it was eventually the value that I wanted it to be. I would be able to use the same test structure that I already had, all I would have to do would be make sure each part of it (GET, DELETE, GET) executed repeatedly until it matched the conditions, up to some maximum number of attempts.

To that end I wrote a simple extension method in the flavour of FluentAssertions that would allow me to check a set of assertions against the return value from a Func in such a way that it would repeatedly execute the Func until the assertions were true or it reached some sort of limit.

using System;
using System.Linq;
using FluentAssertions;
using NSubstitute;
using Serilog;

namespace Api.Tests.Shared.Extensions
{
    public static class FluentAssertionsExtensions
    {
        public static void ShouldEventuallyMeetAssertions<TResult>(this Func<TResult> func, Action<TResult>[] assertions, IRetryStrategy strategy = null)
        {
            strategy = strategy ?? new LinearBackoffWithCappedAttempts(Substitute.For<ILogger>(), 5, TimeSpan.FromSeconds(2));
            var result = strategy.Execute
                (
                    func, 
                    r => 
                    {
                        foreach (var assertion in assertions)
                        {
                            assertion(r);
                        }

                        return true;
                    }
                );

            if (!result.Success)
            {
                throw result.EvaluationErrors.Last();
            }
        }
    }
    
    public interface IRetryStrategy
    {
        RetryResult<TResult> Execute<TResult>(Func<TResult> f, Func<TResult, bool> isValueAcceptable = null);
    }
    
    public class LinearBackoffWithCappedAttempts : IRetryStrategy
    {
        public LinearBackoffWithCappedAttempts(ILogger logger, int maximumNumberOfAttempts, TimeSpan delayBetweenAttempts)
        {
            _logger = logger;
            _maximumNumberOfAttempts = maximumNumberOfAttempts;
            _delayBetweenAttempts = delayBetweenAttempts;
        }

        private readonly ILogger _logger;
        private readonly int _maximumNumberOfAttempts;
        private readonly TimeSpan _delayBetweenAttempts;

        public RetryResult<TResult> Execute<TResult>(Func<TResult> f, Func<TResult, bool> isValueAcceptable = null)
        {
            isValueAcceptable = isValueAcceptable ?? (r => true);
            var result = default(TResult);
            var executionErrors = new List<Exception>();

            var attempts = 0;
            var success = false;
            while (attempts < _maximumNumberOfAttempts && !success)
            {
                try
                {
                    _logger.Debug("Attempting to execute supplied retryable function to get result");
                    result = f();

                    var valueOkay = isValueAcceptable(result);
                    if (!valueOkay) throw new Exception(string.Format("The evaluation of the condition [{0}] over the value [{1}] returned false", isValueAcceptable, result));

                    success = true;
                }
                catch (Exception ex)
                {
                    _logger.Debug(ex, "An error occurred while attempting to execute the retryable function");
                    executionErrors.Add(ex);
                }
                finally
                {
                    attempts++;
                    if (!success) Thread.Sleep(_delayBetweenAttempts);
                }
            }

            var builder = new StringBuilder();
            if (success) builder.Append("Evaluation of function completed successfully");
            else builder.Append("Evaluation of function did NOT complete successfully");

            builder.AppendFormat(" after [{0}/{1}] attempts w. a delay of [{2}] seconds between attempts.", attempts, _maximumNumberOfAttempts, _delayBetweenAttempts.TotalSeconds);
            builder.AppendFormat(" [{0}] total failures.", executionErrors.Count);
            if (executionErrors.Any())
            {
                builder.AppendFormat(" Exception type summary [{0}].", string.Join(", ", executionErrors.Select(a => a.Message)));
            }

            _logger.Information("The retryable function [{function}] was executed as specified. Summary was [{summary}]", f, builder.ToString());

            return new RetryResult<TResult>(success, result, builder.ToString(), executionErrors);
        }
    }

    public class RetryResult<TResult>
    {
        public RetryResult(bool success, TResult result, string summary, List<Exception> evaluationErrors = null)
        {
            Success = success;
            Result = result;
            Summary = summary;
            EvaluationErrors = evaluationErrors;
        }

        public readonly bool Success;
        public readonly TResult Result;
        public readonly List<Exception> EvaluationErrors;
        public readonly string Summary;
    }
}

Usage of the extension method, and some rearranging let me write the test in a way that was now aware of the eventually consistent nature of the API.

When I started thinking about it though, I realised that the approach would only work for idempotent requests, which the DELETE verb on the new endpoint was definitely not.

Repeatedly executing the DELETE was changing the output each time, and depending on whether the stars aligned with regards to the RavenDB indexes, it was unlikely to ever meet the set of assertions that I had defined.

Happiness Comes From Compromise

In the end I had to adapt to the concepts at play, rather than adapting the concepts to fit my desired method of testing.

I could still leverage the eventually consistent assertion extension that I wrote, but I had to slightly rethink what I was trying to test.

I would still have to seed a bunch of entities to the API (some meeting the abandoned conditions, some not)
Instead of doing a single GET, I would have to assert that at least all of my seeded data was eventually present.
Instead of running a single DELETE and checking the result, I would have to loop and delete everything until the service told me nothing was left, and then check that it deleted at least all of the things I seeded
Instead of doing a final GET to verify, I would have to again assert that at least all of my seeded data that was not supposed to be removed was still present

Subtly different, but still providing a lot of assurance that the feature was working as expected. I won’t copy the full test here, because including helper methods to improve readability, it ended up being about 150 lines long.

Conclusion

Sometimes it feels like no matter where I look, RavenDB is there causing me some sort of grief. It has definitely given me an appreciation of the simple nature of traditional databases.

This isn’t necessarily because RavenDB is bad (its not, even though we’ve had issues with it), but because the eventual consistency model in RavenDB (and in many other similar databases) makes testing more difficult than it has to be (and testing is already difficult and time consuming, even if it does provide a world of benefits).

In this particular case, while it did make testing a relatively simple feature more challenging than I expected, it was my own approach to testing that hurt me. I needed to work with the concepts at play rather than to try and bend them to my preferences.

Like I said earlier, working with an eventually consistent model changes how you interact with a service at every level. It is not a decision that should be taken lightly.

Unfortunately I think the selection of RavenDB in our case was more “shiny new technology” than “this is the right fit for our service”, which is definitely less gravitas than it should have been given.