Health Check On Aisle One
- Posted in:
- monitoring
- API
- aws
The way we structure our services for deployment should be familiar to anyone who’s worked with AWS before. An ELB (Elastic Load Balancer) containing one or more EC2 (Elastic Compute Cloud) instances, each of which has our service code deployed on it by Octopus Deploy. The EC2 instances are maintained by an ASG (Auto Scaling Group) which uses a Launch Configuration that describes how to create and initialize an instance. Nothing too fancy.
An ELB has logic in it to direct incoming traffic to its instance members, in order to balance the load across multiple resources. In order to do this, it needs to know whether or not a member instance is healthy, and it accomplishes this by using a health check, configured when the ELB is created.
Health checks can be very simple (does a member instance respond to a TCP packet over a certain port), or a little bit more complex (hit some API endpoint on a member instance via HTTP and look for non 200 response codes), but conceptually they are meant to provide an indication that the ELB is okay to continue to direct traffic to this member instance. If a certain number of status checks fail over a period of time (configurable, but something like “more than 5 within 5 minutes”, the ELB will mark a member instance as unhealthy and stop directing traffic to it. It will continue to execute the configured health check in the background (in case the instance comes back online because the problem was transitory), but can also be configured (when used in conjunction with an ASG) to terminate the instance and replace it with one that isn’t terrible.
Up until recently, out services have consistently used a /status endpoint for the purposes of the ELB health check.
Its a pretty simple endpoint, checking some basic things like “can I connect to the database” and “can I connect to S3”, returning a 503 if anything is wrong.
When we started using an external website monitoring service called Pingdom, it made sense to just use the /status endpoint as well.
Then we complicated things.
Only I May Do Things
Earlier this year I posted about how we do environment migrations. The contents of that post are still accurate (even though I am becoming less and less satisfied with the approach as time goes on), but we’ve improved the part about “making the environment unavailable” before starting the migration.
Originally, we did this by just putting all of the instances in the auto scaling group into standby mode (and shutting down any manually created databases, like ones hosted on EC2 instances), with the intent that this would be sufficient to stop customer traffic while we did the migration. When we introduced the concept of the statistics endpoint, and started comparing data before and after the migration, this was no longer acceptable (because there was no way for us to hit the service ourselves when it was down to everybody).
What we really wanted to do was put the service into some sort of mode that let us interact with it from an administrative point of view, but no-one else. That way we could execute the statistics endpoint (and anything else we needed access to), but all external traffic would be turned away with a 503 Service Unavailable (and a message saying something about maintenance).
This complicated the /status endpoint though. If we made it return a 503 when maintenance mode was on, the instances would eventually be removed from the ELB, which would make the service inaccessible to us. If we didn’t make it return a 503, the external monitoring in Pingdom would falsely report that everything was working just fine, when in reality normal users would be completely unable to use the service.
The conclusion that we came to was that we made a mistake using /status for two very different purposes (ELB membership and external service availability), even though they looked very similar at a glance.
The solutoion?
Split the /status endpoint into two new endpoints, /health for the ELB and /test for the external monitoring service.
Healthy Services
The /health endpoint is basically just the /status endpoint with a new name. Its response payload looks like this:
{ "BuildVersion" : "1.0.16209.8792", "BuildTimestamp" : "2016-07-27T04:53:04+00:00", "DatabaseConnectivity" : true, "MaintenanceMode" : false }
There’s not really much more to say about it, other than its purpose now being clearer and more focused. Its all about making sure the particular member instance is healthy and should continue to receive traffic. We can probably extend it to do other health related things (like available disk space, available memory, and so on), but it’s good enough for now.
Testing Is My Bag Baby
The /test endpoint though, is a different beast.
At the very least, the /test endpoint needs to return the information present in the /health endpoint. If /health is returning a 503, /test should too. But it needs to do more.
The initial idea was to simply check whether or not we were in maintenance mode and then just make the /test endpoint return a 503 when maintenance mode was on as the only differentiator between /health and /test. That’s not quite good enough though, as what we want conceptually is the /test endpoint to act more like a smoke test of common user interactions and just checking the maintenance mode flag doesn’t give us enough faith that the service itself is working from a user point of view.
So we added some tests that get executed when the endpoint is called, checking commonly executed features.
The first service we implemented the /test endpoint for, was our auth service, so the tests included things like “can I generate a token using a known user” and “can I query the statistics endpoint”, but eventually we’ll probably extend it to include other common user operations like “add a customer” and “list customer databases”.
Anyway, the payload for the /test endpoint response ended up looking like this:
{ "Health" : { "BuildVersion" : "1.0.16209.8792", "BuildTimestamp" : "2016-07-27T04:53:04+00:00", "DatabaseConnectivity" : true, "MaintenanceMode" : false }, "Tests" : [{ "Name" : "QueryApiStatistics", "Description" : "A test which returns the api statistics.", "Endpoint" : "/admin/statistics", "Success" : true, "Error" : null }, { "Name" : "GenerateClientToken", "Description" : "A test to check that a client token can be generated.", "Endpoint" : "/auth/token", "Success" : true, "Error" : null } ] }
I think it turned out quite informative in the end, and it definitely meets the need of detecting when the service is actually available for normal usage, as opposed to the very specific case of detecting maintenance mode.
The only trick here was that the tests needed to hit the API via HTTP (i.e. localhost), rather than just shortcutting through the code directly. The reason for this is that otherwise each test would not be an accurate representation of actual usage, and would give false results if we added something in the Nancy pipeline that modified the response before it got to the code that the test was executing.
Subtle, but important.
Conclusion
After going to the effort of clearly identifying the two purposes that the /status endpoint was fulfilling, it was pretty clear that we would be better served by having two endpoints rather than just one. It was obvious in retrospect, but it wasn’t until we actually encountered a situation that required them to be different that we really noticed.
A great side effect of this was that we realised we needed to have some sort of smoke tests for our external monitoring to access on an ongoing basis, to give us a good indication of whether or not everything was going well from a users point of view.
Now our external monitoring actually stands a chance of tracking whether or not users are having issues.
Which makes me happy inside, because that’s the sort of thing I want to know about.