Hello IOPS My Old Friend, Part 1

July 25. 2017 0 Comments

in late 2016 we deployed the first version of our data synchronization API. It was a momentous occasion, and it all worked perfectly…at least until we turned on the software that actually used it. Actual traffic has a habit of doing that.

Long story short, we experienced a bunch of growing pains as we scaled our traffic up during those first few weeks, and this post is a follow-up to one of those issues in particular.

Specifically, the issue of particularly high read IOPS on the RDS instance that backed the sync API.

At the time, we investigated, didn’t find the root cause of the problem, threw money at i,t and ran away to do things that the business considered more important.

Why does it matter now?

Well, it happened again, and this time we can’t just throw money at it.

Sound familiar?

I’ve Come to Read From You Again

As I mentioned above, the main symptom was overly high read IOPS on our RDS server. Much higher than we expected for what we were doing.

This was an issue because of the way that data volumes work in AWS. You get some guaranteed IOPS (read and write combined) based on the size of the volume (1GB = 3 IOPS), while at the same time you get the capability to burst up to 3000 IOPS for brief periods of time.

The brief period of time is the important part, because it is actually modelled as a “balance” of IOPS credits that you consume if you go above your baseline, which get replenished if you’re below your baseline. Very very similar to the way that CPU credits work on EC2 instances. Back in late 2016, what exactly the balance was at any particular point in time was a bit of mystery though, because you couldn’t graph or alarm on it through CloudWatch. You only knew when it ran out and the performance of your database tanked because it was dependent on using more IOPS than its baseline.

Anyway, you can chart and alarm on the balance now, which is nice.

We tried a few things at the time, including scaling the underlying data volume (to give us more baseline), but it wasn’t enough.

We then increased the size of the RDS instance, and the high read IOPS vanished. I mean, that was obviously the best outcome, clearly. A problem we didn’t understand reacting in a completely unexpected way when we threw more power at it.

At the time, we thought that it might be related to EF querying the database on writes (to choose between an insert or update), but we never proved that hypothesis conclusively.

No-one in the business complained too loudly about the cost and everything was working swimmingly, so after an initial period of discomfort (mostly the engineers inside us sobbing quietly), we promptly forgot about it all together.

Because Some Usage Softly Creeping

A few weeks back, our data synchronization API started behaving strangely.

The first we knew about the problem was when our external monitoring service (Pingdom) reported that the API was down. Then up. Then down again. Then up again. So on and so forth.

Looking at the RDS metrics there were thousands of connections active to the database, causing us to hit connection limits. My first thought was that someone internal to the organisation had somehow got the credentials to the database and was running insane queries for business intelligence. I mean, thinking about that conclusion now, its completely bogus, because:

No-one has the credentials to the master database but a few members of my team
No-one even knows what the URL to access the master database is
We created a read replica specifically for the organisation to use for business intelligence, so why would someone use the master
You can’t even connect to the master database without being on our internal AWS VPN, which most people in the organisation don’t have access to

It didn’t help that when PostgreSQL runs out of connections, you can’t even connect to it with pgAdmin or something similar to see what the connections are doing.

Anyway, after falsely bombarding some innocent people with polite requests to “please stop doing things to our master database, its killing our API”, I shut down our data synchronization API and all the connections disappeared.

Because it was our API.

The calls were coming from inside the house.

Wrecked Everything While I was Sleeping

Luckily, the data synchronization API is probably the one API we can take down for a few hours and not have anybody really notice. As long as the database is available, the other API that actually allows access to the data can continue to do what it normally does, just with increasingly stale information. Its an eventually consistent model anyway, so some staleness is inherent.

We still only realistically had a few hours to dig into the issue before it became a real problem for actual users though.

Our only lead was the massive number of connections that were being established to the RDS instance. I managed to get a sample of what the connections were doing just as I shut the API down, and it wasn’t anything interesting. Just normal queries that I would expect from the API. Traffic hadn’t changed either, so we weren’t dealing with some weird DDOS attack or anything like that.

Well, to skip ahead, it turns out that the rise in the number of connections was a symptom, not the cause. The root issue was that we had burnt through our IO burst balance over the last few hours due to an increase in read IOPS. We’d almost done the same thing the day before, but it bottomed out at like 5% remaining because the days traffic started dropping off, so we didn’t notice. The lack of IOPS was causing each request to execute slower than normal, which just meant that there were more requests being executed in parallel.

Each API instance had a maximum number of connections that it could use, but the sum total across all of the instances was higher than the total number available to postgreSQL (whoops).

But why were we eating all of our delicious burst balance all of a sudden?

Well, a few days prior we started syncing a new table, and even though it was a small one, it increased the drain on the infrastructure.

Interestingly enough, it wasn’t a case of read IOPS going from close to nothing to a massive number with that new table. Our read IOPS had actually increased to a ridiculous baseline a few weeks before that when we shipped a completely different table, and we just hadn’t noticed. The table we shipped recently had just been the straw that broke the camels back.

We turned both of the new tables off, restored the API and all was well.

Conclusion

Honestly, we really should have had some alarms. Any alarms. Literally any alarms on the things we were already suspicious of at the database level (connections, read IOPS, etc) would have given us some sort of warning that bad things were happening, weeks before they actually happened.

We have those alarms now, but it was still a pretty big oversight.

With the new tables turned off and stability and functionality restored, we looked to fix the problem.

Throwing money at it wasn’t really feasible, as we would have basically doubled the infrastructure costs, and it was already costing us a few thousand $US each month. No, this time we would have to understand the problem and fix is properly.

But that’s a topic for next time, because this post is already large.

Not As Well Hung As I Thought

July 18. 2017 0 Comments

That thing that I complained about with Logstash and HTTP outputs a few weeks ago? Turns out it was actually a bug after all!

Not only that, but it was totally a race condition relating to the way the success/failure handlers of the Manticore HTTP library were being used, which was what I guessed, so go me. I guess even a broken clock is right once a day.

Long story short, don’t fire off an asynchronous request until you’ve got your asynchronous handlers in place to handle the result.

In their typical awesome style, the maintainers of the HTTP output plugin have already fixed the issue and released a new version (4.3.2) for the masses to consume.

Unfortunately, I don’t think this new version has been rolled into an official Logstash release yet, so if I want to use it I’ll have to make a custom redistributable package to use in Octopus deploy. Not a huge issue, just a small pain. I’m sure it will be rolled into the official package soon.

Other than that piece of news, I don’t really have anything else, so I’m going to take this opportunity to muse on some of my experiences with asynchronous code.

Hellish Code

Depending on how its written, asynchronous code can be hard to reason about, and even harder to test.

Vanilla Node is a good example of bad asynchronous code. Its incredibly easy to get into callback hell, where this function will call into this function, which has a handler that calls this function, which does this thing and so on. Its even worse when you realise that the easiest way to accumulate state between so many asynchronous functions is to close over variables that are in scope when the function is created, which leaves you in a position where its hard to break things apart into nice self contained, easily testable components.

*shudder*

Encapsulating asynchronous operations into some sort of structure is a good counter to callback hell, and once we started using Promises in the ELB logs processor Javascript, I found the quality of the code increased rapidly. It was still Javascript, but at least you could reason about it.

Continuing on with a language that I actually like, C#, you’ve got a similar concept at play with tasks and the TPL (Task Parallel Library).

Before the TPL, you would probably just use raw threads to offload work (or maybe dedicated asynchronous functions or classes like BackgroundWorker), and you could easily get into the same sort of problems that you get into in vanilla Node. I mean, you had the benefit of working with a nice, structured language, but it was still difficult and prone to error.

The TPL introduced encapsulations around asynchronous operations called Tasks, which helped write asynchronous code that actually made sense.

It did something else that was incredibly helpful as well.

It exposed the Task schedulers.

All The Control

Encapsulating operations as Tasks is all well and good, and in most cases you’re probably looking to run those operations asynchronously on another thread (or a pool of threads perhaps). I mean, that’s literally exactly what the default TaskScheduler does in C# (the ThreadPoolScheduler).

Of course, that makes testing hard, because now your main thread (which the test is running on) has to have some logic in it wait for outstanding operations to complete before you can assert things. This gets even more difficult when working with ViewModels, which require that certain updates be performed on the main/UI thread in order to work properly.

This is particularly important for WPF and PropertyChanged events, which only work properly if you raise them on the UI thread.

The good news is that you can write your own TaskScheduler and then inject it for testing purposes, assuming you’ve written your code in such a way that your Tasks are created with a specified scheduler.

I’ve written a TaskScheduler that is not threaded at all (i.e. it just executes the task inline), one that accumulates operations to be executed at will (good for testing transitions) and another that allows you to run tasks on the WPF Dispatcher construct (i.e. the main/UI thread in WPF) and I use all of them frequently while developing applications that rely on Tasks to do things.

They aren’t even all that hard to write really, as the following NonThreadedTaskScheduler shows:

public sealed class NonThreadedTaskScheduler : TaskScheduler
{
    protected override void QueueTask(Task task)
    {
        TryExecuteTask(task);
    }

    protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
    {
        return TryExecuteTask(task);
    }

    protected override IEnumerable<Task> GetScheduledTasks()
    {
        return Enumerable.Empty<Task>();
    }

    public override int MaximumConcurrencyLevel
    {
        get { return 1; }
    }
}

Without the ability to change how the tasks are executed, none of that would have been possible, and while they would still be useful, they would be much harder to actually use and test.

Coincidentally, this is one of the reasons why I’m a little bit leery of the async/await keywords in C#, because I don’t think they expose any of the scheduling functionality, instead assuming that you are happy with the magical way that they work. Honestly though, my knowledge of async/await is fairly limited, so that might not be entirely correct.

Conclusion

To conclude, asynchronous code is hard and its bugs are subtle and hard to detect.

Its also one of the easiest and highest impact ways to wring more performance out of a piece of software. If you’re doing a lot of things that naturally involve waiting (disk writes, network calls, etc), you stand to gain an incredible amount of time by just doing other things during those times when you’re waiting on something external and out of your control.

Double edged sword really.

Like most software.

Alarming Connections

July 11. 2017 0 Comments

Alerting on potential problems before they become real problems is something that we have been historically bad at. We’ve got a wealth of information available both through AWS and in our ELK Stack, but we’ve never really put together anything to use that information to notify us when something interesting happens.

In an effort to alleviate that, we’ve recently started using CloudWatch alarms to do some basic alerting. I’m not exactly sure why I shied away from them to begin with to be honest, as they are easy to setup and maintain. It might have been that the old way that we managed our environments didn’t lend itself to easy, non-destructive updates, making tweaking and creating alarms a relatively onerous process.

Regardless of my initial hesitation, when I was polishing the ELK stack environment recently I made sure to include some CloudWatch alarms to let us know if certain pieces fell over.

The alerting setup isn’t anything special:

Dedicated SNS topics for each environment (i.e. CI, Staging, Production, Scratch)
Each environment has two topics, one for alarms and one for notifications
The entire team is signed up to emails from both Staging and Production, but CI/Scratch are optional

Email topic subscriptions are enough for us for now, but we have plans to use the topics to trigger messages in HipChat as well.

I started with some basic alarms, like unhealthy instances > 0 for Elasticsearch/Logstash. Both of those components expose HTTP endpoints that can easily be pinged, so if they stop responding to traffic for any decent amount of time, we want to be notified. Other than some tweaks to the Logstash health check (I tuned it too tightly initially, so it was going off whenever a HTTP request took longer than 5 seconds), these alarms have served us well in picking up things like Logstash/Elasticsearch crashing or not starting properly.

As far as Elasticsearch is concerned, this is pretty much enough.

With Logstash being Logstash though, more work needed to be done.

Transport Failure

As a result of the issues I had with HTTP outputs in Logstash, we’re still using the Logstash TCP input/output combo.

The upside of this is that it works.

The downside is that sometimes the TCP input on the Broker side seems to crash and get into an unrecoverable state.

That wouldn’t be so terrible if it took Logstash with it, but it doesn’t. Instead, Logstash continues to accept HTTP requests, and just fails all of the TCP stuff. I’m not actually sure if the log events received via HTTP during this time are still processed through the pipeline correctly, but all of the incoming TCP traffic is definitely black holed.

As a result of the HTTP endpoint continuing to return 200 OK, the alarms I setup for unhealthy instances completely fail to pick up this particular issue.

In fact, because of the nature of TCP traffic through an ELB, and the relatively poor quality of TCP metrics, it can be very difficult to tell whether or not its working at a glance. Can’t use requests or latency, because they have no bearing on TCP traffic, and certainly can’t use status codes (obviously). Maybe network traffic, but that doesn’t seem like the best idea due to natural traffic fluctuations.

The only metric that I could find was “Backend Connection Errors”. This metric appears to be a measure of how many low level connection errors occurred between the ELB and the underlying EC2 instances, and seems like a good fit. Even better, when the Logstash TCP input falls over, it is this metric that actually changes, as all of the TCP traffic being forwarded through the ELB fails miserably.

One simple alarm initialization through CloudFormation later, and we were safe and secure in the knowledge that the next time the TCP input fell over, we wouldn’t find out about it 2 days later.

"BrokerLoadBalancerBackendConnectionErrorsAlarm": {
  "Type" : "AWS::CloudWatch::Alarm",
  "Properties" : {
    "AlarmName" : { "Fn::Join": [ "", [ "elk-broker-", { "Ref": "OctopusEnvironment" }, "-backend-errors" ] ] },
    "AlarmDescription" : "Alarm for when there is an increase in the backend connection errors on the Broker load balancer, typically indicating a problem with the Broker EC2 instances. Suggest restarting them",
    "AlarmActions" : [ { "Ref" : "AlarmsTopicARN" } ],
    "OKActions": [ { "Ref" : "AlarmsTopicARN" } ],
    "TreatMissingData": "notBreaching",
    "MetricName" : "BackendConnectionErrors",
    "Namespace" : "AWS/ELB",
    "Statistic" : "Maximum",
    "Period" : "60",
    "EvaluationPeriods" : "5",
    "Threshold" : "100",
    "ComparisonOperator" : "GreaterThanThreshold",
    "Dimensions" : [ {
      "Name" : "LoadBalancerName",
      "Value" : { "Ref" : "BrokerLoadBalancer" }
    } ]
  }
}

Mistakes Were Made

Of course, a few weeks later the TCP input crashed again and we didn’t find out for 2 days.

But why?

It turns out that the only statistic worth a damn for Backend Connection Errors when alerting is SUM, and I created the alarm on the MAXIMUM, assuming that it would act like requests and other metrics (which give you the maximum number of X that occurred during the time period). Graphing the maximum backend connection errors during a time period where the TCP input was broken gives a flat line at y = 1, which is definitely not greater than the threshold of 100 that I entered.

I switched the alarm to SUM and as far as I can see the next time the TCP input goes down, we should get a notification.

But I’ve been fooled before.

Conclusion

I’ll be honest, even though I did make a mistake with this particular CloudWatch alarm, I’m glad that we started using them.

Easy to implement (via our CloudFormation templates) and relatively easy to use, they provide an element of alerting on our infrastructure that was sorely missing. I doubt we will go about making thousands of alarms for all of the various things we want to be alerted on (like increases in 500s and latency problems), but we’ll definitely include a few alarms in each stack we create to yell at us when something simple goes wrong (like unhealthy instances).

To bring it back to the start, I think one of the reasons we’ve hesitated to use the CloudWatch alarms was because we were waiting for the all singing all dancing alerting solution based off the wealth of information in our log stack, but I don’t think that is going to happen in a hurry.

Its been years already.

Well Hung

July 4. 2017 0 Comments

Almost a month ago, I made a bold claim that switching our Logstash instances to HTTP outputs for all log event traffic wouldn’t be all that difficult.

Oh god, how wrong I was.

In some weird display of prescience however, these were the exact words I used:

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

What Is Logstash?

Theoretically, I was entirely correct. It was as simple as changing the TCP output to HTTP. Same host, no fancy settings, no real complexity in the configuration.

For good measure, I upgraded the version of Logstash to 5.4.1 as well (from 5.2.0), which was a whole thing that I’m not going to get into right now. Suffice to say, I had a very recent version of Logstash installed when it all went to tatter.

We’ve got a pretty good deployment pipeline for distributing Logstash to our various API instances and whatnot, so after I made the changes, I checked in, builds kicked off, tests were run, deployments were made, and it was time to check whether or not CI was still working.

It was fine. For the first time, I could see the traffic caused by Logstash on the Broker using CloudWatch, which was nice.

Promote to Staging, same thing. Worked fine, traffic visible, all seemed well.

Promote to Prod, same thing. Worked fine, traffic visible, all is well.

And that’s the end of the story.

Baby Don’t Hurt Me

Ha ha ha ha ha ha no.

What I really should have done is left Staging run for a few days before promoting anything. I guess I was just a little too eager.

It wasn’t long after promoting to production that I noticed data disappearing from our dashboards and visualizations. At first glance, it all seemed to be isolated to one of our services in particular, so I started digging.

Log files were being appended to locally, so not the issue we encountered recently with our Nancy APIs. This problem had different symptoms anyway (no logs at all, instead of just no application logs).
Restarting the local instance of Logstash seemed to get a few events through into the ELK stack, so the communication channels were intact (i.e. traffic wasn’t being blocked at the network level), but only a few hundred.
Broker seemed to be healthy and handling traffic well, so it didn’t seem like a resource issue on the ELK stack side.

Based on the evidence, it looked like the problem was entirely the fault of the local Logstash instances, but if that was the case, why not more failures? At this point in the investigation, other Logstash instances appeared to be pushing their log events into the ELK stack without issue.

There didn’t even seem to be anything special about the logs going through the failing Logstash instances either. It was all pretty vanilla IIS, Performance Counter and API application logs. Nothing fancy.

Time to dig into Logstash itself.

The good news was that Logstash 5.4.1 features a really nice internal API, granting access to a bunch of information about the internal workings of the application.

Using a little bit of Powershell you can get statistics about the Logstash pipeline.

(Invoke-WebRequest http://localhost:9600/_node/stats/pipeline).Content

On one of the broken machines, our output looked something like this:

{ "host" : "a-hostname", "version" : "5.4.1", "http_address" : "127.0.0.1:9600", "id" : "a82b92de-d402-4b99-a569-a859c8e00d2b", "name" : "a-name", "pipeline" : { "events" : { "duration_in_millis" : 11293, "in" : 377, "filtered" : 375, "out" : 250, "queue_push_duration_in_millis" : 28027 }, "plugins" : { "inputs" : [ { "id" : "get-events-from-api-iis-logs", "events" : { "out" : 129, "queue_push_duration_in_millis" : 14075 }, "name" : "file" }, { "id" : "get-events-from-system-stats", "events" : { "out" : 248, "queue_push_duration_in_millis" : 13952 }, "name" : "file" }, { "id" : "get-events-from-api-application-logs", "events" : { "out" : 0, "queue_push_duration_in_millis" : 0 }, "name" : "file" } ], "filters" : [ { "id" : "break-iis-fields-down-via-csv", "events" : { "duration_in_millis" : 1135, "in" : 128, "out" : 128 }, "name" : "csv" }

# More filters here, cut for brevity

], "outputs" : [ { "id" : "send-events-to-elk-broker-over-http", "events" : { "duration_in_millis" : 5441, "in" : 375, "out" : 250 }, "name" : "http" } ] }, "reloads" : { "last_error" : null, "successes" : 0, "last_success_timestamp" : null, "last_failure_timestamp" : null, "failures" : 0 }, "queue" : { "type" : "memory" }, "id" : "main" } }

Super dense, but the interesting part is the HTTP output at the bottom.

See how it has 375 events in, but only 250 out?

It hangs like that forever.

Restarting Logstash causes it to start again and send a similar number of events, but eventually hang in exactly the same manner.

Don’t Hurt Me, No More

I’ll be honest, I tried a lot of things to get this to work, and I just couldn’t.

In fact, the problem actually got worse, because other Logstash instances started failing in the same way, and more and more data dropped off our dashboards and visualizations.

In the end, I had no choice but to roll back to the TCP output (which was flaky, but in a known way) and to log an issue on the Logstash Github describing my problem in the hope that someone would be able to look into it and help out.

Why not look myself?

Well I started, but Logstash is Ruby code running through Java (which is then running on Windows via Powershell) and getting into that stack from a development point of view looked pretty complicated and time-consuming.

I had a bit of a look at the Ruby code of the HTTP output and the only thing I could think of was a potential race condition in the construction of the HTTP requests.

# Create an async request
request = client.background.send(@http_method, url, :body => body, :headers => headers)
request.call # Actually invoke the request in the background

request.on_success do |response|
  begin
    # bunch of code dealing with success
  end
end

request.on_failure do |exception|
  begin 
    # bunch of code dealing with failures
  end
end

I don’t know how the HTTP library works (I think its Manticore?), but if it doesn’t support the registering (and correct execution) of event handlers AFTER a call has completed, then its possible for the request.call to finish and the success and failure handlers to never run.

Part of the success handler’s responsibilities is to inform the rest of the Logstash pipeline that the the HTTP output is done with its events, so that sort of thing could easily lead to a deadlock.

Again, I don’t know Ruby at all, so that is complete conjecture on my part.

Conclusion

This was another one of those issues that just came out of nowhere and stole more of my time than I would like to admit to. I would still like to switch all of our log event traffic to HTTP (for the same reasons I went through last time), but I have other things to do for now. At least its still working, even though the TCP input in Logstash sometimes just stop working. It doesn’t happen very often, and I have some monitoring around it, so its not the end of the world.

To be honest, a smarter man would have seen that it was not working with no clear solution, rolled back to the TCP output and moved on. Probably within a few hours.

I am not a smart man.

I am, however, ridiculously stubborn, so I wailed away at this problem until it stole my will to live, then I gave up.

The Case Of The Missing Log Events

June 27. 2017 0 Comments

Nancy is a pretty great, low weight, low ceremony API framework for C#/.NET. We’ve used it for basically every API that we’ve created since I started working here, with the exception of our Auth API, which uses Web API because of a dependency on the ASP Identity Management package.

In all the time that I’ve been using Nancy, I’ve encountered very few issues that weren’t just my own stupid fault. Sure, we’ve customised pieces of its pipeline (like the bootstrapper, because we use Ninject for dependency injection) but all in all its default behaviour is pretty reliable.

In fact, I can only think of one piece of default behaviour that wasn’t great, and that was the way that errors were handled during initialization when running inside IIS.

The Setup

When you use Nancy through IIS/ASP.NET, you add a reference to a handler class in your web.config file, which tells IIS how it should forward incoming requests to your code.

After installing the Nancy.Hosting.AspNet package, your web.config will contain some stuff that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <system.web>
        <compilation debug="true" targetFramework="4.5" />
        <httpRuntime targetFramework="4.5" requestPathInvalidCharacters="&lt;,&gt;,%,&amp;,?"/>
        <httpHandlers>
            <add verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </httpHandlers>
    </system.web>

    <system.webServer>
        <validation validateIntegratedModeConfiguration="false" />
        <httpErrors existingResponse="PassThrough" />
        <handlers>
            <add name="Nancy" verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </handlers>
        <security>
    </system.webServer>
</configuration>

There are two sections here (system.web and system.webserver) for different versions of IIS (if I remember correctly, the top one is for the IISExpress web server in Visual Studio and the other one is for IIS7+? I’m not 100% certain).

What the configuration means is that for every incoming request (verb=*, path =* in the xml), IIS will simply forward the request to the NancyHttpRequestHandler, where it will do Nancy things.

This worked fine for us until we had an issue with our bootstrapper initialization. Specifically, our bootstrapper was throwing exceptions during creation (because it was trying to connect to a database which wasn’t available yet or something) and when that happened, it would stop the webservice from ever starting. In fact, it would be non-functional until we restarted the IIS application pool.

The root cause here was in the NancyHttpRequestHandler and the way it interacted with IIS. Basically, IIS would create one of these classes, which would trigger its static constructor. If that encountered an error, then the whole thing would be broken, never to recover.

The fix was relatively simple. Create a custom request handler (based off the default one) that had some logic in it to lazy load the Nancy bootstrapper/engine with appropriate error handling. The end result was that each request that failed to initialize would fail as expected (with a 500 or something), but the first one to succeed would cache the result for any subsequent requests.

With that solution in place our webservices became a little bit more reliable and tolerant of transitory errors during startup.

The Symptom

So, in summary, Nancy good, we made a small customisation to make startup more reliable.

Obviously the story doesn’t end there.

Not too long ago we encountered a bug in one of our APIs where it was returning errors, but there were no associated error logs from the application in our ELK stack.

We could see the errors in our ELB logs (purely as HTTP response codes), but we didn’t get any application level logs showing what the root issue was (i.e. the exception). There were other error logs (some transient problems occurring during service startup), so we assumed that there was some problem with our application logging where it wasn’t correctly reporting on errors that occurred as a result of web requests.

Reproducing the bug locally, the log output showed the requests being logged correctly, and at the correct level.

It was all very mysterious.

The Source

Our first instinct was that the data ingress into the ELK stack was failing. I’d recently been working on the stack, so I naturally assumed that it was my fault, but when we investigated, we discovered that the ELK stack was working exactly as expected. There were no errors indicating that an event had been received and then rejected for some reason (Elasticsearch field mapping conflicts are the most common).

Digger deeper, we checked the local log files on the staging webservice machines and discovered that the log events were missing altogether, having never been logged at all. Even when we caused an error on purpose, nothing was being logged into the local log file.

The answer lay in the custom request handler we implemented. It had allowed the service to remain functional (where previously it would have crashed and burned), but had an unfortunate side effect.

The sequence of actions looked like this:

IIS receives a request
IIS forwards request to custom request handler
Custom request handler initialises itself, creating a bootstrapper
Bootstrapper creates logger configuration, including a custom Sink which opens a write stream to a file (named based on date)
An error occurs (can’t initialize DB because the DNS record does not exist yet, or something equally transient)
Bootstrapper initialization fails. Sink is disposable with a finalizer, so it will be cleaned up eventually, just not straightaway
Request handler initialization fails, failing the request
Another request is received
IIS does the same thing
New bootstrapper is created
New Sink created, pointing at same file (which is already locked)
Initialization succeeds, request returns successfully
Sink now throws an error every time it is logged to, because of the file lock
Serilog discovers errors with sink, so stops logging to preserve application health
Original Sink disposes of itself and releases file lock
Second Sink starts functioning again, but Serilog has stopped logging to it, so nothing happens

The interesting thing here is that the Sink does not lock the file during its constructor because its core behaviour is to roll files based on both date and size, so every time you call it, it dynamically determines where it should write to. This meant it was created successfully, but could not write any events.

Serilog, being the good logging framework that it is, was catching those errors and stopping them from breaking the application. Unfortunately, because that Sink was the only place where we had visible output, Serilog itself could not say that it was experiencing errors. During the investigation we actually enabled the Serilog self log, and it was reporting all sorts of useful things, and was in critical in actually diagnosing the problem.

Basically, we had a misbehaving sink and Serilog was protecting us from ourselves.

The Solution

We fixed the problem by moving the initialization of the logging earlier, enforcing that it only happes once per process by using a lazily evaluated static property on the request handler, which was how some of the Nancy properties were already being handled.

This fixed the problem, but looking back I think there were probably a few other ways in which we could have tackled it that would have been better:

We could have associated some sort of unique ID with the logger (via the bootstrapper) guaranteeing no file conflicts
We could have changed the Sink to handle errors that occur while accessing its desired log file, by generating and logging to a different file
We could have handled the fact that the Sink was disposable, and its lifetime should be tied to the bootstrapper as expected

I’ll probably implement at least the second option at some point in the future, just to make the Sink more robust in the face of unexpected circumstances.

Conclusion

The interesting thing about this whole problem (with the custom Sink and file locking) was that we had actually anticipated that it would be a problem when we initially implemented the Sink. IIS has a tendency to run two applications in parallel whenever it does a recycle, so we we knew that there would be periods when two processes might be trying to write to the same location, so we implemented a process ID based prefix to every file. Unfortunately, that approach is remarkably ineffective when everything is happening within the same process.

The hardest part in this whole adventure was trying to determine the cause of a problem when logging was the problem.

Once you get out of the development environment, and into deployed software, logging is pretty much all you have. When that’s gone, diagnosing problems becomes exponentially more difficult.

Like picking up broken glass in the dark.

Sure, its possible, but you’ll probably get cut.