Are You Paying Attention?

November 24. 2015 0 Comments

So, we have all these logs now, which is awesome. Centralised and easily searchable, containing lots of information relating to application health, feature usage, infrastructure utilization along with many other things I’m not going to bother to list here.

But who is actually looking at them?

The answer to that question is something we’ve been struggling with for a little while. Sure, we go into Kibana and do arbitrary searches, we use dashboards, we do our best to keep an eye on things like CPU usage, free memory, number of errors and so on, but we often have other things to do. Nobody has a full time job of just watching this mountain of data for interesting patterns and problems.

We’ve missed things:

We had an outage recently that was the result of a disk filling up with log files because an old version of the log management script had a bug in it. The disk usage was clearly going down when you looked at it in Kibana dashboard, but it was happening so gradually that it was never really brought up as a top priority.
We had a different outage recently where we had a gradual memory leak in the non-paged memory pool on some of our API instances. Similar to above, we were recording free memory and it was clearly dropping over time, but no-one noticed.

There has been other instances (like an increase in the total number of 500’s being returned from an API, indicating a bug), but I won’t go into too much more detail about the fact that we miss things. We’re human, we have other things to do, it happens.

Instead, lets attack the root of the issue. The human element.

We can’t reasonably expect anyone to keep an eye on all of the data hurtling towards us. Its too much. On the other hand, all of the above problems could have easily been detected by a computer, all we need is something that can do the analysis for us, and then let us know when there is something to action. It doesn’t have to be incredibly fancy (no learning algorithms….yet), all it has to do is be able to compare a few points in time and alert off a trend in the wrong direction.

One of my colleagues was investigating solutions to this problem, and they settled on Sensu.

Latin: In The Sense Of

I won’t go into too much detail about Sensu here, because I think the documentation will do a much better job than I will.

My understanding of it, however, is that it is a very generic, messaging based check/handle system, where a check can be almost anything (run an Elasticsearch query, go get some current system stats, respond to some incoming event) and a handler is an arbitrary reaction (send an email, restart a server, launch the missiles).

Sensu has a number of components, including servers (wiring logic, check –> handler), clients (things that get checks executed on them) and an API (separate from the server). All communication happens through RabbitMQ and there is some usage of Redis for storage (which I’m not fully across yet).

I am by no means any sort of expert in Sensu, as I did not implement our current proof of concept. I am, however, hopefully going to use it to deal with some of the alerting problems that I outlined above.

The first check/handler to implement?

Alert us via email/SMS when the available memory on an API instance is below a certain threshold.

Alas I have not actually done this yet. This post is more going to outline the conceptual approach, and I will return later with more information about how it actually worked (or didn’t work).

Throw Out Broken Things

One of the things that you need to come to terms with early when using AWS is that everything will break. It might be your fault, it might not be, but you should accept right from the beginning that at some point, your things will break. This is good in a way, because it forces you to not have any single points of failure (unless you are willing to accept the risk that they might go down and you will have an outage, which is a business decision).

I mention this because the problem with the memory in our API instances that I mentioned above is pretty mysterious. Its not being taken by any active process (regardless of user), so it looks like a driver problem. It could be one of those weird AWS things (there are a lot), and it goes away if you reboot, so the easiest solution is to just recycle the bad API instance and move on. Its already in an auto-scaling group for redundancy, and there is always more than 1, so its better to just murder it, relax, and let the ASG do its work.

Until we’re comfortable automating that sort of recycling, we’ll settle for an alert that someone can use to make a decision and execute the recycle themselves.

By installing the Sensu client on the machines in question (incorporating it into the environment setup itself), we can create a check that allow us to remotely query the available free memory and compare it against some configured value that we deem too low (lets say 100MB). We can then configure 2 handlers for the check result, one that emails a set of known addresses and another that does the same for SMS.

Seems simple enough. I wonder if it will actually be that simple in practice.

Summary

Alerting on your aggregate of information (logs, stats, etc) is a pretty fundamental ability that you need to have.

AWS does provide some alerting in the form of CloudWatch alarms, but we decided to investigate a different (more generic) route instead, mostly because of the wealth of information that we already had available inside our ELK stack (and our burning desire to use it for something other than pretty graphs).

As I said earlier, this post is more of an outline of how we plan to attack the situation using Sensu, so its a bit light on details I’m afraid.

I’m sure the followup will be amazing though.

Right?

The Last Piece of the Logging Puzzle

April 28. 2015 0 Comments

Posted in:
logging
amazon
elk
cloud

As I’ve already stated, I’ve spent the last few weeks working on putting together log aggregation so that we know what our new service looks like in real time.

I’ve incorporated IIS logs, the application logs, machine statistics (memory, CPU, etc) and Windows Event logs into the log aggregator, and successfully used those events for analysis during load testing.

There was one piece missing though, which meant there was a hole in our ability to monitor how our service was actually operating in the face of actual usage.

The Elastic Load Balancer, or ELB, that sits in front of the publically accessible web service.

During load testing, I noticed that sometimes JMeter would record an error (specifically a 504, Gateway Timeout) but our dashboard in Kibana would show nothing. No errors, everything seemed fine.

It turned out that there was a default timeout on the ELB of 60 seconds, and at that point in the load testing, some requests were taking longer than that without causing any traffic over the connection. The ELB would terminate the connection, return a 504 to the client, but the request would still complete successfully (eventually) in the backend.

I needed to get eyes on the ELB.

Its Log!

Turning logging on for an ELB is fairly easy.

Just give it the S3 bucket you want it to log to, a prefix to use for entries made into the bucket and a time interval, and off it goes. All of this can be done through the CloudFormation template, which fits well into our strategy for environment setup (no manual tasks, automate all the things).

The only complex bit is setting up a bucket policy that sets the correct permissions to allow the ELB to write to the bucket, which is all pretty well documented. There is simply a well known ARN for what I assume is all Load Balancers in a region, and you setup a simple Put/Get/List policy to allow it to do its thing.

The only gotcha I ran into was when I included an underscore (_) in the prefix configuration setting for the ELB. The prefix setting is intended to make sure that the keys for files written into the bucket start with a common value. When I included an underscore, I got nothing but Access Denied errors. This was at the same time as I was setting up the bucket policy, so I assumed I had done that incorrectly. Turns out my bucket policy was flawless, and it was a completely unrelated (and unexpected) issue causing the Access Denied errors.

Very frustrating.

With that fixed though, the logs started flowing.

Content Rich

The ELB logs contain things like the ELB IP and port, where the request was forwarded to (IP and port again), the time to forward, process and respond to requests (3 separate entries, process is the time it takes for your server to do its thing), response codes, bytes transferred and other things. Very similar to IIS really, which is not unexpected.

Now all I had to do was get the information into our Log Aggregator.

Stashing Those Logs

I had been using Nxlog as my log processor. It was responsible for picking up files, processing them as necessary, enriching them with various pieces of information (hostname, component, application) and then shipping the results off via TCP to our log aggregator where Logstash was listening.

Nxlog is a fine product, but its scripting language is a hard to get a handle on, and the documentation is a bit sparse. Also it has no concept of decimal numbers, which meant that I had to convert some numbers to integers (like decimal seconds to milliseconds) via regular expressions. Altogether it got the job done, but I wasn’t particularly happy with it.

I thought that since I needed to do something a little bit more complicated (get files from S3 and process them) that I would use Logstash this time. Logstash as a log processor is a lot easier to distribute, configure and debug, which is nice. Its configuration is all in json, and is very easy to wrap your head around, and it has lots of component to accomplish various tasks like getting files from S3, parsing CSV lines, mutating fields to the correct type, etc. It even has a mutator (Logstash calls them filters) that allows you to execute arbitrary Ruby code for those times when you have to do something unusual.

Even better, Logstash is what's listening on the other end of the pipeline, so they play well together, and you only need to know 1 piece of software, instead of 2.

I built a similar distributable project to what I built for Nxlog, that creates a NuGet package that Octopus can deploy to get a copy of Logstash up and running on the target machine as a Windows Service. I won’t go into this in too much detail, but it was essentially the same thing that I did for Nxlog, except with different dependencies (JRE, Logstash, NSSM for service installation/configuration).

I added a small EC2 instance to our environment setup to act as a Log Processor, with the intent that it would immediately be used to process the ELB logs, but may also be used in the future to process other logs that don’t necessarily fit onto a specific machine (S3 access logs is the only one that comes to mind, but I’m sure there are more). The Logs Processor had an IAM role allowing it full control over the logs bucket that ELB was using (which was also created as part of the environment. Nice and clean, and no credentials stored anywhere.

I created a Logstash configuration to grab files from S3 and process them, and then deployed it to the Logs Processor.

Access Denied.

Permission to Rage Requested

The current release version of Logstash (1.4.2) does not support the usage of IAM roles for the S3 input. If I wanted to use that input, I would have to enter the credentials manually into the config file. I could do this easily enough at deployment time (storing the credentials in Octopus, which is much better than in source control), but I would need to actually have a user setup that could access the bucket. As the bucket is created during environment creation, this would mean that the credentials would change every time the environment was recreated. We create temporary environments all the time, so this would mean a manual step editing Octopus every time you wanted to get something to work.

That's unacceptable.

I contemplated using a small script during deployment time to grab some credentials from the IAM role on the machine and enter them into the config file, but those credentials expire and Logstash was running as a service, so at some stage it would just stop working and someone would have to do something to make it work again.

Again, unacceptable.

Luckily for me, the wonderful people behind Logstash (and specifically the S3 plugin) have developed a new version that allows the usage of IAM roles, and it was already in beta. Its a little unstable still (Release Candidate 2), but it was good enough for my purposes.

While doing some reading about Logstash and the new version I discovered that the file input was basically completely broken on Windows. The component that it was leveraging to get the unique identifier for files in order to record the position in the file that it was up to does not work in 1.4.2 and below, so you end up missing huge chunks of data when processing multiple files. This actually explained why I was having so much difficulty using the earlier version to process a large amount of IIS logs from a disconnected machine, and why there was holes in my data. Long story short, if you’re using the file input in Logstash and you’re on windows, get the latest release candidate.

I incorporated the 1.5 RC2 release into my deployment, but I still couldn’t get the S3 input to work.

Why Is It Always A Proxy

I hate proxies.

Not because of what they are. I think they actually do some useful things, like caching, obfuscating where requests are coming from when accessing the internet from within a network and preventing access to bad websites.

No I hate proxies because the support for them is always a pain in the ass. Every application seems to support proxies differently, if they support them at all. Some automatically read from the Internet Explorer registry setting for the proxy, some use the HTTP_PROXY environment variable, some have their own personal settings. This means that every time you want to use a piece of software in an environment that uses a proxy, you have to fight with it to get it to work.

Such was the case with the S3 input. The underlying Ruby based aws-sdk has support for proxies (as does the .NET one), with a Set-AwsProxy method.

I could not, for the life of me, figure out how to configure Logstash with a proxy for the AWS component though.

So , I was stuck. I had all the configuration in place to process the ELB logs, but I didn't have the logs themselves.

In the end I created a small Powershell script that uses the AWS Powershell Component to move all files from an S3 bucket to a local directory on a timer. I then installed that script as a Windows Service using NSSM. Finally I edited my Logstash configuration to process the local files instead. After tweaking my config to process the files correctly, everything started coming through into the Log Aggregator as expected, and I added the missing piece to our intelligence about the service.

I don’t like this solution, because it adds more moving parts than I think is strictly necessary, but sometimes you have to make compromises.

Summary

I’ve uploaded a repository with my deployable build for Logstash here, so hopefully someone else can benefit from the effort that I put into making it re-usable.

Setting up a deployment pipeline for this component saved me a lot of time throughout the development process, making redeploying my changes when I made a mistake or needed to change a dependency (like upgrading to Logstash 1.5 RC2) a breeze. I highly recommend spending that initial bit of effort in setting things up at the start so you can move quickly later.

In regards to the actual ELB logs, they don’t provide any groundbreaking information that IIS didn't already give us, except for the case where connections are terminated at the ELB due to inactivity. At least to my knowledge anyway, I suppose they will track if the underlying instances go offline, which will be good. The ELB entries come in a bit slower than the IIS ones (due to the delay before the log files are published from the ELB + the delay added by my own S3 downloader and Logstash file processor pair), but there’s not really much I can do about that.

I still hate proxies. Only because its easier to hate one thing than every application that doesn’t support them.

Real Meaty Analysis

April 21. 2015 0 Comments

As I’ve already mentioned, I’ve been doing load testing.

Obviously, the follow-up to executing load tests is to analyse the data that comes out of them.

With our new service, we now log many disparate sources of information into a single location, using Elasticsearch, Logstash and Kibana. As the load tests are executed, streams of data come into our log aggregator, ready to be sliced in various ways to see exactly how the service is operating and to find out where the bottlenecks are.

Requesting Assistance

The service leverages Nxlog to ship all IIS and application logs to the log aggregator. This gave me the ability to track average/max request times, error responses and errors occurring inside the application, in real time, as the load test was being run. A side effect of this is that it will also do the same once the service is made publically available.

IIS logs gave me access to fields like sc-status (the returned response code, i.e. 200, 401 or 500), time-taken (the total elapsed time for the request to be completed), request path (i.e. entity/images/1) and bytes uploaded and downloaded (sc-bytes and cs-bytes), along with many others.

Our application logs mostly just gave me access to whether or not unhandled exceptions had occurred (we use Nancy, and our OnError pipeline simply logs the error out while returning a 500) as well as some information about what the service is doing internally with authentication/authorisation.

For the first run of the load test I chose 50 users with no ramp-up, i.e. they would all start working through the test plan all at once. Each user would execute a series of actions that approximated what usage on the service would be like over a day, compressed into a much smaller time period.

I used Kibana to construct a chart that showed the average and maximum request latency (using the time-taken field from IIS) across all requests over time, and another chart that showed the top 5 status codes, also over time.

I had some initial difficulties with using time-taken for numerical aggregations in Kibana. When I originally set up the Nxlog config that shipped the log entries to our log aggregator, I had neglected to type the time-taken field as an integer (its measured in whole milliseconds), so Elasticsearch had inferred the type of the field as string. As you can imagine, strings can’t participate in numerical aggregations. I still don’t quite understand ES mappings, but I had to change the type of time-taken in Nxlog (by casting/parsing to integer in the Exec blog of my input), delete the index in ES and then it correctly inferred the type as a number. When I looked at the JSON being output by Nxlog, the TimeTaken field values were enclosed in quotes, so its likely that’s why ES inferred them as strings.

My first set of results were informative, but they didn’t exactly point me towards the area that needed improvement.

As you can see from the chart above, latency was fairly terrible, even with only 50 users. Looking at the timeline of the test, latency was okay during the periods that didn’t involve images, rose when images were being uploaded and then rose to unacceptable levels when images were being downloaded.

I had standardised all of the images so that they were approximately 1.2MB, so a service that scales well with traffic should show an increase in latency (to account for the time to physically upload and download the images) but it should be consistent with the size of the images.

Looking at the response codes, everything was mostly fine (if slow) until images started to be downloaded, and then 500’s would start to appear. Investigating the 500’s showed that they were primarily caused by timeouts, indicating that part of the system was under so much load that it could no longer service requests within the timeout period. This lined up with the increasing latency of the service.

The timeouts were coming from our underlying database, RavenDB.

Cogs in the Machine

From the IIS and application logs I knew how the service was acting from the point of view of an external party (mostly). What I didn’t know was how the underlying machines were acting, and I didn’t have any eyes on the database underlying the service at all. My first implementation had only aggregated logs from the API instances.

It was easy enough to customize the Nxlog deployment that I built to ship different configurations to difference machines, so it didn’t take very long to start getting logs from the database machine (IIS).

I also needed machine statistics, to identify why the underlying database was unable to service requests in a reasonable amount of time. Ideally these statistics would be available in the same place as the other events, so that I could make correlations between the data sets.

As far as I can tell, on Windows machines if you want to log statistics about the system itself, you use performance counters. You can track just about anything, but for now I limited myself to CPU Utilization, Free Memory and Free Disk Space. I assumed that the database machine was being sitting at 100% CPU or running out of memory and resorting to virtual memory or something.

I created a project similar to the Nxlog one, except focusing on installing a performance counter based on a configuration file, again, deployed though Octopus. It leveraged the command line tool logman to configure and start the performance counter, and simply focused on outputting system statistics to a known log file. I then altered the Nxlog configuration file for both the API and Database instances to read those statistics in and push them out to ES via Logstash, just like the IIS and application logs.

With system statistics now being available, I improved the Kibana dashboard to show more information. I added charts of CPU utilization, free memory and free disk, and slightly altered my request latency charts. Most of the time the requests with the highest latency were those which involved images. The main reason for this is that it takes time to move data around the internet, and all of my test images were 1.2MB, so the time to transfer the raw data was not trivial. In order to better understand the latency of various requests, I needed to differentiate image requests from non-image requests. At the same time I wanted to distinguish different resources from one another, so I could see which resources were slow. I added some custom code to the Nxlog config to parse out resource information from the path and stored the resource identifier as a field along with all of the other data. I used the same approach to identify the requests that dealt with images, and tagged them as such.

I then used this information to mutate my latency chart into two charts. One shows the latency (average, max) of image requests, broken down by Verb (PUT, POST, GET, DELETE) and the other shows latency for each resource in the system (one chart per resource).

At this point it was helpful to define and save some queries in Kibana that immediately narrowed down the total available data to the appropriate subset for charting. We tag our log messages with an application id (something uniquely identifying the system), component id (like API or DB) and source (like IIS or application). I defined some queries for Application + Component:API + Source:IIS + Images/Not Images, which I then used as the inputs for the latency chart visualizations I mentioned above. Its hard to filter inside the visualizations themselves, so I recommend creating and saving queries for that purpose.

My new monitoring dashboard looked like this:

After running another load test, I could clearly see that the images were the slowest part (as expected), and that when images were being downloaded, the performance of every other resource suffered. Alas, all of the machine statistics looked fine at the time of the high latency. None of the machines were showing a bottleneck in any of the areas that I was monitoring. CPU and Free Memory were fine, and yet the requests were taking upwards of 100 seconds (on average) to complete when image downloads were occurring.

Moar Stats

As I could see the timeouts coming from the underlying database in the API logs, I knew that that machine was the issue, I just didn’t know how.

I suspected that the issue was probably disk based (because it was the one thing I wasn’t monitoring) so I added a greater selection of statistics (average disk latency, disk queue length) and re-ran the load tests.

The result was clear. During the period where the load test was downloading images, the latency of the data disk was reaching 800ms per transfer. That's insane! Normal disk latency is sub 30ms.

Finally I had found the obvious bottleneck, but now I had to fix it.

Birds are Weird

Long story short, I tried to improve the performance by making the data disk more awesome (thanks to AWS, this was fairly easy as you can configure the desired amount of IOPS), and making the database machine itself more awesome (again, thanks AWS and its big chunky instance types) but nothing seemed to make much of a difference.

Finally we had to make the hard call and stop storing images in our underlying database. It just didn’t do it well.

We were still using an old version of RavenDB, and were using its attachment feature to store the images. I have to assume we were doing something wrong, because RavenDB is a pretty good product as far as I can see, and I don’t see why it wouldn’t work. I’m honestly not that familiar with RavenDB and the decision to use it was made before I started, so I couldn’t really determine why it wasn’t performing as we expected.

Moving our images to a different storage mechanism was fairly easy, our external API didn’t need to change, and we just moved the image storage into S3.

The next run of the load tests showed average response times under load going down to about 7 seconds, which is about the time it takes to move 1.2MB on our internet connection + some overhead.

The charts still look basically the same, but the numbers are a hell of a lot lower now, so we’ve got that going for us, which is nice.

Summary

Always load test. Always.

We had already guessed before we started load testing that the underlying database would probably be a bottleneck, as there was only one of them, and we had issues adding more in parallel during development. We had assumed that any performance issues with the database could be solved by simply making it more awesome in whatever way it was bottlenecked. Unfortunately, that turned out to be a false assumption, and we had to change the internals of the system in order to scale our performance when it came to image storage.

Without load testing and the analysis of its results via Kibana and its lovely charts, we wouldn’t have been able to identify the scaling issue and resolve it, with real numbers to back up our improvements.

I cannot recommend using a log aggregator enough. Literally you should aggregate all the things. Once you have all of that information at your fingertips you can do all sorts of amazing analysis, which lead to greater insights into how your software is actually running (as opposed to how you think its running).