0 Comments

Another short one this time, as I still haven’t been up to anything particularly interesting recently. This should change in the coming weeks as we start work on a brand new service, which will be nice. We learnt a lot from the first one, but we still have some unanswered questions like database schema versioning and deploying/automating functional tests as part of a continuous integration environment, which should make for some good topics.

Anyway this weeks post is a simple one about a component I ran across the other day while refactoring an application.

Its called Fluent Command Line Parser and it fills one of those niche roles that you come across all the time, but never really spend any time thinking about, which is supplying command line arguments to an application.

Yes Commander?

Most of the time when you see an application that takes input from the command line, you’ll probably see assumptions about the order of elements supplied (like args[0] is the database, args[1] is the username, etc). Typically there is little, if any, validation around the arguments, and you would be lucky if there was help or any other documentation to assist the people who want to use the application from the command line. There are many reasons for this, but the most common is usually effort. It takes effort to put that sort of thing in, and unless the command line tool is going to be publically consumed, that effort probably isn’t necessary.

Fluent Command Line Parser (to be referred to as FCLP from here on) makes this sort of thing much much easier, reducing the effort to the level of “Why not?”.

Essentially you configure a parser object to recognise short and long tokens to indicate a particular command line argument, and then use the presence of those tokens to fill properties in a strongly typed object. The parser can be configured to provide descriptions of the tokens, whether or not they are required, callbacks to be run when the token is encountered and a few other things that I haven’t really looked into yet.

FCLP also offers the ability to set a help callback, executed whenever /? or /help is present, which is nice. The help callback automatically ties in with the definitions and descriptions for the tokens that have been configured, so you don’t need to document anything twice.

Enough rambling about the code, lets see an example:

public static void Main(string[] args)
{
    var parser = new FluentCommandLineParser<Arguments>();
    parser
        .Setup(a => a.Address)
        .As('a', "address")
        .WithDescription("The URL to connect to the service.");
        .Required();

    parser
        .Setup(a => a.Username)
        .As('u', "username")
        .WithDescription("The username to use for authentication.");
    
    parser
        .Setup(a => a.Password)
        .As('p', "password")
        .WithDescription("The password to use for authentication.");

    var parsedArgs = parser.Parse(args);

    // Use parsedArgs in some way.
}

private class Arguments
{
    public string Address { get; set; }
    public string Username { get; set; }
    public string Password { get; set; }
}

As you can see its pretty clear what’s going on here. Its simply to add new arguments as well, which is nice.

The Commander is Down!

Its not all puppies and roses though. The FCLP library is very simple and is missing the ability to do at least 1 important thing that I needed, which was a shame.

It only supports simple/primitive objects (like strings, integers, booleans, enums and collections of those primitives) for the properties being set in your parsed arguments object. This means that if you want to supply a Uri on the command line you have to add an extra step or an extract function on your parsed arguments class that gets the string value passed in as a Uri. Its not a total deal breaker, its just unfortunate, because it seems like something you should be able to do with a callback or something similar, but the context (the object containing the parsed arguments) is just not available at that point in the workflow.

I might take a stab at adding that ability to the library (its open source) if I get a chance. I feel like you can probably accomplish it by adding another method to override the default parsing of the object from the entered string (supplying a Func<string, T> to transform the string with custom logic).

Conclusion

If you’re looking for a nice simple library to make your command line parsing code a bit clearer, this is a good one to grab. It really does help to improve the quality of a typically messy area of an application, which makes it a lot easier for someone coming in later to understand. Just add a reference to the FluentCommandLineParser package via Nuget (you are using Nuget aren’t you? No reason not to) and off you go.

Remember, every little bit like this helps, because we don’t write code for ourselves. We write it for the people who come after us. Even if that person is you 6 months from now, when you’ve forgotten everything.

0 Comments

This will hopefully be a short post, as I haven’t been up to anything particularly interesting recently. We deployed the service we’d been working on, and everything appears to be going swimmingly. The log aggregation that we setup is doing its job (allowing us high visibility into the operations of the service) and we’re slowly introducing new users, making sure they have a great experience.

As a result of the recent deployment, we’re in a bit of a lull, so I took this opportunity to work on our development infrastructure a little bit.

Well, I didn’t start out with the intent to work on our infrastructure. I just wanted to update this component we have that seeds demonstration data into our new service. When I got into the code though, I found that we had a bunch of shared code that was duplicated. Mostly code for dealing with the models inherent in the service and with authentication to the service. The approach that had been taken was to simply copy the code around, which is not generally a good idea in the long run. The demonstration data component didn’t have the copied code (yay!) but it was referencing the libraries of another solution directly from their output directory, which meant you had to remember to compile that other solution before the demo data, which is a bit crap.

All in all, it was in need of a clean up.

Fairly straight forward, split the common code into a different project and repository, create a new TeamCity build configuration to make it output a Nuget Package and reference it like that.

Its never straightforward.

Free As In Beer

We currently use TeamCity as our Nuget server, with the outputs of various build configurations being added to the Nuget server when they are compiled. The downside of this is that you can’t actually publish packages to the TeamCity Nuget server from outside of TeamCity. Another downside of the TeamCity Nuget server is that the packages are tied to the builds that made them. We have had issues in the past with packages disappearing from the the Nuget server because the build that created the package was cleaned up. It makes sense when you think about it, and we should definitely have pinned those builds to keep them forever, but its still another thing that we need to remember to do, which is vulnerable to human failure.

The free version of TeamCity is pretty great. 3 build agents and 20 build configurations are usually enough for small teams, and the product itself is pretty amazing. When you hit that build configuration limit though, you need to pay a few hundred dollars. That’s fine, its a solid product and I have no problems in giving JetBrains (the company that make TeamCity) money for their efforts.

Getting money out of some organizations though, that can be a giant pain. Which it was.

Time for an alternate approach. Time to setup our own Nuget server.

Nuget makes this pretty easy. Just create a new, blank ASP.NET web application and add a reference to the Nuget Server package. Easy.

When you’re building a piece of infrastructure though, you want to go to a bit of trouble to make it nicely encapsulated. So I leveraged some work we’d done as part of the new service and created a deployable package for our Nuget server, an Octopus project to deploy it and an environment for it to be deployed to via AWS CloudFormation. While I was there I added the logging components to make sure we have records of what is being uploaded and downloaded from the server.

Yak Shaved, Moving On

Now that I had a Nuget server separate from TeamCity, I could split out the common components into their own repository and construct a build/publish script to publish to the Nuget server from my local machine (with the intent to turn it into a TeamCity build configuration later, when we finally pay money).

I’ve done this before so I just grabbed my .NET Library Repository template and added the new solution with the project and its tests. The build script in the template automatically picks up the solution file, restores its packages, automatically generates a version, compiles it with MSBuild and then publishes a package for the single non-test project. Its pretty simple, I just use it to get reusable libraries up and running ASAP.

I did have a few difficulties with untangling the common code from the rest, which is probably why it was copied in the first place. The easiest way to go about this sort of thing is to create a brand new project and namespace and just migrate the minimum amount of code that is needed by the other components. If you try and take everything at once, you’re going to have a bad time.

Now with the magic of Nuget I just need to add a reference to the new package in the various places that need it and delete the duplicated code, running the tests (where they exist) to ensure that I haven’t broken anything.

Summary

Its important for the health of your code base to execute this sort of clean up when you see an issue like duplicated code.  Some people would say that some duplicated code isn't too bad, and I would generally agree. However, you need to be able to tell a very good story for why the code is duplicated, and not just accept that it is. Its usually not too hard to extract a library and create a Nuget package, assuming you have the infrastructure setup to support it. Which you should. If you don’t, you should set it up.

If you’ve never heard the team “Shaving a Yak” before, you can blame Scott Hanselman for that one. He did a blog post on the concept and I have never seen anything that better describes an average day in the life of a software developer. I use the term a lot whenever I’m doing something I never expected to be doing due to some crazy causality chain, which happens more than I would like.

0 Comments

As I’ve already stated, I’ve spent the last few weeks working on putting together log aggregation so that we know what our new service looks like in real time.

I’ve incorporated IIS logs, the application logs, machine statistics (memory, CPU, etc) and Windows Event logs into the log aggregator, and successfully used those events for analysis during load testing.

There was one piece missing though, which meant there was a hole in our ability to monitor how our service was actually operating in the face of actual usage.

The Elastic Load Balancer, or ELB, that sits in front of the publically accessible web service.

During load testing, I noticed that sometimes JMeter would record an error (specifically a 504, Gateway Timeout) but our dashboard in Kibana would show nothing. No errors, everything seemed fine.

It turned out that there was a default timeout on the ELB of 60 seconds, and at that point in the load testing, some requests were taking longer than that without causing any traffic over the connection. The ELB would terminate the connection, return a 504 to the client, but the request would still complete successfully (eventually) in the backend.

I needed to get eyes on the ELB.

Its Log!

Turning logging on for an ELB is fairly easy.

Just give it the S3 bucket you want it to log to, a prefix to use for entries made into the bucket and a time interval, and off it goes. All of this can be done through the CloudFormation template, which fits well into our strategy for environment setup (no manual tasks, automate all the things).

The only complex bit is setting up a bucket policy that sets the correct permissions to allow the ELB to write to the bucket, which is all pretty well documented. There is simply a well known ARN for what I assume is all Load Balancers in a region, and you setup a simple Put/Get/List policy to allow it to do its thing.

The only gotcha I ran into was when I included an underscore (_) in the prefix configuration setting for the ELB. The prefix setting is intended to make sure that the keys for files written into the bucket start with a common value. When I included an underscore, I got nothing but Access Denied errors. This was at the same time as I was setting up the bucket policy, so I assumed I had done that incorrectly. Turns out my bucket policy was flawless, and it was a completely unrelated (and unexpected) issue causing the Access Denied errors.

Very frustrating.

With that fixed though, the logs started flowing.

Content Rich

The ELB logs contain things like the ELB IP and port, where the request was forwarded to (IP and port again), the time to forward, process and respond to requests (3 separate entries, process is the time it takes for your server to do its thing), response codes, bytes transferred and other things. Very similar to IIS really, which is not unexpected.

Now all I had to do was get the information into our Log Aggregator.

Stashing Those Logs

I had been using Nxlog as my log processor. It was responsible for picking up files, processing them as necessary, enriching them with various pieces of information (hostname, component, application) and then shipping the results off via TCP to our log aggregator where Logstash was listening.

Nxlog is a fine product, but its scripting language is a hard to get a handle on, and the documentation is a bit sparse. Also it has no concept of decimal numbers, which meant that I had to convert some numbers to integers (like decimal seconds to milliseconds) via regular expressions. Altogether it got the job done, but I wasn’t particularly happy with it.

I thought that since I needed to do something a little bit more complicated (get files from S3 and process them) that I would use Logstash this time. Logstash as a log processor is a lot easier to distribute, configure and debug, which is nice. Its configuration is all in json, and is very easy to wrap your head around, and it has lots of component to accomplish various tasks like getting files from S3, parsing CSV lines, mutating fields to the correct type, etc. It even has a mutator (Logstash calls them filters) that allows you to execute arbitrary Ruby code for those times when you have to do something unusual.

Even better, Logstash is what's listening on the other end of the pipeline, so they play well together, and you only need to know 1 piece of software, instead of 2.

I built a similar distributable project to what I built for Nxlog, that creates a NuGet package that Octopus can deploy to get a copy of Logstash up and running on the target machine as a Windows Service. I won’t go into this in too much detail, but it was essentially the same thing that I did for Nxlog, except with different dependencies (JRE, Logstash, NSSM for service installation/configuration).

I added a small EC2 instance to our environment setup to act as a Log Processor, with the intent that it would immediately be used to process the ELB logs, but may also be used in the future to process other logs that don’t necessarily fit onto a specific machine (S3 access logs is the only one that comes to mind, but I’m sure there are more). The Logs Processor had an IAM role allowing it full control over the logs bucket that ELB was using (which was also created as part of the environment. Nice and clean, and no credentials stored anywhere.

I created a Logstash configuration to grab files from S3 and process them, and then deployed it to the Logs Processor.

Access Denied.

Permission to Rage Requested

The current release version of Logstash (1.4.2) does not support the usage of IAM roles for the S3 input. If I wanted to use that input, I would have to enter the credentials manually into the config file. I could do this easily enough at deployment time (storing the credentials in Octopus, which is much better than in source control), but I would need to actually have a user setup that could access the bucket. As the bucket is created during environment creation, this would mean that the credentials would change every time the environment was recreated. We create temporary environments all the time, so this would mean a manual step editing Octopus every time you wanted to get something to work.

That's unacceptable.

I contemplated using a small script during deployment time to grab some credentials from the IAM role on the machine and enter them into the config file, but those credentials expire and Logstash was running as a service, so at some stage it would just stop working and someone would have to do something to make it work again.

Again, unacceptable.

Luckily for me, the wonderful people behind Logstash (and specifically the S3 plugin) have developed a new version that allows the usage of IAM roles, and it was already in beta. Its a little unstable still (Release Candidate 2), but it was good enough for my purposes.

While doing some reading about Logstash and the new version I discovered that the file input was basically completely broken on Windows. The component that it was leveraging to get the unique identifier for files in order to record the position in the file that it was up to does not work in 1.4.2 and below, so you end up missing huge chunks of data when processing multiple files. This actually explained why I was having so much difficulty using the earlier version to process a large amount of IIS logs from a disconnected machine, and why there was holes in my data. Long story short, if you’re using the file input in Logstash and you’re on windows, get the latest release candidate.

I incorporated the 1.5 RC2 release into my deployment, but I still couldn’t get the S3 input to work.

Why Is It Always A Proxy

I hate proxies.

Not because of what they are. I think they actually do some useful things, like caching, obfuscating where requests are coming from when accessing the internet from within a network and preventing access to bad websites.

No I hate proxies because the support for them is always a pain in the ass. Every application seems to support proxies differently, if they support them at all. Some automatically read from the Internet Explorer registry setting for the proxy, some use the HTTP_PROXY environment variable, some have their own personal settings. This means that every time you want to use a piece of software in an environment that uses a proxy, you have to fight with it to get it to work.

Such was the case with the S3 input. The underlying Ruby based aws-sdk has support for proxies (as does the .NET one), with a Set-AwsProxy method.

I could not, for the life of me, figure out how to configure Logstash with a proxy for the AWS component though.

So , I was stuck. I had all the configuration in place to process the ELB logs, but I didn't have the logs themselves.

In the end I created a small Powershell script that uses the AWS Powershell Component to move all files from an S3 bucket to a local directory on a timer. I then installed that script as a Windows Service using NSSM. Finally I edited my Logstash configuration to process the local files instead. After tweaking my config to process the files correctly, everything started coming through into the Log Aggregator as expected, and I added the missing piece to our intelligence about the service.

I don’t like this solution, because it adds more moving parts than I think is strictly necessary, but sometimes you have to make compromises.

Summary

I’ve uploaded a repository with my deployable build for Logstash here, so hopefully someone else can benefit from the effort that I put into making it re-usable.

Setting up a deployment pipeline for this component saved me a lot of time throughout the development process, making redeploying my changes when I made a mistake or needed to change a dependency (like upgrading to Logstash 1.5 RC2) a breeze. I highly recommend spending that initial bit of effort in setting things up at the start so you can move quickly later.

In regards to the actual ELB logs, they don’t provide any groundbreaking information that IIS didn't already give us, except for the case where connections are terminated at the ELB due to inactivity. At least to my knowledge anyway, I suppose they will track if the underlying instances go offline, which will be good.  The ELB entries come in a bit slower than the IIS ones (due to the delay before the log files are published from the ELB + the delay added by my own S3 downloader and Logstash file processor pair), but there’s not really much I can do about that.

I still hate proxies. Only because its easier to hate one thing than every application that doesn’t support them.

0 Comments

As I’ve already mentioned, I’ve been doing load testing.

Obviously, the follow-up to executing load tests is to analyse the data that comes out of them.

With our new service, we now log many disparate sources of information into a single location, using Elasticsearch, Logstash and Kibana. As the load tests are executed, streams of data come into our log aggregator, ready to be sliced in various ways to see exactly how the service is operating and to find out where the bottlenecks are.

Requesting Assistance

The service leverages Nxlog to ship all IIS and application logs to the log aggregator. This gave me the ability to track average/max request times, error responses and errors occurring inside the application, in real time, as the load test was being run. A side effect of this is that it will also do the same once the service is made publically available.

IIS logs gave me access to fields like sc-status (the returned response code, i.e. 200, 401 or 500), time-taken (the total elapsed time for the request to be completed), request path (i.e. entity/images/1) and bytes uploaded and downloaded (sc-bytes and cs-bytes), along with many others.

Our application logs mostly just gave me access to whether or not unhandled exceptions had occurred (we use Nancy, and our OnError pipeline simply logs the error out while returning a 500) as well as some information about what the service is doing internally with authentication/authorisation.

For the first run of the load test I chose 50 users with no ramp-up, i.e. they would all start working through the test plan all at once. Each user would execute a series of actions that approximated what usage on the service would be like over a day, compressed into a much smaller time period.

I used Kibana to construct a chart that showed the average and maximum request latency (using the time-taken field from IIS) across all requests over time, and another chart that showed the top 5 status codes, also over time.

I had some initial difficulties with using time-taken for numerical aggregations in Kibana. When I originally set up the Nxlog config that shipped the log entries to our log aggregator, I had neglected to type the time-taken field as an integer (its measured in whole milliseconds), so Elasticsearch had inferred the type of the field as string. As you can imagine, strings can’t participate in numerical aggregations. I still don’t quite understand ES mappings, but I had to change the type of time-taken in Nxlog (by casting/parsing to integer in the Exec blog of my input), delete the index in ES and then it correctly inferred the type as a number. When I looked at the JSON being output by Nxlog, the TimeTaken field values were enclosed in quotes, so its likely that’s why ES inferred them as strings.

My first set of results were informative, but they didn’t exactly point me towards the area that needed improvement.

As you can see from the chart above, latency was fairly terrible, even with only 50 users. Looking at the timeline of the test, latency was okay during the periods that didn’t involve images, rose when images were being uploaded and then rose to unacceptable levels when images were being downloaded.

I had standardised all of the images so that they were approximately 1.2MB, so a service that scales well with traffic should show an increase in latency (to account for the time to physically upload and download the images) but it should be consistent with the size of the images.

Looking at the response codes, everything was mostly fine (if slow) until images started to be downloaded, and then 500’s would start to appear. Investigating the 500’s showed that they were primarily caused by timeouts, indicating that part of the system was under so much load that it could no longer service requests within the timeout period. This lined up with the increasing latency of the service.

The timeouts were coming from our underlying database, RavenDB.

Cogs in the Machine

From the IIS and application logs I knew how the service was acting from the point of view of an external party (mostly). What I didn’t know was how the underlying machines were acting, and I didn’t have any eyes on the database underlying the service at all. My first implementation had only aggregated logs from the API instances.

It was easy enough to customize the Nxlog deployment that I built to ship different configurations to difference machines, so it didn’t take very long to start getting logs from the database machine (IIS).

I also needed machine statistics, to identify why the underlying database was unable to service requests in a reasonable amount of time. Ideally these statistics would be available in the same place as the other events, so that I could make correlations between the data sets.

As far as I can tell, on Windows machines if you want to log statistics about the system itself, you use performance counters. You can track just about anything, but for now I limited myself to CPU Utilization, Free Memory and Free Disk Space. I assumed that the database machine was being sitting at 100% CPU or running out of memory and resorting to virtual memory or something.

I created a project similar to the Nxlog one, except focusing on installing a performance counter based on a configuration file, again, deployed though Octopus. It leveraged the command line tool logman to configure and start the performance counter, and simply focused on outputting system statistics to a known log file. I then altered the Nxlog configuration file for both the API and Database instances to read those statistics in and push them out to ES via Logstash, just like the IIS and application logs.

With system statistics now being available, I improved the Kibana dashboard to show more information. I added charts of CPU utilization, free memory and free disk, and slightly altered my request latency charts. Most of the time the requests with the highest latency were those which involved images. The main reason for this is that it takes time to move data around the internet, and all of my test images were 1.2MB, so the time to transfer the raw data was not trivial. In order to better understand the latency of various requests, I needed to differentiate image requests from non-image requests. At the same time I wanted to distinguish different resources from one another, so I could see which resources were slow. I added some custom code to the Nxlog config to parse out resource information from the path and stored the resource identifier as a field along with all of the other data. I used the same approach to identify the requests that dealt with images, and tagged them as such.

I then used this information to mutate my latency chart into two charts. One shows the latency (average, max) of image requests, broken down by Verb (PUT, POST, GET, DELETE) and the other shows latency for each resource in the system (one chart per resource).

At this point it was helpful to define and save some queries in Kibana that immediately narrowed down the total available data to the appropriate subset for charting. We tag our log messages with an application id (something uniquely identifying the system), component id (like API or DB) and source (like IIS or application). I defined some queries for Application + Component:API + Source:IIS + Images/Not Images, which I then used as the inputs for the latency chart visualizations I mentioned above. Its hard to filter inside the visualizations themselves, so I recommend creating and saving queries for that purpose.

My new monitoring dashboard looked like this:

After running another load test, I could clearly see that the images were the slowest part (as expected), and that when images were being downloaded, the performance of every other resource suffered. Alas, all of the machine statistics looked fine at the time of the high latency. None of the machines were showing a bottleneck in any of the areas that I was monitoring. CPU and Free Memory were fine, and yet the requests were taking upwards of 100 seconds (on average) to complete when image downloads were occurring.

Moar Stats

As I could see the timeouts coming from the underlying database in the API logs, I knew that that machine was the issue, I just didn’t know how.

I suspected that the issue was probably disk based (because it was the one thing I wasn’t monitoring) so I added a greater selection of statistics (average disk latency, disk queue length) and re-ran the load tests.

The result was clear. During the period where the load test was downloading images, the latency of the data disk was reaching 800ms per transfer. That's insane! Normal disk latency is sub 30ms.

Finally I had found the obvious bottleneck, but now I had to fix it.

Birds are Weird

Long story short, I tried to improve the performance by making the data disk more awesome (thanks to AWS, this was fairly easy as you can configure the desired amount of IOPS),  and making the database machine itself more awesome (again, thanks AWS and its big chunky instance types) but nothing seemed to make much of a difference.

Finally we had to make the hard call and stop storing images in our underlying database. It just didn’t do it well.

We were still using an old version of RavenDB, and were using its attachment feature to store the images. I have to assume we were doing something wrong, because RavenDB is a pretty good product as far as I can see, and I don’t see why it wouldn’t work. I’m honestly not that familiar with RavenDB and the decision to use it was made before I started, so I couldn’t really determine why it wasn’t performing as we expected.

Moving our images to a different storage mechanism was fairly easy, our external API didn’t need to change, and we just moved the image storage into S3.

The next run of the load tests showed average response times under load going down to about 7 seconds, which is about the time it takes to move 1.2MB on our internet connection + some overhead.

The charts still look basically the same, but the numbers are a hell of a lot lower now, so we’ve got that going for us, which is nice.

Summary

Always load test. Always.

We had already guessed before we started load testing that the underlying database would probably be a bottleneck, as there was only one of them, and we had issues adding more in parallel during development. We had assumed that any performance issues with the database could be solved by simply making it more awesome in whatever way it was bottlenecked. Unfortunately, that turned out to be a false assumption, and we had to change the internals of the system in order to scale our performance when it came to image storage.

Without load testing and the analysis of its results via Kibana and its lovely charts, we wouldn’t have been able to identify the scaling issue and resolve it, with real numbers to back up our improvements.

I cannot recommend using a log aggregator enough. Literally you should aggregate all the things. Once you have all of that information at your fingertips you can do all sorts of amazing analysis, which lead to greater insights into how your software is actually running (as opposed to how you think its running).

0 Comments

Such a stupid thing.

Even though I was really careful, I still did the thing.

I knew about the thing, I planned to avoid it, but it still happened.

I accidentally uploaded some AWS credentials (Key + Secret) into GitHub…

I’m going to use this blog post to share exactly what happened, how we responded, how Amazon responded and some thoughts about how to avoid it.

Anatomy of Stupidity

The repository that accompanied my JMeter post last week had an CloudFormation script in it, to create a variable sized army of AWS instances ready to run load tests over a service. You’re probably thinking that’s where the credentials were, that I’d hardcoded them into the environment creation script and forgotten to remove them before uploading to GitHub.

You would be wrong. My script was parameterised well, requiring you to supply credentials (amongst other things) in order to create the appropriate environment using CloudFormation.

My test though…

I recently started writing tests for my Powershell scripts using Pester. In this particular case, I had a test that created an environment and then verified that parts of it were working (i.e. URL in output resolved, returned 200 OK for a status query, stuff like that), then tore it down.

The test had hardcoded credentials in it. The credentials were intended to be used wit CloudFormation, so they were capable of creating various resources, most importantly EC2 instances.

Normally when I migrate some scripts/code that I’ve written at work into GitHub for public consumption I do two things.

One, I copy all of the files into a fresh directory and pore over the resulting code for references to anything that might be specific. Company name, credentials, those sorts of things. I excise all of the non-generic functionality, and anything that I don’t want to share (mostly stuff not related to the blog post in question).

Two, I create a fresh git repository from those edited files. The main reason I do this instead of just copying the repository, is that the history of the repository will contain all of those changes otherwise and that’s a far more subtle leak.

There is no excuse for me exposing the credentials except for stupidity. I’ll wear this one for a while.

Timeline of Horror

Somewhere between 1000 and 1100 on Wednesday April 8, I uploaded my JMeter blog post, along with its associated repository.

By 1130 an automated bot had already retrieved the credentials and had started to use them.

As far as we can tell, the bot picks up all of the regions that you do not have active resources in (so for us, that’s basically anything outside ap-southeast) and creates the maximum possible number of c4.x8large instances in every single region, via spot requests. All told, 500 + instances spread across the world. Alas we didn’t keep an example of one of the instances (too concerned with terminating them ASAP), but we assume from reading other resources that they were being used to mine Bitcoins and then transfer them anonymously to some destination.

At 1330 Amazon notified us that our credentials were compromised via an email to an inbox that was not being actively monitored for reasons that I wont go into (but I’m sure will be actively monitored from now on). They also prevented our account from doing various things, including the creation of new credentials and particularly expensive resources. Thanks Amazon! Seriously your (probably automated) actions saved us a lot more grief.

The notification from Amazon was actually pretty awesome. They pinpointed exactly where the credentials were exposed in GitHub. They must have the same sort of bots running as the thieves, except used for good, rather than evil.

At approximately 0600 Thursday April 9, we received a billing alert that our AWS account had exceeded a limit we had set in place. Luckily this did go to an inbox that was being actively monitored, and our response was swift and merciless.

Within 15 minutes we had terminated the exposed credentials, terminated all of the unlawful instances and removed all of the spot requests. We created a script to select all resources within our primary region that had been modified or created in the last 24 hours and reviewed the results. Luckily, nothing had been modified within our primary region. All unlawful activity had occurred outside, probably in the hopes that we wouldn't notice.

Ramifications

In that almost 24 hour period, the compromise resulted in just over $8000 AUD of charges to our AWS account.

Don’t underestimate the impact that exposed credentials can have. It happens incredibly quickly.

I have offered to pay for the damages out of my own pocket (as is only appropriate), but AWS also has a concession strategy for this sort of thing, so we’ll see how much I actually have to pay in the end.

Protecting Ourselves

Obviously the first point is don’t store credentials in a file. Ever. Especially not one that goes into source control.

The bad part is, I knew this, but I stupidly assumed it wouldn't happen to me because our code is not publicly visible. That would have held true if I hadn’t used some of the scripts that I’d written as the base for a public repository to help explain a blog post.

Neverassume your credentials are safe if they are specified in a file that isn’t protected by a mechanism not available locally (so encrypting them and having the encryption key in the same codebase is not enough).

I have since removed all credentials references from our code (there were a few copies of that fated environment creation test in a various repositories, luckily no others) and replaced them with a mechanism to supply credentials via a global hashtable entered at the time the tests are run. Its fairly straightforward, and is focused on telling you which credentials are missing when they cannot be found. No real thought has been given to making sure the credentials are secure on the machine itself, its focused entirely on keeping secrets off the disk.

function Get-CredentialByKey
{
    [CmdletBinding()]
    param
    (
        [string]$keyName
    )

    if ($globalCredentialsLookup -eq $null)
    {
        throw "Global hashtable variable called [globalCredentialsLookup] was not found. Credentials are specified at the entry point of your script. Specify hashtable content with @{KEY=VALUE}."
    }

    if (-not ($globalCredentialsLookup.ContainsKey($keyName)))
    {
        throw "The credential with key [$keyName] could not be found in the global hashtable variable called [globalCredentialsLookup]. Specify hashtable content with @{KEY=VALUE}."
    }

    return $globalCredentialsLookup.Get_Item($keyName)
}

The second point is specific to AWS credentials. You should always limit the credentials to only exactly what they need to do.

In our case, there was no reason for the credentials to be able to create instances outside of our primary region. Other than that, they were pretty good (they weren’t administrative credentials for example, but they certainly did have permission to create various resources used in the environment).

The third point is obvious. Make sure you have a reliable communication channel for messages like compromises, one that is guaranteed to be monitored by at least 1 person at all times. This would have saved us a tonne of grief. The earlier you know about this sort of thing the better.

Summary

AWS is an amazing service. It lets me treat hardware resources in a programmable way, and stops me from having to wait on other people (who probably have more important things to do anyway). It lets me create temporary things to deal with temporary issues, and is just generally a much better way to work.

With great power comes great responsibility.

Guard your credentials closely. Don’t be stupid like me, or someone will get a nasty bill, and it will definitely come back to you. Also you lose engineer points. Luckily I’ve only done two stupid things so far this year. If you were curious, the other one was that I forgot my KeePass password for the file that contains all of my work related credentials. Luckily I had set it to match my domain password, and managed to recover that from Chrome (because we use our domain credentials to access Atlassian tools).

This AWS thing was a lot more embarrassing.