Where Has All The Read IOPS Gone

September 20. 2016 0 Comments

Welp, holidays finished, so back into software delivery I go.

We just finished a 3 day Hackathon, which was pretty fun. The team I was in worked on a proof of concept for providing business intelligence on our customers to internal users by connecting and correlating our many disparate data sources. It was a fairly simple React.js website backed by a C# API, which connected to a number of different systems including Microsoft Dynamics CRM, Elasticsearch (specifically, our log stack), some crazy financial database called Helpy (which I don’t even seem to be able to Google) and finally a PostgreSQL database containing customer information.

It was cool to see it all come together over just a few days, but I doubt I’ll give it its own blog post like I did for the faster S3 clone that I did for the last Hackathon. Funnily enough, the hardest part was getting the damn thing deployed to an instance of IIS on a manually provisioned server inside our office. I’ve spent so much time using automated scripts that I wrote months ago in AWS to setup websites and API’s that I’ve completely forgotten how to just hack a webserver together.

Why didn’t I use the scripts you might ask? Well, I tried, and they really only work inside AWS on top of the AMI’s we’ve built up specifically for that purpose. A bit disappointing, but I got over it.

Hackathon fun aside, I had to deal with something else when I got back from my holidays, identifying and resolving a massive unexpected increase in Read IOPS usage for one of our production databases.

And for once, It had nothing to do with RavenDB.

Zero Hour

While I was on holidays, one of our production API’s started exhibiting some pretty degraded performance. It happened suddenly too, it was running fine one minute and then all of a sudden its latency spiked hard. Unfortunately I was not there for the initial investigation, so the first half of this post is going to be a second-hand account, with all the inherent inaccuracy that entails.

Looking into the issue, all queries to the underlying database (a postgreSQL database hosted in RDS) were taking much longer than expected, but it wasn’t immediately obvious as to why. There had been no new deployments (so nothing had changed), and it didn’t look like the amount of traffic being received had drastically increased.

The RDS monitoring statistics showed the root cause of the performance problems. There had been a massive increase in the amount of Read IOPS being consumed by the database instance, from 300ish to 700ish. This had been happening for at least a few days before the performance went downhill, so the operative question was why it had become an issue all of a sudden.

Apparently, IO credits are a thing, like CPU credits.

With the increase in Read IOPS being sustained over many hours, the IO credits available to the instance were slowly being consumed, until they were all gone and performance tanked.

The way IO credits work, to my current understanding, is that when you provision an EBS volume, the size of the volume determines the minimum guaranteed IO throughput, at a rate of 3 IOPS per GB provisioned (unless you’ve specified provisioned IOPS, which is a whole different ballgame). The volume is capable of bursting to 3000 IOPS, but operating above the minimum will consume IO credits. Running out of IO credits means no more bursting, so if the performance of your service was reliant on pulling more than the minimum IO available, it will tank. Like ours did.

Apart from the unfortunate fact that unlike CPU credits, there is no way to monitor IO credits, the problem was relatively easy to solve. Either increase the volume size of the RDS instance (to increase the minimum IO performance) or leave the size the same, but switch to using provisioned IOPS. We opted for the cheaper option (bigger disk = more performance) and the performance stabilised.

This left us in a precarious position though, as we didn’t know what had cause the dramatic increase in Read IOPS.

Detective Agency

PostgreSQL has some pretty good logs, which is a relief, considering no equivalent tool to SQL Server Profiler exists to my knowledge.

Getting the logs for a PostgreSQL instance hosted in RDS isn’t particularly complicated. Either edit the existing parameter group associated with the instance, or create a new one with logging enabled and then associate it with the instance. The logging options we were interested in were log_statement and log_min_duration_statement, which represent what activities to log and how long a query needs to be running before it should be logged, respectively. More information about working with PostgreSQL logs in RDS can be found at the appropriate AWS webpage. Getting to the logs manually is pretty easy. They are available for download when looking at the RDS instance in the AWS console/dashboard.

The logs showed that there was a class of query that was constantly being executed to determine the most recent piece of information in the database for a subset of the data (i.e.a select top x from y where a, b). The only piece of software that can access the database is a C# API, and it uses Entity Framework 6 for all database communication, so the queries were monstrous, but once our attention had been drawn to the appropriate place, it was pretty easy to see some issues:

The data type on the first segmentation column was a string instead of an integer (only integer values were ever present in the column, and conceptually only integer values would ever be present)
The data type on the second segmentation column was a string instead of a guid (similar to the integer comment above)
The primary key on the table was a simple auto ID, whereas the actual primary key should have been a compound key with column a and b above, along with a third column
There did not appear to be any appropriate indexes on the segmentation columns

What all of this meant was that the most frequent query being executed on the database was incredibly sub-optimal. It was likely that because the most frequent query needed to do a table scan in order to get an answer, and because the RDS instance was only specified as a t2.micro, postgreSQL was constantly churning memory, which would account for the high Read IOPS. The issue likely only occurred recently because we finally hit some threshold in traffic or database size where the churning manifested.

Fixing the problem was simple. Edit the EF models to change the types and specify appropriate indexes and add a new EF migration to be executed on the next deployment. The migration was a little more complex than usual because it had to upgrade data already in the database (for the wrong column types), but all in all, the fix didn’t take a huge amount of effort.

Like good software developers though, we had to make sure that it actually did fix the problem without actually deploying to production.

Validate Me!

Thanks to previous efforts in traffic replication, I made the assumption that it would be relatively easy to replicate our production traffic into an updated version of the service.

I was wrong.

Our previous traffic replication had been accomplished using Gor replicating traffic from a single EC2 instance (a machine hosting RavenDB) to another similar instance. Pure HTTP traffic, no URLS, just IP addresses and Gor had been manually installed and configured. I had started work on replicating traffic from our API instances to a URL of my choosing, but I hadn’t quite finished it before I went on holidays.

I needed to replicate traffic from a variable number of instances to a publicly accessible URL, and I really didn’t want to have to manually install it on each machine (and then have to do it again next time I wanted to replicate traffic).

Creating a deployable Nuget package containing Gor and some helper Powershell scripts wasn’t overly difficult, I just used a similar pattern to the one we have for our Logstash deployment package. After deploying it though, it wouldn’t work. When I say it wouldn’t work, I mean I didn’t get any feedback at all when running it, no matter how I ran it. The issue was that Gor running on windows is dependent on WinPCap being present in order to function at all. This is easy enough to accomplish when you’re doing a manual install (just go grab WinPCap and install it), but unfortunately much harder when you want to be able to deploy it to any old machine in a completely unattended fashion. WinPCap doesn’t supply a silent installer (why god why), so I was forced to install NMap silently on the machines as part of the Gor deployment. Why NMap? It also installs WinPCap silently, because they’ve customised it.

Not my greatest victory, but I’ll take it.

With Gor working, I still had an issue where even though Gor seemed to be aware of the traffic occurring on the machine (I could log it to stdout for example), no matter what I did I could not get the traffic to replicate to another service via a standard URL available to the greater internet. Long story short, the issue was that Gor (as of version 0.15.1) does not and cannot use a proxy. Being that these instances did not have public IP addresses (no need, they were hidden behind a load balancer), they had to go through a proxy to get anything meaningful done on the internet.

We’re switching to using NAT Gateways for all internet traffic from private instances, so in the long term, this won’t be an issue. For now, I have to live with the fact that proxies are a thing, so I was forced to create an internal load balancer pointing at the same EC2 instances as the public URL I was going to use.

With that in place, I finally managed to replicate the production traffic to a test service with the changes applied.

And it worked amazingly. With the changes in place, our Read IOPS dropped from 700ish to like 10. Much more efficient. The only cost was that the service had to be made unavailable to apply the database updates (because they involved the transformation of the column contents). That only took like 5 minutes though, so I just did it at night when nobody noticed or cared.

Conclusion

I suppose the conclusion of this entire journey is to never underestimate the impact that early design decisions can have on the long term health of a service, and that the value of detailed monitoring should never be undervalued.

Honestly, we should have picked up this sort of thing much earlier than we did (probably through load testing), but it was good that we had the systems in place to react quickly when it did happen unexpectedly.

Another benefit of the whole thing is that it proved the value of our ability to spin up an entirely self contained environment as a mechanism for testing changes under relatively real circumstances. With the ability to replicate traffic at will now (thanks to the working Gor deployment), we’re in a much better position to try things out before we go live with them.

Health Check On Aisle One

July 26. 2016 0 Comments

Posted in:
monitoring
API
aws

The way we structure our services for deployment should be familiar to anyone who’s worked with AWS before. An ELB (Elastic Load Balancer) containing one or more EC2 (Elastic Compute Cloud) instances, each of which has our service code deployed on it by Octopus Deploy. The EC2 instances are maintained by an ASG (Auto Scaling Group) which uses a Launch Configuration that describes how to create and initialize an instance. Nothing too fancy.

An ELB has logic in it to direct incoming traffic to its instance members, in order to balance the load across multiple resources. In order to do this, it needs to know whether or not a member instance is healthy, and it accomplishes this by using a health check, configured when the ELB is created.

Health checks can be very simple (does a member instance respond to a TCP packet over a certain port), or a little bit more complex (hit some API endpoint on a member instance via HTTP and look for non 200 response codes), but conceptually they are meant to provide an indication that the ELB is okay to continue to direct traffic to this member instance. If a certain number of status checks fail over a period of time (configurable, but something like “more than 5 within 5 minutes”, the ELB will mark a member instance as unhealthy and stop directing traffic to it. It will continue to execute the configured health check in the background (in case the instance comes back online because the problem was transitory), but can also be configured (when used in conjunction with an ASG) to terminate the instance and replace it with one that isn’t terrible.

Up until recently, out services have consistently used a /status endpoint for the purposes of the ELB health check.

Its a pretty simple endpoint, checking some basic things like “can I connect to the database” and “can I connect to S3”, returning a 503 if anything is wrong.

When we started using an external website monitoring service called Pingdom, it made sense to just use the /status endpoint as well.

Then we complicated things.

Only I May Do Things

Earlier this year I posted about how we do environment migrations. The contents of that post are still accurate (even though I am becoming less and less satisfied with the approach as time goes on), but we’ve improved the part about “making the environment unavailable” before starting the migration.

Originally, we did this by just putting all of the instances in the auto scaling group into standby mode (and shutting down any manually created databases, like ones hosted on EC2 instances), with the intent that this would be sufficient to stop customer traffic while we did the migration. When we introduced the concept of the statistics endpoint, and started comparing data before and after the migration, this was no longer acceptable (because there was no way for us to hit the service ourselves when it was down to everybody).

What we really wanted to do was put the service into some sort of mode that let us interact with it from an administrative point of view, but no-one else. That way we could execute the statistics endpoint (and anything else we needed access to), but all external traffic would be turned away with a 503 Service Unavailable (and a message saying something about maintenance).

This complicated the /status endpoint though. If we made it return a 503 when maintenance mode was on, the instances would eventually be removed from the ELB, which would make the service inaccessible to us. If we didn’t make it return a 503, the external monitoring in Pingdom would falsely report that everything was working just fine, when in reality normal users would be completely unable to use the service.

The conclusion that we came to was that we made a mistake using /status for two very different purposes (ELB membership and external service availability), even though they looked very similar at a glance.

The solutoion?

Split the /status endpoint into two new endpoints, /health for the ELB and /test for the external monitoring service.

Healthy Services

The /health endpoint is basically just the /status endpoint with a new name. Its response payload looks like this:

{
    "BuildVersion" : "1.0.16209.8792",
    "BuildTimestamp" : "2016-07-27T04:53:04+00:00",
    "DatabaseConnectivity" : true,
    "MaintenanceMode" : false
}

There’s not really much more to say about it, other than its purpose now being clearer and more focused. Its all about making sure the particular member instance is healthy and should continue to receive traffic. We can probably extend it to do other health related things (like available disk space, available memory, and so on), but it’s good enough for now.

Testing Is My Bag Baby

The /test endpoint though, is a different beast.

At the very least, the /test endpoint needs to return the information present in the /health endpoint. If /health is returning a 503, /test should too. But it needs to do more.

The initial idea was to simply check whether or not we were in maintenance mode and then just make the /test endpoint return a 503 when maintenance mode was on as the only differentiator between /health and /test. That’s not quite good enough though, as what we want conceptually is the /test endpoint to act more like a smoke test of common user interactions and just checking the maintenance mode flag doesn’t give us enough faith that the service itself is working from a user point of view.

So we added some tests that get executed when the endpoint is called, checking commonly executed features.

The first service we implemented the /test endpoint for, was our auth service, so the tests included things like “can I generate a token using a known user” and “can I query the statistics endpoint”, but eventually we’ll probably extend it to include other common user operations like “add a customer” and “list customer databases”.

Anyway, the payload for the /test endpoint response ended up looking like this:

{
    "Health" : {
        "BuildVersion" : "1.0.16209.8792",
        "BuildTimestamp" : "2016-07-27T04:53:04+00:00",
        "DatabaseConnectivity" : true,
        "MaintenanceMode" : false
    },
    "Tests" : [{
            "Name" : "QueryApiStatistics",
            "Description" : "A test which returns the api statistics.",
            "Endpoint" : "/admin/statistics",
            "Success" : true,
            "Error" : null
        }, {
            "Name" : "GenerateClientToken",
            "Description" : "A test to check that a client token can be generated.",
            "Endpoint" : "/auth/token",
            "Success" : true,
            "Error" : null
        }
    ]
}

I think it turned out quite informative in the end, and it definitely meets the need of detecting when the service is actually available for normal usage, as opposed to the very specific case of detecting maintenance mode.

The only trick here was that the tests needed to hit the API via HTTP (i.e. localhost), rather than just shortcutting through the code directly. The reason for this is that otherwise each test would not be an accurate representation of actual usage, and would give false results if we added something in the Nancy pipeline that modified the response before it got to the code that the test was executing.

Subtle, but important.

Conclusion

After going to the effort of clearly identifying the two purposes that the /status endpoint was fulfilling, it was pretty clear that we would be better served by having two endpoints rather than just one. It was obvious in retrospect, but it wasn’t until we actually encountered a situation that required them to be different that we really noticed.

A great side effect of this was that we realised we needed to have some sort of smoke tests for our external monitoring to access on an ongoing basis, to give us a good indication of whether or not everything was going well from a users point of view.

Now our external monitoring actually stands a chance of tracking whether or not users are having issues.

Which makes me happy inside, because that’s the sort of thing I want to know about.

A Real Life Chaos Gorilla

June 7. 2016 0 Comments

You may have heard of the Netflix simian army, which is an interesting concept that ensures that your infrastructure is always prepared to lose chunks of itself at any moment. The simian army is responsible for randomly killing various components in their live environments, ranging from a single machine/component (chaos monkey), to an entire availability zone (chaos gorilla) all the way through to an entire region (chaos kong).

The simian army is all about making sure that you are prepared to weather any sort of storm and is a very important part of the reliability engineering that goes on at Netflix.

On Sunday, 5 June, we found a chaos gorilla in the wild.

Monkey See, Monkey Do

AWS is a weird service when it comes to reliability. Their services are highly available (i.e. you can almost always access the EC2 service itself to spin up virtual machines), but the constructs created by those services seem to be far less permanent. For EC2 in particular, I don’t think there is any guarantee that a particular instance will remain alive, although they usually do. I know I’ve seen instances die of their own accord, or simply become unresponsive for long periods of time (for no apparent reason).

The good thing is that AWS is very transparent in this regard. They mostly just provide you with the tools, and then its up to you to set up whatever you need to create whatever high-availability/high reliability setup, depending on what you need.

When it comes to ensuring availability of the service, there are many AWS regions across the world (a few in North America, some in Europe, some in Asia Pacific), each with at least two availability zones in them (sometimes more), which are generally located geographically separately. Its completely up to you where and how you create your resources, and whether or not you are willing to accept the risk that a single availability zone or even a region might disappear for whatever reason.

For our purposes, we tend towards redundancy (multiple copies of a resource), with those copies spread across multiple availability zones, but only within a single region (Asia Pacific Sydney). Everything is typically hidden behind a load balancer (an AWS supplied component that ensures traffic is distributed evenly to all registered resources) and its rare that we only have a single one of anything.

Yesterday (Sunday, 05 June 2016), AWS EC2 (the Elastic Compute Cloud, the heart of the virtualization services offered by AWS) experienced some major issues in ap-southeast-2.

This was a real-life chaos gorilla at work.

Extinction Is A Real Problem

The first I knew of this outage was when one of our external monitoring services (Pingdom), reported that one of our production services had dropped offline.

Unfortunately, the persistence framework used by this service has a history of being flakey, so this sort of behaviour is not entirely unusual, although the service had been behaving itself for the last couple of weeks. When I get a message like that, I usually access our log aggregation stack (ELK), and then use the information therein to identify what’s going on (it contains information on traffic, resource utilization, instance statistics and so on).

This time though, I couldn’t access the log stack at all.

This was unusual, because we engineered that stack to be highly available. It features multiple instances across all availability zones at all levels of the architecture, from the Logstash broker dealing with incoming log events, to our RabbitMQ event queue that the brokers write to, through to the indexers that pull off the queue all the way down to the elasticsearch instances that back the whole thing.

The next step was to log into the AWS console/dashboard directly and see what it said. This is the second place where we can view information about our resources. Not as easily searchable/visualizable as the data inside the ELK stack, we can still use the CloudWatch statistics inside AWS to get some stats on instances and traffic.

Once I logged in, our entire EC2 dashboard was unavailable. The only thing that would load inside the dashboard in the EC2 section was the load balancers, and they were all reporting various concerning things. The service that I had received the notification for was showing 0/6 instances available, but others were showing some availability, so not everything was dead.

I was blind though.

All By Myself

Luckily, the service that was completely down tends to not get used very much on the weekend, and this outage occurred on Sunday night AEST, so chances of it causing a poor user experience were low.

The question burning in my mind though? Was it just us? Or was the entire internet on fire? Maybe its just a blip and things will come back online in the next few minutes.

A visit to the AWS status page, a few quick Google searches and a visit to the AWS Reddit showed that there was very little traffic relating to this problem, so I was worried that it was somehow just us. My thoughts started to turn towards some sort of account compromise, or that maybe I’d been locked out of the system somehow, and this was just the way EC2 presented that I was not allowed to view the information.

Eventually information started to trickle through about this being an issue specifically with the ap-southeast-2 region, and AWS themselves even updated their status page and put up a warning on the support request page saying it was a known issue.

Now I just had to play the waiting time, because there was literally nothing I could do.

Guess Who’s Back

Some amount of time later (I forget exactly how much), the EC2 dashboard came back. The entirely of ap-southeast-2a appeared to have had some sort of massive outage/failure, meaning that most of the machines we were hosting in that availability zone were unavailable.

Even in the face of a massive interruption to service from ap-southeast-2a, we kind of lucked out. While most of the instances in that availability zone appeared to have died a horrible death, one of our most important instances that happened to not feature any redundancy across availability zones was just fine. Its a pretty massive instance now (r3.4xlarge), which is a result of performance problems we were having with RavenDB, so I wonder if it was given a higher priority or something? All of our small instances (t2/m3 mostly) were definitely dead.

Now that the zone seemed to be back online, restoring everything should have been as simple as killing the broken instances and then waiting for the Auto Scaling Groups to recreate them.

The problem was, this wasn’t working at all. The instances would start up just fine, but they would never be added into the Load Balancers. The reason? Their status checks never succeeded.

This happened consistently across multiple services, even across multiple AWS accounts.

Further investigation showed that our Octopus Deploy server had unfortunately fallen victim to the same problem as the others. We attempted to restart it, but it got stuck shutting down.

No Octopus meant no new instances.

No new instances meant we only had the ones that were already active, and they were showing some strain in keeping up with the traffic, particularly when it came to CPU credits.

Hack Away

Emergency solution time.

We went through a number of different ideas, but we settled on the following:

Take a functioning instance
Snapshot its primary drive
Create an AMI from that snapshot
Manually create some instances to fill in the gaps
Manually add those instances to the load balancer

Not the greatest of solutions, but we really only needed to buy some time while we fixed the problem of not having Octopus. Once that was back up and running, we could clean everything up and rely on our standard self healing strategies to balance everything out.

A few hours later and all of the emergency instances were in place (3 services all told + some supporting services like proxies). Everything was functioning within normal boundaries, and we could get some sleep while the zombie Octopus instance hopefully sorted itself out (because it was around 0100 on Monday at this point).

The following morning the Octopus Server had successfully shutdown (finally) and all we had to do was restart it.

We cleaned up the emergency instances, scaled back to where we were supposed to be and cleaned up anything that wasn’t operating like we expected it to (like the instances that had been created through auto scaling while Octopus was unavailable).

Crisis managed.

Conclusion

In the end, the application of a real life chaos gorilla showed us a number of different areas where we lacked redundancy (and highlighted some areas where we handled failure well).

The low incidence of AWS issues like this combined with the availability expected by our users, probably means that we don’t need to make any major changes to most of our infrastructure. We definitely should add redundancy to the single point of failure database behind the service that went down during this outage though (which is easier said than done).

Our reliability on a single, non-redundant Octopus Server however, especially when its required to react to some sort of event (whether it be a spike in usage or some sort of critical failure) is a different problem altogether. We’re definitely going to have to do something about that.

Everything else went about as well as I expected though, with the majority of our services continuing to function even when we lost an entire availability zone.

Just as planned.

Faster S3! Clone! Clone!

April 19. 2016 0 Comments

A little over 4 months ago, I wrote a post about trying to improve the speed of cloning a large S3 bucket. At the time, I tried to simply parallelise the execution of the AWS CLI sync command, which actually proved to be much slower than simply leaving the CLI alone to do its job. It was an unsurprising result in retrospect, but you never know unless you try.

Unwilling to let the idea die, I decided to make it my focus during our recent hack days.

If you are unfamiliar with the concept of a hack day (or Hackathon as they are sometimes known), have a look at this Wikipedia article. At my current company, we’re only just starting to include hack days on a regular basis, but its a good sign of a healthy development environment.

Continuing on with the original train of thought (parallelise via prefixes), I needed to find a way to farm out the work to something (whether it was a pool of our own workers or some other mechanism). Continuing with that train of thought, I chose to use AWS Lambda.

Enter Node.js on Lambda.

At A High Level

AWS Lambda is a relatively new offering, allowing you to configure some code to automatically execute following a trigger from one of a number of different events, including an SNS Topic Notification, changes to an S3 bucket or a HTTP call. You can use Python, Java or Javascript (through Node.js) as code natively, but you can technically use anything you can compile into a Linux compatible executable and make accessible to the function via S3 or something similar.

Since Javascript seems to be everywhere now (even though its hard to call it a real language), it was a solid choice. No point being afraid of new things.

Realistically, I should have been at least a little afraid of new things.

Conceptually the idea can be explained as a simple divide and conquer strategy, managed by files in an S3 bucket (because S3 was the triggering mechanism I was most familiar with).

If something wants to trigger a clone, it writes a file into a known S3 bucket detailing the desired operation (source, destination, some sort of id) with a key of {id}-{source}-{destination}/clone-request.

In response, the Lambda function will trigger, segment the work and write a file for each segment with a key of {id}-{source}-{destination}/{prefix}-segment-request. When it has finished breaking down the work, it will write another file with the key {id}-{source}-{destination}/clone-response, containing a manifest of the breakdown, indicating that it is done with the division of work.

As each segment file is being written, another Lambda function will be triggered, doing the actual copy work and finally writing a file with the key {id}-{source}-{destination}/{prefix}-segment-response to indicate that its done.

File Formats Are Interesting

Each clone-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

Its a relatively simple file that would be easy to extend as necessary (for example, if you needed to specify the region, credentials to access the bucket, etc).

The clone-response file (the manifest), looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name}
    },
    destination: {
        name: {destination-bucket-name}
    },
    segments: {
        count: {number-of-segments},
        values: [
            {segment-key},
            {segment-key}
            ...
        ]
    }
}

Again, another relatively simple file. The only additional information is the segments that the task was broken into. These segments are used for tracking purposes, as the code that requests a clone needs some way to know when the clone is done.

Each segment-request file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    }
}

And finally, each segment-response file looks like this:

{
    id: {id},
    source: {
        name: {source-bucket-name},
        prefix: {prefix}
    },
    destination: {
        name: {destination-bucket-name}
    },    
    files: [        
        {key},
        {key},
        ...
    ]
}

Nothing fancy or special, just straight JSON files with all the information needed.

Breaking It All Down

First up, the segmentation function.

Each Javascript Lambda function already comes with access to the aws-sdk, which is super useful, because honestly if you’re using Lambda, you’re probably doing it because you need to talk to other AWS offerings.

The segmentation function has to read in the triggering file from S3, parse it (its Javascript and JSON so that’s trivial at least), iterate through the available prefixes (using a delimiter, and sticking with the default “/”), write out a file for each unique prefix and finally write out a file containing the manifest.

As I very quickly learned, using Node.js to accomplish the apparently simple task outlined above was made not simple at all thanks to its fundamentally asynchronous nature, and the fact that async calls don’t seem to return a traceable component (unlike in C#, where if you were using async tasks you would get a task object that could be used to track whether or not the task succeeded/failed).

To complicate this even further, the aws-sdk will only return a maximum of 1000 results when listing the prefixes in a bucket (or doing anything with a bucket really), which means you have to loop using the callbacks. This makes accumulating some sort of result set annoying difficult, especially if you want to know when you are done.

Anyway, the segmentation function is as follows:

console.log('Loading function');

var aws = require('aws-sdk');
var s3 = new aws.S3({ apiVersion: '2006-03-01' });

function putCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Upload Clone Segment ', err);
    }
}

function generateCloneSegments(s3Source, command, commandBucket, marker, context, segments)
{
    var params = { Bucket: command.source.name, Marker: marker, Delimiter: '/' };
    console.log("Listing Prefixes: ", JSON.stringify(params));
    s3Source.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.CommonPrefixes.length; i++)
            {
                var item = data.CommonPrefixes[i];
                var segmentRequest = {
                    id: command.id,
                    source : {
                        name: command.source.name,
                        prefix: item.Prefix
                    },
                    destination : {
                        name: command.destination.name
                    }
                };
                
                var segmentKey = command.id + '/' + item.Prefix.replace('/', '') + '-segment-request';
                segments.push(segmentKey);
                console.log("Uploading: ", segmentKey);
                var segmentUploadParams = { Bucket: commandBucket, Key: segmentKey, Body: JSON.stringify(segmentRequest), ContentType: 'application/json'};
                s3.putObject(segmentUploadParams, putCallback);
            }
            
            if(data.IsTruncated)
            {
                generateCloneSegments(s3Source, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a clone-response file to the commandBucket, stating the segments generated
                console.log('Total Segments: ', segments.length);
                
                var cloneResponse = {
                    segments: {
                        count: segments.length,
                        values: segments
                    }
                };
                
                var responseKey = command.id + '/' + 'clone-response';
                var cloneResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(cloneResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                s3.putObject(cloneResponseUploadParams, putCallback);
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    s3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3Source = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var segments = [];
            generateCloneSegments(s3Source, command, commandBucket, '', context, segments);
        }
    });
};

I’m sure some improvements could be made to the Javascript (I’d love to find a way automate tests on it), but its not bad for being written directly into the AWS console.

Hi Ho, Hi Ho, Its Off To Work We Go

The actual cloning function is remarkably similar to the segmenting function.

It still has to loop through items in the bucket, except it limits itself to items that match a certain prefix. It still has to do something for each item (execute a copy and add the key to its on result set) and it still has to write a file right at the end when everything is done.

console.log('Loading function');

var aws = require('aws-sdk');
var commandS3 = new aws.S3({ apiVersion: '2006-03-01' });

function copyCallback(err, data)
{
    if (err)
    {
        console.log('Failed to Copy ', err);
    }
}

function copyFiles(s3, command, commandBucket, marker, context, files)
{
    var params = { Bucket: command.source.name, Marker: marker, Prefix: command.source.prefix };
    s3.listObjects(params, function(err, data) {
        if (err)
        {
            context.fail(err);
        }
        else
        {
            for (var i = 0; i < data.Contents.length; i++)
            {
                var key = data.Contents[i].Key;
                files.push(key);
                console.log("Copying [", key, "] from [", command.source.name, "] to [", command.destination.name, "]");
                
                var copyParams = {
                    Bucket: command.destination.name,
                    CopySource: command.source.name + '/' + key,
                    Key: key
                };
                s3.copyObject(copyParams, copyCallback);
            }
            
            if(data.IsTruncated)
            {
                copyFiles(s3, command, commandBucket, data.NextMarker, context, segments);
            }
            else
            {
                // Write a segment-response file
                console.log('Total Files: ', files.length);
                
                var segmentResponse = {
                    id: command.id,
                    source: command.source,
                    destination : {
                        name: command.destination.name,
                        files: {
                            count: files.length,
                            files: files
                        }
                    }
                };
                
                var responseKey = command.id + '/' + command.source.prefix.replace('/', '') + '-segment-response';
                var segmentResponseUploadParams = { Bucket: commandBucket, Key: responseKey, Body: JSON.stringify(segmentResponse), ContentType: 'application/json'};
                
                console.log("Uploading: ", responseKey);
                commandS3.putObject(segmentResponseUploadParams, function(err, data) { });
            }
        }
    });
}

exports.handler = function(event, context) {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    
    var commandBucket = event.Records[0].s3.bucket.name;
    var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    var params = {
        Bucket: commandBucket,
        Key: key
    };
    
    commandS3.getObject(params, function(err, data) 
    {
        if (err) 
        {
            context.fail(err);
        }
        else 
        {
            var command = JSON.parse(data.Body);
            var s3 = new aws.S3({ apiVersion: '2006-03-01', region: 'ap-southeast-2' });
            
            var files = [];
            copyFiles(s3, command, commandBucket, '', context, files);
        }
    });
};

Tricksy Trickses

You may notice that there is no mention of credentials in the code above. That’s because the Lambda functions run under a role with a policy that gives them the ability to list, read and put into any bucket in our account. Roles are handy for accomplishing things in AWS, avoiding the new to supply credentials. When applied to the resource, and no credentials are supplied, the aws-sdk will automatically generate a short term token using the role, reducing the likelihood of leaked credentials.

As I mentioned above, The asynchronous nature of Node.js made everything a little but more difficult than expected. It was hard to determine when anything was done (somewhat important for writing manifest files). Annoyingly enough, it was even hard to determine when the function itself was finished. I kept running into issues where the function execution had finished, and it looked like it had done all of the work I expected it to do, but AWS Lambda was reporting that it did not complete successfully.

In the initial version of Node.js I was using (v0.10.42), the AWS supplied context object had a number of methods on it to indicate completion (whether success or failure). If I called the Succeed method after I setup my callbacks, the function would terminate without doing anything, because it didn’t automatically wait for the callbacks to complete. If I didn’t call it, the function would be marked as “did not complete successfully”. Extremely annoying.

As is often the case with AWS though, on literally the second hack day, AWS released support for Node.js v4.3, which automatically waits for all pending callbacks to complete before completing the function, completely changing the interaction model for the better. I did upgrade to the latest version during the second hack day (after I had accepted that my function was going to error out in the control panel but actually do all the work it needed to), but it wasn’t until later that I realised that the upgrade had fixed my problem.

The last tripwire I ran into was related to AWS Lambda not being available in all regions yet. Specifically, its not in ap-southeast-2 (Sydney), which is where all of our infrastructure lives. S3 is weird in relation to regions, as buckets are globally unique and accessible, but they do actually have a home region. What does this have to do with Lambda? Well, the S3 bucket triggers I used as the impetus for the function execution only work if the S3 bucket is in the same region as the Lambda function (so us-west-1), even though once you get inside the Lambda function you can read/write to any bucket you like. Weird.

Conclusion

I’ve omitted the Powershell code responsible for executing the clone for brevity. It writes the request to the bucket, reads the response and then polls waiting for all of the segments to be completed, so its not particularly interesting, although the polling for segment completion was my first successful application of the Invoke-Parallel function from Script Center.

Profiling the AWS Lambda approach versus the original AWS CLI sync command approach over a test bucket (7500 objects, 195 distinct prefixes, 8000 MB of data) showed a decent improvement in performance. The sync approach took 142 seconds and the Lambda approach took 55 seconds, approximately a third of the time, which was good to see considering the last time I tried to parallelise the clone it actually decreased the performance. I think with some tweaking the Lambda approach could be improved further, with tighter polling tolerances and an increased number of parallel Lamda executions allowed.

Unfortunately, I have not had the chance to execute the AWS Lambda implementation on the huge bucket that is the entire reason it exists, but I suspect that it won’t work.

Lambda allows at maximum 5 minutes of execution time per function, and I suspect that the initial segmentation for a big enough bucket will probably take longer than that. It might be possible to chain lambda functions together (i.e. trigger one from the next one, perhaps per 1000 results returned from S3, but I’m not entirely sure how to do that yet (maybe using SNS notifications instead of S3?). Additionally, with a big enough bucket, the manifest file itself (detailed the segments) might become unwieldy. I think the problem bucket has something like 200K unique prefixes, so the size of the manifest file can add up quickly.

Regardless, the whole experience was definitely useful from a technical growth point of view. Its always a good idea to remove yourself from your comfort zone and try some new things, and AWS Lambda + Node.js are definitely well outside my comfort zone.

A whole different continent in fact.

Versioning Infrastructure

February 2. 2016 0 Comments

I’ve talked at length previously about the usefulness of ensuring that your environments are able to be easily spun up and down. Typically this means that they need to be represented as code and that code should be stored in some sort of Source Control (Git is my personal preference). Obviously this is much easier with AWS (or other cloud providers) than it is with traditionally provisioned infrastructure, but you can at least control configurations and other things when you are close to the iron.

We’ve come a long way on our journey to represent our environments as code, but there has been one hole that’s been nagging me for some time.

Versioning.

Our current environment pattern looks something like this:

A repository called X.Environment, where X describes the component the environment is for.
A series of Powershell scripts and CloudFormation templates that describe how to construct the environment.
A series of TeamCity Build Configurations that allow anyone to Create and Delete named versions of the environment (sometimes there are also Clone and Migrate scripts to allow for copying and updating).

When an environment is created via a TeamCity Build Configuration, the appropriate commit in the repository is tagged with something to give some traceability as to where the environment configuration came from. Unfortunately, the environment itself (typically represented as a CloudFormation stack), is not tagged for the reverse. There is currently no easy way for us to look at an environment and determine exactly the code that created it and, more importantly, how many changes have been made to the underlying description since it was created.

Granted, this information is technically available using timestamps and other pieces of data, but this is difficult, time-consuming, manual task, so its unlikely to be done with any regularity.

All of the TeamCity Build Configurations that I mentioned simply use the HEAD of the repository when they run. There is no concept of using an old Delete script or being able to (easily) spin up an old version of an environment for testing purposes.

The Best Version

The key to solving some of the problems above is to really immerse ourselves in the concept of treating the environment blueprint as code.

When dealing with code, you would never publish raw from a repository, so why would we do that for the environment?

Instead, you compile (if you need to), you test and then you package, creating a compact artefact that represents a validated copy of the code that can be used for whatever purpose you need to use it for (typically deployment). This artefact has some version associated with it (whatever versioning strategy you might use) which is traceable both ways (look at the repo, see the version, find artefact, look at the artefact, see the version, go to repository).

Obviously, for a set of Powershell scripts and CloudFormation templates, there is no real compilation step. There is a testing step though (Powershell tests written using Pester) and there can easily be a packaging step, so we have all of the bits and pieces that we need in order to provide a versioned package, and then use that package whenever we need to perform environment operations.

Versioning Details

As a general rule, I prefer to not encapsulate complicated build and test logic into TeamCity itself. Instead, I much prefer to have a self contained script within the repository, that is then used both within TeamCity and whenever you need to build locally. This typically takes the form of a build.ps1 script file with a number of common inputs, and leverages a number of common tools that I’m not going to go into any depth about. The output of the script is a versioned Nupkg file and some test results (so that TeamCity knows whether or not the build failed).

Adapting our environment repository pattern to build a nuget package is fairly straightforward (similar to the way in which we handle Logstash, just package up all the files necessary to execute the scripts using a nuspec file). Voila, a self contained package that can be used at a later date to spin up that particular version of the environment.

The only difficult part here was the actual versioning of the environment itself.

Prior to this, when an environment was created it did not have any versioning information attached to it.

The easiest way to attach that information? Introduce a new common CloudFormation template parameter called EnvironmentVersion and make sure that it is populated when an environment is created. The CloudFormation stack is also tagged with the version, for easy lookup.

For backwards compatibility, I made the environment version optional when you execute the New-Environment Powershell cmdlet (which is our wrapper around the AWS CFN tools). If not specified it will default to something that looks like 0.0.YYDDDD.SSSSS, making it very obvious that the version was not specified correctly.

For the proper versioning inside an environment’s source code, I simply reused some code we already had for dealing with AssemblyInfo files. It might not be the best approach, but including an AssemblyInfo file (along with the appropriate Assembly attributes) inside the repository and then reading from that file during environment creation is easy enough and consistency often beats optimal.

Improving Versioning

What I’ve described above is really a step in part of a larger plan.

I would vastly prefer if the mechanism for controlling what versions of an environment are present and where was delegated to Octopus Deploy, just like with the rest of our deployable components.

With a little bit of extra effort, we should be able to create a release for an appropriately named Octopus project and then push to that project whenever a new version of the environment is available.

This would give excellent visibility into what versions of the environment are where, and also allow us to leverage something I have planned for helping us see just how different the version in environment X is from the version in environment Y.

Ad-hoc environments will still need to be managed via TeamCity, but known environments (like CI, Staging and Production) should be able to be handled within Octopus.

Summary

I much prefer the versioned and packaged approach to environment management that I’ve outlined above. It seems much neater and allows for a lot of traceability and repeatability, something that was lacking when environments were being managed directly from HEAD.

It helps that it looks very similar to the way that we manage our code (both libraries and deployable components) so the pattern is already familiar and understandable.

You can see an example of what a versioned, packagable environment repository would look like here. Keep in mind that the common scripts inside that repository are not usually included directly like that. They are typically downloaded and installed via a bootstrapping process (using a Nuget package), but for this example I had to include them directly so that I didn’t have to bring along the rest or our build pipeline.

Speaking of the common scripts, unfortunately they are a constant reminder of a lack of knowledge about how to create reusable Powershell components. I’m hoping to restructure them into a number of separate modules with greater cohesion, but until then they are a bit unwieldy (just a massive chunk of scripts that are dot-included wherever they are needed).

That would probably make a good blog post actually.

How to unpick a mess of Powershell that past you made.

Sometimes I hate past me.