I Did Such a Stupid Thing

April 14. 2015 0 Comments

Posted in:
amazon
security

Such a stupid thing.

Even though I was really careful, I still did the thing.

I knew about the thing, I planned to avoid it, but it still happened.

I accidentally uploaded some AWS credentials (Key + Secret) into GitHub…

I’m going to use this blog post to share exactly what happened, how we responded, how Amazon responded and some thoughts about how to avoid it.

Anatomy of Stupidity

The repository that accompanied my JMeter post last week had an CloudFormation script in it, to create a variable sized army of AWS instances ready to run load tests over a service. You’re probably thinking that’s where the credentials were, that I’d hardcoded them into the environment creation script and forgotten to remove them before uploading to GitHub.

You would be wrong. My script was parameterised well, requiring you to supply credentials (amongst other things) in order to create the appropriate environment using CloudFormation.

My test though…

I recently started writing tests for my Powershell scripts using Pester. In this particular case, I had a test that created an environment and then verified that parts of it were working (i.e. URL in output resolved, returned 200 OK for a status query, stuff like that), then tore it down.

The test had hardcoded credentials in it. The credentials were intended to be used wit CloudFormation, so they were capable of creating various resources, most importantly EC2 instances.

Normally when I migrate some scripts/code that I’ve written at work into GitHub for public consumption I do two things.

One, I copy all of the files into a fresh directory and pore over the resulting code for references to anything that might be specific. Company name, credentials, those sorts of things. I excise all of the non-generic functionality, and anything that I don’t want to share (mostly stuff not related to the blog post in question).

Two, I create a fresh git repository from those edited files. The main reason I do this instead of just copying the repository, is that the history of the repository will contain all of those changes otherwise and that’s a far more subtle leak.

There is no excuse for me exposing the credentials except for stupidity. I’ll wear this one for a while.

Timeline of Horror

Somewhere between 1000 and 1100 on Wednesday April 8, I uploaded my JMeter blog post, along with its associated repository.

By 1130 an automated bot had already retrieved the credentials and had started to use them.

As far as we can tell, the bot picks up all of the regions that you do not have active resources in (so for us, that’s basically anything outside ap-southeast) and creates the maximum possible number of c4.x8large instances in every single region, via spot requests. All told, 500 + instances spread across the world. Alas we didn’t keep an example of one of the instances (too concerned with terminating them ASAP), but we assume from reading other resources that they were being used to mine Bitcoins and then transfer them anonymously to some destination.

At 1330 Amazon notified us that our credentials were compromised via an email to an inbox that was not being actively monitored for reasons that I wont go into (but I’m sure will be actively monitored from now on). They also prevented our account from doing various things, including the creation of new credentials and particularly expensive resources. Thanks Amazon! Seriously your (probably automated) actions saved us a lot more grief.

The notification from Amazon was actually pretty awesome. They pinpointed exactly where the credentials were exposed in GitHub. They must have the same sort of bots running as the thieves, except used for good, rather than evil.

At approximately 0600 Thursday April 9, we received a billing alert that our AWS account had exceeded a limit we had set in place. Luckily this did go to an inbox that was being actively monitored, and our response was swift and merciless.

Within 15 minutes we had terminated the exposed credentials, terminated all of the unlawful instances and removed all of the spot requests. We created a script to select all resources within our primary region that had been modified or created in the last 24 hours and reviewed the results. Luckily, nothing had been modified within our primary region. All unlawful activity had occurred outside, probably in the hopes that we wouldn't notice.

Ramifications

In that almost 24 hour period, the compromise resulted in just over $8000 AUD of charges to our AWS account.

Don’t underestimate the impact that exposed credentials can have. It happens incredibly quickly.

I have offered to pay for the damages out of my own pocket (as is only appropriate), but AWS also has a concession strategy for this sort of thing, so we’ll see how much I actually have to pay in the end.

Protecting Ourselves

Obviously the first point is don’t store credentials in a file. Ever. Especially not one that goes into source control.

The bad part is, I knew this, but I stupidly assumed it wouldn't happen to me because our code is not publicly visible. That would have held true if I hadn’t used some of the scripts that I’d written as the base for a public repository to help explain a blog post.

Neverassume your credentials are safe if they are specified in a file that isn’t protected by a mechanism not available locally (so encrypting them and having the encryption key in the same codebase is not enough).

I have since removed all credentials references from our code (there were a few copies of that fated environment creation test in a various repositories, luckily no others) and replaced them with a mechanism to supply credentials via a global hashtable entered at the time the tests are run. Its fairly straightforward, and is focused on telling you which credentials are missing when they cannot be found. No real thought has been given to making sure the credentials are secure on the machine itself, its focused entirely on keeping secrets off the disk.

function Get-CredentialByKey
{
    [CmdletBinding()]
    param
    (
        [string]$keyName
    )

    if ($globalCredentialsLookup -eq $null)
    {
        throw "Global hashtable variable called [globalCredentialsLookup] was not found. Credentials are specified at the entry point of your script. Specify hashtable content with @{KEY=VALUE}."
    }

    if (-not ($globalCredentialsLookup.ContainsKey($keyName)))
    {
        throw "The credential with key [$keyName] could not be found in the global hashtable variable called [globalCredentialsLookup]. Specify hashtable content with @{KEY=VALUE}."
    }

    return $globalCredentialsLookup.Get_Item($keyName)
}

The second point is specific to AWS credentials. You should always limit the credentials to only exactly what they need to do.

In our case, there was no reason for the credentials to be able to create instances outside of our primary region. Other than that, they were pretty good (they weren’t administrative credentials for example, but they certainly did have permission to create various resources used in the environment).

The third point is obvious. Make sure you have a reliable communication channel for messages like compromises, one that is guaranteed to be monitored by at least 1 person at all times. This would have saved us a tonne of grief. The earlier you know about this sort of thing the better.

Summary

AWS is an amazing service. It lets me treat hardware resources in a programmable way, and stops me from having to wait on other people (who probably have more important things to do anyway). It lets me create temporary things to deal with temporary issues, and is just generally a much better way to work.

With great power comes great responsibility.

Guard your credentials closely. Don’t be stupid like me, or someone will get a nasty bill, and it will definitely come back to you. Also you lose engineer points. Luckily I’ve only done two stupid things so far this year. If you were curious, the other one was that I forgot my KeePass password for the file that contains all of my work related credentials. Luckily I had set it to match my domain password, and managed to recover that from Chrome (because we use our domain credentials to access Atlassian tools).

This AWS thing was a lot more embarrassing.

A Safe Place to Log, Part 02

March 24. 2015 0 Comments

Last time I outlined the start of setting up an ELK based Log Aggregator via CloudFormation. I went through some of the problems I had with executing someone else's template (permissions! proxies!), and then called it there, because the post was already a mile long.

Now for the thrilling conclusion where I talk about errors that mean nothing to me, accepting defeat, rallying and finally shipping some real data.

You Have Failed Me Again Java

Once I managed to get all of the proxy issues sorted, and everything was being downloaded and installed properly, the instance was still not responding to HTTP requests over the appropriate ports. Well, it seemed like it wasn’t responding anyway.

Looking into the syslog, I saw repeated attempts to start Elasticsearch and Logstash, with an equal number of failures, where the process had terminated immediately and unexpectedly.

The main issue appeared to be an error about “Bad Page Map”, which of course makes no sense to me.

Looking it up, it appears as though there was an issue with the version of Ubuntu that I was using (3.13? Its really hard to tell what version means what), and it was not actually specific to Java. I’m going to blame Java anyway. Apparently the issue is fixed in 3.15.

After swapping the AMI to the latest distro of Ubuntu, the exceptions no longer appeared inside syslog, but the damn thing still wasn’t working.

I could get to the appropriate pages through the load balancer, which would redirect me to Google to supply OAuth credentials, but after supplying appropriate credentials, nothing else ever loaded. No Kibana, no anything. This meant of course that ES was (somewhat) working, as the load balancer was passing its health checks, and Kibana was (somewhat) working because it was at least executing the OAuth bit.

Victory?

Do It Yourself

It was at this point that I decided to just take it back to basics and start from scratch. I realised that I didn’t understand some of the components being installed (LogCabin for example, something to do with the OAuth implementation?), and that they were getting in the way of me accomplishing a minimum viable product. I stripped out all of the components from the UserData script, looked up the latest compatible versions of ES, Logstash and Kibana, installed them, and started them as services. I had to make some changes to the CloudFormation template as well (ES defaults to 9200, Kibana 4 to 5601, had to expose the appropriate ports and make some minor changes to the health check. Logstash was fine).

The latest version of Kibana is more self contained than previous ones, which is nice. It comes with its own web server, so all you have to do is start it up and it will listen to and respond to requests via port 5601 (which can be changed). This is different to the version that I was originally working with (3?), which seemed to be hosted directly inside Elasticsearch? I’m still not sure what the original template was doing to be honest, all I know is that it didnt work.

Success!

A Kibana dashboard, load balancers working, ES responding. Finally everything was up and running. I still didn’t fully understand it fully, but it was a hell of a lot more promising than it was before.

Now all I had to do was get some useful information into it.

Nxlog Sounds Like a Noise You Make When You Get Stabbed

There are a number of ways to get logs into ES via Logstash. Logstash itself can be installed on other machines and forward local logs to a remote Logstash, but its kind of heavy weight for that sort of thing. Someone has written a smaller component called Logstash-Forwarder which does a similar thing. You can also write directly to ES using Logstash compatible index names if you want to as well (Serilog offers a sink that does just that).

The Logstash solutions above seem to assume that you are gathering logs on a Unix based system though, and don’t really offer much in the way of documentation or help if you have a Windows based system.

After a small amount of investigation, a piece of software called Nxlog appears to be the most commonly used log shipper as far as Windows is concerned.

As with everything in automation, I couldn’t just go onto our API hosting instances and just install and configure Nxlog. I had to script it, and then add those scripts to the CloudFormation template for our environment setup.

Installing Nxlog from the command line is relatively simple using msiexec and the appropriate “don't show the damn UI” flags, and configuring it is simple as well. All you need to do is have an nxlog.conf file configured with what you need (in my case, iis and application logs being forwarded to the Logstash endpoint) and then copy it to the appropriate conf folder in the installation directory.

The nxlog configuration file takes some getting used to, but their documentation is pretty good, so its just a matter of working through it. The best tip I can give is to create a file output until you are sure that nxlog is doing what you think its doing, and then flipping everything over to output to Logstash. You’ll save a log of frustration if you know exactly where the failures are (and believe me, there will be failures).

After setting up Nxlog, it all started to happen! Stuff was appearing in Kibana! It was one of those magical moments where you actually get a result, and it felt good.

Types? We need Types?

I got everything working nicely in my test environment, so I saved my configuration, tore down the environments and created them again (to verify they could be recreated). Imagine my surprise when I was getting Nxlog internal messages into ES, but nothing from IIS. I assumed that I had messed up Nxlog somehow, so I spent a few hours trying to debug what was going wrong. My Nxlog config appeared to be fine, so I assumed that there was something wrong with the way I had configured Logstash. Again, it seemed to be fine.

It wasn't until I looked into the Elasticsearch logs that I found out why all of my IIS logs were not making it. The first document sent to Elasticsearch had a field called EventRecievedTime (from the Nxlog internal source) which was a date, represented as ticks since X, i.e. a gigantic number. ES had inferred the type of this field as a long. The IIS source also had a field called EventRecievedTime, which was an actual date (i.e. YYYY-MM-DD HH:mm). When any IIS entry arrived into ES from Logstash, ES errored out trying to parse the datetime into a long, and discarded it. Because of the asynchronous nature of the system, there was no way for Logstash to communicate the failure back to anywhere that I could see it.

After making sure that both EventRecievedTimes were dates, everything worked swimmingly.

I suppose this might reoccur in the future, with a different field name conflicting. I’m not sure exactly what the best way to deal with this would be. Maybe the Elasticsearch logs should be put into ES as well? At least then I could track it. You could setup a template to strongly type the fields as well, but due to the fluidity of ES, there are always going to be new fields, and ES will always try to infer an appropriate type, so having a template won’t stop it from occurring.

Mmmmmmm Kibana

Look at this dashboard.

Look at it.

So nice.

I haven’t even begun to plumb the depths of the information now at my fingertips. Most of those charts are simple (average latency, total requests, response codes, requested paths), but it still provides a fantastic picture of the web applications in question.

Conclusion

Last time I wrote a CloudFormation template, I didn’t manage to get it into a publically available repository, which kind of made the blog posts around it significantly less useful.

This time I thought ahead. You can find all of the scripts and templates for the Log Aggregator in this repository. This is a copy of our actual repository (private, in Bitbucket), so I’m not sure if I will be able to keep it up to date as we make changes, but at least there is some context to what I’ve been speaking about.

I’ve included the scripts that setup and configure Nxlog as well. These are actually located in the repository that contains our environment setup, but I think they are useful inside this repository as a point of reference for setting up log shipping on a Windows system. Some high level instructions are available in the readme of the repository.

Having a Log Aggregator, even though it only contains IIS and application logs for a single application, has already proved useful. It adds a huge amount of transparency to what is going on, and Kibana’s visualisations are extremely helpful in making sense of the data.

Now to do some load testing on the web application and see what Kibana looks like when everything breaks.

A Safe Place to Log, Part 01

March 17. 2015 0 Comments

Logging is one of the most important components of a good piece of software.

That was a statement, not a question, and is not up for debate.

Logging enables you to do useful things, like identify usage patterns (helpful when decided what sections of the application need the most attention), investigate failures (because there are always failures, and you need to be able to get to their root causes) and keep an eye on performance (which is a feature, no matter what anyone else tells you). Good logging enables a piece of software to be supported long after the software developers who wrote it have moved on, extending the life expectancy of the application and thus improving its return on investment.

It is a shame really, that logging is not always treated like a first class citizen. Often it is an afterthought, added in later after some issue or failure proves that it would have been useful, and then barely maintained from that point forward.

Making sure your application has excellent logging is only the first part though, you also need somewhere to put it so that the people who need to can access it.

The most common approach is to have logs be output to a file, somewhere on the local file system relative to the location where the software is installed and running. Obviously this is better than not having logs, but only just barely. When you have log files locally, you are stuck in a reactive mindset, using the logs as a diagnostic tool when a problem is either observed or reported through some other channel (like the user complaining).

The better approach is to the send the logs somewhere. Somewhere they can be watched and analysed and alerted on. You can be proactive when you have logs in a central location, finding issues before the users even notice and fixing them even faster.

I’m going to refer to that centralised place where the logs go as a Log Aggregator, although I’m not sure if that is the common term. It will do for now though.

Bucket O’ Logs

At my current job, we recently did some work to codify the setup of our environments. Specifically, we used AWS CloudFormation and Powershell to setup an auto scaling group + load balancer (and supporting constructs) to be the home for a new API that we were working on.

When you have a single machine, you can usually make do with local logs, even if its not the greatest of ideas (as mentioned above). When you have a variable number of machines, whose endpoints are constantly shifting and changing, you really need a central location where you can keep an eye on the log output.

Thus I’ve spent the last week and bit working on exactly that. Implementing a log aggregator.

After some initial investigation, we decided to go with an ELK stack. ELK stands for Elasticsearch, Logstash and Kibana, three components that each serve a different purpose. Elasticsearch is a document database with strong analysis and search capabilities. Logstash is a ETL (Extract, Transform, Load) system, used for moving logs around as well as transforming and mutating them into appropriate structures to be stored in Elasticsearch and Kibana is a front end visualisation and analysis tool that sits on top of Elasticsearch.

We went with ELK because a few other teams in the organization had already experimented with it, so there was at least a little organizational knowledge floating around to exploit. Alas, the other teams had not treated their ELK stacks as anything reusable, so we still had to start from scratch in order to get anything up and running.

We did look at a few other options (Splunk, Loggly, Seq) but it seemed like ELK was the best fit for our organisation and needs, so that was what we went with.

Unix?

As is my pattern, I didn’t just want to jam something together and call that our log aggregator, hacking away at a single instance or machine until it worked “enough”. I wanted to make sure that the entire process was codified and reproducible. I particularly liked the way in which we had done the environment setup using CloudFormation, so I decided that would be a good thing to aim for.

Luckily someone else had already had the same idea, so in the true spirit of software development, I stole their work to bootstrap my own.

Stole in the sense that they had published a public repository on GitHub with a CloudFormation template to setup an ELK stack inside it.

I cloned the repository, wrote a few scripts around executing the CloudFormation template and that was that. ELK stack up and running.

Right?

Ha! It’s never that easy.

Throughout the rest of this post, keep in mind that I haven't used a Unix based operating system in anger in a long time. The ELK stack used a Ubuntu distro, so I was at a disadvantage from the word go. On the upside, having been using cmder a lot recently, I was far more comfortable inside a command line environment than I ever have been before. Certainly more comfortable than I was when I was used Unix.

Structural Integrity

The structure of the CloudFormation template was fairly straightforward. There were two load balancers, backed by an auto scaling group. One of the load balancers was public, intended to expose Kibana. The other was internal (i.e. only accessible from within the specified VPC) intended to expose Logstash. There were some Route53 entries to give everything nice names, and an Auto Scaling Group with a LaunchConfig to define the configuration of the instances themselves.

The auto scaling group defaulted to a single instance, which is what I went with. I’ll look into scaling later, when it actually becomes necessary and we have many applications using the aggregator.

As I said earlier, the template didn’t just work straight out of the repository, which was disappointing.

The first issue I ran into was that the template called for the creation of an IAM role. The credentials I was using to execute the template did not have permissions to do that, so I simply removed it until I could get the appropriate permissions from our AWS account managers. It turns out I didn’t really need it anyway, as the only thing I needed to access using AWS credentials was an S3 bucket (for dependency distribution) which I could configure credentials for inside the template, supplied as parameters.

Removing the IAM role allowed me to successfully execute the template, and it eventually reach that glorious state of “Create Complete”. Yay!

It still didn’t work though. Booooo!

Its Always a Proxy

The initial template assumed that the instance would be accessible over port 8080. The public load balancer relied on that fact and its health check queried the __es path. The first sign that something was wrong was that the load balancer thought that the instance inside it was unhealthy, so it was failing its health check.

Unfortunately, the instance was not configured to signal failure back to CloudFormation if its setup failed, so although CloudFormation had successfully created all of its resources, when I looked into the cloud-init-output.log file in /var/log, it turned out that large swathes of the init script (configured in the UserData section of the LaunchConfig) had simply failed to execute.

The issue here, was that we require all internet access from within our VPC to outside to go through a proxy. Obviously the instance was not configured to use the proxy (how could it be, it was from a public git repo), so all communications to the internet were being blocked, including calls to apt-get and the download of various configuration files directly from git.

Simple enough to fix, set the http_proxy and https_proxy environment variables with the appropriate value.

It was at this point that I also added a call to install the AWS CloudFormation components on the instance during initialisation, so that I could use cfn-signal to indicate failures. This at least gave me an indication of whether or not the instance had actually succeeded its initialization, without having to remote into the machine to look at the logs.

When working on CloudFormation templates, its always useful to have some sort of repeatable test that you can run in order to execute the template, ideally from the command line. You don’t want to have to go into the AWS Dashboard to do that sort of thing, and its good to have some tests outside the template itself to check its external API. As I was already executing the template through Powershell, it was a simple matter to include a Pester test that executed the template, checked that the outputs worked (the outputs being the Kibana and Logstash URLs) and then tear the whole thing down if everything passed.

At this point I also tried to setup some CloudWatch logs that would automatically extract the contents of the various initialization log files to a common location, so that I could view them from the AWS Dashboard when things were not going as expected. I did not, in the end, manage to get this working. The irony of needing a log aggregator to successfully setup a log aggregator was not lost on me.

Setting the environment variables fixed the majority of the proxy issues, but there was one other proxy related problem left that I didn’t pick up until much later. All of the Elasticsearch plugins were failing to install, for exactly the same reason. No proxy settings. Apparently Java does not read the system proxy settings (bad Java!) so I had to manually supply the proxy address and port to the call to the Elasticsearch plugin installation script.

The initialisation log now showed no errors, and everything appeared to be installed correctly.

But it still wasn’t working.

To Be Continued

Tune in next week for the thrilling conclusion, where I discover a bug caused by the specific combination of Ubuntu version and Java version, get fed up with the components being installed and start from scratch and then struggle with Nxlog in order to get some useful information into the stack.

That Cloud looks like a Staging Environment, Part 02

March 3. 2015 0 Comments

Posted in:
automation
amazon

It’s been a while since I posted the first part of this blog post, but now its time for the thrilling conclusion! Note: Conclusion may not actually be thrilling.

Last time I gave a general outline of the problem we were trying to solve (automatic deployment of an API, into a controlled environment), went through our build process (TeamCity) and quickly ran through how we were using Amazon CloudFormation to setup environments.

This time I will be going over some additional pieces of the environment setup (including distributing dependencies and using Powershell Desired State Configuration for machine setup) and how we are using Octopus Deploy for deployment.

Like the last blog post, this one will be mostly explanation and high level descriptions, as opposed to a copy of the template itself and its dependencies (Powershell scripts, config files, etc). In other words, there won’t be a lot of code, just words.

Distribution Network

The environment setup has a number of dependencies, as you would expect, like scripts, applications (for example, something to zip and unzip archives), configuration files, etc. These dependencies are needed in a number of places. One of those places is wherever the script to create the environment is executed (a development machine or maybe even a CI machine) and another place is the actual AWS instances themselves, way up in the cloud, as they are being spun up and configured.

The most robust way to deal with this, to ensure that the correct versions of the dependencies are being used, is to deploy the dependencies as part of the execution of the script. This way you ensure that you are actually executing what you think you’re executing, and can make local changes and have those changes be used immediately, without having to upload or update scripts stored in some web accessible location (like S3 or an FTP site or something). I’ve seen approaches where dependencies are uploaded to some location once and then manually updated, but I think that approach is risky from a dependency management point of view, so I wanted to improve on it.

I did this in a similar way to what I did when automating the execution of our functional tests. In summary, all of the needed dependencies are collected and compressed during the environment creation script, and uploaded to a temporary location in S3, ready to be downloaded during the execution of the Cloud Formation template within Amazon’s infrastructure.

The biggest issue I had with distributing the dependencies via S3 was getting the newly created EC2 instances specified in the CloudFormation template access to the S3 bucket where the dependencies were. I first tried to use IAM roles (which I don’t really understand), but I didn’t have much luck, probably as a result of inexperience. In the end I went with supplying 3 pieces of information as parameters in the CloudFormation template. The S3 path to the dependencies archive, and a key and secret for a pre-configured AWS user that had guaranteed access to the bucket (via a Bucket Policy).

Within the template, inside the LaunchConfiguration (which defines the machines to be spun up inside an Auto Scaling Group) there is a section for supplying credentials to be used when accessing files stored in an S3 bucket, and the parameters are used there.

"LaunchConfig" : {
    "Type" : "AWS::AutoScaling::LaunchConfiguration",
    "Metadata" : {
        "AWS::CloudFormation::Authentication" : {
            "S3AccessCreds" : {
                "type" : "S3",
                "accessKeyId" : { "Ref" : "S3AccessKey" },
                "secretKey" : { "Ref": "S3SecretKey" },
                "buckets" : [ { "Ref":"S3BucketName" } ]
            }
        }
    }
}

I’m not a huge fan of the approach I had to take in the end, as I feel the IAM roles are a better way to go about it, I’m just not experienced enough to know how to implement them. Maybe next time.

My Wanton Desires

I’ll be honest, I don’t have a lot of experience with Powershell Desired State Configuration (DSC from here on). What I have observed so far, is that it is very useful, as it allows you to take a barebones Windows machine and specify (using a script, so its repeatable) what components you would like to be installed and how you want them to be configured. Things like the .NET Framework, IIS and even third party components like Octopus Tentacles.

When working with virtualisation in AWS, this allows you to skip the part where you have to configure an AMI of your very own, and to instead use one of the pre-built and maintained Amazon AMI’s. This allows you to easily update to the latest, patched version of the OS whenever you want, because you can always just run the same DSC on the new machine to get it into the state you want. You can even switch up the OS without too much trouble, flipping to a newer, greater version or maybe even dropping back to something older.

Even though I don’t have anything to add about DSC, I thought I’d mention it here as a record of the method we are using to configure the Windows instances in AWS during environment setup. Essentially what happens is that AWS gives you the ability to execute arbitrary scripts during the creation of a new EC2 instance. We use a barebones Windows Server 2012 AMI, so during setup it executes a series of scripts (via CloudFormation) one of which is the Powershell DSC script that installs IIS, .NET and an Octopus Tentacle.

I didn’t write the DSC script that we using, I just copied it from someone else in the organisation. It does what I need it to do though, so I haven’t spent any time trying to really understand it. I’m sure I’ll have to dig into it in more depth at some future point, and when I do, I’ll make sure to write about it.

Many Arms Make Light Work

With the environment itself sorted (CloudFormation + dependencies dynamically uploaded to S3 + Powershell DSC for configuration + some custom scripts) now its time to talk about deployment/release management.

Octopus Deploy is all about deployment/release management, and is pretty amazing.

I hadn’t actually used Octopus before this, but I had heard of it. I saw Paul speak at DDD Brisbane 2014, and had a quick chat with him after, so I knew what the product was for and approximately how it worked, but I didn’t realise how good it actually was until I started using it myself.

I think the thing that pleases me most about Octopus is that it treats automation and programmability as a first class citizen, rather than an afterthought. You never have to look too far or dig too deep to figure out how you are going to incorporate your deployment in an automated fashion. Octopus supplies a lot of different programmability options, from an executable to a set of .NET classes to a REST API, all of which let you do anything that you could do through the Octopus website.

Its great, and I wish more products were implemented in the same way.

Our usage of Octopus is very straightforward.

During environment setup Octopus Tentacles are installed on the machines with the appropriate configuration (i.e. in the appropriate role, belonging to the appropriate environment). These tentacles provide the means by which Octopus deploys releases.

One of the outputs from our build process is a NuGet Package containing everything necessary to run the API as a web service. We use Octopack for this (another awesome little tool supplied by Octopus Deploy), which is a simple NuGet Package that you add to your project, and then execute using an extra MSBuild flag at build time. Octopack takes care of expanding dependencies, putting everything in the right directory structure and including any pre and post deploy scripts that you have specified in the appropriate place for execution during deployment.

The package containing our application (versioned appropriately) is uploaded to a NuGet feed, and then we have a project in Octopus Deploy that contains some logic for how to deploy it (stock standard stuff relating to IIS setup). At the end of a CI build (via TeamCity) we create a new release in Octopus Deploy for the version that was just built and then automatically publish it to our CI environment.

This automatic deployment during build is done via a Powershell script.

if ($isCI)
{
    Get-ChildItem -Path ($buildDirectory.FullName) | NuGet-Publish -ApiKey $octopusServerApiKey -FeedUrl "$octopusServerUrl/nuget/packages"

    . "$repositoryRootPath\scripts\common\Functions-OctopusDeploy.ps1"
    $octopusProject = "{OCTOPUS PROJECT NAME]"
    New-OctopusRelease -ProjectName $octopusProject -OctopusServerUrl $octopusServerUrl -OctopusApiKey $octopusServerApiKey -Version $versionChangeResult.New -ReleaseNotes "[SCRIPT] Automatic Release created as part of Build."
    New-OctopusDeployment -ProjectName $octopusProject -Environment "CI" -OctopusServerUrl $octopusServerUrl -OctopusApiKey $octopusServerApiKey
}

Inside the snippet above, the $versionChangeResult is a local variable that defines the version information for the build, with the New property being the new version that was just generated. The NuGet-Publich function is a simple wrapper around NuGet.exe and we’ve abstracted the usage of Octo.exe to a set of functions (New-OctopusRelease, New-OctopusDeployment).

function New-OctopusRelease
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusServerUrl,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusApiKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$projectName,
        [string]$releaseNotes,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$version
    )

    if ($repositoryRoot -eq $null) { throw "repositoryRoot script scoped variable not set. Thats bad, its used to find dependencies." }

    $octoExecutable = Get-OctopusToolsExecutable
    $octoExecutablePath = $octoExecutable.FullName

    $command = "create-release"
    $arguments = @()
    $arguments += $command
    $arguments += "--project"
    $arguments += $projectName
    $arguments += "--server"
    $arguments += $octopusServerUrl
    $arguments += "--apiKey" 
    $arguments += $octopusApiKey
    if (![String]::IsNullOrEmpty($releaseNotes))
    {
        $arguments += "--releasenotes"
        $arguments += "`"$releaseNotes`""
    }
    if (![String]::IsNullOrEmpty($version))
    {
        $arguments += "--version"
        $arguments += $version
        $arguments += "--packageversion"
        $arguments += $version
    }

    (& "$octoExecutablePath" $arguments) | Write-Verbose
    $octoReturn = $LASTEXITCODE
    if ($octoReturn -ne 0)
    {
        throw "$command failed. Exit code [$octoReturn]."
    }
}

The only tricksy thing in the script above is the Get-OctopusToolsExecutable function. All this function does is ensure that the executable exists. It looks inside a known location (relative to the global $repositoryRoot variable) and if it can’t find the executable it will download the appropriate NuGet package.

The rest of the Octopus functions look very similar.

The End

I would love to be able to point you towards a Github repository containing the sum total of all that I have written about with regards to environment setup, deployment, publishing, etc, but I can’t. The first reason is because its very much tied to our specific requirements, and stripping out the sensitive pieces would take me too long. The second reason is that everything is not quite encapsulated in a single repository, which disappointments me greatly. There’s some pieces in TeamCity, some in Octopus Deploy with the rest in the repository. Additionally, the process is dependent on a few external components, like Octopus Deploy. This makes me uncomfortable actually, as I like for everything related to an application to be self contained, ideally within the one repository, including build, deployment, environment setup, etc.

Ignoring the fact that I can’t share the detail of the environment setup (sorry!), I consider this whole adventure to be a massive success.

This is the first time that I’ve felt comfortable with the amount of control that has gone into the setup of an environment. Its completely repeatable, and even has some allowance for creating special, one-off environments for specific purposes. Obviously there would be a number of environments that are long lived (CI, Staging and Production at least) but the ability to create a temporary environment in a completely automated fashion, maybe to do load testing or something similar, is huge for me.

Mostly I just like the fact that the entire environment setup is codified, written into scripts and configuration files that can then be versioned and controlled via some sort of source control (we use Git, you should too). This gives a lot of traceability to the whole process, especially when something breaks, and you never have to struggle with trying to understand exactly what has gone into the environment. Its written right there.

For me personally, a lot of the environment setup was completely new, including CloudFormation, Powershell DSC and Octopus Deploy, so this was a fantastic learning experience.

I assume that I will continue to gravitate towards this sort of DevOps work in the future, and I’m not sad about this at all. Its great fun, if occasionally frustrating.

That Cloud looks like a Staging Environment, Part 01

February 3. 2015 0 Comments

You might have noticed a pattern in my recent posts. They’re all about build scripts, automation, AWS and other related things. It seems that I have fallen into a dev-ops role. Not officially, but it’s basically all I’ve been doing for the past few months.

I’m not entirely sure how it happened. A little automation here, a little scripting there. I see an unreliable manual process and I want to automate it to make it reproducible.

The weird thing is, I don’t really mind. I’m still solving problems, just different ones. It feels a little strange, but its nice to have your client/end-user be a technical person (i.e. a fellow programmer) instead of the usual business person with only a modicum of technical ability.

I’m not sure how my employer feels about it, but they must be okay with it, or surely someone would have pulled me aside and asked some tough questions. I’m very vocal about what I’m working on and why, so its not like I’m just quietly doing the wrong work in the background without making a peep.

Taking into account the above comments, its unsurprising then that this blog post will continue on in the same vein as the last ones.

Walking Skeletons are Scary

As I mentioned at the end of my previous post, we’ve started to develop an web API to replace a database that was being directly accessed from a mobile application. We’re hoping this will tie us less to the specific database used, and allow us some more control over performance, monitoring, logging and other similar things.

Replacing the database is something that we want to do incrementally though, as we can’t afford to develop the API all at once and the just drop it in. That’s not smart, it just leads to issues with the integration at the end.

No, we want to replace the direct database access bit by bit, giving us time to adapt to any issues that we encounter.

In Growing Object Oriented Software Guided By Tests, the authors refer to the concept of a walking skeleton. A walking skeleton is when you develop the smallest piece of functionality possibly, and focus on sorting out the entire delivery chain in order to allow that piece of functionality to be repeatably built and deploy, end-to-end, without human interaction. This differs from the approach I’ve commonly witnessed, where teams focus on getting the functionality together and then deal with the delivery closer to the “end”, often leading to integration issues and other unforeseen problems things, like certificates!

Its always certificates.

The name comes from the fact that you focus on getting the framework up and running (the bones) and then flesh it out incrementally (more features and functionality).

Our goal was to be able to reliably and automatically publish the latest build of the API to an environment dedicated to continuous integration. A developer would push some commits to a specified branch (master) in BitBucket and it would be automatically built, packaged and published to the appropriate environment, ready for someone to demo or test, all without human interaction.

A Pack of Tools

Breaking the problem down we identified 4 main chunks of work. Automatically build, package up application for deployment, actually deploy (and track versions deployed, so some form of release management) and then the setup of the actual environment that would be receiving the deployment.

The build problem is already solved, as we use TeamCity. The only difference from some of our other TeamCity builds, would be that the entire build process would be encapsulated in a Powershell script, so that we can control it in Version Control and run it separately from TeamCity if necessary. I love what TeamCity is capable of, but I’m always uncomfortable when there is so much logic about the build process separate from the actual source. I much prefer to put it all in the one place, aiming towards the ideal of “git clone, build” and it just works.

We can use the same tool for both packaging and deployment, Octopus Deploy. Originally we were going to use NuGet packages to contain our application (created via NuGet.exe), but we’ve since found that its much better to use Octopack to create the package, as it structures the internals in a way that makes it easy for Octopus Deploy to deal with it.

Lastly we needed an environment that we could deploy to using Octopus, and this is where the meat of my work over the last week and a bit actually occurs.

I’ve setup environments before, but I’ve always been uncomfortable with the manual process by which the setup usually occurs. You might provision a machine (virtual if you are lucky) and then spend a few hours manually installing and tweaking the various dependencies on it so your application works as you expect. Nobody ever documents all the things that they did to make it work, so you have this machine (or set of machines) that lives in this limbo state, where no-one is really sure how it works, just that it does. Mostly. God help you if you want to create another environment for testing or if the machine that was so carefully configured burns down.

This time I wanted to do it properly. I wanted to be able to, with the execution of a single script, create an entire environment for the API, from scratch. The environment would be regularly torn down and rebuilt, to ensure that we can always create it from scratch and we know exactly how it has been configured (as described in the script). A big ask, but more than possible with some of the tools available today.

Enter Amazon CloudFormation.

Cloud Pun

Amazon is a no brainer at this point for us. Its where our experience as an organisation lies and its what I’ve been doing a lot of recently. There are obviously other cloud offerings out there (hi Azure!), but its better to stick with what you know unless you have a pressing reason to try something different.

CloudFormation is another service offered by Amazon (like EC2 and S3), allowing you to leverage template files written in JSON that describe in detail the components of your environment and how its disparate pieces are connected. Its amazing and I wish I had known about it earlier.

In retrospect, I’m kind of glad I didn't know about it earlier, as using the EC2 and S3 services directly (and all the bits and pieces that they interact with) I have gained enough understanding of the basic components to know how to fit them together in a template effectively. If I had of started with CloudFormation I probably would have been overwhelmed. It was overwhelming enough with the knowledge that I did have, I can’t imagine what it would be like to hit CloudFormation from nothing.

Each CloudFormation template consists of some set of parameters (names, credentials, whatever), a set of resources and some outputs. Each resource can refer to other resources as necessary (like an EC2 instance referring to a Security Group) and you can setup dependencies between resources as well (like A must complete provisioning before B can start). The outputs are typically something that you want at the end of the environment setup, like a URL for a service or something similar.

I won’t go into detail about the template that I created (its somewhat large), but I will highlight some of the important pieces that I needed to get working in order for the environment to fit into our walking skeleton. I imagine that the template will need to be tweaked and improved as we progress through developing the API, but that's how incremental development works. For now its simple enough, a Load Balancer, Auto Scaling Group and a machine definition for the instances in the Auto Scaling Group (along with some supporting resources, like security groups and wait handles).

This Cloud Comes in Multiple Parts

This blog post is already 1300+ words, so it’s probably a good idea to cut it in two. I have a habit of writing posts that are too long, so this is my attempt to get that under control.

Next time I’ll talk about Powershell Desired State Configuration, deploying dependencies to be accessed by instances during startup, automating deployments with Octopus and many other wondrous things that I still don’t quite fully understand.