Building A Better Beast, Part 2

April 11. 2017 0 Comments

With all of the general context and solution outlining done for now, its time to delve into some of the details. Specifically, the build/test/deploy pipeline for the log stack environments.

Unfortunately, we use the term environmentto describe two things. The first is an Octopus environment, which is basically a grouping construct inside Octopus Deploy, like CI or prod-green. The second is a set of infrastructure intended for a specific purpose, like an Auto Scaling Group and Load Balancer intended to host an API. In the case of the log stack, we have distinct environments for the infrastructure for each layer, like the Broker and the Indexer.

Our environments are all conceptually similar, there is a Git repository that contains everything necessary to create or update the infrastructure (CloudFormation templates, Powershell scripts, etc), along with the logic for what it means to build and validate a Nuget package that can be used to manage the environment. The repository is hooked up to a Build Configuration in TeamCity which runs the build script and the resulting versioned package is uploaded to our Nuget server. The package is then used in TeamCity via other Build Configurations to allow us to Create, Delete, Migrate and otherwise interact with the environment in question.

The creation of this process has happened in bits and pieces over the last few years, most of which I’ve written about on this blog.

Its a decent system, and I’m proud of how far we’ve come and how much automation is now in place, but it’s certainly not without its flaws.

Bestial Rage

The biggest problem with the current process is that while the environment is fully encapsulated as code inside a validated and versioned Nuget package, actually using that package to create or delete an environment is not as simple as it could be. As I mentioned above, we have a set of TeamCity Build Configurations for each environment that allow for the major operations like Create and Delete. If you’ve made changes to an environment and want to deploy them, you have to decide what sort of action is necessary (i.e. “its the first time, Create” or “it already exists, Migrate”) and “run”the build, which will download the package and run the appropriate script.

This is where is gets a bit onerous, especially for production. If you want to change any of the environment parameters from the default values the package was built with, you need to provide a set of parameter overrideswhen you run the build. For production, this means you often end up overriding everything (because production is a separate AWS account) which can be upwards of 10 different parameters, all of which are only visible if you go a look at the source CloudFormation template. You have to do this every time you want to execute that operation (although you can copy the parameters from previous runs, which acts a small shortcut).

The issue with this is that it means production deployments become vulnerable to human error, which is one of the things we’re trying to avoid by automating in the first place!

Another issue is that we lack a true “Update” operation. We only have Create, Delete, Clone and Migrate.

This is entirely my fault, because when I initially put the system together I had a bad experience with the CloudFormation Update command where I accidentally wiped out an S3 bucket containing customer data. As is often the case, that fear then led to an alternate (worse) solution involving cloning, checking, deleting, cloning, checking and deleting (in that order). This was safer, but incredibly slow and prone to failure.

The existence of these two problems (hard to deploy, slow failure-prone deployments) is reason enough for me to consider exploring alternative approaches for the log stack infrastructure.

Fantastic Beasts And Where To Find Them

The existing process does do a number of things well though, and has:

A Nuget package that contains everything necessary to interact with the environment.
Environment versioning, because that’s always important for traceability.
Environment validation via tests executed as part of the build (when possible).

Keeping those three things in mind, and combining them with the desire to ease the actual environment deployment, an improved approach looks a lot like our typical software development/deployment flow.

Changes to environment are checked in
Changes are picked up by TeamCity, and a build is started
Build is tested (i.e. a test environment is created, validated and destroyed)
Versioned Nuget package is created
Package is uploaded to Octopus
Octopus Release is created
Octopus Release is deployed to CI
Secondary validation (i.e. test CI environment to make sure it does what it’s supposed to do after deployment)
[Optional] Propagation of release to Staging

In comparison to our current process, the main difference is the deployment. Prior to this, we were treating our environments as libraries (i.e. they were built, tested, packaged and uploaded to MyGet to be used by something else). Now we’re treating them as self contained deployable components, responsible for knowing how to deploy themselves.

With the approach settled, all that’s left is to come up with an actual deployment process for an environment.

Beast Mastery

There are two main cases we need to take care of when deploying a CloudFormation stack to an Octopus environment.

The first case is what to do when the CloudFormation stack doesn’t exist.

This is the easy case, all we need to do is execute New-CFNStack with the appropriate parameters and then wait for the stack to finish.

The second case is what we should do when the CloudFormation stack already exists, which is the case that is not particularly well covered by our current environment management process.

Luckily, CloudFormation makes this relatively easy with the Update-CFNStack command. Updates are dangerous (as I mentioned above), but if you’re careful with resources that contain state, they are pretty efficient. The implementation of the update is quite smart as well, and will only update the things that have changed in the template (i.e. if you’ve only changed the Load Balancer, it won’t recreate all of your EC2 instances).

The completed deployment script is shown in full below.

[CmdletBinding()]
param
(

)

$here = Split-Path $script:MyInvocation.MyCommand.Path;
$rootDirectory = Get-Item ($here);
$rootDirectoryPath = $rootDirectory.FullName;

$ErrorActionPreference = "Stop";

$component = "unique-stack-name";

if ($OctopusParameters -ne $null)
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.octopus"));

    $awsKey = $OctopusParameters["AWS.Deployment.Key"];
    $awsSecret = $OctopusParameters["AWS.Deployment.Secret"];
    $awsRegion = $OctopusParameters["AWS.Deployment.Region"];
}
else 
{
    $parameters = ConvertFrom-StringData ([System.IO.File]::ReadAllText("$here/cloudformation.parameters.local"));
    
    $path = "C:\creds\credentials.json";
    Write-Verbose "Attempting to load credentials (AWS Key, Secret, Region, Octopus Url, Key) from local, non-repository stored file at [$path]. This is done this way to allow for a nice development experience in vscode"
    $creds = ConvertFrom-Json ([System.IO.File]::ReadAllText($path));
    $awsKey = $creds.aws."aws-account".key;
    $awsSecret = $creds.aws."aws-account".secret;
    $awsRegion = $creds.aws."aws-account".region;

    $parameters["OctopusAPIKey"] = $creds.octopus.key;
    $parameters["OctopusServerURL"] = $creds.octopus.url;
}

$parameters["Component"] = $component;

$environment = $parameters["OctopusEnvironment"];

. "$here/scripts/common/Functions-Aws.ps1";
. "$here/scripts/common/Functions-Aws-CloudFormation.ps1";

Ensure-AwsPowershellFunctionsAvailable

$tags = @{
    "environment"=$environment;
    "environment:version"=$parameters["EnvironmentVersion"];
    "application"=$component;
    "function"="logging";
    "team"=$parameters["team"];
}

$stackName = "$environment-$component";

$exists = Test-CloudFormationStack -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -StackName $stackName

$cfParams = ConvertTo-CloudFormationParameters $parameters;
$cfTags = ConvertTo-CloudFormationTags $tags;
$args = @{
    "-StackName"=$stackName;
    "-TemplateBody"=[System.IO.File]::ReadAllText("$here\cloudformation.template");
    "-Parameters"=$cfParams;
    "-Tags"=$cfTags;
    "-AccessKey"=$awsKey;
    "-SecretKey"=$awsSecret;
    "-Region"=$awsRegion;
    "-Capabilities"="CAPABILITY_IAM";
};

if ($exists)
{
    Write-Verbose "The stack [$stackName] exists, so I'm going to update it. Its better this way"
    $stackId = Update-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::UPDATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::UPDATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::UPDATE_ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Updated";
}
else 
{
    Write-Verbose "The stack [$stackName] does not exist, so I'm going to create it. Just watch me"
    $args.Add("-DisableRollback", $true);
    $stackId = New-CFNStack @args;

    $desiredStatus = [Amazon.CloudFormation.StackStatus]::CREATE_COMPLETE;
    $failingStatuses = @(
        [Amazon.CloudFormation.StackStatus]::CREATE_FAILED,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_IN_PROGRESS,
        [Amazon.CloudFormation.StackStatus]::ROLLBACK_COMPLETE
    );
    Wait-CloudFormationStack -StackName $stackName -DesiredStatus  $desiredStatus -FailingStates $failingStatuses -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    Write-Verbose "Stack [$stackName] Created";
}

Other than the Create/Update logic that I’ve already talked about, the only other interesting thing in the deployment script is the way that it deals with parameters.

Basically if the script detects that its being run from inside Octopus Deploy (via the presence of an $OctopusParameters variable), it will load all of its parameters (as a hashtable) from a particular local file. This file leverages the Octopus variable substitution feature, so that when we deploy the infrastructure to the various environments, it gets the appropriate values (like a different VPC because prod is a separate AWS account to CI). When its not running in Octopus, it just uses a different file, structured very similarly, with test/scratch values in it.

With the deployment script in place, we plug the whole thing into our existing “deployable” component structure and we have automatic deployment of tested, versioned infrastructure via Octopus Deploy.

Conclusion

Of course, being a first version, the deployment logic that I’ve described above is not perfect. For example, there is no support for deploying to an environment where the stack is in error (failing stacks can’t be updated, but they already exist, so you have to delete it and start again) and there is little to no feedback available if a stack creation/update fails for some reason.

Additionally, the code could benefit from being extracted to a library for reuse.

All in all, the deployment process I just described is a lot simpler than the one I described at the start of this post, and its managed by Octopus, which makes it consistent with the way that we do everything else, which is nice.

With a little bit more polish, and some pretty strict usage of the CloudFormation features that stop you accidentally deleting databases full of valuable data, I think it will be a good replacement for what we do now.

Building a Better Beast, Part 1

April 4. 2017 0 Comments

Posted in:
aws
logging
elk

Way back in March 2015, I wrote a few posts explaining how we set up our log aggregation. I’ve done a lot of posts since then about logging in general and about specific problems we’ve encountered in various areas, but I’ve never really revisited the infrastructure underneath the stack itself.

The main reason for the lack of posts is that the infrastructure I described back then is not the infrastructure we’re using now. As we built more things and started pushing more and more data into the stack we had, we began to experience some issues, mostly related to the reliability of the Elasticsearch process. At the time, the organization decided that it would be better if our internal operations team were responsible for dealing with these issues, and they built a new stack as a result.

This was good and bad. The good part was the arguments that if we didn’t have to spend our time building and maintaining the system, it should theoretically leave more time and brainspace for us to focus on actual software development. Bad for almost exactly the same reason, problems with the service would need to be resolved by a different team, one with their own set of priorities and their own schedule.

The arrangement worked okay for a while, until the operations team were engaged on a relatively complex set of projects and no longer had the time to extend and maintain the log stack as necessary. They did their best, but with no resources dedicated to dealing with the maintenance on the existing stack, it started to degrade surprisingly quickly.

This came to a head when we had a failure in the new stack that required us to replace some EC2 instances via the Auto Scaling Group, and the operations team was unavailable to help. When we executed the scaling operation, we discovered that it was creating instances that didn’t actually have all of the required software setup in order to fulfil their intended role. At some point in the past someone had manually made changes to the instances already in service and these changes had not been made in the infrastructure as code.

After struggling with this for a while, we decided to reclaim the stack and make it our responsibility again.

Beast Mode

Architecturally, the new log stack was a lot better than the old one, even taking into account the teething issues that we did have.

The old stack was basically an Auto Scaling Group capable of creating EC2 instances with Elasticsearch, Logstash and Kibana along with a few load balancers for access purposes. While the old stack could theoretically scale out to better handle load, we never really tested that capability in production, and I’m pretty sure it wouldn’t have worked (looking back I doubt the Elasticseach clustering was setup correctly, in addition to some other issues with the way the Logstash indexes were being configured).

The new stack looked a lot like the reference architecture described on the Logstash website, which was good, because those guys know their stuff.

At a high level, log events would be shipped from many different places to a Broker layer (auto scaling Logstash instances behind a Load Balancer) which would then cache those events in a queue of some description (initially RabbitMQ, later Redis). An Indexer layer (auto scaling Logstash instances) would pull events off the queue at a sustainable pace, process them and place them into Elasticsearch. Users would then use Kibana (hosted on the Elasticsearch instances for ease of access) to interact with the data.

There are a number of benefits to the architecture described above, but a few of the biggest ones are:

Its not possible for an influx of log events to shut down Elasticsearch because the Indexer layer is pulling events out of the cache at a sustainable rate. The cache might start to fill up if the number of events rises, but we’ll still be able to use Elasticsearch.
The cache provides a buffer if something goes wrong with either the Indexer layer or Elasticsearch. We had some issues with Elasticsearch crashing in our old log stack, so having some protection against losing log events in the event of downtime was beneficial.

There were downsides as well, the most pertinent of which was that the new architecture was a lot more complicated than the old one, with a lot of moving parts. This made it harder to manage and understand, and increased the number of different ways in which it could break.

Taking all of the above into account, when we reclaimed the stack we decided to keep the architecture intact, and just improve it.

But how?

Beast-Like Vigour

The way in which the stack was described was not bad. It just wasn’t quite as controlled was the way we’d been handling our other environments, mostly as a factor of being created/maintained by a different team.

Configuration for the major components (Logstash Broker, Logstash Indexer, Elasticsearch, Kibana) was Source Controlled, with builds in TeamCity and deployment was handled by deploying Nuget packages through Octopus. This was good, and wouldn’t require much work to bring into line with the rest of our stuff. All we would have to do was ensure all of the pertinent deployment logic was encapsulated in the Git repositories and maybe add some tests.

The infrastructure needed some more effort. It was all defined using CloudFormation templates, which was excellent, but there was no build/deployment pipeline for the templates and they were not versioned. In order to put such a pipeline in place, we would need to have CI and Staging deployments of the infrastructure as well, which did not yet exist. The infrastructure definition for each layer also shared a repository with the relevant configuration (i.e. Broker Environment with Broker Config), which was against our existing patterns. Finally, the cache/queue layer did not have an environment definition at all, because the current one (Elasticache w. Redis) had been manually created to replace the original one (RabbitMQ) as a result of some issues where the cache/queue filled up and then became unrecoverable.

In addition to the above, once we’ve improved all of the processes and got everything under control, we need to work on fixing the actual bugs in the stack (like Logstash logs filling up disks, Elasticsearch mapping templates not being setup correctly, no alerting/monitoring on the various layers, etc). Some of these things will probably be fixed as we make the process improvements, but others will require dedicated effort.

To Be Continued

With the total scope of the work laid out (infrastructure build/deploy, clean up configuration deployment, re-create infrastructure in appropriate AWS accounts, fix bugs) its time to get cracking.

The first cab off the rank is the work required to create a process that will allow us to fully automate the build and deployment of the infrastructure. Without that sort of system in place, we would have to do all of the other things manually.

The Broker layer is the obvious starting point, so next week I’ll outline how we went about using a combination of TeamCity, Nuget, Octopus and Powershell to accomplish a build and deployment pipeline for the infrastructure.

Syncing the Ship, Quietly

March 28. 2017 0 Comments

Like with all software, its rare to ever actually be done with something. A few weeks back I wrote at length about the data synchronization algorithm we use to free valuable and useful data from its on-premises prison, to the benefit of both our clients (for new and exciting applications) and us (for statistical analysis.

Conceptually, the process leveraged an on-premises application, a relatively simple API and a powerful backend data store to accomplish its goals, along with the following abstracted algorithm (which I went into in depth in the series of blog posts that I linked above).

Get Local Version
Get Remote Version
If Local == Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local > Remote
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local < Remote
    Find Remote > Local Version
    Delete from Remote

One of the issues with the algorithm above is that its pretty chatty when it comes to talking to the remote API. It is polling based, so that’s somewhat to be expected, but there is a lot of requests and responses being thrown around that seem like a prime opportunity for improvement.

To give some context:

We have approximately 3500 unique clients (each one representing a potential data synchronization)
Of that 3500, approximately 2200 clients are actively using the synchronization
In order to service these clients, the API deals with approximately 450 requests a second

Not a ground shaking amount of traffic, but if we needed to service the remainder of our clients in the same way, we’d probably have to scale out to deal with the traffic. Scaling out when you use AWS is pretty trivial, but the amount of traffic in play is also overloading our log aggregation (our ELK stack), there are other factors to consider.

Digging into the traffic a bit (using our awesome logging), it looks like the majority of the requests are GET requests.

The following KIbana visualization shows a single days traffic, aggregated over time/HTTP verb. You can clearly see the increase in the amount of non-GET requests during the day as clients make changes to their local database, but the GET traffic dwarfs it.

If we want to reduce the total amount of traffic, attacking the GET requests seems like a sane place to start. But maybe we could just reduce the traffic altogether?

Frequency Overload

The plugin architecture that schedules and executes the application responsible for performing the data synchronization has a cadence of around 60 seconds (with support for backoff in the case of errors). It is smart enough to be non re-entrant (meaning it won’t start another process if one is already running), but has some weaknesses that lead to the overall cadence being closer to 5 minutes, with higher cadences for multiple registered databases (because each database runs its own application, but the number running in parallel is limited).

One easy way to reduce the total amount of traffic is to simply reduce the cadence, drawing it out to at least 10 minutes in between runs.

The downside of this is that it increases the amount of latency between local changes being made and them being represented on the remote database.

Being that one of the goals for the sync process was to minimise this latency, simply reducing the cadence in order to decrease traffic is not a good enough solution.

Just Stop Talking, Please

If we look at the GET traffic in more detail we can see that it is mostly GET requests to two endpoints.

/v1/customers/{customer-number}/databases/{database-id}/tables/{table-name}/
/v1/customers/{customer-number}/databases/{database-id}/tables/{table-name}/manifest

These two endpoints form the basis for two different parts of the sync algorithm.

The first endpoint is used to get an indication of the status for the entire table for that customer/database combination. It returns a summary of row count, maximum row version and maximum timestamp. This information is used in the algorithm above in the part where it executes the “Get remote version” statement. It is then compared to the same information locally, and the result of that comparison is used to determine what to do next.

The second endpoint is used to get a summarised chunk of information from the table, using a from and to row version. This is used in the sync algorithm to perform the diff check whenever the local and remote versions (the other endpoint) are the same.

What this means is that every single run of every application is guaranteed to hit the first endpoint for each table (to get the remote version) and pretty likely to hit the second endpoint (because the default action is to engage on the diff check whenever local and remote versions are the same).

The version comparison is flawed though. It only takes into account the maximum row version and maximum timestamp for its decision making, ignoring the row count altogether (which was there historically for informational purposes). The assumption here was that we wanted to always fall back to scanning through the table for other changes using the differencing check, so if our maximum version/timestamps are identical that’s what we should do.

If we use the row count though, we can determine if the local and remote tables are completely in sync, allowing us to dodge a large amount of work. Being that all updates will be covered by the row version construct and all deletions will be covered by the row count changing, we should be in a pretty good place to maintain the reliability of the sync.

1 Row, 2 Rows, 3 Rows, Ah Ah Ah!

The naive thing to do would be to get the local count/version/timestamp and the local count/version/timestamp from last time and compare them (including the row count). If they are the same, we don’t need to do anything! Yay!

This fails to take into account the state of the remote though, and the nature of the batching process. While there might not be any changes locally since last time, we might not have actually pushed all of the changes from last time to the remote.

Instead, what we can do is compare the local count/version/timestamp with the last remote count/version/timestamp. If they are the same, we can just do nothing because both are completely in sync.

Editing the algorithm definition from the start of this post, we get this:

Get Local Count/Version/Timestamp
Get Last Remote Count/Version/Timestamp
If Local Count/Version/Timestamp == Last Remote Count/Version/Timestamp
    Do Nothing and Exit
Get Remote Count/Version/Timestamp
Store Remote Count/Version/Timestamp For Lookup Next Time
If Local Count/Version/Timestamp == Remote Count/Version/Timestamp
    Do Nothing and Exit
If Local Version/Timestamp == Remote Version/Timestamp BUT Local Count != Remote Count
    Calculate [BATCH SIZE] Using Historical Data
    Get Last Local Position
    Get Next [BATCH SIZE] Local Rows from last position
    Get Min & Max Version in Batch
    Query Remote for Manifest Between Min/Max Local Version
    Create Manifest from Local Batch
    Compare
        Find Remote Not in Local
            Delete from Remote
        Find Local Not in Remote
            Upload to Remote
If Local Version/Timestamp > Remote Version/Timestamp
    Calculate [BATCH SIZE] Using Historical Data
    Get Next [BATCH SIZE] Local Rows > Remote Version
    Upload to Remote
        Record Result for [BATCH SIZE] Tuning
        If Failure & Minimum [BATCH SIZE], Skip Ahead
If Local Version/Timestamp < Remote Version/Timestamp
    Find Remote > Local Version
    Delete from Remote

The other minor change in there is comparing the full local count/version/timestamp against the remote count/version/timestamp. If they are identical, its just another case where we need to do nothing, so we can exit safely until next time.

Conclusion

Just how much of a difference does this make though? I’ll let a picture of a partial days traffic answer that for me.

In the image below I’ve specifically set the scale of the graph to be the same as the one above for comparison purposes.

As you can see, the traffic rises from nothing (because nothing is changing overnight) to a very acceptable amount of traffic representing real work that needs to be done during the day, and then will probably continue forward with the same pattern, flattening back into nothing as the day progresses and people stop making changes.

Its a ridiculous decrease in the amount of pointless noise traffic, which is a massive victory.

Thinking about this sort of thing in more general terms, optimization is an important step in the production and maintenance of any piece of software, but its important not to engage on this path too early. You should only do it after you gather enough information to justify it, and to pinpoint exactly where the most impact will be made for the least effort. The last thing you want to do is spend a week or two chasing something you think is a terribly inefficiency only to discover that it makes less than a % difference to the system as a whole.

The most efficient way to do this sort of analysis is with good metrics and logging.

You’d be crazy not to do it.

Down With The Sickness

March 21. 2017 0 Comments

Posted in:
sick

Yet another missed blogged post to sickness.

Just a cold. Nothing serious.

Back next week.

Pack Of Lies

March 14. 2017 0 Comments

I use Packer on and off. Mostly I use it to make Amazon Machine Images (AMI’s) for our environment management packages, specifically by creating Packer templates that operate on top of the Amazon supplied Windows Server images.

You should never use an Amazon supplied Windows Server AMI in your Auto Scaling Group Launch Configurations. These images are regularly retired, so if you’ve taken a dependency on one, there is a good chance it will disappear just when you need it most. Like when you need to auto-scale your API cluster because you’ve unknowingly burnt through all of the CPU credits you had on the machines slowly over the course of the last few months. What you should do is create an AMI of your own from the ones supplied by AWS so you can control its lifetime. Packer is a great tool for this.

A Packer template is basically a set of steps to execute on a virtual machine of some sort, where the core goal is to take some sort of baseline thing, apply a set of steps to it programmatically and end up with some sort of reusable thing out the other end. Like I mentioned earlier, we mostly deal in AWS AMI’s, but it can do a bunch of other things as well (VWWare, Docker, etc).

The benefits of using a Packer template for this sort of thing (instead of just doing it all manually) is reproducibility. Specifically, if you built your custom image using the AWS AMI for Windows Server 2012 6 months ago, you can go and grab the latest one from yesterday (with all of the patches and security upgrades), execute your template on it and you’ll be in a great position to upgrade all of the existing usages of your old custom AMI with minimal effort.

When using Packer templates though, you need to be cognizant of how errors are dealt with. Specifically:

Step failures appear to be indicated entirely by the exit code of the tool used in the step.

I’ve been bitten by this on two separate occasions.

A Powerful Cry For Help

Packer has much better support for Windows than it once did, but even taking that into account, Powershell steps can still be a troublesome beast.

The main issue with the Powershell executable is that if an error or exception occurs and terminates the process (i.e. its a terminating error of you have ErrorActionPreference set to Stop) the Powershell process itself still exits with zero.

In a sane world, an exit code of zero indicates success, which is what Packer expects (and most other automation tools like TeamCity/Octopus Deploy).

If you don’t take this into account, your Powershell steps may fail but the Packer execution will still succeed, giving you an artefact that hasn’t been configured the way it should have been.

Packer is pretty configurable though, and is very clear about the command that it uses to execute your Powershell steps. The great thing is, it also enables you to override that command, so you can customise your Powershell steps to exit with a non-zero code if an error occurs without actually having to change every line in your step to take that sort of thing into account.

Take this template excerpt below, which uses Powershell to set the timezone of the machine and turn off negative DNS result caching.

{
    "type": "powershell",
    "inline": [
        "tzutil.exe /s \"AUS Eastern Standard Time_dstoff\"",
        "[Microsoft.Win32.Registry]::SetValue('HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\Dnscache\\Parameters','NegativeCacheTime',0,[Microsoft.Win32.RegistryValueKind]::DWord)"
    ],
    "execute_command": "powershell -Command \"$ErrorActionPreference = 'Stop'; try { & '{{.Path}}' } catch { Write-Warning $_; exit 1; } \""
}

The “execute_command” is the customisation, providing error handling for exceptions that occur during the execution of the Powershell snippet. Packer will take each line in that inline array, copy it to a file on the machine being setup (using WinRM) and then execute it using the command you specify. The {{.Path}} syntax is the Packer variable substitution and specifically refers to the path on the virtual machine that packer has copied the current command to. With this custom command in place, you have a much better chance of catching errors in your Powershell commands before they come back to bite you later on.

So Tasty

In a similar vein to the failures with Powershell above, be careful when doing package installs via yum on Linux.

The standard “yum install” command will not necessarily exit with a non-zero code when a package fails to install. Sometimes it will, but if a package couldn’t be found (maybe you misconfigured the repository or something) it still exits with a zero.

This can throw a pretty big spanner in the works when you’re expecting your AMI to have Elasticsearch on it (for example) and it just doesn’t because the package installation failed but Packer thought everything was fine.

Unfortunately, there is no easy way to get around this like there is for the Powershell example above, but you can mitigate it by just adding an extra step after your package install that validates the package was actually installed.

{
    "type" : "shell",
    "inline" : [
        "sudo yum remove java-1.7.0-openjdk -y",
        "sudo yum install java-1.8.0 -y",
        "sudo yum update -y",
        "sudo sh -c 'echo \"[logstash-5.x]\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"name=Elastic repsitory for 5.x packages\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"baseurl=https://artifacts.elastic.co/packages/5.x/yum\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"gpgcheck=1\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo sh -c 'echo \"enabled=1\" >> /etc/yum.repos.d/logstash.repo'",
        "sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch",
        "sudo yum install logstash-5.2.2 -y",
        "sudo rpm --query logstash-5.2.2"
    ]
}

In the example above, the validation is the rpm --query command after the yum install. It will return a non-zero exit code (and thus fail the Packer execution) if the package with that version is not installed.

Conclusion

Packer is an incredibly powerful automation tool for dealing with a variety of virtual machine platforms and I highly recommend using it.

If you’re going to use it though, you need to understand what failure means in your specific case, and you need to take that into account when you decide how to signal to the Packer engine that something isn’t right.

For me, I prefer to treat every error as critical, because I prefer to deal with them at the time the AMI is created, rather than 6 months later when I try to use the AMI and can’t figure out why the Windows Firewall on an internal API instance is blocking requests from its ELB. Not that that has ever happened of course.

In order to accomplish this lofty goal of dealing with errors ASAP you need to understand how each one of your steps (and the applications and tools they use) communicate failure, and then make sure they all communicate that appropriately in a way Packer can understand.

Understanding how to deal with failure is useful outside Packer too.