0 Comments

In my quest to automate all the things, I’ve failed a lot.

A lot.

Typically the failures are configuration related (forgot to do X, didn’t line up the names of Y and Z properly, accidentally using uninitialized variable (Powershell is pretty “flexible”)), but sometimes they are just  transient. The cloud can be like that and you need to make sure you build with fault tolerance in mind. For example, sometimes an AWS EC2 instance will just take ages to complete its initialization, and your environment will fail to create with a timeout.

Using CloudFormation, you can use WaitConditions and WaitHandles to keep track of the setup of the more complicated components of your environment (like EC2 instances). Generally you take the exit code from your initialization on the EC2 instance (for us that is typically cfn-init) and push it through cfn-signal pointing at a particular WaitHandle. In the case where the exit code is non-zero, CloudFormation considers it a failure and it bubbles up and fails the stack (or whatever your error handling strategy is). If you need to you can supply a reason as well (a short string) which can help in identifying the root cause of the failure.

Up until recently, when my environment setups failed, I would report the failure (i.e. this stack failed to initialize) and then automatically clean it up. My scripts would wait to see if the stack status was changing to CREATE_COMPLETE within some timeout, and in the case of a failure, were simply reporting that the stack failed. If I was executing from a test, typically I would break on the line that does the cleanup, and investigate the root cause of the failure before I let the stack be torn down.

Now that I’m automatically executing environment creation/destruction through TeamCity, that kind of approach doesn’t really cut it.

If I arrive at work in the morning and see that the CI environment for service A failed to automatically re-create I need to know why. The stack will have already been cleaned up, so unless the scripts report as much information as possible, my only real recourse is to run it again locally and see what happened. This only works if the failure is consistent. It might run perfectly the second time, which is incredibly annoying when you’re trying to track down a failure.

It was time to think about failing a little bit harder.

Investigating Failures

When a stack fails I typically do 3 things.

The first is to check to see if the stack failed on creation (i.e. on the actual call to New-CfnStack). These failures are usually related to template issues, so the environment setup scripts already handle this. When New-CfnStack throws an exception it bubbles up to the top and is logged. All you have to do is read the message and it will (hopefully) point you at the root cause.

An interesting fact. If your template file is long-ish (it happened to me when mine was about 1000 lines long, but I’m pretty sure its dependent on characters, not lines) then the call to New-CfnStack can and will fail with an incredibly useless error about incorrectly formatted XML. This only happens if you are uploading the content as part of the call, instead of uploading to S3 first. I’m not entirely sure why this happens, but I have a sneaking suspicion that the Powershell/.NET AWS components have some hard limit on the length of the request that isn’t being checked properly. Just upload your template to S3 (I upload to the same place I put my dependencies) and it will work just fine, regardless of the size of the template.

The second is to check the Events attached to the stack for failures, specifically the first one chronologically, as failures typically have a knock on effect. These events are available in the AWS dashboard, under the CloudFormation section. Sometimes this is all it takes, as most of the failure events are pretty descriptive. If you fail to have permissions to do something, this is where it normally comes up, along with some errors that the template verification system won’t find (like an RDS instance using a snapshot but also supplying a DBName).

The third is to use the events to determine which component failed (usually an EC2 instance), remote onto the machine and look at some log files. For Windows EC2 instances there are two log files that are useful, cfn-init and EC2Config.

You can typically find the cfn-init log file at C:\cfn\log\cfn-init.log. It contains the full log trace from everything that happened as a result of the call to cfn-init on that machine. As far as log files go, its pretty useful, and I’ve plumbed its depths many times trying to figure out why my initialization was failing.

Second interesting fact, also related to length. If the output from one of your commands in cfn-init is too long, it will fail with an incredibly unhelpful error.

You can typically find the EC2Config log at C:\Program Files\Amazon\EC2Config\logs\EC2Config.log. This one is useful if your cfn-init log file doesn’t have anything in it, which can happen if you’re failing so hard that your EC2 instance can’t even start executing cfn-init. It happens.

The 3 step process I outlined above usually identifies the root cause of the failure, which can then be rectified. Following the process isn’t difficult, but it is annoying, and manual. The last step is especially egregious, requiring me to find the machine/s in question, get their private IPs, be connected to our AWS VPN, remote on, locate the log files, read them, etc.

If I have to leave the confines of my environment script in order to go check why it failed, then I’m losing valuable time and context. It would be much better if the environment setup script detected failures and went and did those things for me.

The Right Kind of Lazy

Essentially a good lazy programmer is the one who doesn’t have the stomach for constantly doing a manual task, so instead automates it. This frees them up to do more interesting things, and the automated task can then be run with minimal human involvement, hopefully acting as a multiplier for the effectiveness of the team.

In my case, I needed to automate the second two steps above. The first step is taken care of because the exceptions being thrown from the various calls to the AWS API are already being logged in TeamCity.

For the second step (interrogating the Events of the Stack), AWS offers API’s for pretty much everything, which is awesome (Octopus is similar, which is equally awesome, maybe more so because they are Windows native). All you need is a call to Get-CfnStackEvent, and then you can filter the resulting events to those that failed, which helps to cut down on the amount of output. The following piece of Powershell demonstrates just that.

function _TryGetFailingStackEvents
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$stackId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion
    )

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"
        
        . "$commonScriptsDirectoryPath\Functions-Enumerables.ps1"

        $events = Get-CFNStackEvent -StackName $stackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        $failingEvents = @($events | Where { $_.ResourceStatus.Value -match "FAILED" })
        if ($failingEvents | Any -Predicate { $true })
        {
            return $failingEvents
        }
        else
        {
            return @()
        }
    }
    catch
    {
        Write-Warning "Could not get events for stack [$stackId]."
        Write-Warning $_
    }
}

Automating the third step is a little more difficult.

As a general rule, all of our EC2 instances can have Powershell remotely executed on them from our secure AWS VPN CIDR. This is already part of our general EC2 setup (security group configuration to allow the traffic, WIndows firewall exception to unblock the Powershell Remote Execution port, etc).

Whenever a failure event occurs that contains a valid EC2 instance ID (identified by a regular expression), the script creates a remote session to the machine using its private IP address and reads the last couple of hundred lines from the log files I mentioned above. You can see the full script below. It actually feeds off the failure events identified in the previous step in order to find the ID of the instance that it needs to connect to.

function _TryExtractLogFilesFromInstanceFailingViaCfnInit
{
    param
    (
        [Parameter(Mandatory=$true)]
        $failingStackEvents,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$adminPassword
    )

    if ($failingStackEvents -eq $null) { return "No events were supplied, could not determine if anything failed as a result of CFN-INIT failure" }

    try
    {
        if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
        $rootDirectoryDirectoryPath = $rootDirectory.FullName
        $commonScriptsDirectoryPath = "$rootDirectoryDirectoryPath\scripts\common"

        $cfnInitFailedIndicator = "CFN-INIT-FAILED"
        Write-Verbose "Attempting to identify and extract information from failure events containing the string [$cfnInitFailedIndicator]"
        $instanceIdRegexExtractor = "(i\-[0-9a-zA-Z]+)"
        $cfnFailureEvent = $failingStackEvents | 
            Where {$_.ResourceStatusReason -match $cfnInitFailedIndicator} | 
            Select -First 1

        if ($cfnFailureEvent.ResourceStatusReason -match $instanceIdRegexExtractor)
        {
            $instanceId = $matches[0];
            Write-Verbose "Found a failure event for instance [$instanceId]"
            Write-Verbose "Attempting to extract some information from the logs from that machine"

            . "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

            $instance = Get-AwsEc2Instance -InstanceId $instanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

            $ipAddress = $instance.PrivateIpAddress

            $remoteUser = "Administrator"
            $remotePassword = $adminPassword
            $securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
            $cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)
            $session = New-PSSession -ComputerName $ipAddress -Credential $cred
    
            $remoteScript = {
                $lines = 200
                $cfnInitLogPath = "C:\cfn\log\cfn-init.log"
                Write-Output "------------------------------------------------------"
                Write-Output "Last [$lines] from $file"
                Get-Content $cfnInitLogPath -Tail $lines
                Write-Output "------------------------------------------------------"
                #Get-Content C:\Program Files
            }
            $remotelyExtractedData = Invoke-Command -Session $session -ScriptBlock $remoteScript
            # If you dont do this when you do a JSON convert later it spits out a whole bunch of useless
            # information about the machine the line was extracted from, files, etc.
            $remotelyExtractedData = $remotelyExtractedData | foreach { $_.ToString() }
            
            return $remotelyExtractedData
        }
        else
        {
            Write-Verbose "Could not find a failure event about CFN-INIT failing"
            return "No events failed with a reason containing the string [$cfnInitFailedIndicator]"
        }
    }
    catch
    {
        Write-Warning "An error occurred while attempting to gather more information about an environment setup failure"
        Write-Warning $_       
    }
}

I execute both of these functions in the catch block of the main environment setup function. Basically if anything goes wrong during the environment creation, it tries to gather some additional information to print out to the Warning channel. Sometimes failures occur before you actually managed to create the stack, so it needs to be robust in the face of not having a stack ID. As it is inside the catch block, it also needs to be wrapped in a try…catch itself, otherwise you’re likely to lose the root cause of the error if an unexpected error happens during your information gathering.

if ($stackId -ne $null)
{
    Write-Warning "Environment creation failed. Attempting to gather additional information."
    $failingStackEvents = _TryGetFailingStackEvents -StackId $stackId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $cfnInitFailureLogs = _TryExtractLogFilesFromInstanceFailingViaCfnInit -failingStackEvents $failingStackEvents -adminPassword $adminPassword -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    $customError = @{
        "StackId"=$stackId;
        "Stack"=$stack;
        "FailingStackEvents"=$failingStackEvents;
        "CfnInitFailingInstanceLogs"=$cfnInitFailureLogs;
    }

    $failingDetails = new-object PSObject $customError
    Write-Warning (ConvertTo-Json $failingDetails)
}

With everyone above working, you get a nice dump of information whenever an environment fails to create, without having to do anything else, which is nice.

The full source that the snippets above were extracted from will be available at Solavirum.Scripts.Common at some point. The content available there now is out of date, but I need to check through the scripts carefully before I upload them to Github. We all remember what happened last time.

Room for Improvement

At some point in the future I would prefer if the output from CloudFormation (its events) and the various log files from the EC2 instances being provisioned were automatically pumped into our Log Aggregation server (ELK) so that we could analyse them in more detail.

The hard part about this is that the log aggregation would require the installation of something on the machine, and if we’re in a failure state during environment setup, those components are probably not installed. I mean the whole point of the environment setup is to create an environment and deploy some software components to do it, having to have other components in place to report failures is probably not going to work.

I’ll keep thinking about it and make a post if I find a good solution.

Conclusion

I hesitated to do this automated investigation for a long time, because I knew it would be finicky and complex.

I shouldn’t have.

Every time an environment setup fails I now get detailed information right in the same window that I was watching for completion messages. It is beyond useful. It also means that whenever an environment refresh fails in the morning before I get to work, I can still investigate to see what happened, even though the stack was already cleaned up.

I should have followed my instincts and automated this kind of pointless busywork earlier.

0 Comments

If you’ve ever read any of my previous posts, you would realise that I’m pretty big on testing. From my point of view, any code without automated tests is probably a liability in the long run. Some code doesn’t need tests of course (like prototypes and spikes), but if you intend for your code to be used or maintained (i.e. most of it) you should be writing tests for it.

I’ve blogged before about the 3 different classes of automated tests, Unit, Integration and Functional. I fully admit that the lines between them can get blurred from time to time, but I feel these 3 classifications help to think about tests at different levels, which in turn assists in creating a quality piece of software.

A lot of the work I do in my current job has dependencies on a legacy SQL Server database. Not one gigantic database in a central location, many databases with the same schema, distributed at client locations. Its a bit of a monster, but workable, and we control the schema, so we’ve got that going for us, which is nice.

Quite a few of the older components take a hard dependency on this database. Some of the newer ones use EF at least, which is nice, but others simply use direct SQL execution. Tests have typically been written using an approach I see a lot, use a local database in a known state and write some tests on top of it. Typically this database is specific to a developers machine, and as such the tests will not work inside our CI environment, so they just get marked as [Ignored] (so the build won’t fail) until someone wants to use them again.

Unacceptable.

Why go to all that effort writing a test if you aren’t going to execute it all the time? Tests that aren’t run are worthless after all.

Making it Work

Testing on top of a real SQL server is a fantastic idea and I would classify this sort of test as an Integration test, especially if you’re doing it directly from the classes themselves (instead of the normal entry point for the application). It verifies that the components that you have written (or at least a subset of them) work as expected when you sit them on top of a real database.

The problem comes in automating those sort of tests.

In an ideal world, your test would be responsible for setting up and tearing down all of its dependencies. This would mean it creates a database, fills it with some data, runs the code that it needs to run, verifies the results and then cleans up after itself.

That’s not simple though and looks like a lot of unnecessary work to some developers. They think, why don’t I just create a local database and put some data in it. I can run my test while I write my code, and I can reset my database whenever I want using scripts/restores/etc. These developers don’t realise that tests live just as long as the code that they verify, and if you’re developing for the long run, you need to put that effort in or suffer the consequences later on.

I’ll take a moment to mention that not every piece of code needs to be written for the long run (most do) and that it is possible to have so many tests that it becomes hard to change you code due to the barrier of requiring that you change you tests (which can be a lot of work). As with most things in software, its a balancing act. Just like your code, your tests should be easy to change and maintain, or they will end up causing the same pain that you were trying to avoid in the first place, just in a different place.

In order to facilitate the approach where each test is responsible for its own test data you need to put some infrastructure in place.

  1. You need a common location where the tests can create a database. If your CI environment is in the same network as your developers, this can simply be a machine at a known location with the required software installed. Its a little more complicated if they are in two different networks (our CI is in AWS for example).
  2. You need to have a reusable set of tools for creating, initialising and destroying database resources, to make sure your tests are self contained. These tools must be able to be run in an automated fashion.

Where the Wild Databases Are

The first step in allowing us to create executable database integration tests is to have a database server (specifically MSSQL server) available to both our development environment and our CI environment.

Since our CI is in AWS, the best location for the database is there as well. The main reason for this is that it will be easy enough to expose the database securely to the office, whereas it would be hard to expose resources in the office to AWS.

We can use the Amazon supplied SQL Server 2014 Express AMI as a baseline, and create a CloudFormation template that puts all the bits in place (instance, security groups, host record, etc).

The template is fairly trivial, so I won’t go into detail about it. Its very much the same as any other environment setup I’ve done before (I think the only publicly available example is the JMeter Workers, but that’s a pretty good example).

In order to expose the SQL instance on the canned AMI I had to make some changes to the server itself. It’s easy enough to execute a Powershell script during initialisation (via cfn-init), so I just ran this script to enable TCP/IP, switch to mixed mode and enable and change the password for the sa account. The instance is only accessible via a secure channel (internally in our VPC and via our AWS VPN) so I’m not too concerned about exposing the sa username directly, especially with a unique and secure password. Its only purpose is to hold temporary test data anyway, and it doesn’t have the ability to connect to any other resources, so I’m not particularly worried.

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [string]$SaPassword
)

$ErrorActionPreference = "Stop"

Import-Module sqlps -DisableNameChecking

$localhostname = hostname
$instanceName = "mssqlserver"

$smo = 'Microsoft.SqlServer.Management.Smo.'
$wmi = new-object ($smo + 'Wmi.ManagedComputer').

Write-Verbose "Enabling TCP/IP."
$uri = "ManagedComputer[@Name='$localhostname']/ ServerInstance[@Name='$instanceName']/ServerProtocol[@Name='Tcp']"
$Tcp = $wmi.GetSmoObject($uri)
$Tcp.IsEnabled = $true
$Tcp.Alter()

Write-Verbose "Enabling Mixed Mode Authentication."
$s = new-object Microsoft.SqlServer.Management.Smo.Server($localhostname)

$s.Settings.LoginMode = [Microsoft.SqlServer.Management.SMO.ServerLoginMode]::Mixed
$s.Alter()

Write-Verbose "Restarting [$instanceName] so TCP/IP and Auth changes are applied."
$service = get-service $instanceName
$service.Stop()
$service.WaitForStatus("Stopped")
$service.Start()
$service.WaitForStatus("Running")

Write-Verbose "Editing sa user for remote access."
$SQLUser = $s.Logins | ? {$_.Name -eq "sa"}
$SQLUser.PasswordPolicyEnforced = 0
$SQLUser.Alter()
$SQLUser.Refresh()
$SQLUser.ChangePassword($SaPassword)
$SqlUser.Enable()
$SQLUser.Alter()
$SQLUser.Refresh()

I know the script above isn’t necessarily the neatest way to do the configuration I needed, but its enough for now.

How Are Baby Databases Made?

The next step is having a library that we can re-use to easily make test databases.

We already had some Entity Framework classes and a database context buried deep inside another solution, so I extracted those, put them into their own repository and built a standalone Nuget package from that. It’s not the same code that actually creates a brand new database for a client (that’s a series of SQL scripts embedded in a VB6 application), but its close enough for testing purposes. I hope that eventually we will use EF instead of the hardcoded scripts (leveraging the migrations functionality for database version management), but that’s probably a long way away.

I’d previously completed a small refactor on the EF project in the past, so it already had the concept of a DbContextFactory, so all I had to do was implement a new one that connected to a known SQL server and created a randomly named database. I made it disposable, so that it would destroy the database once it was done.

EF took care of actually creating the database to match the schema defined by the DTO classes (which were already there), so I didn’t have to worry about that too much.

In the code below, the ITestDatabaseConnectionStringFactory implementation is responsible for knowing where the server is and how to connect to it (there’s a few implementations, one takes values from the app.config, one is hardcoded, etc). The INamedDbFactory has a single Create method that returns a derived DbContent, nothing fancy.

using System;
using System.Data.SqlClient;
using System.Diagnostics;
using System.Linq;
using Serilog;

namespace Solavirum.Database.EF.Tests.Common
{
    public class TestDatabaseDbFactory : INamedDbFactory, IDisposable
    {
        public TestDatabaseDbFactory(ILogger logger, ITestDatabaseConnectionStringFactory connFactory)
        {
            _connFactory = connFactory;
            var databaseName = _random.GenerateString(20);
            ConnectionString = _connFactory.Create(databaseName);

            _logger = logger
                .ForContext("databaseName", ConnectionString.DatabaseName)
                .ForContext("databaseHost", ConnectionString.Host);

            _logger.Information("New factory for database {databaseName} on host {databaseHost} created");
        }

        private readonly IRandomTestData _random = new DefaultRandomTestData();
        private readonly ILogger _logger;
        private ITestDatabaseConnectionStringFactory _connFactory;

        public readonly TestDatabaseConnectionString ConnectionString;

        public NamedDb Create()
        {
            _logger.Information("Creating a new NamedDb for database {databaseName}");
            var context = new NamedDb(new SqlConnection(ConnectionString.GetFullConnectionString()));

            if (!context.Database.Exists())
            {
                _logger.Information("This is the first time {databaseName} has been created, creating backing database on host {databaseHost}");
                context.Database.CreateIfNotExists();
            }

            return context;
        }

        public void Cleanup()
        {
            using (var context = new NamedDb(ConnectionString.GetFullConnectionString()))
            {
                _logger.Information("Deleting backing database {databaseName} on host {databaseHost}");
                context.Database.Delete();
            }
        }

        bool _disposed;

        public void Dispose()
        {
            Dispose(true);
            GC.SuppressFinalize(this);
        }

        ~TestDatabaseNamedDbFactory()
        {
            Dispose(false);
        }

        protected virtual void Dispose(bool disposing)
        {
            if (_disposed)
                return;

            if (disposing)
            {
                try
                {
                    Cleanup();
                }
                catch (Exception ex)
                {
                    Trace.WriteLine(string.Format("An unexpected error occurred while attempting to clean up the database named [{0}] spawned from this database factory. The database may not have been cleaned up.", ConnectionString.DatabaseName));
                    Trace.WriteLine(ex.ToString());
                }
                
            }

            _disposed = true;
        }
    }

    public interface IRandomTestData
    {
        string GenerateString(int length);
    }

    public class DefaultRandomTestData : IRandomTestData
    {
        private readonly Random _random = new Random();

        public string GenerateString(int length)
        {
            var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
            var result = new string(Enumerable.Repeat(chars, length)
                .Select(s => s[_random.Next(s.Length)])
                .ToArray());

            return result;
        }
    }
}

The Cleanup

Of course by extracting the database classes above into its own repository/package, I had to replace all references to the old classes. It turns out the database component was quite heavily referenced in its original location, so it was a non-trivial amount of work to incorporate it properly.

As is almost always the way when you start doing things correctly after a long period of not doing that, I had to shave a few yaks to make the change happen.

On the upside, as a result of the extraction and the creation of the TestDatabaseFactory, I managed to create reliably executable tests for the actual database component itself, proving that it works when connected to a real SQL server database.

Summary

For me, the takeaway from this activity is that it takes effort to setup good testing infrastructure so that you can easily create tests that can be repeatedly executed. Its not something that just happens, and you need to be willing to accept the fact that you have to go slower to start off with in order to go faster later. Like delayed gratification.

I could have just setup an SQL server in a remote location without automating any of it, but that’s just another case of the same principle. I now have the ability to setup any version of SQL server that I want, or to change it to deploy a custom version (assuming we had an Octopus package to install SQL server, which is more than possible), and I’ve also automated its destruction every night and recreation every morning, allowing us to easily incorporate changes to the base AMI or to alter it in other ways (like a different version of SQL server).

It can be very hard to get some people to understand this. I find the best approach is to just do it (taking into account hard deadlines of course) and to let is prove itself in the field as time goes on, when you are able to make large changes without fear that you’ve broken everything.

I like knowing that I haven’t broken everything personally.

Or even better, knowing that I have.

0 Comments

God that title is terrible. Sorry.

If you’ve ever read any of my blog before, you would know that I’ve put a significant amount of effort into making sure that our environments can be spun up and down easily. A good example of this is the environment setup inside the Solavirum.Testing.JMeter repository, which allows you to easily setup a group of JMeter worker machines for the purposes of load testing.

These environment setup scripts are all well and good, but I really don’t like the fact that I end up executing them from my local machine. This can breed a host of problems, in the same way that compiling and distributing code from your own machine can. Its fine for the rarely provisioned environments (like the JMeter workers I mentioned above) but its definitely not suitable for provisioning a CI environment for a web service or anything similar.

Additionally, when you provision environments from a developer machine you risk accidentally provisioning an environment with changes that have not been committed into source control. This can be useful for testing purposes, but ultimately leads to issues with configuration management. Its nice to tag the commit that the environment was provisioned from as well, on both sides (in source control and on the environment itself, just like versioning a library or executable).

Luckily we already have a platform in place for centralizing our compiling and packaging, and I should be able to use it to do environment management as well.

TeamCity.

I’m Not Sure Why Its Called TeamCity

If you’re unfamiliar with TeamCity, its similar to Jenkins. If you’re unfamiliar with Jenkins, its similar to TeamCity.

Ha.

Anyway, TeamCity is a CI (Continuous Integration) service. It allows you to setup build definitions and then run them on various triggers to produce artifacts (like installers, or Nuget packages). It does a lot more to be honest, but its core idea is about automating the build process in a controlled environment, free from the tyranny of developer machines.

The team here was using TeamCity before I started, so they already had some components being built, including a 20 step monstrosity that I will one day smooth out.

As a general rule, I love a good CI server, but I much prefer to automate in something like Powershell, so that it can be run locally if need be (for debugging/investigation purposes), so I’m wary of putting too much logic inside the actual CI server configuration itself. I definitely like to use the CI server for scheduling, history and other ancillary services (like tagging on successful builds and so on) though, the things that you don’t necessarily need when running a build locally.

Anyway, the meat of this blog post is about automating environment management using TeamCity, so I should probably talk about that now.

Most of my environment provisioning scripts have the same structure (with a few notable exceptions), so it was easy enough to create a Template is TeamCity to automate the destruction and recreation of an environment via scripts that already exist. The template was simple, a link to a git repository (configurable by repo name), a simple build step that just runs some Powershell, a trigger and some build parameters.

The only thing I can really copy here is the Powershell build step, so here it is:

try
{
    if ("%teamcity.build.branch.is_default%" -eq "false") 
    {
        Write-Error "Cannot create environment from non-default branch (i.e. not master)."
        exit 1
    }

    . ".\scripts\build\_Find-RootDirectory.ps1"

    $rootDirectory = Find-RootDirectory "."
    $rootDirectoryPath = $rootDirectory.FullName

    # Consider adding Migration check and run that instead if it exists.
    $environment = "%environment%"
    
    $restrictedEnvironmentRegex = "prod"
    if ($environment -match $restrictedEnvironmentRegex)
    {
        write-error "No. You've selected the environment named [$environment] to create, and it matches the regex [$restrictedEnvironmentRegex]. Think harder before you do this."
        Write-Host "##teamcity[buildProblem description='Restricted Environment Selected']"
        exit 1
    }
    
    $invokeNewEnvironmentPath = ".\scripts\environment\Invoke-NewEnvironment.ps1"
    $invokeDeleteEnvironmentPath = ".\scripts\environment\Invoke-DeleteEnvironment.ps1"

    if(-not (test-path $invokeNewEnvironmentPath) -or -not (test-path $invokeDeleteEnvironmentPath))
    {
        write-error "One of the expected environment management scripts (New: [$invokeNewEnvironmentPath], Delete: [$invokeDeleteEnvironmentPath]) could not be found."
        Write-Host "##teamcity[buildProblem description='Missing Environment Management Scripts']"
        exit 1
    }
    
    $bootstrapPath = ".\scripts\Functions-Bootstrap.ps1"
    if (-not (test-path $bootstrapPath))
    {
        Write-Warning "The bootstrap functions were not available at [$bootstrapPath]. This might not be important if everything is already present in the repository."
    }
    else
    {
        . $bootstrapPath
        Ensure-CommonScriptsAvailable
    }
    
    $octopusUrl = "%octopusdeploy-server-url%"
    $octopusApiKey = "%octopusdeploy-apikey%"
    $awsKey = "%environment-deployment-aws-key%"
    $awsSecret = "%environment-deployment-aws-secret%"
    $awsRegion =  "%environment-deployment-aws-region%"
    
    $arguments = @{}
    $arguments.Add("-Verbose", $true)
    $arguments.Add("-AwsKey", $awsKey)
    $arguments.Add("-AwsSecret", $awsSecret)
    $arguments.Add("-AwsRegion", $awsRegion)
    $arguments.Add("-OctopusApiKey", $octopusApiKey)
    $arguments.Add("-OctopusServerUrl", $octopusUrl)
    $arguments.Add("-EnvironmentName", $environment)
    
    try
    {
        Write-Host "##teamcity[blockOpened name='Delete Environment']"
        Write-Host "##teamcity[buildStatus text='Deleting $environment']"
        & $invokeDeleteEnvironmentPath @arguments
        Write-Host "##teamcity[buildStatus text='$environment Deleted']"
        Write-Host "##teamcity[blockClosed name='Delete Environment']"
    }
    catch
    {
        write-error $_
        Write-Host "##teamcity[buildProblem description='$environment Deletion Failed']"
        exit 1
    }

    try
    {
        $recreate = "%environment-recreate%"
        if ($recreate -eq "true")
        {
            Write-Host "##teamcity[blockOpened name='Create Environment']"
            Write-Host "##teamcity[buildStatus text='Creating $environment']"
            & $invokeNewEnvironmentPath @arguments
            Write-Host "##teamcity[buildStatus text='$environment Created']"
            Write-Host "##teamcity[blockClosed name='Create Environment']"
        }
    }
    catch
    {
        write-error $_
        Write-Host "##teamcity[buildProblem description='$environment Created Failed']"
        exit 1
    }
}
catch 
{
    write-error $_
    Write-Host "##teamcity[buildProblem description='$environment Created Failed']"
    exit 1
}

Once I had the template, I created new build configurations for each environment I was interested in, and filled them out appropriately.

Now I could recreate an entire environment just by clicking a button in TeamCity, and every successful recreation was tagged appropriately in Source Control, which was awesome. Now I had some traceability.

The final step was to schedule an automatic recreation of each CI environment every morning, to constantly validate our scripts and make sure they work appropriately.

Future Improvements

Alas, I ran into one of the most annoying parts of TeamCity. After the initial 20, licensing is partially based on build configurations. We already had a significant amount of configs, so I ran out before I could implement a build configuration to do a nightly tear down of environments that don’t need to exist overnight (for example all our CI environments). I had to settle for merely recreating them each morning (tear down followed by spin up), which at least verifies that the scripts continue to work.

If I could change build parameters based on a Trigger in TeamCity that would also work, but that’s a missing feature for now. I could simply set up two triggers, one for the morning to recreate and the other for the evening to tear down (where they both execute the same script, just with different inputs). This has been a requested feature of TeamCity for a while now, so I hope they get to it at some stage.

I’ll rectify this as soon as we get more build configurations. Which actually leads nicely into my next point.

So, What’s It Cost

Its free! Kinda.

Its free for 3 build agents and 20 build configurations. You can choose to buy another agent + 10 build configs for a token amount (currently $300 US), or you can choose to buy unlimited build configurations (the Enterprise edition) for another token amount (currently $2000 US).

If you’re anything like me, and you love to automate all the things, you will run out of build configurations far before you need more build agents.

I made the mistake of getting two build agent + configs packs through our organization before I realized that I should have just bought the Enterprise edition, and now I’m having a hard time justifying its purchase to the people what control the money. Unfortunate, but I’m sure I’ll convince them in time, and we’ve got an extra 2 build agents as well, so that’s always nice.

Jetbrains (the creators of TeamCity) were kind of annoying in this situation actually. We wanted to switch to Enterprise, and realized we didn’t need the build agents (yet), but they wouldn’t do us a deal. I can understand that its just probably their policy, but its still annoying.

Summary

I’m never happy unless what I’ve done can be executed on a machine completely different from my own, without a lot of futzing about. I’m a big proponent for “it should just work”, and having the environments triggered from TeamCity enforces that sort of thinking. Our build agents are pretty vanilla as far as our newer code is concerned (our legacy code has some nasty installed dependencies that I won’t go into detail about), so being able to execute the environment provisioning through TeamCity constantly validates that the code works straight out of source control.

It also lets other people create environments too, and essentially documents the usage of the environment provisioning scripts.

I get some nice side effects from doing environment management in TeamCtiy as well, the most important of which is the ability to easily tag when environments were provisioned (and from what commit) in source control.

Now I just need more build configurations…

0 Comments

I mentioned last week that I was struggling with an issue with my environment setup scripts via CloudFormation.

Well, I finally got to the bottom of it, and I thought it would be a good thing to share, because I don’t want anyone else to have to struggle with finding the root cause behind the incredibly unhelpful error that I was getting. Hopefully, my pain becomes your knowledge.

To give a little bit of background, I use a combination of CloudFormation templates, Powershell and Octopus Deploy to setup environments for our products and services in AWS. This approach has proven to be extremely useful for a number of reasons, including codification of what exactly goes into the environment, reducing the amount of time to setup an environment (in comparison to using typical infrastructure requests) and reducing the reliance on one particular person or team who knows how to put it all together.

A little while ago I noticed that the environment setup that I use for my JMeter Load Test Workers was failing to run to completion. Every time I tried to setup the environment, it would fail with a “Create Failed” status, stating that one or more of the instances failed their configuration.

Looking into the issue, I found the following error in the cfn-init.log file (available at C:\cfn\log\cfn-init.log).

In case you are unfamiliar with CloudFormation, cfn-init is a program supplied by Amazon that works with CloudFormation, and allows you to configure a series of steps to occur during the startup of an EC2 instance. Typically you  use the UserData for an instance to execute cfn-init with the appropriate parameters, and it connects to the meta data service, downloads the CloudFormation (cfn) template and executes it on the local machine. Its awesome, and I use it a lot.

The relevant excerpt from the log file:

2015-06-12 03:06:02,674 [DEBUG] Running command a-deploy-loadtest
2015-06-12 03:06:02,674 [DEBUG] No test for command a-deploy-loadtest
2015-06-12 03:06:59,846 [INFO] Command a-deploy-loadtest succeeded
2015-06-12 03:06:59,846 [ERROR] -----------------------BUILD FAILED!------------------------
2015-06-12 03:06:59,846 [ERROR] Unhandled exception during build: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte
Traceback (most recent call last):
    File "cfn-init", line 123, in <module>
    File "cfnbootstrap\construction.pyc", line 117, in build
    File "cfnbootstrap\construction.pyc", line 502, in build
    File "cfnbootstrap\construction.pyc", line 513, in run_config
    File "cfnbootstrap\construction.pyc", line 125, in run_commands
    File "cfnbootstrap\command_tool.pyc", line 116, in apply
    File "encodings\utf_8.pyc", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 644: invalid continuation byte

Of course, that doesn’t make any sense, as I hadn’t changed anything to do with the encoding of the output. Even more confusingly, when I ran cfn-init using the exact same mechanism (the execution of the UserData batch file), on the exact same machine, it would work just fine. It only failed during instance startup.

I looked into it for a little while, but couldn’t find an obvious root cause. I assumed that I was doing something slightly different from the norm or that something had changed upstream to cause the error. This made sense, because all of the other places where I was using the exact same code were working just fine, and it seemed to be localised to just my load test worker environment.

I altered the environment setup to not require instances to setup successfully (by removing the call to cfn-signal after cfn-init, and removing the WaitCondition from the template) and manually initialized each instance after it was created before I ran my load tests.

Not great, but an acceptable work around until it mattered more. I had bigger fish to fry.

The Virus Spreads

Then it happened somewhere else.

The new cloud based service we are working on needed an environment refresh to take advantage of scalable persistence in the form of RDS. We tore down the old CI environment, and then attempted to create a new one.

The mysterious “invalid UTF8 character” error reared its ugly head again.

I couldn’t brush it away this time, as this environment was much more important than my load test workers, so it was time to dig in and see what was really going on.

Taking the error at face value, I assumed that something had changed in the pipeline (Octopus, AWS, etc) and that there really was an invalid character. It was fairly obvious that cfn-init didn’t fail until after the first deployment had actually completed, so something seemed to be off in the output of that particular step.

"a-deploy-system-stats-logger": { "command": { "Fn::Join": [ "", [ "powershell.exe -ExecutionPolicy Bypass -Command c:\\cfn\\dependencies\\scripts\\environment\\Invoke-DeployLatestProjectForEnvironment.ps1 ",

" –Verbose ", " -OctopusServerApiKey ", { "Ref": "OctopusAPIKey" }, " -OctopusServerUrl ", { "Ref": "OctopusServerURL" }, " -OctopusEnvironment ", { "Ref": "OctopusEnvironment" }, " -ProjectName 'System Stats Logger'" ] ] }, "waitAfterCompletion": "0" }

I ran the script manually through C# and captured the raw output stream. I parsed the output as UTF8, and it came out just fine, so clearly the output itself wasn’t the problem.

At this point I took a slightly different approach. I knew that an old version of the environment setup for our new cloud service had worked just fine a few days ago, so I branched from that point in Git and executed the setup script.

It worked.

Now I at least had two copies of similar codebases that I could compare for differences to find what was breaking the deployments, and I knew that it wasn’t an environmental issue related to either Octopus Deploy or AWS components.

I used an application called WinMerge to diff the two repositories (after extracting them to different directories on disk), and found a number of differences, but nothing that was obviously the root cause. I thought that perhaps the problem was caused by upgrading to the latest version of the AWS Powershell package, but I rolled that back and still had the same issue.

Heads up WinMerge people, if you are on the internet at all. The statuses for “File Identical” and “File Different” in your application look incrediblysimilar at a glance. I thought all my files were identical initially, at least until I looked a little closer. Colours would be nice.

The Root Cause

While looking at the diff of the two directories, I noticed that I had added additional Write-Verbose statements to some of my generic Powershell enumerable functions (like Any, Single, All, etc). In fact, in all of the environment setup scripts that were working, this additional logging was missing (because they were using an older version of the common scripts). The logging was only present in those environments that were broken.

But how could additional logging statements cause a UTF8 error in cfn-init, and only during startup?

It took me a fair bit of time to realise that it was an issue with the length of the output. My deployment scripts all had –Verbose turned on by default. I did this on purpose, to make sure that I received the maximum amount of information possible from the environment setup.

Apparently cfn-init has a limit on the amount of output it can process from a command, and I was pushing that limit.

As to why it was failing during startup and not when manually executed? The code that was run at startup did some archive extraction which didn’t happen in future calls (because it was already on disk) and the items being extracted were logged. This was evidently enough to push the output over the invisible, undocumented limit.

I removed the –Verbose flag from the calls deployment scripts and everything was fine.

Problem solved.

Why?

Of course, the question remains. Why?

Why does cfn-init give a completely unrelated error when the output from a single command is too long? Why not a “output has been truncated” or “output too long” error of some description?

I have no idea.

I suspect the reason for the UTF8 error is that the output was truncated between two bytes or something, and when the parser underneath cfn-init tried to parse it, it encountered what appeared to be an invalid UTF8 character. I still have no idea why it would do arbitrary truncation though, especially without telling the user.

Ah well.

Conclusion

This is the sort of thing that you just have to deal with sometimes. I do wonder how many other people have run into this issue though. Considering that when I did a Google search for the error, I found almost nothing, I’m not sure if anyone else has. If they have, they certainly didn’t post online about it. Luckily, for everyone else in the future, I did, both here and on the AWS forums, so maybe that will help.

On the upside, running into these sorts of things makes you think more about your own code, and how it reports problems to the user when assumptions that you’ve made break down in reality. It takes a lot of effort to write robust code that is communicative when it fails, but I think its definitely worth the effort.

Remember, your code will be used and read a hell of a lot more than its written.

0 Comments

So, as per my last post, I built a scalable, deployable, codified proxy environment in AWS, leveraging Cloud Formation, Octopus and Squid.

Since then I have attempted to use this proxy for things. Specifically load tests, which was the entire reason I built the proxy in the first place.

In my attempts to use the proxy, I have learned a few lessons that I thought would be worthwhile to share, so that others might benefit from my pain.

This will be a relatively short post, because:

  1. I’ve been struggling with these issues over the last week and haven’t done much else,
  2. For all the trouble I had, I really only ran into 2 issues,
  3. I’m still struggling with a mysterious, probably unrelated issue, and its taking up most of my mental space (unexpected UTF8 characters in a Powershell/Octopus deployment output stream, which might be a good future blog post if I ever figure it out).

Weird Differences

Initially I assumed that my proxy would slot into the same hole as our current proxy. They were both Squid, I had replicated the config from the old one on the new one and both were referenced simply by a URL and port number.

I was mistaken.

I’m still not sure how it did it, but the old proxy was somehow allowing connections to the special EC2 meta information address (169.254.169.254) to pass through correctly. The moment I swapped in my new proxy, cfn-init and cfn-signal no longer worked.

For cfn-init, the error was incredibly unhelpful. It insisted that my instance was not a member of the CloudFormation group/template/configuration that I was trying to initialise from.

For cfn-signal, it just didn’t do anything. It said it signalled, but it was lying.

In hindsight, this makes perfect sense. The request would have gone through the proxy, which was a CloudFormation resource itself, and it would have tried to use the proxy’s CloudFormation template as the container for the meta data, which would fail, giving a technically correct error message in the first case, and signalling something non-existent in the second.

From my point of view, it looked insane.

I assumed I had put some sort of incorrect details into the cfn-init call, or that I had failed to meet the arcane requirements for cfn-signal (must base64 encode the wait handle on windows only for example), but I hadn’t changed anything. The only thing I changed was the proxy configuration.

Long story short, for my proxy, I had to add a bypass entry (on each EC2 instance, configured in the same place as the proxy, the UserData script) which would stop cfn-init (and other tools) from trying to go through the proxy to hit the meta information address. I still have no idea how the old proxy did not require the same sort of configuration. I have a hunch that it might be because it was Linux and the original creators of the box did something special to make it work. Maybe they ran into the same issue, but just fixed it a different way? Maybe Linux automatically handles the situation better? Who knows.

Very frustrating.

Location, Location, Location

The second pain point I ran into was more insane and just as frustrating.

After reviewing the results of an initial load test, I hypothesised that maybe the proxy was a bottleneck. All of the traffic for the load test had to pass through the proxy (including image uploads) and I couldn’t see anything obvious in the service logs to account for the level of failure I was seeing, except high load. In the interests of getting a better subsequent load test, I wanted to make sure that the proxy boxes could not possibly be a bottleneck, so I planned to beef up their instance type.

I was originally using t2.medium instances, which have some limitations, mostly around network performance and CPU credits. I wanted to switch to something a bit beefier, just for the proxy specific to the load tests.

When i switched to an m3.large, the proxy stopped working.

Looking into it, the expected installation directory (C:\Squid) was empty of anything that even vaguely looked like a proxy.

Following the installation log, I found out that Squid had decided to install itself to Z drive. Z drive was an ephemeral drive. You know, the ones whose content is transitory, and which tend to get annihilated if the instance goes down for any reason?

I tried so very hard to get Squid to just install to the C drive, including checking the registry settings for program installation locations (which were all correctly C based) and manually overriding TARGETFOLDER, ROOTDRIVE and INSTALLDIR in the msi execution parameters.

Alas, it was not to be. No matter what I did, Squid insisted on installing to Z drive.

I still have no idea why, I just turned the instance type back to one that didn’t have ephemeral drives available.

Like any good software user, I logged a bug. Well, I assume its a bug, because that’s a really weird feature.

Conclusion

There is no conclusion. Just a glimpse into some of the traps that sap your time, motivation and will to live when doing this sort of thing.

I only hope that someone runs across this blog post one day and it helps them. Or at least lets them know someone else out there understands their pain.