0 Comments

Its time to fix that whole shared infrastructure issue.

I need to farm my load tests out to AWS, and I need to do it in a way that won’t accidentally murder our current production servers. In order to do that , I need to forge the replacement for our manually created and configured proxy box. A nice, codified, auto-scaling proxy environment.

Back into CloudFormation I go.

I contemplated simply using a NAT box instead of a proxy, but decided against it because:

  • We already use a proxy, so assuming my template works as expected it should be easy enough to slot in,
  • I don’t have any experience with NAT boxes (I’m pretty weak on networking in general actually),
  • Proxies scale better in the long run, so I might as well sort that out now.

Our current lone proxy machine is a Linux instance with Squid manually installed on it. It was setup some time before I started, by someone who no longer works at the company. An excellent combination, I’m already a bit crap at Linux, and now I can’t even ask anyone how it was put together and what sort of tweaks were done to it over time as failures were encountered. Time to start from scratch. The proxy itself is sound enough, and I have some experience with Squid, so I’ll stick with it. As for the OS, while I know that Linux will likely be faster with less overhead, I’m far more comfortable with Windows, so to hell with Linux for now.

Here's the plan. Create a CloudFormation template for the actual environment (Load Balancer, Auto Scaling Group, Instance Configuration, DNS Record) and also create a NuGet package that installs and configures the proxy to be deployed via Octopus.

I’ve always liked the idea of never installing software manually, but its only been recently that I’ve had access to the tools to accomplish that. Octopus, NuGet and Powershell form a very powerful combination for managing deployments on Windows. I have no idea what the equivalent is for Linux, but I’m sure there is something. At some point in the future Octopus is going to offer the ability to do SSH deploys which will allow me to include more Linux infrastructure (or manage existing Linux infrastructure even better, I’m looking at you ELK stack).

Save the Environment

The environment is pretty simple. A Load Balancer hooked up to an Auto Scaling Group, whose instances are configured to do some simple setup, (including using Octopus to deploy some software) and a DNS record so that I can refer to the load balancer in a nice way.

I’ve done enough of these simple sorts of environments now that I didn’t really run into any interesting issues. Don’t get me wrong, they aren’t trivial, but I wasn’t stuck smashing my head against a desk for a few days while I sorted out some arcane problem that ended up being related to case sensitivity or something ridiculous like that.

One thing that I have learned, is to setup the Octopus project that will be deployed during environment setup ahead of time. Give it some trivial content, like running a Powershell script, and then make sure it deploys correctly during the startup of the instances in the Auto Scaling Group. If you try to sort out the package and its deployment at the same time as the environment, you’ll probably run into issues where the environment setup technically succeeded, but because the deployment of the package failed, the whole thing failed and you have to wait another 20 minutes to fix it. It really saves a lot of time to create the environment in such a way that you can extend it with deployments later.

Technically you could also make it so failing deployments don’t fail an environment setup, but I like my environments to work when they are “finished”, so I’m not really comfortable with that in the long run.

The only tricky things about the proxy environment are making sure that you setup your security groups appropriately so that the proxy port can be accessed, and making sure that you use the correct health check for the load balancer (for Squid at least, TCP over the port 3128 (the default for Squid) is a good health check).

That's a Nice Package

With the environment out of the way, its time to setup the package that will be used to deploy Squid.

Squid is available on Windows via diladele. Since 99% of our systems are 64 bit, I just downloaded the 64 bit MSI. Using the same structure that I used for the Nxlog package, I packaged up the MSI and some supporting scripts, making sure to version the package appropriately. Consistent versioning is important, so I use the same versioning strategy that I use for our software components. Include a SharedAssemblyInfo file and then mutate that file via some common versioning Powershell functions.

Apart from installation of Squid itself, I also included the ability to deploy a custom configuration file. The main reason I did this was so that I could replicate out current Squid proxy config exactly, because I’m sure it does things that have been built up over the last few years that I don’t understand. I did this in a similar way to how I did config deployment for Nxlog and Logstash. Essentially a set of configuration files are included in the Nuget package and then the correct one is chosen at deployment time based on some configuration within Octopus.

I honestly don’t remember if I had any issues with creating the Squid proxy package, but I’m sure if I had of, they would be fresh in my mind. MSI’s are easy to install silently with MSIEXEC once you know the arguments, and the Squid installer for Windows is pretty reliable. I really do think it was straightforward, especially considering that I was following the same pattern that I’d used to install an MSI via Octopus previously.

Delivering the Package

This is standard Octopus territory. Create a project to represent the deployable component, target it appropriately at machines in roles and then deploy to environments. Part of the build script that is responsible for putting together the package above can also automatically deploy it via Octopus to an environment of your choice.

In TeamCity, we typically do an automatic deploy to CI on every checkin (gated by passing tests), but for this project I had to hold off. We’re actually running low on Build Configurations right now (I’ve already put in a request for more, but the wheels of bureaucracy move pretty slow), so I skipped out on setting one up for the Squid proxy. Once we get some more available configurations I’ll rectify that situation.

Who Could Forget Logs?

The final step in deploying and maintaining anything is to make sure that the logs from the component are being aggregated correctly, so that you don’t have to go the machine/s in question to see what's going on and you have a nice pile of data to do analysis on later. Space is cheap after all, so you might as well store everything, all the time (except media).

Squid features a nice Access log with a well known format, which is perfect for this sort of log processing and aggregation.

Again, using the same sort of approach that I’ve used for other components, I quickly knocked up a logstash config for parsing the log file and deployed it (and Logstash) to the same machines as the Squid proxy installations. I’ll include that config here, because it lives in a different repository to the rest of the Squid stuff (it would live in the Solavirum.Logging.Logstash repo, if I updated it).

input {
    file {
        path => "@@SQUID_LOGS_DIRECTORY/access.log"
        type => "squid"
        start_position => "beginning"
        sincedb_path => "@@SQUID_LOGS_DIRECTORY/.sincedb"
    }
}

filter {
    if [type] == "squid" {
        grok {
            match => [ "message", "%{NUMBER:timestamp}\s+%{NUMBER:TimeTaken:int} %{IPORHOST:source_ip} %{WORD:squid_code}/%{NUMBER:Status} %{NUMBER:response_bytes:int} %{WORD:Verb} %{GREEDYDATA:url} %{USERNAME:user} %{WORD:squid_peerstatus}/(%{IPORHOST:destination_ip}|-) %{GREEDYDATA:content_type}" ]
        }
        date {
            match => [ "timestamp", "UNIX" ]
            remove_field => [ "timestamp" ]
        }
    }
     
    mutate {
        add_field => { "SourceModuleName" => "%{type}" }
        add_field => { "Environment" => "@@ENVIRONMENT" }
        add_field => { "Application" => "SquidProxy" }
        convert => [ "Status", "string" ]
    }
    
    # This last common mutate deals with the situation where Logstash was creating a custom type (and thus different mappings) in Elasticsearch
    # for every type that came through. The default "type" is logs, so we mutate to that, and the actual type is stored in SourceModuleName.
    # This is a separate step because if you try to do it with the SourceModuleName add_field it will contain the value of "logs" which is wrong.
    mutate {
        update => [ "type", "logs" ]
    }
}

output {
    tcp {
        codec => json_lines
        host => "@@LOG_SERVER_ADDRESS"
        port => 6379
    }
    
    #stdout {
    #    codec => rubydebug
    #}
}

I Can Never Think of a Good Title for the Summary

For reference purposes I’ve included the entire Squid package/environment setup code in this repository. Use as you see fit.

As far as environment setups go, this one was pretty much by the numbers. No major blockers or time wasters. It wasn’t trivial, and it still took me a few days of concentrated effort, but the issues I did have were pretty much just me making mistakes (like setting up security group rules wrong, or failing to tag instances correctly in Octopus, or failing to test the Squid install script locally before I deployed it). The slowest part is definitely waiting for the environment creation to either succeed or fail, because it can take 20+ minutes for the thing to run start to finish. I should look into making that faster somehow, as I get distracted during that 20 minutes.

Really the only reason for the lack of issues was that I’d done all of this sort of stuff before, and I tend to make my stuff reusable. It was a simple matter to plug everything together in the configuration that I needed, no need to reinvent the wheel.

Though sometimes you do need to smooth the wheel a bit when you go to use it again.

0 Comments

Managing subnets in AWS makes me sad. Don’t get me wrong, AWS (as per normal) gives you full control over that kind of thing, I’m mostly complaining from an automation point of view.

Ideally, when you design a self contained environment, you want to ensure that it is isolated in as many ways as possible from other environments. Yes you can re-use shared infrastructure from a cost optimization point of view, but conceptually you really do want to make sure that Environment A can’t possibly affect anything in Environment B and vice versa.

As is fairly standard, all of our AWS CloudFormation templates use subnets.

In AWS, a subnet defines a set of available IP addresses (i.e. using CIDR notation 1.198.143.0/28, representing 1.198.143.1 – 1.198.143.16). Subnets also define an availability zone (for redundancy, i.e. ap-southeast-2a vs ap-southeast-2b), whether or not resources using the subnet automatically get an IP address and can be used to define routing rules to restrict access. Route tables and security groups are the main mechanisms by which you can lock down access to your machines, outside of the OS level, so its important to use them as much as you can. You should always assume that any one of your machines might be compromised and minimise possible communication channels accordingly.

Typically, in a CloudFormation template each resource will have a dependency on one or more subnets (more subnets for highly available resources, like auto scaling groups and RDS instances). The problem is, while it is possible to setup one or many subnets inside a CloudFormation template, there’s no real tools available to select an appropriate IP range for your new subnet/s from the available range in the VPC.

What we’ve had to do as a result of this, is setup a couple of known subnets with high capacity (mostly just blocks of 255 addresses) and then use those subnets statically in the templates. We’ve got a few subnets for publically accessible resources (usually just load balancers), a few for private web servers (typically only accessible from the load balancers) and a few for

This is less than ideal for various reasons (hard dependency on resources created outside of the template, can’t leverage route tables as cleanly, etc). What I would prefer, is the ability to query the AWS infrastructure for a block of IP addresses at the time the template is executed, and dynamically create subnets like that (setting up route tables as appropriate). To me this feels like a much better way of managing the network infrastructure in the cloud, keeping in line with my general philosophy of self contained environment setup.

Technically the template would probably have a dependency on a VPC, but you could search for that dynamically if you wanted to. Our accounts only have 1 VPC In them anyway.

The Dream

I can see the set of tools that I want to access in my head, they just don’t seem to exist.

The first thing needed would be a library of some sort, that allows you to supply a VPC (and its meta information) and a set of subnets (also with their meta information) and then can produce for you a new subnet of the desired capacity. For example, if I know that I only need a few IP addresses for the public facing load balancer in my environment, I would get the tool to generate 2 subnets, one in each availability zone in ap-southeast-2, of size 16 or something similarly tiny.

The second thing would be a visualization tool built on top of the library, that let you view your address space as a giant grid, zoomable, with important elements noted, like coloured subnets, resources currently using IP addresses and if you wanted to get really fancy, route tables and their effects on communication.

Now you may be thinking, you’re a programmer, why don’t you do it? The answer is, I’m considering it pretty hard, but while the situation does annoy me, it hasn’t annoyed me enough to spur me into action yet. I’m posting up the idea on the off chance someone who is more motivated than me grabs it and runs with it.

Downfall

There is at least one downside that I can think of with using a library to create subnets of the appropriate size.

Its a similar issue to memory allocation and management. As it is likely that the way in which you need IP address ranges changes from template to template, the addressable space will eventually suffer from fragmentation. In memory management, this is solved by doing some sort of compacting or other de-fragmentation activity. For IP address ranges, I’m not sure how you could solve that issue. You could probably update the environment to use new subnets, re-allocated to minimise fragmentation, but I think its likely to be more trouble than its worth.

Summary

To summarise, I really would like a tool to help me visualize the VPC (and its subnets, security groups, etc) in my AWS account. I’d settle for something that just lets me visualize my subnets in the context of the total addressable space.

I might write it.

You might write it.

Someone should.

0 Comments

We’ve spent a significant amount of effort recently ensuring that our software components are automatically built and deployed. Its not something new, and its certainly something that some of our components already had in place, but nothing was ever made generic enough to reuse. The weak spot in our build/deploy pipeline is definitely tests though. We’ve had a few attempts in the past to get test automation happening as part of the build, and while it has worked on an individual component basis, we’ve never really taken a holistic look at the process and made it easy to apply to a range of components.

I’ve mentioned this before but to me tests fall into 3 categories, Unit, Integration and Functional. Unit tests cover the smallest piece of functionality, usually algorithms or classes with all dependencies stubbed or mocked out. Integration tests cover whether all of the bits are configured to work together properly, and can be used to verify features in a controlled environment. Functional tests cover the application from a feature point of view. For example, functional tests for a web service would be run on it after it is deployed, verifying users can interact with it as expected.

From my point of view, the ideal flow is as follows:

Checkin – Build – Unit and Integration Tests – Deploy (CI) – Functional Tests – Deploy (Staging)

Obviously I’m talking about web components here (sites, services, etc), but you could definitely apply it to any component if you tried hard enough.

The nice part of this flow is that you can do any manual testing/exploration/early integration on the Staging environment, with the guarantee that it will probably not be broken by a bad deploy (because the functional tests will protect against that and prevent the promotion to staging).

Aren’t All Cities Full of Teams

We use Team City as our build platform and Octopus as our deployment platform, and thanks to these components we have the checkin, build and deployment parts of the pipeline pretty much taken care of.

My only issue with these products is that they are so configurable and powerful that people often use them to store complex build/deployment logic. This makes me sad, because that logic belongs as close to the code as possible, ideally in the same repository. I think you should be able to grab a repository and build it, without having the use an external tool to put all the pieces together. Its also an issue if you need to change your build logic, but still allow for older builds (maybe a hotfix branch or something). If you stored your build logic in source control, then this situation just works, because the logic is right there with the code.

So I mostly use Team City to trigger builds and collect history about previous builds (and their output), which it does a fine job at. Extending that thought I use Octopus to manage environments and machines, but all the logic for how to install a component lives in the deployable package (which can be built with minimal fuss from the repository).

I do have to mention that these tools do have elements of change control, and do allow you to version your Build Configurations (TeamCity)/Projects (Octopus). I just prefer that this logic lives with the source, because then the same version is applied to everything.

All of our build and deployment logic lives in source control, right next to the code. There is a single powershell script (unsurprisingly called build.ps1) per repository, acting as the entry point. The build script in each repository is fairly lightweight, leveraging a set of common scripts downloaded from our Nuget server, to avoid duplicating logic.

Team City calls this build script with some appropriate parameters, and it takes care of the rest.

Testy Testy Test Test

Until recently, our generic build script didn’t automatically execute tests, which was an obvious weakness. Being that we are in the process of setting up a brand new service, I thought this would be the ideal time to fix that.

To tie in with the types of tests I mentioned above, we generally have 2 projects that live in the same solution as the main body of code (X.Tests.Unit and X.Tests.Integration, where X is the component name), and then another project that lives in parallel called X.Tests.Functional. The Functional tests project is kind of a new thing that we’re trying out, so is still very much in flux. The other two projects are well accepted at this point, and consistently applied.

Both Unit and Integration tests are written using NUnit. We went with NUnit over MSTEST for reasons that seemed valid at the time, but which I can no longer recall with any level of clarity. I think it might have been something about the support for data driven tests, or the ability to easily execute the tests from the command line? MSTEST offers both of those things though, so I’m honestly not sure. I’m sure we had valid reasons though.

The good thing about NUnit, is that the NUnit Runner is a NuGet package of its own, which fits nicely into our dependency management strategy. We’ve written powershell scripts to manage external components (like Nuget, 7Zip, Octopus Command Line Tools, etc) and the general pattern I’ve been using is to introduce a Functions-Y.ps1 file into our CommonDeploymentScripts package, where Y is the name of the external component. This powershell file contains functions that we need from the external component (for example for Nuget it would be Restore, Install, etc) and also manages downloading the dependent package and getting a reference to the appropriate executable.

This approach has worked fairly well up to this point, so my plan was to use the same pattern for test execution. I’d need to implement functions to download and get a reference to the NUnit runner, as well as expose something to run the tests as appropriate. I didn’t only require a reference to NUnit though, as we also use OpenCover (and ReportGenerator) to get code coverage results when running the NUnit tests. Slightly more complicated, but really just another dependency to manage just like NUnit.

Weirdly Smooth

In a rare twist of fate, I didn’t actually encounter any major issues implementing the functions for running tests. I was surprised, as I always run into some crazy thing that saps my time and will to live. It was nice to have something work as intended, but it was probably primarily because this was a refactor of existing functionality. We already had the script that ran the tests and got the coverage metrics, I was just restructuring it and moving it into a place where it could be easily reused.

I wrote some very rudimentary tests to verify that the automatic downloading of the dependencies was working, and then set to work incorporating the execution of the tests into our build scripts.

function FindAndExecuteNUnitTests
{
    [CmdletBinding()]
    param
    (
        [System.IO.DirectoryInfo]$searchRoot,
        [System.IO.DirectoryInfo]$buildOutput
    )

    Write-Host "##teamcity[blockOpened name='Unit and Integration Tests']"

    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"
    . "$rootDirectoryPath\scripts\common\Functions-OpenCover.ps1"

    $testAssemblySearchPredicate = { 
            $_.FullName -like "*release*" -and 
            $_.FullName -notlike "*obj*" -and
            (
                $_.Name -like "*integration*" -or 
                $_.Name -like "*unit*"
            )
        }
    Write-Verbose "Locating test assemblies using predicate [$testAssemblySearchPredicate]."
    $testLibraries = Get-ChildItem -File -Path $srcDirectoryPath -Recurse -Filter "*.Test*.dll" |
        Where $testAssemblySearchPredicate
            
    $failingTestCount = 0
    foreach ($testLibrary in $testLibraries)
    {
        $testSuiteName = $testLibrary.Name
        Write-Host "##teamcity[testSuiteStarted name='$testSuiteName']"
        $result = OpenCover-ExecuteTests $testLibrary
        $failingTestCount += $result.NumberOfFailingTests
        $newResultsPath = "$($buildDirectory.FullName)\$($result.LibraryName).TestResults.xml"
        Copy-Item $result.TestResultsFile "$newResultsPath"
        Write-Host "##teamcity[importData type='nunit' path='$newResultsPath']"

        Copy-Item $result.CoverageResultsDirectory "$($buildDirectory.FullName)\$($result.LibraryName).CodeCoverageReport" -Recurse

        Write-Host "##teamcity[testSuiteFinished name='$testSuiteName']"
    }

    write-host "##teamcity[publishArtifacts '$($buildDirectory.FullName)']"
    Write-Host "##teamcity[blockClosed name='Unit and Integration Tests']"

    if ($failingTestCount -gt 0)
    {
        throw "[$failingTestCount] Failing Tests. Aborting Build."
    }
}

As you can see, its fairly straightforward. After a successful build the source directory is searched for all DLLs with Tests in their name, that also appear in the release directory and are also named with either Unit or Integration. These DLLs are then looped through, and the tests executed on each one (using the OpenCover-ExecuteTests function from the Functions-OpenCover.ps1 file), with the results being added to the build output directory. A record of the number of failing tests is kept and if we get to the end with any failing tests an exception is thrown, which is intended to prevent the deployment of faulty code.

The build script that I extracted the excerpt above from lives inside our CommonDeploymentScripts package, which I have replicated into this Github repository.

I also took this opportunity to write some tests to verify that the build script was working as expected. In order to do that, I had to create a few dummy Visual Studio projects (one for a deployable component via Octopack and another for a simple library component). At the start of each test, these dummy projects are copied to a working directory, and then mutated as necessary in order to provide the appropriate situation that the test needs to verify.

The best example of this is the following test:

Describe {
    Context "When deployable component with failing tests supplied and valid deploy" {
        It "An exception is thrown indicating build failure" {
            $creds = Get-OctopusCredentials

            $testDirectoryPath = Get-UniqueTestWorkingDirectory
            $newSourceDirectoryPath = "$testDirectoryPath\src"
            $newBuildOutputDirectoryPath = "$testDirectoryPath\build-output"

            $referenceDirectoryPath = "$rootDirectoryPath\src\TestDeployableComponent"
            Copy-Item $referenceDirectoryPath $testDirectoryPath -Recurse

            MakeTestsFail $testDirectoryPath
            
            $project = "TEST_DeployableComponent"
            $environment = "CI"
            try
            {
                $result = Build-DeployableComponent -deploy -environment $environment -OctopusServerUrl $creds.Url -OctopusServerApiKey $creds.ApiKey -projects @($project) -DI_sourceDirectory { return $testDirectoryPath } -DI_buildOutputDirectory { return $newBuildOutputDirectoryPath }
            }
            catch 
            {
                $exception = $_
            }

            $exception | Should Not Be $null

            . "$rootDirectoryPath\scripts\common\Functions-OctopusDeploy.ps1"

            $projectRelease = Get-LastReleaseToEnvironment -ProjectName $project -EnvironmentName $environment -OctopusServerUrl $creds.Url -OctopusApiKey $creds.ApiKey
            $projectRelease | Should Not Be $result.VersionInformation.New
        }
    }
}

As you can see, there is a step in this test to make the dummy tests fail. All this does is rewrite one of the classes to return a different value than is expected, but its enough to fail the tests in the solution. By doing this, we can verify that yes a failing does in fact lead to no deployment.

Summary

Nothing that I’ve said or done above is particularly ground-breaking. Its all very familiar to anyone who is doing continuous integration/deployment. Having tests is fantastic, but unless they take part in your build/deploy pipeline they are almost useless. That’s probably a bit harsh, but if you can deploy code without running the tests on it, you will (with the best of intentions no doubt) and that doesn’t lead anywhere good.

Our approach doesn’t leverage the power of TeamCity directly, due to my reluctance to store complex logic there. There are upsides and downsides to this, mostly that you trade off owning the implementation of the test execution against keeping all your logic in one place.

Obviously I prefer the second approach, but your mileage may vary.

0 Comments

The service that I’ve mentioned previously (and the iOS app it supports) has been in beta now for a few weeks. People seem relatively happy with it, both from a performance standpoint and due to the fact that it doesn’t just arbitrarily lose their information, unlike the previous version, so we’ve got that going for us, which is nice.

We did a fair amount of load testing on it before it went out to beta, but only for small numbers of concurrent users (< 100), to make sure that our beta experience would be acceptable. That load testing picked up a few issues, including one where the service would happily (accidentally of course) delete other peoples data. It wasn’t a permissions issue, it was due to the way in which we were keying our image storage. More importantly, the load testing found issues with the way in which we were storing images (we were using Raven 2.5 attachments) and how it just wasn’t working from a performance point of view. We switched to storing the files in S3, and it was much better.

I believe the newer version of Raven has a new file storage mechanism that is much better. I don’t even think Ayende recommends that you use the attachments built into Raven 2.5 for any decent amount of file storage.

Before we go live, we knew that we needed to find the breaking point of the service. The find the number of concurrent users at which its performance degraded to the point where it was unusable (at least for the configuration that we were planning on going live with). If that number was too low, we knew we would need to make some additional changes, either in terms of infrastructure (beefier AWS instances, more instances in the Auto Scaling Group) or in terms of code.

We tried to simply run a huge amount of users through our load tests locally (which is how we we did the first batch of load testing, locally using JMeter) but we capped out our available upload bandwidth pretty quickly, well below the level of traffic that the service could handle.

It was time to farm the work out to somewhere else, somewhere with a huge amount of easily accessibly computing resources.

Where else but Amazon Web Services?

I’ve Always Wanted to be a Farmer

The concept was fairly straightforward. We had a JMeter configuration file that contained all of our load tests. It was parameterised by the number of users, so conceptually the path would be to spin up some worker instances in EC2, push JMeter, its dependencies and our config to them, then execute the tests. This way we could tune the number users per instance along with the total number of worker instances, and we would be able to easily put enough pressure on the service to find its breaking point.

JMeter gives you the ability to set the value of variables via the command line. Be careful though, as the variable names are case sensitive. That one screwed me over for a while, as I couldn’t figure out why the value of my variables was still the default on every machine I started the tests on. For the variable that defined the maximum number of users it wasn’t so bad, if a bit confusing. The other variable that defined the seed for the user identity was more of an issue when it wasn’t working, because it meant the same user was doing similar things from multiple machines. Still a valid test, but not the one I was aiming to do, as the service isn’t defined for concurrent access like that.

We wouldn’t want to put all of that load on the service all at once though, so we needed to stagger when each instance started its tests.

Leveraging the work I’d done previously for setting up environments, I created a Cloud Formation template containing an Auto Scaling Group with a variable number of worker instances. Each instance would have the JMeter config file and all of its dependencies (Java, JMeter, any supporting scripts) installed during setup, and then be available for remote execution via Powershell.

The plan was to hook into that environment (or setup a new one if one could not be found), find the worker instances and then iterate through them, starting the load tests on each one, making sure to stagger the time between starts to some reasonable amount. The Powershell script for doing exactly that is below:

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$environmentName,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsKey,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsSecret,
    [string]$awsRegion="ap-southeast-2"
)

$ErrorActionPreference = "Stop"

$currentDirectoryPath = Split-Path $script:MyInvocation.MyCommand.Path
write-verbose "Script is located at [$currentDirectoryPath]."

. "$currentDirectoryPath\_Find-RepositoryRoot.ps1"

$repositoryRoot = Find-RepositoryRoot $currentDirectoryPath

$repositoryRootDirectoryPath = $repositoryRoot.FullName
$commonScriptsDirectoryPath = "$repositoryRootDirectoryPath\scripts\common"

. "$repositoryRootDirectoryPath\scripts\environment\Functions-Environment.ps1"

. "$commonScriptsDirectoryPath\Functions-Aws.ps1"

Ensure-AwsPowershellFunctionsAvailable

$stack = $null
try
{
    $stack = Get-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion
}
catch 
{
    Write-Warning $_
}

if ($stack -eq $null)
{
    $update = ($stack -ne $null)

    $stack = New-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -UpdateExisting:$update -Wait -disableCleanupOnFailure
}

$autoScalingGroupName = $stack.AutoScalingGroupName

$asg = Get-ASAutoScalingGroup -AutoScalingGroupNames $autoScalingGroupName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
$instances = $asg.Instances

. "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

$remoteUser = "Administrator"
$remotePassword = "ObviouslyInsecurePasswordsAreTricksyMonkeys"
$securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)

$usersPerMachine = 100
$nextAvailableCustomerNumber = 1
$jobs = @()
foreach ($instance in $instances)
{
    # Get the instance
    $instance = Get-AwsEc2Instance -InstanceId $instance.InstanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    $ipAddress = $instance.PrivateIpAddress
    
    $session = New-PSSession -ComputerName $ipAddress -Credential $cred

    $remoteScript = {
        param
        (
            [int]$totalNumberOfUsers,
            [int]$startingCustomerNumber
        )
        Set-ExecutionPolicy -ExecutionPolicy Bypass
        & "C:\cfn\dependencies\scripts\jmeter\execute-load-test-no-gui.ps1" -totalNumberOfUsers $totalNumberOfUsers -startingCustomerNumber $startingCustomerNumber -AllocatedMemory 512
    }
    $job = Invoke-Command -Session $session -ScriptBlock $remoteScript -ArgumentList $usersPerMachine,$nextAvailableCustomerNumber -AsJob
    $jobs += $job
    $nextAvailableCustomerNumber += $usersPerMachine

    #Sleep -Seconds ([TimeSpan]::FromHours(2).TotalSeconds)
    Sleep -Seconds 300

    # Can use Get-Job or record list of jobs and then terminate them. I suppose we could also wait on all of them to be complete. Might be good to get some feedback from
    # the remote process somehow, to indicate whether or not it is still running/what it is doing.
}

Additionally, I’ve recreated and reuploaded the repository from my first JMeter post, containing the environment template and scripts for executing the template, as well as the script above. You can find it here.

The last time I uploaded this repository I accidentally compromised our AWS deployment credentials, so I tore it down again very quickly. Not my brightest moment, but you can rest assured I’m not making the same mistake twice. If you look at the repository, you’ll notice that I implemented the mechanism for asking for credentials for tests so I never feel tempted to put credentials in a file ever again.

We could watch the load tests kick into gear via Kibana, and keep an eye on when errors start to occur and why.

Obviously we didn’t want to run the load tests on any of the existing environments (which are in use for various reasons), so we spun up a brand new environment for the service, fired up the script to farm out the load tests (with a 2 hour delay between instance starts) and went home for the night.

15 minutes later, Production (the environment actively being used for the external beta) went down hard, and so did all of the others, including the new load test environment.

Separately Dependent

We had gone to great lengths to make sure that our environments were independent. That was the entire point behind codifying the environment setup, so that we could spin up all resources necessary for the environment, and keep it isolated from all of the other ones.

It turns out they weren’t quite as isolated as we would have liked.

Like a lot of AWS setups, we have an internet gateway, allowing resources internal to our VPC (like EC2 instances) access to the internet. By default, only resources with an external IP can access the internet through the gateway. Other resources have to use some other mechanism for accessing the internet. In our case, the other mechanism is a SQUID proxy.

It was this proxy that was the bottleneck. Both the service under test and the load test workers themselves were slamming it, the service in order to talk to S3 and the load test workers in order to hit the service (through its external URL).

We recently increased the specs on the proxy machine (because of a similar problem discovered during load testing with fewer users) and we thought that maybe it would be powerful enough to handle the incoming requests. It probably would have been if it wasn’t for the double load (i.e. if the load test requests had of been coming from an external party and the only traffic going through the proxy was to S3 from the service).

In the end the load tests did exactly what they were supposed to do, even if they did it in an unexpected way. The pushed the system to breaking point, allowing us to identify where it broke and schedule improvements to prevent the situation from occurring again.

Actions Speak Louder Than Words

What are we going to do about it? There are a number of things I have in mind.

The first is to not have a single proxy instance and instead have an auto scaling group that scales as necessary based on load. I like this idea and I will probably be implementing it at some stage in the future. To be honest, as a shared piece of infrastructure, this is how it should have been implemented in the first place. I understand that the single instance (configured lovingly by hand) was probably quicker and easier initially, but for such a critical piece of infrastructure, you really do need to spend the time to do it properly.

The second is to have environment specific proxies, probably as auto scaling groups anyway. This would give me more confidence that we won’t accidentally murder production services when doing internal things, just from an isolation point of view. Essentially, we should treat the proxy just like we treat any other service, and be able to spin them up and down as necessary for whatever purposes.

The third is to isolate our production services entirely, either with another VPC just for production, or even another AWS account just for production. I like this one a lot, because as long as we have shared environments, I’m always terrified I’ll screw up a script and accidentally delete everything. If production wasn’t located in the same account, that would literally be impossible. I’ll be trying to make this happen over the coming months, but I’ll need to move quickly, as the more stuff we have in production, the harder it will be to move.

The last optimisation is to use the new VPC endpoint feature in AWS to avoid having to go to the internet in order to access S3, which I have already done. This really just delays the root issue (shared single point of failure), but it certainly solves the immediate problem and should also provide a nice performance boost, as it removes the proxy from the picture entirely for interactions with S3, which is nice.

Conclusion

To me, this entire event proved just how valuable load testing is. As I stated previously, it did exactly what I expected it to do. Find where the service breaks. It broke in an entirely unexpected way (and broke other things as well), but honestly this is probably the best outcome, because that would have happened at some point in the future anyway (whenever we hit the saturation point for the proxy) and I’d prefer it to happen now, when we’re in beta and managing communications with every user closely, than later, when everybody and their dog are using the service.

Of course, now we have a whole lot more infrastructure work to complete before we can go live, but honestly, the work is never really done anyway.

I still hate proxies.

0 Comments

I’ve been doing a lot of work with AWS recently.

For the last service component that we developed, we put together a CloudFormation template and a series of Powershell scripts to setup, tear down and migrate environments (like CI, Staging, Production, etc). It was extremely effective, baring some issues that we still haven’t quite solved with data migration between environment versions and updating machine configuration settings.

In the first case, an environment is obviously not stateless once you start using it, and you need a good story about maintaining user data between environment versions, at the very least for Production.

In the second case tearing down an entire environment just to update a configuration setting is obviously sub-optimal. We try to make sure that most of our settings are encapsulated within components that we deploy, but not everything can be done this way. CloudFormation does have update mechanisms, I just haven’t had a chance to investigate them yet.

But I digress, lets switch to an entirely different topic for this post How to give secure access to objects in an S3 bucket during initialization of EC2 instances while executing a CloudFormation template.

That was a mouthful.

Don’t Do What Donny Don’t Does

My first CloudFormation template/environment setup system had a fairly simple rule. Minimise dependencies.

There were so many example templates on the internet that just downloaded arbitrary scripts or files from GitHub or S3, and to me that’s the last thing you want. When I run my environment setup (ideally from within a controlled environment, like TeamCity) I want it to use the versions of the resources that are present in the location I’m running the script from. It should be self contained.

Based on that rule, I put together a fairly simple process where the Git Repository housing my environment setup contained all the necessary components required by the resources in the CloudFormation template, and the script was responsible for collecting and then uploading those components to some location that the resources could access.

At the time, I was not very experienced with S3, so I struggled a lot with getting the right permissions.

Eventually I solved the issue by handing off the AWS Key/Secret to the CloudFormation template, and then using those credentials in the AWS::CloudFormation::Authentication block inside the resource (LaunchConfig/Instance). The URL of the dependencies archive was then supplied to the source element of the first initialization step in the AWS::CloudFormation::Init block, which used the supplied credentials to download the file and extract its contents (via cfn-init) to a location on disk, ready to be executed by subsequent components.

This worked, but it left a bad taste in my mouth once I learnt about IAM roles.

IAM roles give you the ability to essentially organise sets of permissions that can be applied to resources, like EC2 instances. For example, we have a logs bucket per environment that is used to capture ELB logs. Those logs are then processed by Logstash (indirectly, because I can’t get the goddamn S3 input to work with a proxy, but I digress) on a dedicated logs processing instance. I could have gone about this in two ways. The first would have been to supply the credentials to the instance, like I had in the past. This exposes those credentials on the instance though, which can be dangerous. The second option is to apply a role to the instance that says “you are allowed to access this S3 bucket, and you can do these things to it”.

I went with the second option, and it worked swimmingly (once I got it all configured).

Looking back at the way I had done the dependency distribution, I realised that using IAM roles would be a more secure option, closer to best practice. Now I just needed a justifiable opportunity to implement it.

New Environment, Time to Improve

We’ve started work on a new service, which means new environment setup. This is a good opportunity to take what you’ve done previously and reuse it, improving it along the way. For me, this was the perfect chance to try and use IAM roles for the dependency distribution, removing all of those nasty “credentials in the clear” situations.

I followed the same process that I had for the logs processing. Setup a role describing the required policy (readonly access to the S3 bucket that contains the dependencies) and then link that role to a profile. Finally, apply the profile to the instances in question.

"ReadOnlyAccessToDependenciesBucketRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/",
        "Policies" : [
            {
                "Version" : "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                    },
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:ListBucket" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                    }
                ]
            }
        ]
    }
},
"ReadOnlyDependenciesBucketInstanceProfile": {    
    "Type": "AWS::IAM::InstanceProfile",    
    "Properties": { 
        "Path": "/", 
        "Roles": [ { "Ref": "ReadOnlyDependenciesBucketRole" }, { "Ref": "FullControlLogsBucketRole" } ] 
    }
},
"InstanceLaunchConfig": {    
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {        
        * snip *    
    },    
    "Properties": {        
        "KeyName": { "Ref": "KeyName" },        
        "ImageId": { "Ref": "AmiId" },        
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],        
        "InstanceType": { "Ref": "InstanceType" },        
        "IamInstanceProfile": { "Ref": "ReadOnlyDependenciesBucketInstanceProfile" },        
        "UserData": {            
            * snip *        
        }    
    }
}

It worked before, so it should work again, right? I’m sure you can probably guess that that was not the case.

The first mistake I made was attempting to specify multiple roles in a single profile. I wanted to do this because the logs processor needed to maintain its permissions to the logs bucket, but it needed the new permissions to the dependencies bucket as well. Even though the roles element is defined as an array, it can only accept a single element. I now hate whoever designed that, even though I’m sure they probably had a good reason.

At least that was an easy fix, flip the relationship between roles and policies. I split the inline policies out of the roles, then linked the roles to the policies instead. Each profile only had 1 role, so everything should have been fine.

"ReadOnlyDependenciesBucketPolicy": {
    "Type":"AWS::IAM::Policy",
    "Properties": {
        "PolicyName": "ReadOnlyDependenciesBucketPolicy",
        "PolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                },
                {
                    "Effect": "Allow",
                    "Action": [ "s3:ListBucket" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                }
            ]
        },
        "Roles": [
            { "Ref" : "InstanceRole" },
            { "Ref" : "OtherInstanceRole" }
        ]
    }
},
"InstanceRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/"
    }
},
"InstanceProfile": {
    "Type": "AWS::IAM::InstanceProfile",
    "Properties": { "Path": "/", "Roles": [ { "Ref": "InstanceRole" } ] }
}

Ha ha ha ha ha, no.

The cfn-init logs showed that the process was getting 403s when trying to access the S3 object URL. I had incorrectly assumed that because the instance was running with the appropriate role (and it was, if I remoted onto the instance and attempted to download the object from S3 via the AWS Powershell Cmdlets, it worked just fine) that cfn-init would use that role.

It does not.

You still need to specify the AWS::CloudFormation::Authentication element, naming the role and the bucket that it will be used for. This feel s a little crap to be honest. Surely the cfn-init application is using the same AWS components, so why doesn’t it just pickup the credentials from the instance profile like everything else does?

Anyway, I added the Authentication element with appropriate values, like so.

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
          "S3AccessCreds": {
            "type": "S3",
            "roleName": { "Ref" : "InstanceRole" },
            "buckets" : [ { "Ref" : "DependenciesS3Bucket" } ]
          }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Then I started getting different errors. You may think this is a bad thing, but I disagree. Different errors means progress. I’d switched from getting 403 responses (access denied) to getting 404s (not found).

Like I said, progress!

The Dependencies Archive is a Lie

It was at this point that I gave up trying to use the IAM roles. I could not for the life of me figure out why it was returning a 404 for a file that clearly existed. I checked and double checked the path, and even used the same path to download the file via the AWS Powershell Cmdlets on the machines that were having the issues. It all worked fine.

Assuming the issue was with my IAM role implementation, I rolled back to the solution that I knew worked. Specifying the Access Key and Secret in the AWS::CloudFormation::Authentication element of the LaunchConfig and removed the new IAM roles resources (for readonly access to the dependencies archive).

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
            "S3AccessCreds": {
                "type": "S3",
                "accessKeyId" : { "Ref" : "DependenciesS3BucketAccessKey" },
                "secretKey" : { "Ref": "DependenciesS3BucketSecretKey" },
                "buckets" : [ { "Ref":"DependenciesS3Bucket" } ]
            }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Imagine my surprise when it also didn’t work, throwing back the same response, 404 not found.

I tried quite a few things over the next few hours, and there was much gnashing and wailing of teeth. I’ve seen some weird crap with S3 and bucket names (too long and you get errors, weird characters in your key and you get errors, etc) but as far as I could tell, everything was kosher. Yet it just wouldn’t work.

After doing a line by line diff against the template/scripts that were working (the other environment setup) and my new template/scripts I realised my error.

While working on the IAM role stuff, trying to get it to work, I had attempted to remove case sensitivity from the picture by calling ToLowerInvariant on the dependencies archive URL that I was passing to my template. The old script/template combo didn’t do that.

When I took that out, it worked fine.

The issue was that the key of the file being uploaded was not being turned into lower case, only the URL of the resulting file was, and AWS keys are case sensitive.

Goddamn it.

Summary

I lost basically an entire day to case sensitivity. Its not even the first time this has happened to me (well, its the first time its happened in S3 I think). I come from a heavy Windows background. I don’t even consider case sensitivity to be a thing. I can understand why its a thing (technically different characters and all), but its just not on windows, so its not even on my radar most of the time. I assume the case sensitivity in S3 is a result of the AWS backend being Unix/Linux based, but its still a shock to find a case sensitive URL.

I turns out that my IAM stuff had started working just fine and I was getting 404s because of an entirely different reason. I had assumed that I was still doing something wrong with my permissions and the API was just giving a crappy response (i.e. not really a 404, some sort of permission based can’t find file error masquerading as a 404).

At the very least I didn’t make the silliest mistake you can make in software (assuming the platform is broken), I just assumed I had configured it wrong somehow. That’s generally a fairly safe assumption when you’re using a widely distributed system. Sometimes you do find a feature that is broken, but it is far more likely that you are just doing it wrong. In my case, the error message was completely accurate, and was telling me exactly the right thing, I just didn’t realise why.

Somewhat ironically, the root cause of my 404 issue was my attempt to remove case sensitivity from the picture when I was working on getting the IAM stuff up and running. I just didn’t apply the case insensitivity consistently.

Ah well.