0 Comments

Its time to fix that whole shared infrastructure issue.

I need to farm my load tests out to AWS, and I need to do it in a way that won’t accidentally murder our current production servers. In order to do that , I need to forge the replacement for our manually created and configured proxy box. A nice, codified, auto-scaling proxy environment.

Back into CloudFormation I go.

I contemplated simply using a NAT box instead of a proxy, but decided against it because:

  • We already use a proxy, so assuming my template works as expected it should be easy enough to slot in,
  • I don’t have any experience with NAT boxes (I’m pretty weak on networking in general actually),
  • Proxies scale better in the long run, so I might as well sort that out now.

Our current lone proxy machine is a Linux instance with Squid manually installed on it. It was setup some time before I started, by someone who no longer works at the company. An excellent combination, I’m already a bit crap at Linux, and now I can’t even ask anyone how it was put together and what sort of tweaks were done to it over time as failures were encountered. Time to start from scratch. The proxy itself is sound enough, and I have some experience with Squid, so I’ll stick with it. As for the OS, while I know that Linux will likely be faster with less overhead, I’m far more comfortable with Windows, so to hell with Linux for now.

Here's the plan. Create a CloudFormation template for the actual environment (Load Balancer, Auto Scaling Group, Instance Configuration, DNS Record) and also create a NuGet package that installs and configures the proxy to be deployed via Octopus.

I’ve always liked the idea of never installing software manually, but its only been recently that I’ve had access to the tools to accomplish that. Octopus, NuGet and Powershell form a very powerful combination for managing deployments on Windows. I have no idea what the equivalent is for Linux, but I’m sure there is something. At some point in the future Octopus is going to offer the ability to do SSH deploys which will allow me to include more Linux infrastructure (or manage existing Linux infrastructure even better, I’m looking at you ELK stack).

Save the Environment

The environment is pretty simple. A Load Balancer hooked up to an Auto Scaling Group, whose instances are configured to do some simple setup, (including using Octopus to deploy some software) and a DNS record so that I can refer to the load balancer in a nice way.

I’ve done enough of these simple sorts of environments now that I didn’t really run into any interesting issues. Don’t get me wrong, they aren’t trivial, but I wasn’t stuck smashing my head against a desk for a few days while I sorted out some arcane problem that ended up being related to case sensitivity or something ridiculous like that.

One thing that I have learned, is to setup the Octopus project that will be deployed during environment setup ahead of time. Give it some trivial content, like running a Powershell script, and then make sure it deploys correctly during the startup of the instances in the Auto Scaling Group. If you try to sort out the package and its deployment at the same time as the environment, you’ll probably run into issues where the environment setup technically succeeded, but because the deployment of the package failed, the whole thing failed and you have to wait another 20 minutes to fix it. It really saves a lot of time to create the environment in such a way that you can extend it with deployments later.

Technically you could also make it so failing deployments don’t fail an environment setup, but I like my environments to work when they are “finished”, so I’m not really comfortable with that in the long run.

The only tricky things about the proxy environment are making sure that you setup your security groups appropriately so that the proxy port can be accessed, and making sure that you use the correct health check for the load balancer (for Squid at least, TCP over the port 3128 (the default for Squid) is a good health check).

That's a Nice Package

With the environment out of the way, its time to setup the package that will be used to deploy Squid.

Squid is available on Windows via diladele. Since 99% of our systems are 64 bit, I just downloaded the 64 bit MSI. Using the same structure that I used for the Nxlog package, I packaged up the MSI and some supporting scripts, making sure to version the package appropriately. Consistent versioning is important, so I use the same versioning strategy that I use for our software components. Include a SharedAssemblyInfo file and then mutate that file via some common versioning Powershell functions.

Apart from installation of Squid itself, I also included the ability to deploy a custom configuration file. The main reason I did this was so that I could replicate out current Squid proxy config exactly, because I’m sure it does things that have been built up over the last few years that I don’t understand. I did this in a similar way to how I did config deployment for Nxlog and Logstash. Essentially a set of configuration files are included in the Nuget package and then the correct one is chosen at deployment time based on some configuration within Octopus.

I honestly don’t remember if I had any issues with creating the Squid proxy package, but I’m sure if I had of, they would be fresh in my mind. MSI’s are easy to install silently with MSIEXEC once you know the arguments, and the Squid installer for Windows is pretty reliable. I really do think it was straightforward, especially considering that I was following the same pattern that I’d used to install an MSI via Octopus previously.

Delivering the Package

This is standard Octopus territory. Create a project to represent the deployable component, target it appropriately at machines in roles and then deploy to environments. Part of the build script that is responsible for putting together the package above can also automatically deploy it via Octopus to an environment of your choice.

In TeamCity, we typically do an automatic deploy to CI on every checkin (gated by passing tests), but for this project I had to hold off. We’re actually running low on Build Configurations right now (I’ve already put in a request for more, but the wheels of bureaucracy move pretty slow), so I skipped out on setting one up for the Squid proxy. Once we get some more available configurations I’ll rectify that situation.

Who Could Forget Logs?

The final step in deploying and maintaining anything is to make sure that the logs from the component are being aggregated correctly, so that you don’t have to go the machine/s in question to see what's going on and you have a nice pile of data to do analysis on later. Space is cheap after all, so you might as well store everything, all the time (except media).

Squid features a nice Access log with a well known format, which is perfect for this sort of log processing and aggregation.

Again, using the same sort of approach that I’ve used for other components, I quickly knocked up a logstash config for parsing the log file and deployed it (and Logstash) to the same machines as the Squid proxy installations. I’ll include that config here, because it lives in a different repository to the rest of the Squid stuff (it would live in the Solavirum.Logging.Logstash repo, if I updated it).

input {
    file {
        path => "@@SQUID_LOGS_DIRECTORY/access.log"
        type => "squid"
        start_position => "beginning"
        sincedb_path => "@@SQUID_LOGS_DIRECTORY/.sincedb"
    }
}

filter {
    if [type] == "squid" {
        grok {
            match => [ "message", "%{NUMBER:timestamp}\s+%{NUMBER:TimeTaken:int} %{IPORHOST:source_ip} %{WORD:squid_code}/%{NUMBER:Status} %{NUMBER:response_bytes:int} %{WORD:Verb} %{GREEDYDATA:url} %{USERNAME:user} %{WORD:squid_peerstatus}/(%{IPORHOST:destination_ip}|-) %{GREEDYDATA:content_type}" ]
        }
        date {
            match => [ "timestamp", "UNIX" ]
            remove_field => [ "timestamp" ]
        }
    }
     
    mutate {
        add_field => { "SourceModuleName" => "%{type}" }
        add_field => { "Environment" => "@@ENVIRONMENT" }
        add_field => { "Application" => "SquidProxy" }
        convert => [ "Status", "string" ]
    }
    
    # This last common mutate deals with the situation where Logstash was creating a custom type (and thus different mappings) in Elasticsearch
    # for every type that came through. The default "type" is logs, so we mutate to that, and the actual type is stored in SourceModuleName.
    # This is a separate step because if you try to do it with the SourceModuleName add_field it will contain the value of "logs" which is wrong.
    mutate {
        update => [ "type", "logs" ]
    }
}

output {
    tcp {
        codec => json_lines
        host => "@@LOG_SERVER_ADDRESS"
        port => 6379
    }
    
    #stdout {
    #    codec => rubydebug
    #}
}

I Can Never Think of a Good Title for the Summary

For reference purposes I’ve included the entire Squid package/environment setup code in this repository. Use as you see fit.

As far as environment setups go, this one was pretty much by the numbers. No major blockers or time wasters. It wasn’t trivial, and it still took me a few days of concentrated effort, but the issues I did have were pretty much just me making mistakes (like setting up security group rules wrong, or failing to tag instances correctly in Octopus, or failing to test the Squid install script locally before I deployed it). The slowest part is definitely waiting for the environment creation to either succeed or fail, because it can take 20+ minutes for the thing to run start to finish. I should look into making that faster somehow, as I get distracted during that 20 minutes.

Really the only reason for the lack of issues was that I’d done all of this sort of stuff before, and I tend to make my stuff reusable. It was a simple matter to plug everything together in the configuration that I needed, no need to reinvent the wheel.

Though sometimes you do need to smooth the wheel a bit when you go to use it again.

0 Comments

Managing subnets in AWS makes me sad. Don’t get me wrong, AWS (as per normal) gives you full control over that kind of thing, I’m mostly complaining from an automation point of view.

Ideally, when you design a self contained environment, you want to ensure that it is isolated in as many ways as possible from other environments. Yes you can re-use shared infrastructure from a cost optimization point of view, but conceptually you really do want to make sure that Environment A can’t possibly affect anything in Environment B and vice versa.

As is fairly standard, all of our AWS CloudFormation templates use subnets.

In AWS, a subnet defines a set of available IP addresses (i.e. using CIDR notation 1.198.143.0/28, representing 1.198.143.1 – 1.198.143.16). Subnets also define an availability zone (for redundancy, i.e. ap-southeast-2a vs ap-southeast-2b), whether or not resources using the subnet automatically get an IP address and can be used to define routing rules to restrict access. Route tables and security groups are the main mechanisms by which you can lock down access to your machines, outside of the OS level, so its important to use them as much as you can. You should always assume that any one of your machines might be compromised and minimise possible communication channels accordingly.

Typically, in a CloudFormation template each resource will have a dependency on one or more subnets (more subnets for highly available resources, like auto scaling groups and RDS instances). The problem is, while it is possible to setup one or many subnets inside a CloudFormation template, there’s no real tools available to select an appropriate IP range for your new subnet/s from the available range in the VPC.

What we’ve had to do as a result of this, is setup a couple of known subnets with high capacity (mostly just blocks of 255 addresses) and then use those subnets statically in the templates. We’ve got a few subnets for publically accessible resources (usually just load balancers), a few for private web servers (typically only accessible from the load balancers) and a few for

This is less than ideal for various reasons (hard dependency on resources created outside of the template, can’t leverage route tables as cleanly, etc). What I would prefer, is the ability to query the AWS infrastructure for a block of IP addresses at the time the template is executed, and dynamically create subnets like that (setting up route tables as appropriate). To me this feels like a much better way of managing the network infrastructure in the cloud, keeping in line with my general philosophy of self contained environment setup.

Technically the template would probably have a dependency on a VPC, but you could search for that dynamically if you wanted to. Our accounts only have 1 VPC In them anyway.

The Dream

I can see the set of tools that I want to access in my head, they just don’t seem to exist.

The first thing needed would be a library of some sort, that allows you to supply a VPC (and its meta information) and a set of subnets (also with their meta information) and then can produce for you a new subnet of the desired capacity. For example, if I know that I only need a few IP addresses for the public facing load balancer in my environment, I would get the tool to generate 2 subnets, one in each availability zone in ap-southeast-2, of size 16 or something similarly tiny.

The second thing would be a visualization tool built on top of the library, that let you view your address space as a giant grid, zoomable, with important elements noted, like coloured subnets, resources currently using IP addresses and if you wanted to get really fancy, route tables and their effects on communication.

Now you may be thinking, you’re a programmer, why don’t you do it? The answer is, I’m considering it pretty hard, but while the situation does annoy me, it hasn’t annoyed me enough to spur me into action yet. I’m posting up the idea on the off chance someone who is more motivated than me grabs it and runs with it.

Downfall

There is at least one downside that I can think of with using a library to create subnets of the appropriate size.

Its a similar issue to memory allocation and management. As it is likely that the way in which you need IP address ranges changes from template to template, the addressable space will eventually suffer from fragmentation. In memory management, this is solved by doing some sort of compacting or other de-fragmentation activity. For IP address ranges, I’m not sure how you could solve that issue. You could probably update the environment to use new subnets, re-allocated to minimise fragmentation, but I think its likely to be more trouble than its worth.

Summary

To summarise, I really would like a tool to help me visualize the VPC (and its subnets, security groups, etc) in my AWS account. I’d settle for something that just lets me visualize my subnets in the context of the total addressable space.

I might write it.

You might write it.

Someone should.

0 Comments

The service that I’ve mentioned previously (and the iOS app it supports) has been in beta now for a few weeks. People seem relatively happy with it, both from a performance standpoint and due to the fact that it doesn’t just arbitrarily lose their information, unlike the previous version, so we’ve got that going for us, which is nice.

We did a fair amount of load testing on it before it went out to beta, but only for small numbers of concurrent users (< 100), to make sure that our beta experience would be acceptable. That load testing picked up a few issues, including one where the service would happily (accidentally of course) delete other peoples data. It wasn’t a permissions issue, it was due to the way in which we were keying our image storage. More importantly, the load testing found issues with the way in which we were storing images (we were using Raven 2.5 attachments) and how it just wasn’t working from a performance point of view. We switched to storing the files in S3, and it was much better.

I believe the newer version of Raven has a new file storage mechanism that is much better. I don’t even think Ayende recommends that you use the attachments built into Raven 2.5 for any decent amount of file storage.

Before we go live, we knew that we needed to find the breaking point of the service. The find the number of concurrent users at which its performance degraded to the point where it was unusable (at least for the configuration that we were planning on going live with). If that number was too low, we knew we would need to make some additional changes, either in terms of infrastructure (beefier AWS instances, more instances in the Auto Scaling Group) or in terms of code.

We tried to simply run a huge amount of users through our load tests locally (which is how we we did the first batch of load testing, locally using JMeter) but we capped out our available upload bandwidth pretty quickly, well below the level of traffic that the service could handle.

It was time to farm the work out to somewhere else, somewhere with a huge amount of easily accessibly computing resources.

Where else but Amazon Web Services?

I’ve Always Wanted to be a Farmer

The concept was fairly straightforward. We had a JMeter configuration file that contained all of our load tests. It was parameterised by the number of users, so conceptually the path would be to spin up some worker instances in EC2, push JMeter, its dependencies and our config to them, then execute the tests. This way we could tune the number users per instance along with the total number of worker instances, and we would be able to easily put enough pressure on the service to find its breaking point.

JMeter gives you the ability to set the value of variables via the command line. Be careful though, as the variable names are case sensitive. That one screwed me over for a while, as I couldn’t figure out why the value of my variables was still the default on every machine I started the tests on. For the variable that defined the maximum number of users it wasn’t so bad, if a bit confusing. The other variable that defined the seed for the user identity was more of an issue when it wasn’t working, because it meant the same user was doing similar things from multiple machines. Still a valid test, but not the one I was aiming to do, as the service isn’t defined for concurrent access like that.

We wouldn’t want to put all of that load on the service all at once though, so we needed to stagger when each instance started its tests.

Leveraging the work I’d done previously for setting up environments, I created a Cloud Formation template containing an Auto Scaling Group with a variable number of worker instances. Each instance would have the JMeter config file and all of its dependencies (Java, JMeter, any supporting scripts) installed during setup, and then be available for remote execution via Powershell.

The plan was to hook into that environment (or setup a new one if one could not be found), find the worker instances and then iterate through them, starting the load tests on each one, making sure to stagger the time between starts to some reasonable amount. The Powershell script for doing exactly that is below:

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$environmentName,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsKey,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsSecret,
    [string]$awsRegion="ap-southeast-2"
)

$ErrorActionPreference = "Stop"

$currentDirectoryPath = Split-Path $script:MyInvocation.MyCommand.Path
write-verbose "Script is located at [$currentDirectoryPath]."

. "$currentDirectoryPath\_Find-RepositoryRoot.ps1"

$repositoryRoot = Find-RepositoryRoot $currentDirectoryPath

$repositoryRootDirectoryPath = $repositoryRoot.FullName
$commonScriptsDirectoryPath = "$repositoryRootDirectoryPath\scripts\common"

. "$repositoryRootDirectoryPath\scripts\environment\Functions-Environment.ps1"

. "$commonScriptsDirectoryPath\Functions-Aws.ps1"

Ensure-AwsPowershellFunctionsAvailable

$stack = $null
try
{
    $stack = Get-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion
}
catch 
{
    Write-Warning $_
}

if ($stack -eq $null)
{
    $update = ($stack -ne $null)

    $stack = New-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -UpdateExisting:$update -Wait -disableCleanupOnFailure
}

$autoScalingGroupName = $stack.AutoScalingGroupName

$asg = Get-ASAutoScalingGroup -AutoScalingGroupNames $autoScalingGroupName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
$instances = $asg.Instances

. "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1"

$remoteUser = "Administrator"
$remotePassword = "ObviouslyInsecurePasswordsAreTricksyMonkeys"
$securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword)

$usersPerMachine = 100
$nextAvailableCustomerNumber = 1
$jobs = @()
foreach ($instance in $instances)
{
    # Get the instance
    $instance = Get-AwsEc2Instance -InstanceId $instance.InstanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion

    $ipAddress = $instance.PrivateIpAddress
    
    $session = New-PSSession -ComputerName $ipAddress -Credential $cred

    $remoteScript = {
        param
        (
            [int]$totalNumberOfUsers,
            [int]$startingCustomerNumber
        )
        Set-ExecutionPolicy -ExecutionPolicy Bypass
        & "C:\cfn\dependencies\scripts\jmeter\execute-load-test-no-gui.ps1" -totalNumberOfUsers $totalNumberOfUsers -startingCustomerNumber $startingCustomerNumber -AllocatedMemory 512
    }
    $job = Invoke-Command -Session $session -ScriptBlock $remoteScript -ArgumentList $usersPerMachine,$nextAvailableCustomerNumber -AsJob
    $jobs += $job
    $nextAvailableCustomerNumber += $usersPerMachine

    #Sleep -Seconds ([TimeSpan]::FromHours(2).TotalSeconds)
    Sleep -Seconds 300

    # Can use Get-Job or record list of jobs and then terminate them. I suppose we could also wait on all of them to be complete. Might be good to get some feedback from
    # the remote process somehow, to indicate whether or not it is still running/what it is doing.
}

Additionally, I’ve recreated and reuploaded the repository from my first JMeter post, containing the environment template and scripts for executing the template, as well as the script above. You can find it here.

The last time I uploaded this repository I accidentally compromised our AWS deployment credentials, so I tore it down again very quickly. Not my brightest moment, but you can rest assured I’m not making the same mistake twice. If you look at the repository, you’ll notice that I implemented the mechanism for asking for credentials for tests so I never feel tempted to put credentials in a file ever again.

We could watch the load tests kick into gear via Kibana, and keep an eye on when errors start to occur and why.

Obviously we didn’t want to run the load tests on any of the existing environments (which are in use for various reasons), so we spun up a brand new environment for the service, fired up the script to farm out the load tests (with a 2 hour delay between instance starts) and went home for the night.

15 minutes later, Production (the environment actively being used for the external beta) went down hard, and so did all of the others, including the new load test environment.

Separately Dependent

We had gone to great lengths to make sure that our environments were independent. That was the entire point behind codifying the environment setup, so that we could spin up all resources necessary for the environment, and keep it isolated from all of the other ones.

It turns out they weren’t quite as isolated as we would have liked.

Like a lot of AWS setups, we have an internet gateway, allowing resources internal to our VPC (like EC2 instances) access to the internet. By default, only resources with an external IP can access the internet through the gateway. Other resources have to use some other mechanism for accessing the internet. In our case, the other mechanism is a SQUID proxy.

It was this proxy that was the bottleneck. Both the service under test and the load test workers themselves were slamming it, the service in order to talk to S3 and the load test workers in order to hit the service (through its external URL).

We recently increased the specs on the proxy machine (because of a similar problem discovered during load testing with fewer users) and we thought that maybe it would be powerful enough to handle the incoming requests. It probably would have been if it wasn’t for the double load (i.e. if the load test requests had of been coming from an external party and the only traffic going through the proxy was to S3 from the service).

In the end the load tests did exactly what they were supposed to do, even if they did it in an unexpected way. The pushed the system to breaking point, allowing us to identify where it broke and schedule improvements to prevent the situation from occurring again.

Actions Speak Louder Than Words

What are we going to do about it? There are a number of things I have in mind.

The first is to not have a single proxy instance and instead have an auto scaling group that scales as necessary based on load. I like this idea and I will probably be implementing it at some stage in the future. To be honest, as a shared piece of infrastructure, this is how it should have been implemented in the first place. I understand that the single instance (configured lovingly by hand) was probably quicker and easier initially, but for such a critical piece of infrastructure, you really do need to spend the time to do it properly.

The second is to have environment specific proxies, probably as auto scaling groups anyway. This would give me more confidence that we won’t accidentally murder production services when doing internal things, just from an isolation point of view. Essentially, we should treat the proxy just like we treat any other service, and be able to spin them up and down as necessary for whatever purposes.

The third is to isolate our production services entirely, either with another VPC just for production, or even another AWS account just for production. I like this one a lot, because as long as we have shared environments, I’m always terrified I’ll screw up a script and accidentally delete everything. If production wasn’t located in the same account, that would literally be impossible. I’ll be trying to make this happen over the coming months, but I’ll need to move quickly, as the more stuff we have in production, the harder it will be to move.

The last optimisation is to use the new VPC endpoint feature in AWS to avoid having to go to the internet in order to access S3, which I have already done. This really just delays the root issue (shared single point of failure), but it certainly solves the immediate problem and should also provide a nice performance boost, as it removes the proxy from the picture entirely for interactions with S3, which is nice.

Conclusion

To me, this entire event proved just how valuable load testing is. As I stated previously, it did exactly what I expected it to do. Find where the service breaks. It broke in an entirely unexpected way (and broke other things as well), but honestly this is probably the best outcome, because that would have happened at some point in the future anyway (whenever we hit the saturation point for the proxy) and I’d prefer it to happen now, when we’re in beta and managing communications with every user closely, than later, when everybody and their dog are using the service.

Of course, now we have a whole lot more infrastructure work to complete before we can go live, but honestly, the work is never really done anyway.

I still hate proxies.

0 Comments

I’ve been doing a lot of work with AWS recently.

For the last service component that we developed, we put together a CloudFormation template and a series of Powershell scripts to setup, tear down and migrate environments (like CI, Staging, Production, etc). It was extremely effective, baring some issues that we still haven’t quite solved with data migration between environment versions and updating machine configuration settings.

In the first case, an environment is obviously not stateless once you start using it, and you need a good story about maintaining user data between environment versions, at the very least for Production.

In the second case tearing down an entire environment just to update a configuration setting is obviously sub-optimal. We try to make sure that most of our settings are encapsulated within components that we deploy, but not everything can be done this way. CloudFormation does have update mechanisms, I just haven’t had a chance to investigate them yet.

But I digress, lets switch to an entirely different topic for this post How to give secure access to objects in an S3 bucket during initialization of EC2 instances while executing a CloudFormation template.

That was a mouthful.

Don’t Do What Donny Don’t Does

My first CloudFormation template/environment setup system had a fairly simple rule. Minimise dependencies.

There were so many example templates on the internet that just downloaded arbitrary scripts or files from GitHub or S3, and to me that’s the last thing you want. When I run my environment setup (ideally from within a controlled environment, like TeamCity) I want it to use the versions of the resources that are present in the location I’m running the script from. It should be self contained.

Based on that rule, I put together a fairly simple process where the Git Repository housing my environment setup contained all the necessary components required by the resources in the CloudFormation template, and the script was responsible for collecting and then uploading those components to some location that the resources could access.

At the time, I was not very experienced with S3, so I struggled a lot with getting the right permissions.

Eventually I solved the issue by handing off the AWS Key/Secret to the CloudFormation template, and then using those credentials in the AWS::CloudFormation::Authentication block inside the resource (LaunchConfig/Instance). The URL of the dependencies archive was then supplied to the source element of the first initialization step in the AWS::CloudFormation::Init block, which used the supplied credentials to download the file and extract its contents (via cfn-init) to a location on disk, ready to be executed by subsequent components.

This worked, but it left a bad taste in my mouth once I learnt about IAM roles.

IAM roles give you the ability to essentially organise sets of permissions that can be applied to resources, like EC2 instances. For example, we have a logs bucket per environment that is used to capture ELB logs. Those logs are then processed by Logstash (indirectly, because I can’t get the goddamn S3 input to work with a proxy, but I digress) on a dedicated logs processing instance. I could have gone about this in two ways. The first would have been to supply the credentials to the instance, like I had in the past. This exposes those credentials on the instance though, which can be dangerous. The second option is to apply a role to the instance that says “you are allowed to access this S3 bucket, and you can do these things to it”.

I went with the second option, and it worked swimmingly (once I got it all configured).

Looking back at the way I had done the dependency distribution, I realised that using IAM roles would be a more secure option, closer to best practice. Now I just needed a justifiable opportunity to implement it.

New Environment, Time to Improve

We’ve started work on a new service, which means new environment setup. This is a good opportunity to take what you’ve done previously and reuse it, improving it along the way. For me, this was the perfect chance to try and use IAM roles for the dependency distribution, removing all of those nasty “credentials in the clear” situations.

I followed the same process that I had for the logs processing. Setup a role describing the required policy (readonly access to the S3 bucket that contains the dependencies) and then link that role to a profile. Finally, apply the profile to the instances in question.

"ReadOnlyAccessToDependenciesBucketRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/",
        "Policies" : [
            {
                "Version" : "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                    },
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:ListBucket" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                    }
                ]
            }
        ]
    }
},
"ReadOnlyDependenciesBucketInstanceProfile": {    
    "Type": "AWS::IAM::InstanceProfile",    
    "Properties": { 
        "Path": "/", 
        "Roles": [ { "Ref": "ReadOnlyDependenciesBucketRole" }, { "Ref": "FullControlLogsBucketRole" } ] 
    }
},
"InstanceLaunchConfig": {    
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {        
        * snip *    
    },    
    "Properties": {        
        "KeyName": { "Ref": "KeyName" },        
        "ImageId": { "Ref": "AmiId" },        
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],        
        "InstanceType": { "Ref": "InstanceType" },        
        "IamInstanceProfile": { "Ref": "ReadOnlyDependenciesBucketInstanceProfile" },        
        "UserData": {            
            * snip *        
        }    
    }
}

It worked before, so it should work again, right? I’m sure you can probably guess that that was not the case.

The first mistake I made was attempting to specify multiple roles in a single profile. I wanted to do this because the logs processor needed to maintain its permissions to the logs bucket, but it needed the new permissions to the dependencies bucket as well. Even though the roles element is defined as an array, it can only accept a single element. I now hate whoever designed that, even though I’m sure they probably had a good reason.

At least that was an easy fix, flip the relationship between roles and policies. I split the inline policies out of the roles, then linked the roles to the policies instead. Each profile only had 1 role, so everything should have been fine.

"ReadOnlyDependenciesBucketPolicy": {
    "Type":"AWS::IAM::Policy",
    "Properties": {
        "PolicyName": "ReadOnlyDependenciesBucketPolicy",
        "PolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                },
                {
                    "Effect": "Allow",
                    "Action": [ "s3:ListBucket" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                }
            ]
        },
        "Roles": [
            { "Ref" : "InstanceRole" },
            { "Ref" : "OtherInstanceRole" }
        ]
    }
},
"InstanceRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/"
    }
},
"InstanceProfile": {
    "Type": "AWS::IAM::InstanceProfile",
    "Properties": { "Path": "/", "Roles": [ { "Ref": "InstanceRole" } ] }
}

Ha ha ha ha ha, no.

The cfn-init logs showed that the process was getting 403s when trying to access the S3 object URL. I had incorrectly assumed that because the instance was running with the appropriate role (and it was, if I remoted onto the instance and attempted to download the object from S3 via the AWS Powershell Cmdlets, it worked just fine) that cfn-init would use that role.

It does not.

You still need to specify the AWS::CloudFormation::Authentication element, naming the role and the bucket that it will be used for. This feel s a little crap to be honest. Surely the cfn-init application is using the same AWS components, so why doesn’t it just pickup the credentials from the instance profile like everything else does?

Anyway, I added the Authentication element with appropriate values, like so.

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
          "S3AccessCreds": {
            "type": "S3",
            "roleName": { "Ref" : "InstanceRole" },
            "buckets" : [ { "Ref" : "DependenciesS3Bucket" } ]
          }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Then I started getting different errors. You may think this is a bad thing, but I disagree. Different errors means progress. I’d switched from getting 403 responses (access denied) to getting 404s (not found).

Like I said, progress!

The Dependencies Archive is a Lie

It was at this point that I gave up trying to use the IAM roles. I could not for the life of me figure out why it was returning a 404 for a file that clearly existed. I checked and double checked the path, and even used the same path to download the file via the AWS Powershell Cmdlets on the machines that were having the issues. It all worked fine.

Assuming the issue was with my IAM role implementation, I rolled back to the solution that I knew worked. Specifying the Access Key and Secret in the AWS::CloudFormation::Authentication element of the LaunchConfig and removed the new IAM roles resources (for readonly access to the dependencies archive).

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
            "S3AccessCreds": {
                "type": "S3",
                "accessKeyId" : { "Ref" : "DependenciesS3BucketAccessKey" },
                "secretKey" : { "Ref": "DependenciesS3BucketSecretKey" },
                "buckets" : [ { "Ref":"DependenciesS3Bucket" } ]
            }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Imagine my surprise when it also didn’t work, throwing back the same response, 404 not found.

I tried quite a few things over the next few hours, and there was much gnashing and wailing of teeth. I’ve seen some weird crap with S3 and bucket names (too long and you get errors, weird characters in your key and you get errors, etc) but as far as I could tell, everything was kosher. Yet it just wouldn’t work.

After doing a line by line diff against the template/scripts that were working (the other environment setup) and my new template/scripts I realised my error.

While working on the IAM role stuff, trying to get it to work, I had attempted to remove case sensitivity from the picture by calling ToLowerInvariant on the dependencies archive URL that I was passing to my template. The old script/template combo didn’t do that.

When I took that out, it worked fine.

The issue was that the key of the file being uploaded was not being turned into lower case, only the URL of the resulting file was, and AWS keys are case sensitive.

Goddamn it.

Summary

I lost basically an entire day to case sensitivity. Its not even the first time this has happened to me (well, its the first time its happened in S3 I think). I come from a heavy Windows background. I don’t even consider case sensitivity to be a thing. I can understand why its a thing (technically different characters and all), but its just not on windows, so its not even on my radar most of the time. I assume the case sensitivity in S3 is a result of the AWS backend being Unix/Linux based, but its still a shock to find a case sensitive URL.

I turns out that my IAM stuff had started working just fine and I was getting 404s because of an entirely different reason. I had assumed that I was still doing something wrong with my permissions and the API was just giving a crappy response (i.e. not really a 404, some sort of permission based can’t find file error masquerading as a 404).

At the very least I didn’t make the silliest mistake you can make in software (assuming the platform is broken), I just assumed I had configured it wrong somehow. That’s generally a fairly safe assumption when you’re using a widely distributed system. Sometimes you do find a feature that is broken, but it is far more likely that you are just doing it wrong. In my case, the error message was completely accurate, and was telling me exactly the right thing, I just didn’t realise why.

Somewhat ironically, the root cause of my 404 issue was my attempt to remove case sensitivity from the picture when I was working on getting the IAM stuff up and running. I just didn’t apply the case insensitivity consistently.

Ah well.

0 Comments

As I’ve already stated, I’ve spent the last few weeks working on putting together log aggregation so that we know what our new service looks like in real time.

I’ve incorporated IIS logs, the application logs, machine statistics (memory, CPU, etc) and Windows Event logs into the log aggregator, and successfully used those events for analysis during load testing.

There was one piece missing though, which meant there was a hole in our ability to monitor how our service was actually operating in the face of actual usage.

The Elastic Load Balancer, or ELB, that sits in front of the publically accessible web service.

During load testing, I noticed that sometimes JMeter would record an error (specifically a 504, Gateway Timeout) but our dashboard in Kibana would show nothing. No errors, everything seemed fine.

It turned out that there was a default timeout on the ELB of 60 seconds, and at that point in the load testing, some requests were taking longer than that without causing any traffic over the connection. The ELB would terminate the connection, return a 504 to the client, but the request would still complete successfully (eventually) in the backend.

I needed to get eyes on the ELB.

Its Log!

Turning logging on for an ELB is fairly easy.

Just give it the S3 bucket you want it to log to, a prefix to use for entries made into the bucket and a time interval, and off it goes. All of this can be done through the CloudFormation template, which fits well into our strategy for environment setup (no manual tasks, automate all the things).

The only complex bit is setting up a bucket policy that sets the correct permissions to allow the ELB to write to the bucket, which is all pretty well documented. There is simply a well known ARN for what I assume is all Load Balancers in a region, and you setup a simple Put/Get/List policy to allow it to do its thing.

The only gotcha I ran into was when I included an underscore (_) in the prefix configuration setting for the ELB. The prefix setting is intended to make sure that the keys for files written into the bucket start with a common value. When I included an underscore, I got nothing but Access Denied errors. This was at the same time as I was setting up the bucket policy, so I assumed I had done that incorrectly. Turns out my bucket policy was flawless, and it was a completely unrelated (and unexpected) issue causing the Access Denied errors.

Very frustrating.

With that fixed though, the logs started flowing.

Content Rich

The ELB logs contain things like the ELB IP and port, where the request was forwarded to (IP and port again), the time to forward, process and respond to requests (3 separate entries, process is the time it takes for your server to do its thing), response codes, bytes transferred and other things. Very similar to IIS really, which is not unexpected.

Now all I had to do was get the information into our Log Aggregator.

Stashing Those Logs

I had been using Nxlog as my log processor. It was responsible for picking up files, processing them as necessary, enriching them with various pieces of information (hostname, component, application) and then shipping the results off via TCP to our log aggregator where Logstash was listening.

Nxlog is a fine product, but its scripting language is a hard to get a handle on, and the documentation is a bit sparse. Also it has no concept of decimal numbers, which meant that I had to convert some numbers to integers (like decimal seconds to milliseconds) via regular expressions. Altogether it got the job done, but I wasn’t particularly happy with it.

I thought that since I needed to do something a little bit more complicated (get files from S3 and process them) that I would use Logstash this time. Logstash as a log processor is a lot easier to distribute, configure and debug, which is nice. Its configuration is all in json, and is very easy to wrap your head around, and it has lots of component to accomplish various tasks like getting files from S3, parsing CSV lines, mutating fields to the correct type, etc. It even has a mutator (Logstash calls them filters) that allows you to execute arbitrary Ruby code for those times when you have to do something unusual.

Even better, Logstash is what's listening on the other end of the pipeline, so they play well together, and you only need to know 1 piece of software, instead of 2.

I built a similar distributable project to what I built for Nxlog, that creates a NuGet package that Octopus can deploy to get a copy of Logstash up and running on the target machine as a Windows Service. I won’t go into this in too much detail, but it was essentially the same thing that I did for Nxlog, except with different dependencies (JRE, Logstash, NSSM for service installation/configuration).

I added a small EC2 instance to our environment setup to act as a Log Processor, with the intent that it would immediately be used to process the ELB logs, but may also be used in the future to process other logs that don’t necessarily fit onto a specific machine (S3 access logs is the only one that comes to mind, but I’m sure there are more). The Logs Processor had an IAM role allowing it full control over the logs bucket that ELB was using (which was also created as part of the environment. Nice and clean, and no credentials stored anywhere.

I created a Logstash configuration to grab files from S3 and process them, and then deployed it to the Logs Processor.

Access Denied.

Permission to Rage Requested

The current release version of Logstash (1.4.2) does not support the usage of IAM roles for the S3 input. If I wanted to use that input, I would have to enter the credentials manually into the config file. I could do this easily enough at deployment time (storing the credentials in Octopus, which is much better than in source control), but I would need to actually have a user setup that could access the bucket. As the bucket is created during environment creation, this would mean that the credentials would change every time the environment was recreated. We create temporary environments all the time, so this would mean a manual step editing Octopus every time you wanted to get something to work.

That's unacceptable.

I contemplated using a small script during deployment time to grab some credentials from the IAM role on the machine and enter them into the config file, but those credentials expire and Logstash was running as a service, so at some stage it would just stop working and someone would have to do something to make it work again.

Again, unacceptable.

Luckily for me, the wonderful people behind Logstash (and specifically the S3 plugin) have developed a new version that allows the usage of IAM roles, and it was already in beta. Its a little unstable still (Release Candidate 2), but it was good enough for my purposes.

While doing some reading about Logstash and the new version I discovered that the file input was basically completely broken on Windows. The component that it was leveraging to get the unique identifier for files in order to record the position in the file that it was up to does not work in 1.4.2 and below, so you end up missing huge chunks of data when processing multiple files. This actually explained why I was having so much difficulty using the earlier version to process a large amount of IIS logs from a disconnected machine, and why there was holes in my data. Long story short, if you’re using the file input in Logstash and you’re on windows, get the latest release candidate.

I incorporated the 1.5 RC2 release into my deployment, but I still couldn’t get the S3 input to work.

Why Is It Always A Proxy

I hate proxies.

Not because of what they are. I think they actually do some useful things, like caching, obfuscating where requests are coming from when accessing the internet from within a network and preventing access to bad websites.

No I hate proxies because the support for them is always a pain in the ass. Every application seems to support proxies differently, if they support them at all. Some automatically read from the Internet Explorer registry setting for the proxy, some use the HTTP_PROXY environment variable, some have their own personal settings. This means that every time you want to use a piece of software in an environment that uses a proxy, you have to fight with it to get it to work.

Such was the case with the S3 input. The underlying Ruby based aws-sdk has support for proxies (as does the .NET one), with a Set-AwsProxy method.

I could not, for the life of me, figure out how to configure Logstash with a proxy for the AWS component though.

So , I was stuck. I had all the configuration in place to process the ELB logs, but I didn't have the logs themselves.

In the end I created a small Powershell script that uses the AWS Powershell Component to move all files from an S3 bucket to a local directory on a timer. I then installed that script as a Windows Service using NSSM. Finally I edited my Logstash configuration to process the local files instead. After tweaking my config to process the files correctly, everything started coming through into the Log Aggregator as expected, and I added the missing piece to our intelligence about the service.

I don’t like this solution, because it adds more moving parts than I think is strictly necessary, but sometimes you have to make compromises.

Summary

I’ve uploaded a repository with my deployable build for Logstash here, so hopefully someone else can benefit from the effort that I put into making it re-usable.

Setting up a deployment pipeline for this component saved me a lot of time throughout the development process, making redeploying my changes when I made a mistake or needed to change a dependency (like upgrading to Logstash 1.5 RC2) a breeze. I highly recommend spending that initial bit of effort in setting things up at the start so you can move quickly later.

In regards to the actual ELB logs, they don’t provide any groundbreaking information that IIS didn't already give us, except for the case where connections are terminated at the ELB due to inactivity. At least to my knowledge anyway, I suppose they will track if the underlying instances go offline, which will be good.  The ELB entries come in a bit slower than the IIS ones (due to the delay before the log files are published from the ELB + the delay added by my own S3 downloader and Logstash file processor pair), but there’s not really much I can do about that.

I still hate proxies. Only because its easier to hate one thing than every application that doesn’t support them.