IAM Legend

May 19. 2015 0 Comments

I’ve been doing a lot of work with AWS recently.

For the last service component that we developed, we put together a CloudFormation template and a series of Powershell scripts to setup, tear down and migrate environments (like CI, Staging, Production, etc). It was extremely effective, baring some issues that we still haven’t quite solved with data migration between environment versions and updating machine configuration settings.

In the first case, an environment is obviously not stateless once you start using it, and you need a good story about maintaining user data between environment versions, at the very least for Production.

In the second case tearing down an entire environment just to update a configuration setting is obviously sub-optimal. We try to make sure that most of our settings are encapsulated within components that we deploy, but not everything can be done this way. CloudFormation does have update mechanisms, I just haven’t had a chance to investigate them yet.

But I digress, lets switch to an entirely different topic for this post How to give secure access to objects in an S3 bucket during initialization of EC2 instances while executing a CloudFormation template.

That was a mouthful.

Don’t Do What Donny Don’t Does

My first CloudFormation template/environment setup system had a fairly simple rule. Minimise dependencies.

There were so many example templates on the internet that just downloaded arbitrary scripts or files from GitHub or S3, and to me that’s the last thing you want. When I run my environment setup (ideally from within a controlled environment, like TeamCity) I want it to use the versions of the resources that are present in the location I’m running the script from. It should be self contained.

Based on that rule, I put together a fairly simple process where the Git Repository housing my environment setup contained all the necessary components required by the resources in the CloudFormation template, and the script was responsible for collecting and then uploading those components to some location that the resources could access.

At the time, I was not very experienced with S3, so I struggled a lot with getting the right permissions.

Eventually I solved the issue by handing off the AWS Key/Secret to the CloudFormation template, and then using those credentials in the AWS::CloudFormation::Authentication block inside the resource (LaunchConfig/Instance). The URL of the dependencies archive was then supplied to the source element of the first initialization step in the AWS::CloudFormation::Init block, which used the supplied credentials to download the file and extract its contents (via cfn-init) to a location on disk, ready to be executed by subsequent components.

This worked, but it left a bad taste in my mouth once I learnt about IAM roles.

IAM roles give you the ability to essentially organise sets of permissions that can be applied to resources, like EC2 instances. For example, we have a logs bucket per environment that is used to capture ELB logs. Those logs are then processed by Logstash (indirectly, because I can’t get the goddamn S3 input to work with a proxy, but I digress) on a dedicated logs processing instance. I could have gone about this in two ways. The first would have been to supply the credentials to the instance, like I had in the past. This exposes those credentials on the instance though, which can be dangerous. The second option is to apply a role to the instance that says “you are allowed to access this S3 bucket, and you can do these things to it”.

I went with the second option, and it worked swimmingly (once I got it all configured).

Looking back at the way I had done the dependency distribution, I realised that using IAM roles would be a more secure option, closer to best practice. Now I just needed a justifiable opportunity to implement it.

New Environment, Time to Improve

We’ve started work on a new service, which means new environment setup. This is a good opportunity to take what you’ve done previously and reuse it, improving it along the way. For me, this was the perfect chance to try and use IAM roles for the dependency distribution, removing all of those nasty “credentials in the clear” situations.

I followed the same process that I had for the logs processing. Setup a role describing the required policy (readonly access to the S3 bucket that contains the dependencies) and then link that role to a profile. Finally, apply the profile to the instances in question.

"ReadOnlyAccessToDependenciesBucketRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/",
        "Policies" : [
            {
                "Version" : "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                    },
                    {
                        "Effect": "Allow",
                        "Action": [ "s3:ListBucket" ],
                        "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                    }
                ]
            }
        ]
    }
},
"ReadOnlyDependenciesBucketInstanceProfile": {    
    "Type": "AWS::IAM::InstanceProfile",    
    "Properties": { 
        "Path": "/", 
        "Roles": [ { "Ref": "ReadOnlyDependenciesBucketRole" }, { "Ref": "FullControlLogsBucketRole" } ] 
    }
},
"InstanceLaunchConfig": {    
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {        
        * snip *    
    },    
    "Properties": {        
        "KeyName": { "Ref": "KeyName" },        
        "ImageId": { "Ref": "AmiId" },        
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],        
        "InstanceType": { "Ref": "InstanceType" },        
        "IamInstanceProfile": { "Ref": "ReadOnlyDependenciesBucketInstanceProfile" },        
        "UserData": {            
            * snip *        
        }    
    }
}

It worked before, so it should work again, right? I’m sure you can probably guess that that was not the case.

The first mistake I made was attempting to specify multiple roles in a single profile. I wanted to do this because the logs processor needed to maintain its permissions to the logs bucket, but it needed the new permissions to the dependencies bucket as well. Even though the roles element is defined as an array, it can only accept a single element. I now hate whoever designed that, even though I’m sure they probably had a good reason.

At least that was an easy fix, flip the relationship between roles and policies. I split the inline policies out of the roles, then linked the roles to the policies instead. Each profile only had 1 role, so everything should have been fine.

"ReadOnlyDependenciesBucketPolicy": {
    "Type":"AWS::IAM::Policy",
    "Properties": {
        "PolicyName": "ReadOnlyDependenciesBucketPolicy",
        "PolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [ "s3:GetObject", "s3:GetObjectVersion" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" }, "/*" ] ] }
                },
                {
                    "Effect": "Allow",
                    "Action": [ "s3:ListBucket" ],
                    "Resource": { "Fn::Join": [ "", [ "arn:aws:s3:::", { "Ref": "DependenciesS3Bucket" } ] ] }
                }
            ]
        },
        "Roles": [
            { "Ref" : "InstanceRole" },
            { "Ref" : "OtherInstanceRole" }
        ]
    }
},
"InstanceRole": {
    "Type": "AWS::IAM::Role",
    "Properties": {
        "AssumeRolePolicyDocument": {
            "Version" : "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": { "Service": [ "ec2.amazonaws.com" ] },
                    "Action": [ "sts:AssumeRole" ]
                }
            ]
        },
        "Path": "/"
    }
},
"InstanceProfile": {
    "Type": "AWS::IAM::InstanceProfile",
    "Properties": { "Path": "/", "Roles": [ { "Ref": "InstanceRole" } ] }
}

Ha ha ha ha ha, no.

The cfn-init logs showed that the process was getting 403s when trying to access the S3 object URL. I had incorrectly assumed that because the instance was running with the appropriate role (and it was, if I remoted onto the instance and attempted to download the object from S3 via the AWS Powershell Cmdlets, it worked just fine) that cfn-init would use that role.

It does not.

You still need to specify the AWS::CloudFormation::Authentication element, naming the role and the bucket that it will be used for. This feel s a little crap to be honest. Surely the cfn-init application is using the same AWS components, so why doesn’t it just pickup the credentials from the instance profile like everything else does?

Anyway, I added the Authentication element with appropriate values, like so.

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
          "S3AccessCreds": {
            "type": "S3",
            "roleName": { "Ref" : "InstanceRole" },
            "buckets" : [ { "Ref" : "DependenciesS3Bucket" } ]
          }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Then I started getting different errors. You may think this is a bad thing, but I disagree. Different errors means progress. I’d switched from getting 403 responses (access denied) to getting 404s (not found).

Like I said, progress!

The Dependencies Archive is a Lie

It was at this point that I gave up trying to use the IAM roles. I could not for the life of me figure out why it was returning a 404 for a file that clearly existed. I checked and double checked the path, and even used the same path to download the file via the AWS Powershell Cmdlets on the machines that were having the issues. It all worked fine.

Assuming the issue was with my IAM role implementation, I rolled back to the solution that I knew worked. Specifying the Access Key and Secret in the AWS::CloudFormation::Authentication element of the LaunchConfig and removed the new IAM roles resources (for readonly access to the dependencies archive).

"InstanceLaunchConfig": {
    "Type": "AWS::AutoScaling::LaunchConfiguration",
    "Metadata": {
        "Comment": "Set up instance",
        "AWS::CloudFormation::Init": {
            * snip *
        },
        "AWS::CloudFormation::Authentication": {
            "S3AccessCreds": {
                "type": "S3",
                "accessKeyId" : { "Ref" : "DependenciesS3BucketAccessKey" },
                "secretKey" : { "Ref": "DependenciesS3BucketSecretKey" },
                "buckets" : [ { "Ref":"DependenciesS3Bucket" } ]
            }
        }
    },
    "Properties": {
        "KeyName": { "Ref": "KeyName" },
        "ImageId": { "Ref": "AmiId" },
        "SecurityGroups": [ { "Ref": "InstanceSecurityGroup" } ],
        "InstanceType": { "Ref": "ApiInstanceType" },
        "IamInstanceProfile": { "Ref": "InstanceProfile" },
        "UserData": {
            * snip *
        }
    }
}

Imagine my surprise when it also didn’t work, throwing back the same response, 404 not found.

I tried quite a few things over the next few hours, and there was much gnashing and wailing of teeth. I’ve seen some weird crap with S3 and bucket names (too long and you get errors, weird characters in your key and you get errors, etc) but as far as I could tell, everything was kosher. Yet it just wouldn’t work.

After doing a line by line diff against the template/scripts that were working (the other environment setup) and my new template/scripts I realised my error.

While working on the IAM role stuff, trying to get it to work, I had attempted to remove case sensitivity from the picture by calling ToLowerInvariant on the dependencies archive URL that I was passing to my template. The old script/template combo didn’t do that.

When I took that out, it worked fine.

The issue was that the key of the file being uploaded was not being turned into lower case, only the URL of the resulting file was, and AWS keys are case sensitive.

…

Goddamn it.

Summary

I lost basically an entire day to case sensitivity. Its not even the first time this has happened to me (well, its the first time its happened in S3 I think). I come from a heavy Windows background. I don’t even consider case sensitivity to be a thing. I can understand why its a thing (technically different characters and all), but its just not on windows, so its not even on my radar most of the time. I assume the case sensitivity in S3 is a result of the AWS backend being Unix/Linux based, but its still a shock to find a case sensitive URL.

I turns out that my IAM stuff had started working just fine and I was getting 404s because of an entirely different reason. I had assumed that I was still doing something wrong with my permissions and the API was just giving a crappy response (i.e. not really a 404, some sort of permission based can’t find file error masquerading as a 404).

At the very least I didn’t make the silliest mistake you can make in software (assuming the platform is broken), I just assumed I had configured it wrong somehow. That’s generally a fairly safe assumption when you’re using a widely distributed system. Sometimes you do find a feature that is broken, but it is far more likely that you are just doing it wrong. In my case, the error message was completely accurate, and was telling me exactly the right thing, I just didn’t realise why.

Somewhat ironically, the root cause of my 404 issue was my attempt to remove case sensitivity from the picture when I was working on getting the IAM stuff up and running. I just didn’t apply the case insensitivity consistently.

Ah well.

I Did Such a Stupid Thing

April 14. 2015 0 Comments

Posted in:
amazon
security

Such a stupid thing.

Even though I was really careful, I still did the thing.

I knew about the thing, I planned to avoid it, but it still happened.

I accidentally uploaded some AWS credentials (Key + Secret) into GitHub…

I’m going to use this blog post to share exactly what happened, how we responded, how Amazon responded and some thoughts about how to avoid it.

Anatomy of Stupidity

The repository that accompanied my JMeter post last week had an CloudFormation script in it, to create a variable sized army of AWS instances ready to run load tests over a service. You’re probably thinking that’s where the credentials were, that I’d hardcoded them into the environment creation script and forgotten to remove them before uploading to GitHub.

You would be wrong. My script was parameterised well, requiring you to supply credentials (amongst other things) in order to create the appropriate environment using CloudFormation.

My test though…

I recently started writing tests for my Powershell scripts using Pester. In this particular case, I had a test that created an environment and then verified that parts of it were working (i.e. URL in output resolved, returned 200 OK for a status query, stuff like that), then tore it down.

The test had hardcoded credentials in it. The credentials were intended to be used wit CloudFormation, so they were capable of creating various resources, most importantly EC2 instances.

Normally when I migrate some scripts/code that I’ve written at work into GitHub for public consumption I do two things.

One, I copy all of the files into a fresh directory and pore over the resulting code for references to anything that might be specific. Company name, credentials, those sorts of things. I excise all of the non-generic functionality, and anything that I don’t want to share (mostly stuff not related to the blog post in question).

Two, I create a fresh git repository from those edited files. The main reason I do this instead of just copying the repository, is that the history of the repository will contain all of those changes otherwise and that’s a far more subtle leak.

There is no excuse for me exposing the credentials except for stupidity. I’ll wear this one for a while.

Timeline of Horror

Somewhere between 1000 and 1100 on Wednesday April 8, I uploaded my JMeter blog post, along with its associated repository.

By 1130 an automated bot had already retrieved the credentials and had started to use them.

As far as we can tell, the bot picks up all of the regions that you do not have active resources in (so for us, that’s basically anything outside ap-southeast) and creates the maximum possible number of c4.x8large instances in every single region, via spot requests. All told, 500 + instances spread across the world. Alas we didn’t keep an example of one of the instances (too concerned with terminating them ASAP), but we assume from reading other resources that they were being used to mine Bitcoins and then transfer them anonymously to some destination.

At 1330 Amazon notified us that our credentials were compromised via an email to an inbox that was not being actively monitored for reasons that I wont go into (but I’m sure will be actively monitored from now on). They also prevented our account from doing various things, including the creation of new credentials and particularly expensive resources. Thanks Amazon! Seriously your (probably automated) actions saved us a lot more grief.

The notification from Amazon was actually pretty awesome. They pinpointed exactly where the credentials were exposed in GitHub. They must have the same sort of bots running as the thieves, except used for good, rather than evil.

At approximately 0600 Thursday April 9, we received a billing alert that our AWS account had exceeded a limit we had set in place. Luckily this did go to an inbox that was being actively monitored, and our response was swift and merciless.

Within 15 minutes we had terminated the exposed credentials, terminated all of the unlawful instances and removed all of the spot requests. We created a script to select all resources within our primary region that had been modified or created in the last 24 hours and reviewed the results. Luckily, nothing had been modified within our primary region. All unlawful activity had occurred outside, probably in the hopes that we wouldn't notice.

Ramifications

In that almost 24 hour period, the compromise resulted in just over $8000 AUD of charges to our AWS account.

Don’t underestimate the impact that exposed credentials can have. It happens incredibly quickly.

I have offered to pay for the damages out of my own pocket (as is only appropriate), but AWS also has a concession strategy for this sort of thing, so we’ll see how much I actually have to pay in the end.

Protecting Ourselves

Obviously the first point is don’t store credentials in a file. Ever. Especially not one that goes into source control.

The bad part is, I knew this, but I stupidly assumed it wouldn't happen to me because our code is not publicly visible. That would have held true if I hadn’t used some of the scripts that I’d written as the base for a public repository to help explain a blog post.

Neverassume your credentials are safe if they are specified in a file that isn’t protected by a mechanism not available locally (so encrypting them and having the encryption key in the same codebase is not enough).

I have since removed all credentials references from our code (there were a few copies of that fated environment creation test in a various repositories, luckily no others) and replaced them with a mechanism to supply credentials via a global hashtable entered at the time the tests are run. Its fairly straightforward, and is focused on telling you which credentials are missing when they cannot be found. No real thought has been given to making sure the credentials are secure on the machine itself, its focused entirely on keeping secrets off the disk.

function Get-CredentialByKey
{
    [CmdletBinding()]
    param
    (
        [string]$keyName
    )

    if ($globalCredentialsLookup -eq $null)
    {
        throw "Global hashtable variable called [globalCredentialsLookup] was not found. Credentials are specified at the entry point of your script. Specify hashtable content with @{KEY=VALUE}."
    }

    if (-not ($globalCredentialsLookup.ContainsKey($keyName)))
    {
        throw "The credential with key [$keyName] could not be found in the global hashtable variable called [globalCredentialsLookup]. Specify hashtable content with @{KEY=VALUE}."
    }

    return $globalCredentialsLookup.Get_Item($keyName)
}

The second point is specific to AWS credentials. You should always limit the credentials to only exactly what they need to do.

In our case, there was no reason for the credentials to be able to create instances outside of our primary region. Other than that, they were pretty good (they weren’t administrative credentials for example, but they certainly did have permission to create various resources used in the environment).

The third point is obvious. Make sure you have a reliable communication channel for messages like compromises, one that is guaranteed to be monitored by at least 1 person at all times. This would have saved us a tonne of grief. The earlier you know about this sort of thing the better.

Summary

AWS is an amazing service. It lets me treat hardware resources in a programmable way, and stops me from having to wait on other people (who probably have more important things to do anyway). It lets me create temporary things to deal with temporary issues, and is just generally a much better way to work.

With great power comes great responsibility.

Guard your credentials closely. Don’t be stupid like me, or someone will get a nasty bill, and it will definitely come back to you. Also you lose engineer points. Luckily I’ve only done two stupid things so far this year. If you were curious, the other one was that I forgot my KeePass password for the file that contains all of my work related credentials. Luckily I had set it to match my domain password, and managed to recover that from Chrome (because we use our domain credentials to access Atlassian tools).

This AWS thing was a lot more embarrassing.