The Case Of The Phantom Bucket
A very short post this week, as I’m still struggling with my connection leak and a number of other things (RavenDB production server performance issues is the biggest one, but also automating a Node/NPM built website into our current CI architecture, which is mostly based around Powershell/MSBuild). Its been a pretty discombobulated week.
So this incredibly short post?
Phantom buckets in S3.
There Is A Hole In The Bucket
Our environments often include S3 buckets, and those buckets are typically created via the same CloudFormation template as the other components (like EC2 instances, ELB, Auto Scaling Groups, etc).
Until now, the names of these buckets have been relatively straightforward. A combination of a company name + environment (i.e. ci, staging, etc) + the component (like auth service) + the purpose of the bucket (logs, images, documents, whatever).
This works great. Your buckets have sane names, so you know where to look for things and its easy to apply different lifecycle management depending on the bucket purpose.
Unfortunately its not all wonderful happy time.
The first issue is that CloudFormation will not delete a bucket with contents. I can understand this from a safety point of view, but when the actual AWS API allows you to just delete buckets with contents, the disconnect is frustrating.
What this means is that now you need to delete the bucket contents outside of the actual stack deletion. Its especially annoying for buckets being used to contain ELB logs, as there is an extremely good chance of files being written after you’ve cleared the bucket ready for CloudFormation to delete it. I’ve solved this issue by just deleting the bucket outside of the stack teardown (we already do some other things here, like Octopus management, so its not entirely unprecedented).
The second issue is phantom buckets.
OooOOooOoo
I’ve encountered this issue twice now. Once for our proxy environment and now once for one of our API’s.
What happens is that when the environment attempts to spin up (our CI environments are recreated every morning to verify that our environment creation scripts work as expected), it will fail because it cannot create the bucket. The actual error is incredibly unhelpful:
{ "EventId" : "LogsBucket-CREATE_FAILED-2015-11-02T21:49:55.907Z", "LogicalResourceId" : "LogsBucket", "PhysicalResourceId" : "OBFUSCATED_BUCKET_NAME", "ResourceProperties" : "{\"BucketName\":\"OBFUSCATED_BUCKET_NAME\",\"LifecycleConfiguration\":{\"Rules\":[{\"Status\":\"Enabled\",\"Id\":\"1\",\"ExpirationInDays\":\"7\"}]}}\n", "ResourceStatus" : "CREATE_FAILED", "ResourceStatusReason" : "The specified bucket does not exist", "ResourceType" : "AWS::S3::Bucket", "StackId" : "OBFUSCATED_STACK_ID", "StackName" : "OBFUSCATED_STACK_NAME", "Timestamp" : "\/Date(1446500995907)\/" }
If I go into the AWS dashboard and look at my buckets, its clearly not there.
If I try to create a bucket with the expected name, it fails, saying the bucket already exists.
Its a unique enough name that it seems incredibly unlikely that someone else has stolen the name (bucket names being globally unique), so I can only assume that something has gone wrong in AWS and the bucket still technically exists somehow, but we’ve lost control over it.
Somehow.
Of course, because the bucket is an intrinsic part of the environment, now I can’t create my CI environment for that particular service. Which means we can’t successfully build/deploy any thing involving that service, because CI is typically used for functional test validation.
Who Ya Gunna Call? Ghostbusters!
The only solution I could come up with, was to make sure that every time an environment is created, the buckets have completely unique names. With only 63 characters to work with, this is somewhat challenging, especially if we want to maintain nice sane bucket names that a human could read.
What I ended up doing was shortening the human readable part (just environment + component + purpose) and appending a GUID onto the end.
Now that I couldn’t predict the name of the bucket though, I had to fix up a couple of other loose ends.
The first was that the bucket deletion (during environment tear down) now had to query the stack itself to find out the bucket resources. Not overly difficult.
try { if ($environment -ne $null) { $resources = Get-CFNStackResources -StackName $environment.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion $s3buckets = $resources | Where { $_.ResourceType -eq "AWS::S3::Bucket" } foreach ($s3Bucket in $s3Buckets) { try { $bucketName = $s3Bucket.PhysicalResourceId _RemoveBucket -bucketName $bucketName -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion } catch { Write-Warning "Error occurred while trying to delete bucket [$bucketName] prior to stack destruction." Write-Warning $_ } } } } catch { Write-Warning "Error occurred while attempting to get S3 buckets to delete from the CloudFormation stack." Write-Warning $_ }
The second was that our Octopus projects used the predictable bucket name during deployments, so I had to change the environment setup code to update the project variables to have the correct value. This was a little more difficult, but due to Octopus being awesome from an automation point of view, it eventually worked.
Summary
I can see how this sort of situation can arise in a disconnected, eventually consistent architecture, but that doesn’t make it any less frustrating.
It could be my fault for constantly creating/deleting buckets as part of the environment management scripts, but being that it doesn’t happen all the time, it really does feel like a bug of some sort.
Plus, ghost buckets are scary. Does that mean there is some of my data up there in AWS that I no longer have control over? I mean, I can’t even see it, let alone manage it.
A sobering thought.