503: Health Not Available

August 25. 2015 0 Comments

Posted in:
sick

No post this week. Throat plague is mostly gone, but all energy is being diverted into maintaining the facade of doing normal work.

Next time!

404: Health Not Found

August 18. 2015 0 Comments

Posted in:
sick

No post this week, I have some sort of throat plague.

Normal service will resume next week.

Pack It Up

August 12. 2015 0 Comments

So it turns out that Amazon delete their pre-packaged Windows AMI’s muchquicker than they delete their Linux ones.

I’ve known about this for a little while, but it wasn’t until recently that it bit me.

One of the first environments that we created using the new strategy of “fully codified and automated environment setup”, recently hit a little snag. It was a production environment, primarily intended for beta testing, and the understanding was that it would be refreshed (using the latest version of the environment setup scripts) before it became available to the general public.

Late last week, one of our scripts that shuts down AWS EC2 instances based on time of day (to limit costs) accidentally terminated both of the EC2 instances that make up the API layer in that particular environment. Normally, this wouldn’t be an issue. The Auto Scaling Group would realise that it no longer had as many instances as it should and it would recreate them. The API instances are mostly stateless, so after a small amount of time, everything would be fine again.

As I’m sure you can imagine, this did not happen.

Amazon has since removed the AMI that the API instances were based off, so the ASG couldn’t spin up any new instances to replace the ones that were terminated.

The service was down, and it was going to stay that way for a while until we managed to fix the root issue.

The Immediate Response

The first thing we did was update the existing CloudFormation stack to use the latest version of the Windows AMI that we were previously using. This at least allowed the API instances to be created. However, they never finished their initialization.

In the time between when the environment was initially provisioned and the time when it was accidentally destroyed, we had made quite a few changes to the common scripts that back our environment provisioning process. One of those was to specify the version of the Octopus Tentacle that was going to be installed on the machine. We had previously run into an issue when Octopus 3 was released where the latest tentacle no longer worked the same way, and with little time to investigate it, simply hardcoded the version that was installed to the one we had previously been using.

In order to fix this issue in the old environment we had to fix the script. Whenever an environment is provisioned, the scripts that it depends on are uploaded to S3, ready to be downloaded by EC2 instances and other resources that need access to them. Rather than manually dig in to the dependencies, it was just easier to do the planned environment refresh.

This went…okay. Not great, not terrible, but the problem was fixed and everything came back online before anyone tried to use the service the next day.

Fixing It Properly

I had actually been aware of the AMI missing issue for some time and was already working on a longer term fix. In fact, I had scheduled an environment refresh of the production/beta environment for this service for a bit later in the same week the incident happened. It was just unfortunate that the incident forced my hand.

The root cause of the issue is that we did not control all of the elements in the environment, specifically the AMI used. Having external dependencies isn’t always an issue (for example we use Nuget extensively, but old Nuget packages are generally left alone), but Amazon makes no guarantees as to the availability of the AMI’s they supply as time progresses. The solution is to create your own AMI’s, so that Amazon can’t just delete them out from underneath you.

There are upsides and downsides to managing your own AMI’s.

The primary upside is that you avoid issues like I’ve described above. Nobody external to your company is going to go and delete an AMI when you still have a dependency on it. Obviously if someone internal to your organization deletes the AMI you still have the same issue, but you at least have much more control over that situation.

Another upside is that you can include commonly installed software for your environments in your custom AMI’s. For us, that would be things like an unregistered Octopus Tentacle or the .NET Framework and ASP.NET (not for all machines, but at least for all API instances).

The primary downside is that you can no longer easily take advantage of the fact that new Amazon AMI’s are released on a regular basis, containing Windows Updates and other fixes (which are critically important to apply to machines that are exposed to the greater internet). You can still take advantage of those new AMI’s, its just a little bit more difficult.

Another downside is that you now have to manage your own AMI’s. This isn’t particularly difficult to be honest, but it is one more thing that you need to take care of, and I much prefer to simplify things rather than add more complexity.

The Mechanism

In an effort to avoid much of the manual work that can go into creating an AMI, I looked for a solution that was automated. I wanted to be able to run a process that simply spat out a customised AMI at the end, so that we could easily take advantage of new AMI’s as Amazon released them, and then refresh our environments as required.

Initially I looked into automating the process myself, using the various API’s available for AWS. I’d already done some work previously in creating an EC2 instance for the purposes of updating an AMI, so I started with that.

Shortly after, someone informed me of the existence of Packer.

Packer is a wonderful little application that allows you create AMI’s and virtual machines for a number of virtualisation platforms. It even works on Windows, without having to install some arcane dependency chain through the command line. Its just a collection of executables, which is nice.

Using Packer, I could put together the following configuration file that describes how I want my AMI to be structured.

{
    "variables" : {
        "aws_access_key" : "",
        "aws_secret_key" : "",
        "aws_region" : "",
        "source_ami" : "",
        "ami_name" : "",
        "user_data_file_path" : "",
        "octopus_api_key" : "",
        "octopus_server_url" : ""
    },
    "builders" : [{
            "type" : "amazon-ebs",
            "access_key" : "{{user `aws_access_key`}}",
            "secret_key" : "{{user `aws_secret_key`}}",
            "region" : "{{user `aws_region`}}",
            "source_ami" : "{{user `source_ami`}}",
            "instance_type" : "m3.large",
            "ami_name" : "{{user `ami_name`}}-{{timestamp}}",
            "user_data_file" : "{{user `user_data_file_path`}}",
            "vpc_id" : "vpc-a0a6aec9",
            "subnet_id" : "subnet-5908182d",
            "security_group_ids" : ["sg-0b65076e", "sg-4d188f28", "sg-faaf429f"],
            "ssh_keypair_name" : "YourKeyPair",
            "ssh_private_key_file":"C:\\temp\\YourKeyPair.pem",
            "communicator" : "winrm",
            "winrm_username" : "Administrator",
            "winrm_port" : 5985
        }
    ],
    "provisioners" : [
        {
            "type" : "powershell",
            "inline" : [
                "try",
                "{",
                    "$signalFilePath = \"C:\\signal\"",
                    "$content = Get-Content $signalFilePath",
                    "$maxWaitSeconds = 3000",
                    "$currentWaitSeconds = 0",
                    "$waitSeconds = 30",
                    "while ($content -eq \"1\" -and $currentWaitSeconds -lt $maxWaitSeconds) { Sleep -Seconds $waitSeconds; Write-Output \"Checking signal\"; $currentWaitSeconds += $waitSeconds; $content = Get-Content $signalFilePath; if ($content -eq \"-1\") { Write-Output \"User data signalled -1, indicating failure.\"; exit 1 } }",
                "}",
                "catch",
                "{",
                    "Write-Ouput \"An unexpected error occurred.\"",
                    "Write-Output $_",
                    "exit 1",
                "}"
            ]
        },
        {
            "type":"powershell",
            "scripts": [
                "@@ROOT_DIRECTORY_PATH\\scripts\\packer\\Ec2Config.ps1"
            ]
        }
    ]
}

The first part of the template describes various things about the EC2 instance that will be used to create the AMI, and the second part describes operations to perform on the instance in order to configure it the way you want it.

Note that the security groups used in the template above simply describe (in order) ability to connect via RDP and Windows Remote Management, Unfiltered Access Out and Octopus Tentacle Port In.

The configuration I’ve shared above is from our baseline Octopus Tentacle capable image. It comes with an Octopus Tentacle installed, but not configured (because its much more reliable to configure it at initialization time in CloudFormation).

The instance configuration is broken into two parts:

Use UserData to run some scripts that configure the proxy (so the machine can get to the internet) and download some dependencies, plus some other miscellaneous configuration (WinRM, Firewall, etc).
Use the Powershell script execution from Packer to run some scripts from the dependencies downloaded in 1.) to download and install an Octopus Tentacle.
Some other miscellaneous configuration.

Nothing too fancy.

The Windows support for Packer is still a bit on the rough side, mostly due to the fact that doing this sort of thing with Windows machines is inherently more complicated than it is for Linux boxes. Luckily for me, I started using Packer after the Windows plugins were incorporated into the primary build of the application, so I didn’t have to do anything special to get Windows support.

Gotchas

It definitely wasn’t all smooth sailing though.

The documentation for the creation of Windows AMI’s from Packer is a little sparse, so I had to do some trial and error in order to figure out how everything fit together.

The main mechanism for executing scripts remotely on Windows is WinRM (Windows Remote Managment), which is basically Powershell remote execution. As such, you need to make sure that you allow access to the machine over port 5985 or nothing will work. It won’t fail straightaway either, it will timeout, which can take upwards of 10 minutes.

You also need to make sure that you specify WinRM as the communication method. Most of the template examples on the web use SSH (because Linux), so its not immediately obvious that you can actually switch to a different communication method.

Finally, you need to include the EC2 and Bundle config files, to tell the instance that it needs to run sysprep, otherwise it won’t regenerate a new Administrator password when you use the AMI to create a new EC2 instance (and thus you wouldn’t be able to retrieve the password from the AWS API). It will also have saved state on it from last time, so its definitely better to run sysprep for an AMI that will be used generically.

Conclusion

I’ve uploaded a sanitised copy of the repository containing my Packer templates and scripts to Github. If you look, you can see that I haven’t done anything particularly fancy. All I’ve done is wrap the execution of Packer in some Powershell scripts to make it easier to run. I have two different scripts to create the two AMI’s that we need right now (Octopus capable + pre-installed IIS/.NET Framework), and when you run either of them with the appropriate parameters a brand new, timestamped AMI will be created in the appropriate AWS account.

Creating our own AMI’s fixes the scaling issue that started this whole blog post. Since we control them, we can be sure that they won’t be deleted and our ability to scale via Auto Scaling Groups is maintained for the life of the environment. Another benefit of this approach is that the provisioning of an environment is now quicker, as some of the components (especially IIS/.NET Framework) are now pre-installed for the components that require them. Considering our environments can take upwards of 20 minutes to provision, every minute counts.

The whole process of creating these AMI’s via Packer took me about a day or two, so it definitely wasn’t the most time consuming task I’ve ever completed.

Incorporating the AMI’s into our environment provisioning scripts was trivial, as they already searched for the appropriate AMI to use dynamically, I just had to change the search parameters.

In the end I’m fairly pleased with Packer and how easy it made the AMI creation process. If I had to use the AWS Powershell cmdlets (or the CLI app) directly for all of this, I probably would have wasted a lot of time.

And sanity.

Actually Not a Length Issue After All

August 4. 2015 0 Comments

A while back I made a post about how cfn-init was failing with an error that didn’t seem to make any sense (utf8 codec can’t decode, invalid continuation byte). At the time, I came to the conclusion that the reason why it was failing was the length of the output.

That was actually incorrect.

The reason I came to that conclusion was because the action that seemed to fix the issue was whether or not the –Verbose flag was present on the Powershell script being called. If I had the flag on, I would get the error. Off, and everything was fine.

At first I trusted the error. I assumed that somewhere in my output there was actually an invalid set of bytes, at least as far as utf8 is concerned. It seemed entirely plausible, considering it was Ruby, parsing the output from a Powershell script, which was running an executable directly. So many layers, so many different ways in which it could break, the least of which would be output stream encoding incompatibilities.

My initial investigation into the output stream didn’t seem to show any invalid bytes, so I assumed that cfn-init was doing something stupid, and truncating the output stream because it was too long. If it truncated the stream, it was feasible that it was shearing a UTF8 byte pair in half, hence the invalid continuation byte error. It made sense to me, because:

I had only recently added a large amount of extra verbose logging to the scripts, and that seemed to be the primary difference between the version that worked and the version that failed.
I’ve had issues with AWS components giving arcane errors when you exceed some arbitrary length before. The Powershell cmdlet New-CfnStack will fail if you try to upload a template directly and the template is too long. It also gives incredibly unhelpful errors, about malformed XML, when in reality, its a length issue.

I accepted that I could not use verbose logging for the deployment and moved on, but it always bugged me.

I don’t like losing Verbose logging. Its extremely useful for when things go bad, which they do. All the time.

Round Two (or Three?)

I got a chance to go back to the deployment scripts recently, because some deployments were failing and I wasn’t sure why. I needed the verbose logging back, so I re-added the –Verbose flag and tried to get to the bottom of why it was failing.

My first attempt simply commented out the lines that I knew could be particularly spammy (search functions testing a predicate against a collection of objects).

The error still occurred.

I ran the script by itself (in a clean environment) and the output wasn’t even that long. I realised that I had made an incorrect conclusion from my initial investigation. It definitely wasn’t a length issue.

It was time to do a manual binary search.

I knew which script was failing, so I suppressed verbose output in half the script and tested it again to see if it failed. Its easy enough to temporarily suppress verbose output in Powershell, if a bit messy.

$pref = $VerbosePreference
$VerbosePreference = "SilentlyContinue"

... operations here ...

$VerbosePreference = $pref

What followed was a few hours of manually searching for the piece of script that broke cfn-init. I couldn’t just run it locally, because everything worked just fine, I had to instantiate an environment and get it to initialise itself as it usually would to see if the script would fail or not. It had to be a fresh environment too, because if I ran cfn-init again on an environment that had already failed, it would work just fine.

An environment takes at least 10 minutes to reach the point of failure.

It was the slowest binary search ever.

Something About Rooting Issues?

Eventually I got to the root of the problem. I was piping the output from nuget.exe (used for package installation) to the Verbose stream. Somewhere in the output from nuget, there was actually (maybe?) an invalid UTF8 character, according to the code used by cfn-init anyway. The reason it didn’t fail if you ran it a second time, was because that component was already successfully installed, so it didn’t try to install it again.

I could leave every Write-Verbose statement in place except for that one, and it would all work fine. This meant that I could finally get the verbose output from the Octopus deployment back, which was the main thing I wanted. Sure I could see it through Octopus, but I like all of my information to be in one place, because it just makes everything easier.

To complicate matters further, it wasn’t all nuget output that triggered the issue. For example, the 7Zip command line tools are one of the first components to be installed, in order to make sure the AWS cmdlets are available (they are distributed to each machine as a 7zip file). That particular component would install fine. It seemed to only be the Octopus Client package (the one that contains the .NET libraries) that caused the issue. I’m still not 100% sure of that to be honest, I was mostly just happy to get my verbose deployment errors back in the cfn-init logs, and I had to move on to something else.

To be honest, the fact that I don’t know for sure why the nuget installation output breaks the script is another application of the same mistake I made the first time, and I’m almost certainly going to have to revisit it at some point in the future, where I will no doubt discover something else entirely is actually the root cause. This solution is still better than the original one though, which is enough for now.

Conclusion

Sometimes in software you think you know the answer, but it turns out you’ve made some poor assumptions or conclusions, and just completely missed the mark. In my case, the ramifications were not particularly bad, I was just missing out on informational output, but that’s not always the case. Sometimes the results of a poor assumption/conclusions can be much worse.

The important thing to note is that I stopped investigating once I found something that made the bug go away, instead of investigating in more depth as to whythe action made the bug go away. Sometimes this is a compromise (usually as a result of time contraints/pressure), but I’ve found its almost always worthwhile to spend the time to understand the why of an issue before accepting a solution. Obviously at some stage you do need to just accept that you don’t understand the problem fully, so like everything in software development, its a balancing act.

At the very least if you investigate for a little bit longer, you’ll understand the problem better, which can only be a good thing.

Moving House Is Stressful

July 28. 2015 0 Comments

Now that we’ve (somewhat) successfully codified our environment setup and are executing it automatically every day with TeamCity, we have a new challenge. Our setup scripts create an environment that has some set of features/bugs associated with it. Not officially (we’re not really tracking environment versions like that), but definitely in spirit. As a result, we need to update environments to the latest version of the “code” whenever we fix a bug or add a feature. Just like deploying a piece of software.

To be honest, I haven’t fully nailed the whole codified environment thing just yet, but I am getting closer. Giving it some thought, I think I will probably move towards a model where the environment is built and tested (just like a piece of software) and then packaged and versioned, ready to be executed. Each environment package should consist of installation and uninstallation logic, along with any other supporting actions, in order to make them as self contained as possible.

That might be the future. For now, we simply have a repository with scripts in it for each of our environments, supported by a set of common scripts.

The way I see it, environments fall into two categories.

Environments created for a particular task, like load testing or some sort of experimental development.
Environments that take part in your deployment pipeline.

The fact that we have entirely codified our environment setup gives us the power to create an environment for either of the above. The first point is not particularly interesting, but the second one is.

We have 3 standard environments, which are probably familiar to just about anyone (though maybe under different names). They are, CI, Staging and Production.

CI is the environment that is recreated every morning through TeamCity. It is used for continuous integration/deployment, and is typically not used directly for manual testing/demonstration/anything else. It forms an important part of the pipeline, as after deployment, automated functional tests are run on it, and if successful that component is (usually) automatically propagated to Staging.

Staging is, for all intents and purposes, a Production level environment. It is stable (only components that fully pass all of their tests are deployed here) and is used primarily for manual testing and feature validation, with a secondary focus on early integration within a trusted group of people (which may include external parties and exceptional customers).

Production is of course production. Its the environment that the greater world uses for any and all executions of the software component (or components) in question. It is strictly controlled and protected, to make sure that we don’t accidentally break it, inconveniencing our customers and making them unhappy.

The problem is, how do you get changes to the underlying environment (i.e. a new version of it) into Staging/Production, without losing any state held within the system? You can’t just recreate the environment (like we do each morning for CI), because the environment contains the state, and that destroys it.

You need another process.

Migration.

Birds Fly South for the Winter

Migration, for being such a short word, is actually incredibly difficult.

Most approaches that I’ve seen in the past, involved some sort of manual migration strategy (usually written down and approved by 2+ people), which is executed by some long suffering operations person at midnight when hopefully no-one is trying o use the environment for its intended purpose.

A key component to any migration strategy: What happens if it goes wrong? Otherwise known as a rollback procedure.

This is, incidentally, where everything gets pretty hard.

With our environments being entirely codified in a mixture of Powershell and CloudFormation, I wanted to create something that would automatically update an environment to the latest version, without losing any of the data currently stored in the environment, and in a safe way.

CloudFormation offers the ability to update a stack after it has been created. This way you can change the template to include a new resource (or to change existing resources) and then execute the update and have AWS handle all of the updating. This probably works fine for most people, but I was uneasy at the prospect. Our environments are already completely self contained and I didn’t understand how CloudFormation updates would handle rollbacks, or how updates would work for all components involved. I will go back and investigate it in more depth at some point in the future, but for now I wanted a more generic solution that targeted the environment itself.

My idea was fairly simple.

What if I could clone an environment? I could make a clone of the environment I wanted to migrate, test the clone to make sure all the data came through okay and its behaviour was still the same, delete the old environment and then clone the temporary environment again, into the original environments name. At any point up to the delete of the old environment I could just stop, and everything would be the same as it was before. No need for messy rollbacks that might might only do a partial job.

Of course, the idea is not actually all that simple in practice.

A Perfect Clone

In order to clone an environment, you need to identify the parts of the environment that contain persistent data (and would not automatically be created by the environment setup). Databases and file storage (S3, disk, etc) are examples of persistent data. Log files are another example of persistent data, except they don’t really matter from a migration point of view, mostly because all of our log entries are aggregated into an ELK stack. Even if they weren’t aggregated, they probably still wouldn’t be worth spending time on.

In the case of the specific environment I’m working on for the migration this time, there is an RDS instance (the database) and at least one S3 bucket containing user data. Everything else about the environment is transient, and I won’t need to worry about it.

Luckily for me, cloning an RDS instance and an S3 bucket is relatively easy.

With RDS you can simply take a snapshot and then use that snapshot as an input into the RDS instance creation on the new environment. Fairly straightforward.

function _WaitRdsSnapshotAvailable
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$snapshotId,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [int]$timeoutSeconds=3000
    )

    write-verbose "Waiting for the RDS Snapshot with Id [$snapshotId] to be [available]."
    $incrementSeconds = 15
    $totalWaitTime = 0
    while ($true)
    {
        $a = Get-RDSDBSnapshot -DBSnapshotIdentifier $snapshotId -Region $awsRegion -AccessKey $awsKey -SecretKey $awsSecret
        $status = $a.Status

        if ($status -eq "available")
        {
            write-verbose "The RDS Snapshot with Id [$snapshotId] has exited [$testStatus] into [$status] taking [$totalWaitTime] seconds."
            return $a
        }

        write-verbose "Current status of RDS Snapshot with Id [$snapshotId] is [$status]. Waiting [$incrementSeconds] seconds and checking again for change."

        Sleep -Seconds $incrementSeconds
        $totalWaitTime = $totalWaitTime + $incrementSeconds
        if ($totalWaitTime -gt $timeoutSeconds)
        {
            throw "The RDS Snapshot with Id [$snapshotId] was not [available] within [$timeoutSeconds] seconds."
        }
    }
}

... snip some scripts getting CFN stacks ...

$resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

$rds = $resources |
    Single -Predicate { $_.ResourceType -eq "AWS::RDS::DBInstance" }

$timestamp = [DateTime]::UtcNow.ToString("yyyyddMMHHmmss")
$snapshotId = "$sourceEnvironment-for-clone-to-$destinationEnvironment-$timestamp"
$snapshot = New-RDSDBSnapshot -DBInstanceIdentifier $rds.PhysicalResourceId -DBSnapshotIdentifier $snapshotId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

_WaitRdsSnapshotAvailable -SnapshotId $snapshotId -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion

With S3, you can just copy the bucket contents. I say just, but in reality there is no support for a “sync” command in the AWS Powershell cmdlets. There is a sync command on the AWS CLI though, so I wrote a wrapper around the CLI and execute the sync command there. It works pretty nicely. Essentially its broken into two parts, the part that deals with actually locating and extracting the AWS CLI to a known location, and then the part that actually does the clone. The only difficult bit was that you don’t seem to be able to just supply credentials to the AWS CLI executable, at least in a way that I would expect (i.e. as parameters). Instead you have to use a profile, or use environment variables.

function Get-AwsCliExecutablePath
{
    if ($rootDirectory -eq $null) { throw "rootDirectory script scoped variable not set. Thats bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    $commonScriptsDirectoryPath = "$rootDirectoryPath\scripts\common"

    . "$commonScriptsDirectoryPath\Functions-Compression.ps1"

    $toolsDirectoryPath = "$rootDirectoryPath\tools"
    $nugetPackagesDirectoryPath = "$toolsDirectoryPath\packages"

    $packageId = "AWSCLI64"
    $packageVersion = "1.7.41"

    $expectedDirectory = "$nugetPackagesDirectoryPath\$packageId.$packageVersion"
    if (-not (Test-Path $expectedDirectory))
    {
        $extractedDir = 7Zip-Unzip "$toolsDirectoryPath\dist\$packageId.$packageVersion.7z" "$toolsDirectoryPath\packages"
    }

    $executable = "$expectedDirectory\aws.exe"

    return $executable
}

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    $previousAWSKey = $env:AWS_ACCESS_KEY_ID
    $previousAWSSecret = $env:AWS_SECRET_ACCESS_KEY
    $previousAWSRegion = $env:AWS_DEFAULT_REGION

    $env:AWS_ACCESS_KEY_ID = $awsKey
    $env:AWS_SECRET_ACCESS_KEY = $awsSecret
    $env:AWS_DEFAULT_REGION = $awsRegion

    & $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName

    $env:AWS_ACCESS_KEY_ID = $previousAWSKey
    $env:AWS_SECRET_ACCESS_KEY = $previousAWSSecret
    $env:AWS_DEFAULT_REGION = $previousAWSRegion
}

I do have some concerns that as the bucket gets bigger, the clone will take longer and longer. I’ll cross that bridge when I come to it.

Using the identified areas of persistence above, the only change I need to make is to alter the new environment script to take them as optional inputs (specifically the RDS snapshot). If they are supplied, it will use them, if they are not, it will default to normal creation.

Job done, right?

A Clean Snapshot

The clone approach works well enough, but in order to perform a migration on a system that is actively being used, you need to make sure that the content does not change while you are doing it. If you don’t do this, you can potentially lose data during a migration. The most common example would be if you clone the environment, but after the clone some requests occur and the data changes. If you then delete the original and migrate back, you’ve lost that data. There are other variations as well.

This means that you need the ability to put an environment into standby mode, where it is still running, and everything is working, but it is no longer accepting user requests.

Most of our environments are fairly simple and are based around web services. They have a number of instances behind a load balancer, managed by an auto scaling group. Behind those instances are backend services, like databases and other forms of persistence/scheduled task management.

AWS Auto Scaling Groups allow you to set instances into Standby mode, which removes them from the load balancer (meaning they will no longer have requests forwarded to them) but does not delete or otherwise terminate them. More importantly, instances in Standby can count towards the desired number of instances in the Auto Scaling Group, meaning it won’t go and create X more instances to service user requests, which obviously would muck the whole plan up.

This is exactly what we need to set our environment into a Standby mode (at least until we have scheduled tasks that deal with underlying data anyway). I took the ability to shift instances into Standby mode and wrapped it into a function for setting the availability of the environment (because that’s the concept that I’m interested in, the Standby mode instances are just a mechanism to accomplish that).

function _ChangeEnvironmentAvailability
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environment,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateSet("Available", "Unavailable")]
        [string]$availability,
        [switch]$wait
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Enumerables.ps1"

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"
    Ensure-AwsPowershellFunctionsAvailable

    $sourceStackName = Get-StackName -Environment $environment -UniqueComponentIdentifier (_GetUniqueComponentIdentifier)
    $sourceStack = Get-CFNStack -StackName $sourceStackName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $resources = Get-CFNStackResources -StackName $sourceStack.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Locating auto scaling group in [$environment]"
    $asg = $resources |
        Single -Predicate { $_.ResourceType -eq "AWS::AutoScaling::AutoScalingGroup" }

    $asg = Get-ASAutoScalingGroup -AutoScalingGroupName $asg.PhysicalResourceId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion

    $instanceIds = @()
    $standbyActivities = @()
    if ($availability -eq "Unavailable")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Standby"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::InService } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Enter-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -ShouldDecrementDesiredCapacity $true -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }
    elseif ($availability -eq "Available")
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Switching all instances in [$($asg.AutoScalingGroupName)] to Inservice"
        $instanceIds = $asg.Instances | Where { $_.LifecycleState -eq [Amazon.AutoScaling.LifecycleState]::Standby } | Select -ExpandProperty InstanceId
        if ($instanceIds | Any)
        {
            $standbyActivities = Exit-ASStandby -AutoScalingGroupName $asg.AutoScalingGroupName -InstanceId $instanceIds -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        }
    }

    $anyStandbyActivities = $standbyActivities | Any
    if ($wait -and $anyStandbyActivities)
    {
        Write-Verbose "_ChangeEnvironmentAvailability([$availability]): Waiting for all scaling activities to complete"
        _WaitAutoScalingGroupActivitiesComplete -Activities $standbyActivities -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
    }
}

With the only mechanism to affect the state of persisted data disabled, we can have some more confidence that the clone is a robust and clean copy.

To the Future!

I don’t think that the double clone solution is the best one, it’s just the best that I could come up with without having to make a lot of changes to way we manage our environments.

Another approach would be to maintain 2 environments during migration (A and B), but only have one of those environments be active during normal operations. So to do a migration, you would spin up Prod A if Prod B already existed. At the entry point, you have a single (or multiple) DNS record that points to either A or B based on your needs. This one still involves cloning and downtime though, so for a high availability service, it won’t really work (our services can have some amount of downtime, as long as it is organized and communicated ahead of time).

Speaking of downtime, there is another approach that you can follow in order to do zero downtime migrations. I haven’t actually done it, but if you had a mechanism to replicate incoming requests to both environments, you could conceivably bring up the new environment, let it deal with the same requests as the old environment for long enough to synchronize and to validate that it works (without responding to the user, just processing the requests) and then perform the top level switch so that the new environment becomes the active one. At some point in the future you can destroy the old environment, when you are confident that it works as expected.

This is a lot more complicated, and involves some external component managing the requests (and storing a record of all requests ever made, at least back to the last backup point) as well as each environment knowing what request they last processed. Its certainly not impossible, but if you can tolerate downtime, its probably not worth the effort.

Summary

Managing your environments is not a simple task, especially when you have actual users (and if you don’t have users, why are you bothering at all?). It’s very difficult to make sure that your production (or other live environment) does not stagnate in the face of changes, and is always kept up to date.

What I’ve outlined in this blog post is a single approach that I’ve been working on over the last week or so, to deal with out specific environments. Its not something that will work for everyone, but I thought it was worthwhile to write it down anyway, to show some of the thought processes that I needed to go through in order to accomplish the migration in a safe, robust fashion.

There is at least one nice side effect from my approach, in that we will now be able to clone any environment I want (without doing a migration, just the clone) and then use it for experiments or analysis.

I’m sure that I will run into issues eventually, but I’m pretty happy with the way the migration process is happening. It was weighing on my mind for a while, so its good to get it done.