The Dangers of Shared Infrastructure
The service that I’ve mentioned previously (and the iOS app it supports) has been in beta now for a few weeks. People seem relatively happy with it, both from a performance standpoint and due to the fact that it doesn’t just arbitrarily lose their information, unlike the previous version, so we’ve got that going for us, which is nice.
We did a fair amount of load testing on it before it went out to beta, but only for small numbers of concurrent users (< 100), to make sure that our beta experience would be acceptable. That load testing picked up a few issues, including one where the service would happily (accidentally of course) delete other peoples data. It wasn’t a permissions issue, it was due to the way in which we were keying our image storage. More importantly, the load testing found issues with the way in which we were storing images (we were using Raven 2.5 attachments) and how it just wasn’t working from a performance point of view. We switched to storing the files in S3, and it was much better.
I believe the newer version of Raven has a new file storage mechanism that is much better. I don’t even think Ayende recommends that you use the attachments built into Raven 2.5 for any decent amount of file storage.
Before we go live, we knew that we needed to find the breaking point of the service. The find the number of concurrent users at which its performance degraded to the point where it was unusable (at least for the configuration that we were planning on going live with). If that number was too low, we knew we would need to make some additional changes, either in terms of infrastructure (beefier AWS instances, more instances in the Auto Scaling Group) or in terms of code.
We tried to simply run a huge amount of users through our load tests locally (which is how we we did the first batch of load testing, locally using JMeter) but we capped out our available upload bandwidth pretty quickly, well below the level of traffic that the service could handle.
It was time to farm the work out to somewhere else, somewhere with a huge amount of easily accessibly computing resources.
Where else but Amazon Web Services?
I’ve Always Wanted to be a Farmer
The concept was fairly straightforward. We had a JMeter configuration file that contained all of our load tests. It was parameterised by the number of users, so conceptually the path would be to spin up some worker instances in EC2, push JMeter, its dependencies and our config to them, then execute the tests. This way we could tune the number users per instance along with the total number of worker instances, and we would be able to easily put enough pressure on the service to find its breaking point.
JMeter gives you the ability to set the value of variables via the command line. Be careful though, as the variable names are case sensitive. That one screwed me over for a while, as I couldn’t figure out why the value of my variables was still the default on every machine I started the tests on. For the variable that defined the maximum number of users it wasn’t so bad, if a bit confusing. The other variable that defined the seed for the user identity was more of an issue when it wasn’t working, because it meant the same user was doing similar things from multiple machines. Still a valid test, but not the one I was aiming to do, as the service isn’t defined for concurrent access like that.
We wouldn’t want to put all of that load on the service all at once though, so we needed to stagger when each instance started its tests.
Leveraging the work I’d done previously for setting up environments, I created a Cloud Formation template containing an Auto Scaling Group with a variable number of worker instances. Each instance would have the JMeter config file and all of its dependencies (Java, JMeter, any supporting scripts) installed during setup, and then be available for remote execution via Powershell.
The plan was to hook into that environment (or setup a new one if one could not be found), find the worker instances and then iterate through them, starting the load tests on each one, making sure to stagger the time between starts to some reasonable amount. The Powershell script for doing exactly that is below:
[CmdletBinding()] param ( [Parameter(Mandatory=$true)] [ValidateNotNullOrEmpty()] [string]$environmentName, [Parameter(Mandatory=$true)] [ValidateNotNullOrEmpty()] [string]$awsKey, [Parameter(Mandatory=$true)] [ValidateNotNullOrEmpty()] [string]$awsSecret, [string]$awsRegion="ap-southeast-2" ) $ErrorActionPreference = "Stop" $currentDirectoryPath = Split-Path $script:MyInvocation.MyCommand.Path write-verbose "Script is located at [$currentDirectoryPath]." . "$currentDirectoryPath\_Find-RepositoryRoot.ps1" $repositoryRoot = Find-RepositoryRoot $currentDirectoryPath $repositoryRootDirectoryPath = $repositoryRoot.FullName $commonScriptsDirectoryPath = "$repositoryRootDirectoryPath\scripts\common" . "$repositoryRootDirectoryPath\scripts\environment\Functions-Environment.ps1" . "$commonScriptsDirectoryPath\Functions-Aws.ps1" Ensure-AwsPowershellFunctionsAvailable $stack = $null try { $stack = Get-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion } catch { Write-Warning $_ } if ($stack -eq $null) { $update = ($stack -ne $null) $stack = New-Environment -EnvironmentName $environmentName -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion -UpdateExisting:$update -Wait -disableCleanupOnFailure } $autoScalingGroupName = $stack.AutoScalingGroupName $asg = Get-ASAutoScalingGroup -AutoScalingGroupNames $autoScalingGroupName -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion $instances = $asg.Instances . "$commonScriptsDirectoryPath\Functions-Aws-Ec2.ps1" $remoteUser = "Administrator" $remotePassword = "ObviouslyInsecurePasswordsAreTricksyMonkeys" $securePassword = ConvertTo-SecureString $remotePassword -AsPlainText -Force $cred = New-Object System.Management.Automation.PSCredential($remoteUser, $securePassword) $usersPerMachine = 100 $nextAvailableCustomerNumber = 1 $jobs = @() foreach ($instance in $instances) { # Get the instance $instance = Get-AwsEc2Instance -InstanceId $instance.InstanceId -AwsKey $awsKey -AwsSecret $awsSecret -AwsRegion $awsRegion $ipAddress = $instance.PrivateIpAddress $session = New-PSSession -ComputerName $ipAddress -Credential $cred $remoteScript = { param ( [int]$totalNumberOfUsers, [int]$startingCustomerNumber ) Set-ExecutionPolicy -ExecutionPolicy Bypass & "C:\cfn\dependencies\scripts\jmeter\execute-load-test-no-gui.ps1" -totalNumberOfUsers $totalNumberOfUsers -startingCustomerNumber $startingCustomerNumber -AllocatedMemory 512 } $job = Invoke-Command -Session $session -ScriptBlock $remoteScript -ArgumentList $usersPerMachine,$nextAvailableCustomerNumber -AsJob $jobs += $job $nextAvailableCustomerNumber += $usersPerMachine #Sleep -Seconds ([TimeSpan]::FromHours(2).TotalSeconds) Sleep -Seconds 300 # Can use Get-Job or record list of jobs and then terminate them. I suppose we could also wait on all of them to be complete. Might be good to get some feedback from # the remote process somehow, to indicate whether or not it is still running/what it is doing. }
Additionally, I’ve recreated and reuploaded the repository from my first JMeter post, containing the environment template and scripts for executing the template, as well as the script above. You can find it here.
The last time I uploaded this repository I accidentally compromised our AWS deployment credentials, so I tore it down again very quickly. Not my brightest moment, but you can rest assured I’m not making the same mistake twice. If you look at the repository, you’ll notice that I implemented the mechanism for asking for credentials for tests so I never feel tempted to put credentials in a file ever again.
We could watch the load tests kick into gear via Kibana, and keep an eye on when errors start to occur and why.
Obviously we didn’t want to run the load tests on any of the existing environments (which are in use for various reasons), so we spun up a brand new environment for the service, fired up the script to farm out the load tests (with a 2 hour delay between instance starts) and went home for the night.
15 minutes later, Production (the environment actively being used for the external beta) went down hard, and so did all of the others, including the new load test environment.
Separately Dependent
We had gone to great lengths to make sure that our environments were independent. That was the entire point behind codifying the environment setup, so that we could spin up all resources necessary for the environment, and keep it isolated from all of the other ones.
It turns out they weren’t quite as isolated as we would have liked.
Like a lot of AWS setups, we have an internet gateway, allowing resources internal to our VPC (like EC2 instances) access to the internet. By default, only resources with an external IP can access the internet through the gateway. Other resources have to use some other mechanism for accessing the internet. In our case, the other mechanism is a SQUID proxy.
It was this proxy that was the bottleneck. Both the service under test and the load test workers themselves were slamming it, the service in order to talk to S3 and the load test workers in order to hit the service (through its external URL).
We recently increased the specs on the proxy machine (because of a similar problem discovered during load testing with fewer users) and we thought that maybe it would be powerful enough to handle the incoming requests. It probably would have been if it wasn’t for the double load (i.e. if the load test requests had of been coming from an external party and the only traffic going through the proxy was to S3 from the service).
In the end the load tests did exactly what they were supposed to do, even if they did it in an unexpected way. The pushed the system to breaking point, allowing us to identify where it broke and schedule improvements to prevent the situation from occurring again.
Actions Speak Louder Than Words
What are we going to do about it? There are a number of things I have in mind.
The first is to not have a single proxy instance and instead have an auto scaling group that scales as necessary based on load. I like this idea and I will probably be implementing it at some stage in the future. To be honest, as a shared piece of infrastructure, this is how it should have been implemented in the first place. I understand that the single instance (configured lovingly by hand) was probably quicker and easier initially, but for such a critical piece of infrastructure, you really do need to spend the time to do it properly.
The second is to have environment specific proxies, probably as auto scaling groups anyway. This would give me more confidence that we won’t accidentally murder production services when doing internal things, just from an isolation point of view. Essentially, we should treat the proxy just like we treat any other service, and be able to spin them up and down as necessary for whatever purposes.
The third is to isolate our production services entirely, either with another VPC just for production, or even another AWS account just for production. I like this one a lot, because as long as we have shared environments, I’m always terrified I’ll screw up a script and accidentally delete everything. If production wasn’t located in the same account, that would literally be impossible. I’ll be trying to make this happen over the coming months, but I’ll need to move quickly, as the more stuff we have in production, the harder it will be to move.
The last optimisation is to use the new VPC endpoint feature in AWS to avoid having to go to the internet in order to access S3, which I have already done. This really just delays the root issue (shared single point of failure), but it certainly solves the immediate problem and should also provide a nice performance boost, as it removes the proxy from the picture entirely for interactions with S3, which is nice.
Conclusion
To me, this entire event proved just how valuable load testing is. As I stated previously, it did exactly what I expected it to do. Find where the service breaks. It broke in an entirely unexpected way (and broke other things as well), but honestly this is probably the best outcome, because that would have happened at some point in the future anyway (whenever we hit the saturation point for the proxy) and I’d prefer it to happen now, when we’re in beta and managing communications with every user closely, than later, when everybody and their dog are using the service.
Of course, now we have a whole lot more infrastructure work to complete before we can go live, but honestly, the work is never really done anyway.
I still hate proxies.