A Deftly Encoded Mystery
We use TeamCity as our Continuous Integration tool.
Unfortunately, our setup hasn’t been given as much love as it should have. Its not bad (or broken or any of those things), its just not quite as well setup as it could be, which increases the risk that it will break and makes it harder to manage (and change and keep up to date) than it could be.
As with everything that has got a bit crusty around the edges over time, the only real way to attack it while still delivering value to the business is by doing it slowly, piece by piece, over what seems like an inordinate amount of time. The key is to minimise disruption, while still making progress on the bigger picture.
Our setup is fairly simple. A centralised TeamCity server and at least 3 Build Agents capable of building all of our software components. We host all of this in AWS, but unfortunately, it was built before we started consistently using CloudFormation and infrastructure as code, so it was all manually setup.
Recently, we started using a few EC2 spot instances to provide extra build capabilities without dramatically increasing our costs. This worked fairly well, up until the spot price spiked and we lost the build agents. We used persistent requests, so they came back, but they needed to be configured again before they would hook up to TeamCity because of the manual way in which they were provisioned.
There’s been a lot of instability in the spot price recently, so we were dealing with this manual setup on a daily basis (sometimes multiple times per day), which got old really quickly.
You know what they say.
“If you want something painful automated, make a developer responsible for doing it manually and then just wait.”
Its Automatic
The goal was simple.
We needed to configure the spot Build Agents to automatically bootstrap themselves on startup.
On the upside, the entire process wasn’t completely manual. We were at least spinning up the instances from a pre-built AMI that already had all of the dependencies for our older, crappier components as well as an unconfigured TeamCity Build Agent on it, so we didn’t have to automate absolutely everything.
The bootstrapping would need to tag the instance appropriately (because for some reason spot instances don’t inherit the tags of the spot request), configure the Build Agent and then start it up so it would connect to TeamCity. Ideally, it would also register and authorize the Build Agent, but if we used controlled authorization tokens we could avoid this step by just authorizing the agents once. Then they would automatically reappear each time the spot instance came back,.
So tagging, configuring, service start, using Powershell, with the script baked into the AMI. During provisioning we would supply some UserData that would execute the script.
Not too complicated.
Like Graffiti, Except Useful
Tagging an EC2 instance is pretty easy thanks to the multitude of toolsets that Amazon provides. Our tool of choice is the Powershell cmdlets, so the actual tagging was a simple task.
Getting permission to the do the tagging was another story.
We’re pretty careful with our credentials these days, for some reason, so we wanted to make sure that we weren’t supply and persisting any credentials in the bootstrapping script. That means IAM.
One of the key features of the Powershell cmdlets (and most of the Amazon supplied tools) is that they are supposed to automatically grab credentials if they are being run on an EC2 instance that currently has an instance profile associated with it.
For some reason, this would just not work. We tried a number of different things to get this to work (including updating the version of the Powershell cmdlets we were using), but in the end we had to resort to calling the instance metadata service directly to grab some credentials.
Obviously the instance profile that we applied to the instance represented a role with a policy that only had permissions to alter tags. Minimal permission set and all that.
Service With a Smile
Starting/stopping services with Powershell is trivial, and for once, something weird didn’t happen causing us to burn days while we tracked down some obscure bug that only manifests in our particular use case.
I was as surprised as you are.
Configuration Is Key
The final step should have been relatively simple.
Take a file with some replacement tokens, read it, replace the tokens with appropriate values, write it back.
Except it just wouldn’t work.
After editing the file with Powershell (a relatively simple Get-Content | For-Each { $_ –replace {token}, {value} } | Out-File) the TeamCity Build Agent would refuse to load.
Checking the log file, its biggest (and only) complaint was that the serverUrl (which is the location of the TeamCity server) was not set.
This was incredibly confusing, because the file clearly had a serverUrl value in it.
I tried a number of different things to determine the root cause of the issue, including:
- Permissions? Was the file accidentially locked by TeamCity such that the Build Agent service couldn’t access it?
- Did the rewrite of the tokens somehow change the format of the file (extra spaces, CR LF when it was just expecting LF)
- Was the serverUrl actually configured, but inaccessible for some reason (machine proxy settings for example) and the problem was actually occurring not when the file was rewritten but when the script was setting up the AWS Powershell cmdlets proxy settings?
Long story short, it turns out that Powershell doesn’t remember file encoding when using the Out-File functionality in the way we were using it. It was changing the Byte Order Mark (BOM) on the file from ASCII to Unicode Little Endian, and the Build Agent did not like that (it didn’t throw an encoding error either, which is super annoying, but whatever).
The error message was both a red herring (yes the it was configured) and also truthly (the Build Agent was incapable of reading the serverUrl).
Putting It All Together
With all the pieces in place, it was a relatively simple matter to create a new AMI with those scripts baked into it and put it to work straightaway.
Of course, I’d been doing this the whole time in order to test the process, so I certainly had a lot of failures building up to the final deployment.
Conclusion
Even simple automation can prove to be time consuming, especially when you run into weird unforseen problems like components not performing as advertised or even reporting correct errors for you to use for debugging purposes.
Still, it was worth it.
Now I never have to manually configure those damn spot instances when they come back.
And satisfaction is worth its weight in gold.