0 Comments

So, as per my last post, I built a scalable, deployable, codified proxy environment in AWS, leveraging Cloud Formation, Octopus and Squid.

Since then I have attempted to use this proxy for things. Specifically load tests, which was the entire reason I built the proxy in the first place.

In my attempts to use the proxy, I have learned a few lessons that I thought would be worthwhile to share, so that others might benefit from my pain.

This will be a relatively short post, because:

  1. I’ve been struggling with these issues over the last week and haven’t done much else,
  2. For all the trouble I had, I really only ran into 2 issues,
  3. I’m still struggling with a mysterious, probably unrelated issue, and its taking up most of my mental space (unexpected UTF8 characters in a Powershell/Octopus deployment output stream, which might be a good future blog post if I ever figure it out).

Weird Differences

Initially I assumed that my proxy would slot into the same hole as our current proxy. They were both Squid, I had replicated the config from the old one on the new one and both were referenced simply by a URL and port number.

I was mistaken.

I’m still not sure how it did it, but the old proxy was somehow allowing connections to the special EC2 meta information address (169.254.169.254) to pass through correctly. The moment I swapped in my new proxy, cfn-init and cfn-signal no longer worked.

For cfn-init, the error was incredibly unhelpful. It insisted that my instance was not a member of the CloudFormation group/template/configuration that I was trying to initialise from.

For cfn-signal, it just didn’t do anything. It said it signalled, but it was lying.

In hindsight, this makes perfect sense. The request would have gone through the proxy, which was a CloudFormation resource itself, and it would have tried to use the proxy’s CloudFormation template as the container for the meta data, which would fail, giving a technically correct error message in the first case, and signalling something non-existent in the second.

From my point of view, it looked insane.

I assumed I had put some sort of incorrect details into the cfn-init call, or that I had failed to meet the arcane requirements for cfn-signal (must base64 encode the wait handle on windows only for example), but I hadn’t changed anything. The only thing I changed was the proxy configuration.

Long story short, for my proxy, I had to add a bypass entry (on each EC2 instance, configured in the same place as the proxy, the UserData script) which would stop cfn-init (and other tools) from trying to go through the proxy to hit the meta information address. I still have no idea how the old proxy did not require the same sort of configuration. I have a hunch that it might be because it was Linux and the original creators of the box did something special to make it work. Maybe they ran into the same issue, but just fixed it a different way? Maybe Linux automatically handles the situation better? Who knows.

Very frustrating.

Location, Location, Location

The second pain point I ran into was more insane and just as frustrating.

After reviewing the results of an initial load test, I hypothesised that maybe the proxy was a bottleneck. All of the traffic for the load test had to pass through the proxy (including image uploads) and I couldn’t see anything obvious in the service logs to account for the level of failure I was seeing, except high load. In the interests of getting a better subsequent load test, I wanted to make sure that the proxy boxes could not possibly be a bottleneck, so I planned to beef up their instance type.

I was originally using t2.medium instances, which have some limitations, mostly around network performance and CPU credits. I wanted to switch to something a bit beefier, just for the proxy specific to the load tests.

When i switched to an m3.large, the proxy stopped working.

Looking into it, the expected installation directory (C:\Squid) was empty of anything that even vaguely looked like a proxy.

Following the installation log, I found out that Squid had decided to install itself to Z drive. Z drive was an ephemeral drive. You know, the ones whose content is transitory, and which tend to get annihilated if the instance goes down for any reason?

I tried so very hard to get Squid to just install to the C drive, including checking the registry settings for program installation locations (which were all correctly C based) and manually overriding TARGETFOLDER, ROOTDRIVE and INSTALLDIR in the msi execution parameters.

Alas, it was not to be. No matter what I did, Squid insisted on installing to Z drive.

I still have no idea why, I just turned the instance type back to one that didn’t have ephemeral drives available.

Like any good software user, I logged a bug. Well, I assume its a bug, because that’s a really weird feature.

Conclusion

There is no conclusion. Just a glimpse into some of the traps that sap your time, motivation and will to live when doing this sort of thing.

I only hope that someone runs across this blog post one day and it helps them. Or at least lets them know someone else out there understands their pain.

0 Comments

Its time to fix that whole shared infrastructure issue.

I need to farm my load tests out to AWS, and I need to do it in a way that won’t accidentally murder our current production servers. In order to do that , I need to forge the replacement for our manually created and configured proxy box. A nice, codified, auto-scaling proxy environment.

Back into CloudFormation I go.

I contemplated simply using a NAT box instead of a proxy, but decided against it because:

  • We already use a proxy, so assuming my template works as expected it should be easy enough to slot in,
  • I don’t have any experience with NAT boxes (I’m pretty weak on networking in general actually),
  • Proxies scale better in the long run, so I might as well sort that out now.

Our current lone proxy machine is a Linux instance with Squid manually installed on it. It was setup some time before I started, by someone who no longer works at the company. An excellent combination, I’m already a bit crap at Linux, and now I can’t even ask anyone how it was put together and what sort of tweaks were done to it over time as failures were encountered. Time to start from scratch. The proxy itself is sound enough, and I have some experience with Squid, so I’ll stick with it. As for the OS, while I know that Linux will likely be faster with less overhead, I’m far more comfortable with Windows, so to hell with Linux for now.

Here's the plan. Create a CloudFormation template for the actual environment (Load Balancer, Auto Scaling Group, Instance Configuration, DNS Record) and also create a NuGet package that installs and configures the proxy to be deployed via Octopus.

I’ve always liked the idea of never installing software manually, but its only been recently that I’ve had access to the tools to accomplish that. Octopus, NuGet and Powershell form a very powerful combination for managing deployments on Windows. I have no idea what the equivalent is for Linux, but I’m sure there is something. At some point in the future Octopus is going to offer the ability to do SSH deploys which will allow me to include more Linux infrastructure (or manage existing Linux infrastructure even better, I’m looking at you ELK stack).

Save the Environment

The environment is pretty simple. A Load Balancer hooked up to an Auto Scaling Group, whose instances are configured to do some simple setup, (including using Octopus to deploy some software) and a DNS record so that I can refer to the load balancer in a nice way.

I’ve done enough of these simple sorts of environments now that I didn’t really run into any interesting issues. Don’t get me wrong, they aren’t trivial, but I wasn’t stuck smashing my head against a desk for a few days while I sorted out some arcane problem that ended up being related to case sensitivity or something ridiculous like that.

One thing that I have learned, is to setup the Octopus project that will be deployed during environment setup ahead of time. Give it some trivial content, like running a Powershell script, and then make sure it deploys correctly during the startup of the instances in the Auto Scaling Group. If you try to sort out the package and its deployment at the same time as the environment, you’ll probably run into issues where the environment setup technically succeeded, but because the deployment of the package failed, the whole thing failed and you have to wait another 20 minutes to fix it. It really saves a lot of time to create the environment in such a way that you can extend it with deployments later.

Technically you could also make it so failing deployments don’t fail an environment setup, but I like my environments to work when they are “finished”, so I’m not really comfortable with that in the long run.

The only tricky things about the proxy environment are making sure that you setup your security groups appropriately so that the proxy port can be accessed, and making sure that you use the correct health check for the load balancer (for Squid at least, TCP over the port 3128 (the default for Squid) is a good health check).

That's a Nice Package

With the environment out of the way, its time to setup the package that will be used to deploy Squid.

Squid is available on Windows via diladele. Since 99% of our systems are 64 bit, I just downloaded the 64 bit MSI. Using the same structure that I used for the Nxlog package, I packaged up the MSI and some supporting scripts, making sure to version the package appropriately. Consistent versioning is important, so I use the same versioning strategy that I use for our software components. Include a SharedAssemblyInfo file and then mutate that file via some common versioning Powershell functions.

Apart from installation of Squid itself, I also included the ability to deploy a custom configuration file. The main reason I did this was so that I could replicate out current Squid proxy config exactly, because I’m sure it does things that have been built up over the last few years that I don’t understand. I did this in a similar way to how I did config deployment for Nxlog and Logstash. Essentially a set of configuration files are included in the Nuget package and then the correct one is chosen at deployment time based on some configuration within Octopus.

I honestly don’t remember if I had any issues with creating the Squid proxy package, but I’m sure if I had of, they would be fresh in my mind. MSI’s are easy to install silently with MSIEXEC once you know the arguments, and the Squid installer for Windows is pretty reliable. I really do think it was straightforward, especially considering that I was following the same pattern that I’d used to install an MSI via Octopus previously.

Delivering the Package

This is standard Octopus territory. Create a project to represent the deployable component, target it appropriately at machines in roles and then deploy to environments. Part of the build script that is responsible for putting together the package above can also automatically deploy it via Octopus to an environment of your choice.

In TeamCity, we typically do an automatic deploy to CI on every checkin (gated by passing tests), but for this project I had to hold off. We’re actually running low on Build Configurations right now (I’ve already put in a request for more, but the wheels of bureaucracy move pretty slow), so I skipped out on setting one up for the Squid proxy. Once we get some more available configurations I’ll rectify that situation.

Who Could Forget Logs?

The final step in deploying and maintaining anything is to make sure that the logs from the component are being aggregated correctly, so that you don’t have to go the machine/s in question to see what's going on and you have a nice pile of data to do analysis on later. Space is cheap after all, so you might as well store everything, all the time (except media).

Squid features a nice Access log with a well known format, which is perfect for this sort of log processing and aggregation.

Again, using the same sort of approach that I’ve used for other components, I quickly knocked up a logstash config for parsing the log file and deployed it (and Logstash) to the same machines as the Squid proxy installations. I’ll include that config here, because it lives in a different repository to the rest of the Squid stuff (it would live in the Solavirum.Logging.Logstash repo, if I updated it).

input {
    file {
        path => "@@SQUID_LOGS_DIRECTORY/access.log"
        type => "squid"
        start_position => "beginning"
        sincedb_path => "@@SQUID_LOGS_DIRECTORY/.sincedb"
    }
}

filter {
    if [type] == "squid" {
        grok {
            match => [ "message", "%{NUMBER:timestamp}\s+%{NUMBER:TimeTaken:int} %{IPORHOST:source_ip} %{WORD:squid_code}/%{NUMBER:Status} %{NUMBER:response_bytes:int} %{WORD:Verb} %{GREEDYDATA:url} %{USERNAME:user} %{WORD:squid_peerstatus}/(%{IPORHOST:destination_ip}|-) %{GREEDYDATA:content_type}" ]
        }
        date {
            match => [ "timestamp", "UNIX" ]
            remove_field => [ "timestamp" ]
        }
    }
     
    mutate {
        add_field => { "SourceModuleName" => "%{type}" }
        add_field => { "Environment" => "@@ENVIRONMENT" }
        add_field => { "Application" => "SquidProxy" }
        convert => [ "Status", "string" ]
    }
    
    # This last common mutate deals with the situation where Logstash was creating a custom type (and thus different mappings) in Elasticsearch
    # for every type that came through. The default "type" is logs, so we mutate to that, and the actual type is stored in SourceModuleName.
    # This is a separate step because if you try to do it with the SourceModuleName add_field it will contain the value of "logs" which is wrong.
    mutate {
        update => [ "type", "logs" ]
    }
}

output {
    tcp {
        codec => json_lines
        host => "@@LOG_SERVER_ADDRESS"
        port => 6379
    }
    
    #stdout {
    #    codec => rubydebug
    #}
}

I Can Never Think of a Good Title for the Summary

For reference purposes I’ve included the entire Squid package/environment setup code in this repository. Use as you see fit.

As far as environment setups go, this one was pretty much by the numbers. No major blockers or time wasters. It wasn’t trivial, and it still took me a few days of concentrated effort, but the issues I did have were pretty much just me making mistakes (like setting up security group rules wrong, or failing to tag instances correctly in Octopus, or failing to test the Squid install script locally before I deployed it). The slowest part is definitely waiting for the environment creation to either succeed or fail, because it can take 20+ minutes for the thing to run start to finish. I should look into making that faster somehow, as I get distracted during that 20 minutes.

Really the only reason for the lack of issues was that I’d done all of this sort of stuff before, and I tend to make my stuff reusable. It was a simple matter to plug everything together in the configuration that I needed, no need to reinvent the wheel.

Though sometimes you do need to smooth the wheel a bit when you go to use it again.