0 Comments

We’ve been doing some work recently to regain control over our Log Stack. Nothing groundbreaking, just cleaning up, creating deployment pipelines and moving infrastructure around.

To go into it in a little more details, the nature of the work could be broadly categorised as follows:

  • Amazon Machine Images (AMIs) created through Packer (via TeamCity + Powershell)
  • CloudFormation scripts executed through Octopus Deploy
  • Application specific configuration files deployed to Linux machines through Octopus Deploy

The third topic is relevant to this post, as it concerns an annoying problem I encountered while working on the configuration for the Elasticsearch layer.

Operating Tolerances

I do all of my development on my Surface Book running whatever flavour of Windows 10 is the most current. When doing C# application/service development, I mostly use Visual Studio, and for just about everything else (Powershell, CloudFormation, Bash Scripts, etc) I use Visual Studio Code (it’s plugins are really good). This holds true even when I’m building things to deploy to Linux operating systems.

When I was creating the repository/Nuget package/deployment logic for the Elasticsearch configuration, it was all Visual Studio Code, all the time.

The first pass at the configuration was pretty straightforward. An elasticsearch.yml file with some variable substitution via Octopus Deploy dealing with things like master vs data nodes, cluster names and cluster discovery. Once completed and deployed, this worked pretty much as expected, with the Elasticsearch cluster coming online in the arrangement I specified.

Further tweaking was required though, because the default memory allocation settings for Elasticsearch do not take available memory into account.

You might be wondering when you would need to take control over something as low level as the amount of memory allocated to the process. Elasticsearch is a pretty memory intensive process, so it does a lot of upfront memory allocation in order to make sure that performance doesn’t suffer later if the JVM decides it needs to swap memory or something similarly painful. The default allocation for ES is only 1GB though, so if you want to make use of your beefy EC2 instance with 32GB of ram, you’ll have to do some tweaking.

Back when I first built an ELK Stack (waaaaaay back in March 2015) you would usually set an environment variable called ES_HEAP_SIZE in order to change the default memory allocation. You were free to set the Xms and Xmx JVM options directly when running the Elasticsearch process as well, but the recommended way was to use the environment variable.

Elasticsearch has evolved since then, and the current recommendation is to set memory allocation settings via a jvm.options file living in an appropriate place (typically /etc/elasticsearch). It is in this file that you can set the Xms (minimum) and Xmx (maximum) values for heap allocation, along with any other settings on the JVM that you might want to change.

The only complication for me was that we were deploying Elasticsearch to a variety of different EC2 instance types, with highly variable total amounts of memory, so the memory settings couldn’t be set in source, but had to be determine dynamically. Easy enough to accomplish with a small Bash script that runs during deployment:

# Echo commands after expansion
set -x

# Mutate and copy the JVM options file file
es_memory=$(free -m | awk '/^Mem:/ { print int($2/2) }') || exit 1
sed -i "s/@@ES_MEMORY@@/$es_memory/" /tmp/elk-elasticsearch/jvm.options || exit 1

With the new jvm.options file in place, a new version of the configuration package was deployed to CI, and Elasticsearch broke.

You’re Lying To Me

The deployment logic is simple. Get Octopus to deploy some files to a temporary location, mutate those files if necessary (this is where the Bash script above comes in), copy the files to the appropriate locations and then restart Elasticsearch.

Simply restarting the Elasticsearch node is not necessarily the best way to go about things, giving that Elasticsearch will try to shuffle shards/replicas around if a node disappears from the cluster. I’m going to ignore this particular fact for now, and hope that it doesn’t come back to bite us in the ass at a later date.

When the Elasticsearch service was restarted after the new deployment, it failed with the following error:

Invalid initial heap size: -Xms468M
Could not create the Java virtual machine.

A cursory internet search finds a decent number of these errors, but they mostly seem to be related to trying to allocate more memory than is available, i.e. a valid problem with the configuration. The problem was, Elasticsearch was running on a t2.micro EC2 instance, which has 1GB of memory, so 468MB is valid. The instance in question only runs Elasticsearch too, so its not like anything else had already stolen the memory.

My first instinct when I get error messages that seem to make no sense, especially with Java components, is to look at the encoding. In this particular case though, the jvm.options file was UTF-8, so that was a dead end.

The next step was to grab the jvm.options file that ships with Elasticsearch and see if I had made some sort of critical error when I was editing it. Eyeballing the two files showed no major differences though, aside from the change to the memory allocation settings.

Next idea?

Line breaks.

This was where I made a critical error. I looked at the file in Linux using VI and its : set list command. Both my file and the default jvm.options file that ships with Elasticsearch looked like they treated line endings the same.

Long story short, I tried a few more things, tore my (non-existent) hair out of a bit and then finally copied the two files to my computer and compared them in Notepad++.

My file had CRLF, the default LF. The extra whitespace character was causing errors when interpreting the options file, which meant it was trying to use a setting of –Xms468MB[CR], which is strictly invalid.

A quick edit to the Bash script used during deployment to clean this up and to stop it from happening again, and everything was fine.

# Echo commands after expansion
set -x

# Mutate and copy the JVM options file file
es_memory=$(free -m | awk '/^Mem:/ { print int($2/2) }') || exit 1
sed -i "s/@@ES_MEMORY@@/$es_memory/" /tmp/elk-elasticsearch/jvm.options || exit 1
sed -i 's/\r//' /tmp/elk-elasticsearch/jvm.options || exit 1

Conclusion

This was one of those weird cases where I checked all of the usual suspects and yet still failed miserably to resolve the problem in a reasonable timeframe. A complete lack of experience using Linux native tools to debug and diagnose common issues was the main contributing factor, but I really wish software would give better errors in these cases.

I mean, anything would be better than “completely valid looking setting is invalid, please try again”.

0 Comments

Regularly writing blog posts is hard.

Recently, I’ve been focusing on monstrously long, multi-part posts explaining complex or involved processes, like:

I’m honestly not sure whether this pattern is good or bad in the greater scheme of things. Either I just happen to be tackling (and thus writing about) bigger problems in my day to day life, or I’ve simply run out of small things to write about, leaving only the big ones.

But anyway, enough musing, because this post fits squarely into the “bunch of connected posts about one topic” theme.

I was pretty happy with the state of our data synchronization algorithm the last time I wrote about it. We’d just finished putting together some optimizations that dramatically reduced the overall traffic while still maintaining the quality of the overall syncing process, and it felt pretty good. Its been a little over a month since those changes were deployed, and everything has been going swimmingly. We’ve added a few new tables to the process, but the core algorithm hasn’t changed.

Normally this is the point where I would explain how it was all secretly going terribly wrong, but in a weird twist of fate, the algorithm is actually pretty solid.

We did find a bug which can cause the on-premises and remote locations to be out of of sync though, which was unfortunate. It happens infrequently, so a small subset of the data, but it still makes for an interesting topic to write about.

Well, interesting to me at least.

Optimizations Are Always Dangerous

The core of the bug lies in our recent optimizations.

In order to reduce the amount of busywork traffic occurring (i.e. the traffic resulting from the polling nature of the process), we implemented some changes that leverage local and remote table manifests to short-circuit the sync process if there was nothing to do. To further minimize the traffic to the API, we only queried the remote table manifest at the start of the run and then used that for the comparison against the local on the next run. Essentially we exchanged a guaranteed call on every non-skipped sync for one call each time the local and remote became identical.

The bug arises in the rare case where the captured remote from the last run is the same as the current local, even though the current remote is different.

The main way that this seems to happen is:

  • Table with low rate of change gets a new row.
  • Algorithm kicks in and syncs the new data.
  • In the time between the run that pushed the data and the next one, user removes the data somehow.
  • Current local now looks exactly like the remote did before the data was pushed.
  • Sync algorithm thinks that it has nothing to do.

In this case the algorithm is doing exactly what it was designed to do. Its detected that there are no changes to deal with, and will continue to skip executions until something does change (new rows, updates, deletions, anything), where it will run again. If the table changes infrequently we’re left with an annoying desync for much longer than we would like.

Like I said earlier, its a pretty specific situation, with relatively tight timings, and it only occurs for tables that are infrequently changed, but a bug is a bug, so we should fix it all the same.

Safety Dance

The obvious solution is to requery the remote after the sync operation has finished execution and store that value for the comparison against the local next time, rather than relying on the value from before the operation started.

The downside of this is that it adds another request to every single non-skipped sync, which amounts to a relatively significant amount of traffic. We’re still way ahead of the situation before the optimizations, but maybe we can do better?

Another idea is to limit the maximum number of skips that can happen in a row, taking into account how long we might want the situation described above to persist.

This approach also raises the number of requests occurring, but has the happy side effect of picking up changes at the remote end as well (i.e. nothing has changed locally, but we deleted all the data remotely in order to force a resync or something).

The compare the two possible fixes, I actually did some math to see which one would result in more requests, and with the maximum number of skips set to a value that forced a run every 30 minutes or so, they are pretty much a wash in terms of additional requests.

I’ve flip-flopped a fair bit on which solution I think we should apply, initially thinking the “limit maximum skips” approach was the best (because it essentially offers a sanity check to the concept of skipping runs), but from an engineering point of view, it just feels messy, like the sort of solution you come up with when you can’t come up with something better. Almost brute force in its approach.

I’m currently favouring amending the algorithm to query the remote after the operation executes because it feels the cleanest, but I’m not ecstatic about it either, as it feels like its doing more work than is strictly necessary.

Decisions decisions.

Conclusion

As much as it saddens me to find bugs, it pleases me to know that with each bug fixed, the algorithm is becoming stronger, like tempering steel, or building muscle. Applying stress to something causing it to break down and then be repaired with improvements.

It can be tempting to just throw a fix in whenever you find a bug like that, but I believe that hack fixes should never be tolerated without a truly exceptional reason. You should always aim to make the code better as you touch it, not worse. The hardest part of fixing bugs is to perform the repairs in such a way that it doesn’t compromise the design of the code.

Of course, if the design is actually the cause of the problem, then you’re in for a world of hurt.

0 Comments

With all that talk about logging environments and their improvements out of the way, its time to return to familiar territory yet again.

Our ELB logs processor has been chugging along for a while now, dutifully aggregating all of the ELB logs for one of our APIs. We have quite a few APIs though, and almost all of the them have ELB logs to process, so there was no sense keeping the awesome sequestered in one place.

Time to spread the love.

Honestly, we always planned on this sort of reuse anyway, so the logs processor was built in such a way that it could be relatively easily used in another location.

The actual Javascript code is completely generic, not specific to a particular Lambda function/API combo. There are things that need to change between instances obviously, but they are covered by the variables, the changes of which are easily accomplished via Octopus Deploy. In order to setup the processor for a new API, all you need to do is add the Lambda function/permissions to the environment definition (i.e. the CloudFormation template), make a new Octopus Project and add a small piece of Powershell into the environment scripts to deploy that project.

What could possibly go wrong?

Surface Tension

Surprisingly, very little actually went wrong.

I mean, it didn’t work, and we had no log events inside the ELK stack for the new API, but at a glance, there were no catastrophic failures.

The Lambda function existed, it was receiving events from S3, downloading files, processing them and pumping them out via TCP. There were no errors (at the Lambda function level OR within the function itself) and all of the function logging showed everything working as expected.

After digging into it a little bit, it turned out that not only were the ELB logs for the new API missing, the CI and Staging logs for the older API were missing as well. Weirdly enough, the Production logs for the older API were working exactly as expected.

Mysteriously, the code was exactly the same in all 4 Lambda functions, with the only differences being in configuration (which was relatively simple).

Forget About Freeman!

To cut a relatively long story short, the core of the problem was related to the asynchronous nature of TCP connections (via sockets) in Node.

This was the problem code:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = net.createConnection(config.logstashPort, config.logstashHost);
        summary.connections.created += 1;
        resolve(socket);
    })
}

This snippet is the mechanism by which TCP connections are created and added to the pool in the Lambda function, prior to being used to actually send data to our log stack.

It turns out that a socket can be returned by net.createConnection before it has actually connected. Even worse, sockets that are still connecting will not throw errors when you attempt to write to them. I’m not sure what actually happens to the writes in this case, but giving Javascript/Node the benefit of the doubt, I assume they are probably queued in some way.

Unfortunately, the Lambda function was exiting before any sort of queued writes could actually be applied, but only when there was a small total number of ELB log lines being processed. This is why it was working fine in production (thousands of messages every execution) and failing miserably in CI/Staging and the new environment (tens of messages). The function was just too quick and didn’t know that it needed to wait for the socket to be connected and for all pending writes to execute before it was allowed to finished. As far as the promise chain was concerned, everything was done and dusted.

The solution is actually pretty simple:

create: function() {
    return new Promise(function(resolve, reject) {
        const socket = new net.Socket;

        socket.on("connect", () => {
            summary.connections.created += 1;
            resolve(socket);
        });

        socket.on("error", error => {
            summary.failures.sockets.total += 1;
            if (summary.failures.sockets.lastFew.length >= 5) {
                summary.failures.sockets.lastFew.shift();
            }
            summary.failures.sockets.lastFew.push(error);
            socket.destroy(error);
            reject(error);
        })

        socket.connect(config.logstashPort, config.logstashHost);
    })
}

Instead of immediately returning the socket, hook a listener up to its connect event and resolve the promise then. I added in some extra error handling/reporting as well, but its all pretty straight forward.

What this means is that a socket is never added to the pool unless its already connected, which makes the rest of the code work exactly as expected.

Conclusion

It was nice to return briefly to the ELB logs processor to witness its reuse, especially because the return actually led to the discovery and eventual fix of a bug that I missed the first time.

I’m still wrapping my head around the intrinsically asynchronous/callback nature of Node though, which was the root cause of the issue. To be honest, while I really like the way that promises work, I really really hate the callback hell that vanilla Node seems to encourage. It was so easy to create a socket and write to it, without the socket actually being valid, and the only way to ensure the socket was valid was via an asynchronous event handler.

To complicated matters, I’m sure it doesn’t help that AWS Lambda can only run a relatively old version of Node (4.3.2). The latest version of Node is miles ahead of the that, and it probably has hundreds of bugfixes that might be relevant.

Anyway, I’m just happy I got to steal more Half-Life chapter titles as headings.

Seriously, they fit so well.

0 Comments

With environments and configuration out of the way, its time to put all of the pieces together.

Obviously this isn’t the first time that both of those things have been put together. In order to validate that everything was working as expected, I was constantly creating/updating environments and deploying new versions of the configuration to them. Not only that, but with the way our deployment pipeline works (commit, push, build, test, deploy [test], [deploy]), the CI environment had been up and running for some time.

What’s left to do then?

Well, we still need to create the Staging and Production environments, which should be easy because those are just deployments inside Octopus now.

The bigger chunk of work is to use those new environments and to redirect all of our existing log traffic as appropriate.

Hostile Deployment

This is a perfect example of why I spend time on automating things.

With the environments setup to act just like everything else in Octopus, all I had to do to create a Staging environment was push a button. Once the deployment finished and the infrastructure was created, it was just another button push to deploy the configuration for that environment to make it operational. Rinse and repeat for all of the layers (Broker, Indexer, Cache, Elasticsearch) and Staging is up and running.

Production was almost the same, with one caveat. We use an entirely different AWS account for all our production resources, so we had to override all of the appropriate Octopus variables for each environment project (like AWS Credentials, VPC ID, Subnet ID’s, etc). With those overrides in place, all that’s left is to make new releases (to capture the variables) and deploy to the appropriate environments.

It’s nice when everything works.

Redirecting Firepower

Of course, the new logging environments are worthless without log events. Luckily, we have plenty of those:

  • IIS logs from all of our APIs
  • Application logs from all of our APIs
  • ELB logs from a subset of our load balancers, most of which are APIs, but at least one is an Nginx router
  • Simple performance counter statistics (like CPU, Memory, Disk, etc) from basically every EC2 instance
  • Logs from on-premises desktop applications

We generally have CI, Staging and prod-X (green/blue) environments for all of our services/APIs (because its how our build/test/deployment pipeline works), so now that we have similarly partitioned logging environments, all we have to do is line them up (CI to CI, Staging to Staging and so on).

For the on-premises desktop applications, there is no CI, but they do generally have the capability to run in Staging mode, so we can use that setting to direct log traffic.

There are a few ways in which the log events hit the Broker layer:

  • Internal Logstash instance running on an EC2 instance with a TCP output pointing at the Broker hostname
  • Internal Lambda function writing directly to the Broker hostname via TCP (this is the ELB logs processor)
  • External application writing to an authenticated Logging API, which in turn writes to the Broker via TCP (this is for the on-premises desktop applications)

We can change the hostname used by all of these mechanisms simply by changing some variables in Octopus deploy, making a new release and deploying it through the environments.

And that’s exactly what we did, making sure to monitor the log event traffic for each one to make sure we didn’t lose anything.

With all the logs going to their new homes, all that was left to do was destroy the old log stack, easily done manually through CloudFormation.

You might be wondering about any log events that were stored in the old stack? Well, we generally only keep around 14 days worth of log events in the stack itself (because there are so many), so we pretty much just left the old stack up for searching purposes until it was no longer relevant, and then destroyed it.

Conclusion

And that basically brings us to the end of this series of posts about our logging environment and the reclamation thereof.

We’ve now got our standard deployment pipeline in place for both infrastructure and configuration and have separated our log traffic accordingly.

This puts us in a much better position moving forward. Not only is the entire thing fully under our control, but we now have the capability to test changes to infrastructure/configuration before just deploying them into production, something we couldn’t do before when we only had a single stack for everything.

In all fairness though, all we really did was polish an existing system so that it was a better fit for our specific usage.

Evolutionary, not revolutionary.

0 Comments

Continuing on from last week, its time to talk software and the configuration thereof.

With the environments themselves being managed by Octopus (i.e. the underlying infrastructure), we need to deal with the software side of things.

Four of the five components in the new ELK stack require configuration of some sort in order to work properly:

  • Logstash requires configuration to tell it how to get events, filter/process then and where to put them, so the Broker and Indexer layers require different, but similar, configuration.
  • Elasticsearch requires configuration for a variety of reasons including cluster naming, port setup, memory limits and so on.
  • Kibana requires configuration to know where Elasticsearch is, and a few other things.

For me, configuration is a hell of a lot more likely to change than the software itself, although with the pace that some organisations release software, that might not always be strictly true. Also, coming from a primarily Windows background, software is traditionally a lot more difficult to install and setup, and by that logic is not something you want to do all the time.

Taking those things into account, I’ve found it helpful to separate the installation of the software from the configuration of the software. What this means in practice is that a particular version of the software itself will be baked into an AMI, and then the configuration of that software will be handled via Octopus Deploy whenever a machine is created from the AMI.

Using AMIs or Docker Images to create immutable software + configuration artefacts is also a valid approach, and is superior in a lot of respects. It makes dynamic scaling easier (by facilitating a quick startup of fully functional nodes), helps with testing and generally simplifies the entire process. Docker Images in particular are something that I would love to explore in the future, just not right at this moment.

The good news is that this is pretty much exactly what the new stack was already doing, so we only need to make a few minor improvements and we’re good to go.

The Lament Configuration

As I mentioned in the first post in this series, software configuration was already being handled by TeamCity/Nuget/Octopus Deploy, it just needed to be cleaned up a bit. First thing was to move the configuration out into its own repository as appropriate for each layer and rewrite TeamCity as necessary. The Single Responsibility Principle doesn’t just apply to classes after all.

The next part is something of a personal preference, and it relates to the logic around deployment. All of the existing configuration deployments in the new stack had their logic (i.e. where to copy the files on the target machine, how to start/restart services and so on) encapsulated entirely inside Octopus Deploy. I’m not a fan of that. I much prefer to have all of this logic inside scripts in source control alongside the artefacts that will be deployed. This leaves projects in Octopus Deploy relatively simple, only responsible for deploying Nuget packages, managing variables (which is hard to encapsulate in source control because of sensitive values) and generally overseeing the whole process. This is the same sort of approach that I use for building software, with TeamCity acting as a relatively stupid orchestration tool, executing scripts that live inside source control with the code.

Octopus actually makes using source controlled scripts pretty easy, as it will automatically execute scripts named a certain way at particular times during the deployment of a Nuget package (for example, any script called deploy.ps1 at the root of the package will be executed after the package has been copied to the appropriate location on the target machine). The nice thing is that this also works with bash scripts for Linux targets (i.e. deploy.sh), which is particularly relevant here, because all of the ELK stuff happens on Linux.

Actually deploying most of the configuration is pretty simple. For example, this is the deploy.sh script for the ELK Broker configuration.

# The deploy script is automatically run by Octopus during a deployment, after Octopus does its thing.
# Octopus deploys the contents of the package to /tmp/elk-broker/
# At this point, the logstash configuration directory has been cleared by the pre-deploy script

# Echo commands after expansion
set -x

# Copy the settings file
cp /tmp/elk-broker/logstash.yml /etc/logstash/logstash.yml || exit 1

# Copy the config files
cp /tmp/elk-broker/*.conf /etc/logstash/conf.d/ || exit 1

# Remove the UTF-8 BOM from the config files
sed -i '1 s/^\xef\xbb\xbf//' /etc/logstash/conf.d/*.conf || exit 1

# Test the configuration
sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/ -t --path.settings /etc/logstash/logstash.yml || exit 1

# Set the ownership of the config files to the logstash user, which is what the service runs as
sudo chown -R logstash:logstash /etc/logstash/conf.d || exit 1

# Restart logstash - dont use the restart command, it throws errors when you try to restart a stopped service
sudo initctl stop logstash || true
sudo initctl start logstash

Prior to the script based approach, this logic was spread across five or six different steps inside an Octopus Project, which I found much harder to read and reason about.

Or Lemarchand’s Box If You Prefer

The only other difference worth talking about is the way in which we actually trigger configuration deployments.

Traditionally, we have asked Octopus to deploy the most appropriate version of the necessary projects during the initialization of a machine. For example, the ELK Broker EC2 instances had logic inside their LaunchConfiguration:UserData that said “register as Tentacle”, “deploy X”, “deploy Y” etc.

This time I tried something a little different, but which feels a hell of a lot smarter.

Instead of the machine being responsible for asking for projects to be deployed to it, we can just let Octopus react to the registration of a new Tentacle and deploy whatever Projects are appropriate. This is relatively easy to setup as well. All you need to do is add a trigger to your Project that says “deploy whenever a new machine comes online”. Octopus takes care of the rest, including picking what version is best (which is just the last successful deployment to the environment).

This is a lot cleaner than hardcoding project deployment logic inside the environment definition, and allows for changes to what software gets deployed where without actually having to edit or update the infrastructure definition. This sort of automatic deployment approach is probably more useful to our old way of handling environments (i.e. that whole terrible migration process with no update logic), than it is to the newer, easier to update environment deployments, but its still nice all the same.

Conclusion

There really wasn’t much effort required to clean up the configuration for each of the layers in the new ELK stack, but it was a great opportunity to try out the new trigger based deployments in Octopus Deploy, which was pretty cool.

With the configuration out of the way, and the environment creation also sorted, all that’s left is to actually create some new environments and start using them instead of the old one.

That’s a topic for next time though.