Queue Queuing
A very quick post this week, because I’ve been busy rebuilding our ELB Logs Processor in .NET Core. I had some issues converting it to use HTTP instead of TCP for connecting to Logstash and I just got tired of dealing with Javascript.
I’m sure Javsacript is a perfectly fine language capable of accomplishing many wonderous things. Its not a language I will voluntarily choose to work with though, not when I can pick C# to accomplish the same thing.
On to the meat of this post though, which is quick heads up for people who want to upgrade from Logstash 5.2.X to 5.4.0 (something I did a recently for the Broker/Indexer layer inside our Log Aggregation Stack).
Make sure you configure a data directory and that that directory both exists and has appropriate permissions.
Queueing Is Very British
Logstash 5.4.0 marked the official release of the Persistent Queues feature (which had been in beta for a few versions). This is a pretty neat feature that allows you to skip the traditional queue/cache layer in your log aggregation stack. Basically, when enabled, it inserts a disk queue into your Logstash instance in between inputs and filters/outputs. It only works for inputs that have request/response models (so HTTP good, TCP bad), but it’s a pretty cool feature all round.
I have plans to eventually use it to completely replace our Cache and Indexer layers in the log aggregation stack (a win for general complexity and number of moving parts), but when I upgraded to 5.4.0 I left it disabled because we already have Elasticache:Redis for that.
That didn’t stop it from causing problems though.
I Guess They Just Like Taking Orderly Turns
Upgrading the version of Logstash we use is relatively straightforward. We bake a known version of Logstash into a new AMI via Packer, update an environment parameter for the stack, kick off a build and let TeamCity/Octopus take care of the rest.
To actually bake the AMI, we just update the Packer template with the new information (in this case, the Logstash version that should be installed via yum) and then run it through TeamCity.
On the other side, in the environment itself, when we update the AMI in use, CloudFormation will slowly replace all of the EC2 instances inside the Auto Scaling Group with new ones, waiting for each one to come online before continuing. We use Octopus Deploy Triggers to automate the deployment of software to those machines when they come online.
This is where things started to fall down with Logstash 5.4.0.
The Octopus deployment of the Logstash Configuration was failing. Specifically, Logstash would simply never come online with the AMI that used 5.4.0 and the configuration that we were using successfully for 5.2.0.
The Logstash log files were full of errors like this:
[2017-05-24T04:42:02,021][FATAL][logstash.runner ] An unexpected error occurred! { : error => # < ArgumentError: Path "/usr/share/logstash/data/queue" must be a writable directory.It is not writable. > , : backtrace => [ "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:433:in `validate'", "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:216:in `validate_value'", "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:132:in `validate_all'", "org/jruby/RubyHash.java:1342:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:131:in `validate_all'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:217:in `execute'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:185:in `run'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'" ] }
A bit weird considering I hadn’t changed anything in our config, but it makes sense that maybe Logstash itself can’t write to the directory it was installed into by yum, and the new version now needs to do just that.
Moving the data directory was simple enough. Add path.datato the logstash.yml inside our configuration package, making sure that the data directory exists and that the Logstash user/group has ownership and full control.
I still got the same error though, except the directory was different (it was the one I specified).
I Mean Who Doesn’t
I fought with this problem for a few hours to be honest. Trying various permutations of permissions, ACLs, ownership, groups, users, etc.
In the end, I just created the queue directory ahead of time (as part of the config deployment) and set the ownership of the data directory recursively to the Logstash user/group.
This was enough to make Logstash stop complaining about the feature I didn’t want to use and get on with its life.
I still don’t understand what happened though, so I logged an issue in the Logstash repo in Github. Maybe someone will explain it one day. Weirdly it looks like Logstash created a directory that it was not allowed to access (the /queue directory under the specified data directory), which leads me towards something being wrong with my configuration (like ownership or something like that), but I couldn’t find anything that would point to that.
Conclusion
This one really came out of left field. I didn’t expect the upgrade to 5.4.0 to be completely painless (rarely is a software upgrade painless), but I didn’t expect to struggle with an issue caused by a feature that I didn’t even want to use.
What’s even weirder about the whole thing is Persistent Queues were available in the version of Logstash that was was upgrading from (5.2.0), at least in beta form, and I had no issues whatsoever.
Don’t get me wrong, Logstash is an amazing product, but it can also be incredibly frustrating.