0 Comments

A long long time ago, a valuable lesson was learned about segregating production infrastructure from development/staging infrastructure. Feel free to go and read that post if you want, but to summarise: I ran some load tests on an environment (one specifically built for load testing, separate to our normal CI/Staging/Production), and the tests flooded a shared component (a proxy), brining down production services and impacting customer experience. The outage didn’t last that long once we realised what was happening, but it was still pretty embarrassing.

Shortly after that adventure, we created a brand new AWS account to isolate our production infrastructure and slowly moved all of that important stuff into it, protecting it from developers doing what they do best (breaking stuff in new and interesting ways).

This arrangement complicated a few things, but the most relevant to this discussion was the creation and management of AMIs.

We were already using Packer to create and maintain said AMI’s, so it wasn’t a manual process, but an AMI in AWS is always owned by and accessible from one AWS account by default.

With two completely different AWS accounts, it was easy to imagine a situation where each account has slightly different AMIs available, which have slightly different behaviour, leading to weird things happening on production environments that don’t happen during development or in staging.

That sounds terrible, and it would be neat if we could ensure it doesn’t happen.

A Packaged Deal

The easiest thing to do is share the AMIs in question.

AWS makes it relatively easy to make an AMI accessible to a different AWS account, likely for this exact purpose. I think it’s also used to enable companies to sell pre-packaged AMIs as well, but that’s a space I know little to nothing about, so I’m not sure.

As long as you know the account number of the AWS account that you want to grant access to, its a simple matter to use the dashboard or the API to share the AMI, which can then be freely used from the other account to create EC2 instances.

One thing to be careful of is to make sure you grant the ability for the other account to access the AMI snapshot as well, or you’ll run into permission problems if you try to actually use the AMI to make an EC2 instance.

Sharing AMI’s is alright, but it has risks.

If you create AMIs in your development account and then share them with production, then you’ve technically got production infrastructure inside your development account, which was one of the things we desperately wanted to avoid. The main problem here is that people will not assume that a resource living inside the relatively free-for-all development environment could have any impact on production, and they might delete it or something equally dangerous. Without the AMI, auto scaling won’t work, and the most likely time to figure that sort of thing out is right when you need it the most.

A slightly better approach is to copy the AMI to the other account. In order to do this you share the AMI from the owner account (i.e. dev) and then make a permanent copy on the other account (i.e. prod). Once the copy is complete, you unshare (to prevent accidental usage).

This breaks the linkage between the two accounts while ensuring that the AMIs are identical, so its a step up from simple sharing, but there are limitations.

For example, everything works swimmingly until you try to copy a Windows AMI, then it fails miserably as a result of the way in which AWS licences Windows. On the upside, the copy operation itself fails fast, rather than making a copy that then fails when you try to use it, so that’s nice.

So, two solutions, neither of which is ideal.

Surely we can do better?

Pack Of Wolves

For us, the answer is yes. We just run our Packer templates twice, once for each account.

This has actually been our solution for a while. We execute our Packer templates through TeamCity Build Configurations, so it is a relatively simple matter to just run the build twice, once for each account.

Well, “relatively simple” is probably understating it actually.

Running in dev is easy. Just click the button, wait and a wild AMI appears.

Prod is a different question.

When creating an AMI Packer needs to know some things that are AWS account specific, like subnets, VPC, security groups and so on (mostly networking concerns). The source code contained parameters relevant for dev (hence the easy AMI creation for dev), but didn’t contain anything relevant for prod. Instead, whenever you ran a prod build in TeamCity, you had to supply a hashtable of parameter overrides, which would be used to alter the defaults and make it work in the prod AWS account.

As you can imagine, this is error prone.

Additionally, you actually have to remember to click the build button a second time and supply the overrides in order to make a prod image, or you’ll end up in a situation where you deployed your environment changes successfully through CI and Staging, but it all explodes (or even worse, subtly doesn’t do what its supposed to) when you deploy them into Production because there is no equivalent AMI. Then you have to go and make one using TeamCity, which is error prone, and if the source has diverged since you made the dev one…well, its just bad times all around.

Leader Of The Pack

With some minor improvements, we can avoid that whole problem though.

Basically, whenever we do a build in TeamCity, it creates the dev AMI first, and then automatically creates the prod one as well. If the dev fails, no prod. If the prod fails, dev is deleted.

To keep things in sync, both AMIs are tagged with a version attribute created during the build (just like software), so that we have a way to trace the AMI back to the git commit it was created from (just like software).

To accomplish this approach, we now have a relatively simple configuration hierarchy, with default parameters, dev specific parameters and prod specific parameters. When you start the AMI execution, you tell the function what environment you’re targetting (dev/prod) and it loads defaults, then merges in the appropriate overrides.

This was a relatively easy way to deal with things that are different and non-sensitive (like VPC, subnets, security groups, etc).

What about credentials though?

Since an…incident…waaaay back in 2015, I’m pretty wary of credentials, particularly ones that give access to AWS.

So they can’t go in source control with the rest of the parameters.

That leaves TeamCity as the only sane place to put them, which it can easily do, assuming we don’t mind writing some logic to pick the appropriate credentials depending on our targeted destination.

We could technically have used some combination of IAM roles and AWS profiles as well, but we already have mechanisms and experience dealing with raw credential usage, so this was not the time to re-invent that particular wheel. That’s a fight for another day.

With account specific parameters and credentials taken care of, everything is good, and every build results in 2 AMIs, one for each account.

I’ve uploaded a copy of our Packer repository containing all of this logic (and a copy of the script we embed into TeamCity) to Github for reference purposes.

Conclusion

I’m much happier with the process I described above for creating our AMIs. If a build succeeds, it creates resources in both of our active AWS accounts, keeping them in sync and reducing the risk of subtle problems come deployment time. Not only that, but it also tags those resources with a version that can be traced back to a git commit, which is always more useful than you think.

There are still some rough edges around actually using the AMIs though. Most of our newer environments specify their AMIs directly via parameter files, so you have to remember to change the values for each environment target when you want to use a new AMI. This is dangerous, because if someone forgets it could lead to a disconnect between CI/Staging and Production, which was pretty much the entire problem we were trying to avoid in the first place.

Honestly, its going to be me that forgets.

Ah well, all in all, its a lot more consistent than it was before, which is pretty much the best I could hope for.

0 Comments

If anybody actually reads this blog, it would be easy to see that my world has been all about log processing and aggregation for a while now. I mean, I do other things as well, but the questions and problems that end up here on this blog have been pretty focused around logging for as far back as I care to remember. The main reason is that I just like logging, and appreciate the powerful insights that it can give about your software. Another reason is that we don’t actually have anyone who sits in the traditional “operations” role (i.e. infrastructure/environment improvement and maintenance), so in order to help keep the rest of the team focused on our higher level deliverables, I end up doing most of that stuff.

Anyway, I don’t see that pattern changing any time soon, so on with the show.

While I was going about regaining control over our log stack, I noticed that it was extremely difficult to reason about TCP traffic when using an AWS ELB.

Why does this matter?

Well, the Broker layer in our ELK stack (i.e the primary ingress point), uses a Logstash configuration with a TCP input and as a result, all of the things that write to the Broker (our externally accessible Ingress API, other instances of Logstash, the ELB Logs Processor) use TCP. That’s a significant amount of traffic, and something that I’d really like to be able to monitor and understand.

Stealthy Traffic

As is my current understanding, when you make a TCP connection through an ELB, the ELB records the initial creation of the connection as a “request” (one of the metrics you can track with CloudWatch) and then pretty much nothing else after that. I mean, this makes sense, as its the ELB’s job to essentially pick an underlying machine to route traffic to, and most TCP connections created and used specifically as TCP connections tend to be long lived (as opposed to TCP connections created as part of HTTP requests and responses).

As far as our three primary contributors are concerned:

  • The Logging Ingress API is pretty oblivious. It just makes a new TCP connection for each incoming log event, so unless the .NET Framework is silently caching TCP connections for optimization purposes, it’s going to cause one ELB request per log event.
  • The ELB Logs Processor definitely caches TCP connections. We went through a whole ordeal with connection pooling and reuse before it would function in production, so its definitely pushing multiple log events through a single socket.
  • The Logstash instances that we have distributed across our various EC2 instances (local machine log aggregation, like IIS and application logs) are using the Logstash TCP output. I assume it uses one (or many) long live connections to do its business, but I don’t really know. Logstash is very mysterious.

This sort of usage makes it very hard to tell just how many log events are coming through the system via CloudWatch, which is a useful metric, especially when things start to go wrong and you need to debug which part of the stack is actually causing the failure.

Unfortunately, the monitoring issue isn’t the only problem with using the Logstash TCP input/output. Both input and output have, at separate times, been…flakey. I’ve experienced both sides of the pipeline going down for no obvious reason, or simply not working after running uninterrupted for a long time.

The final nail in the coffin for TCP came recently, when Elastic.co released the Logstash Persistent Queue feature for Logstash 5.4.0, which does not work with TCP at all (it only supports inputs that use the request-response model). I want to use persistent queues to remove both the Cache and Indexer layers from our log stack, so it was time for TCP to die.

Socket Surgery

Adding a HTTP input to our Broker layer was easy enough. In fact, such an input was already present because the ELB uses a HTTP request to check whether or not the Broker EC2 instances are healthy.

Conceptually, changing our Logstash instances to use a HTTP output instead of TCP should also be easy. Just need to change some configuration and deploy through Octopus. Keep in mind I haven’t actually done it yet, but it feels simple enough.

In a similar vein, changing our Logging Ingress API to output through HTTP instead of TCP should also be easy. A small code change to use HttpClient or RestSharp or something, a new deployment and everything is puppies and roses. Again, I haven’t actually done it yet, so who knows what dragons lurk there.

Then we have the ELB Logs Processor, which is a whole different kettle of fish.

It took a significant amount of effort to get it working with TCP in the first place (connection pooling was the biggest problem), and due to the poor quality of the Javascript (entirely my fault), its pretty tightly coupled to that particular mechanism.

Regardless of the difficulty, TCP has to go, for the good of the log stack

The first issue I ran into was “how do you even do HTTP requests in Node 4.3.2 anyway?”. There are many answers to this question, but the most obvious one is to use the HTTP API that comes with Node. Poking around this for a while showed that it wasn’t too bad, as long as I didn’t want to deal with a response payload, which I didn’t.

The biggest issue with the native Node HTTP API was that it was all callbacks, all the time. In my misadventures with the ELB Logs Processor I’d become very attached to promises and the effect they have on the readability of the resulting code, and didn’t really want to give that up so easily. I dutifully implemented a simple promise wrapper around our specific usage of the native Node HTTP API (which was just a POST of a JSON payload), and incorporated it into the Lambda function.

Unfortunately, this is where my memory gets a little bit fuzzy (it was a few weeks ago), and I don’t really remember how well it went. I don’t think it went well, because I decided to switch to a package called Axios which offered promise based HTTP requests out of the box.

Axios of Evil

Axios was pretty amazing. Well, I mean, Axios IS pretty amazing, but I suppose that sentence gave it away that the relationship didn’t end well.

The library did exactly what it said it did and with its native support for promises, was relatively easy to incorporate into the existing code, as you can see from the following excerpt:

// snip, whole bunch of setup, including summary object initialization

let axios = require('axios').create({
    baseURL: 'http://' + config.logstashHost,
    headers: {'Content-Type': 'application/json'}
});

// snip, more setup, other functions

function handleLine(line) {
    summary.lines.encountered += 1;

    var entry = {
        Component: config.component,
        SourceModuleName: config.sourceModuleName,
        Environment: config.environment,
        Application: config.application,
        message: line,
        type: config.type,
        Source: {
            S3: s3FileDetails
        }
    };
        
    var promise = axios
        .post("/", entry)
        .then((response) => {
            summary.lines.sent += 1;
        })
        .catch((error) => { 
            summary.failures.sending.total += 1;
            if (summary.failures.sending.lastFew.length >= 5) {
                summary.failures.sending.lastFew.shift();
            }
            summary.failures.sending.lastFew.push(error);
        });

    promises.push(promise);
}

// snip, main body of function (read S3, stream to line processor, wait for promises to finish

Even though it took a lot of effort to write, it was nice to remove all of the code relating to TCP sockets and connection pooling, as it simplified the whole thing.

The (single, manual) test proved that it still did its core job (contents of file written into the ELK stack), it worked in CI and it worked in Staging, so I was pretty happy.

For about 15 minutes that is, until I deployed it into Production.

Crushing Disappointment

Just like last time, the implementation simply could not deal with the amount of traffic that was being thrown at it. Even worse, it wasn’t actually logging any errors or giving me any indication as to why it was failing. After a brief and frustrating investigation, it looked like it was simply running out of memory (the Lambda function was only configured to use 192 MB, which had been enough for the TCP approach) and it was simply falling over once it reached that amount. This was my hypothesis, but I was never able to conclusively prove ii, but it was definitely using all of the memory available to the function each time it ran.

I could have just increased the available memory, but I wanted to understand where all the memory was going first.

Then I realised I would have to learn how to do memory analysis in Javascript, and I just gave up.

On Javascript that is.

Instead, I decided to rewrite the ELB Logs Processor in .NET Core, using a language that I actually like (C#).

Conclusion

This is one of those cases where looking back, with the benefits of hindsight, I probably should have just increased the memory until it worked and then walked away.

But I was just so tired of struggling with Javascript and Node that it was incredibly cathartic to just abandon it all in favour of something that actually made sense to me.

Of course, implementing the thing in C# via .NET Core wasn’t exactly painless, but that’s a topic for another time.

Probably next week.

0 Comments

A very quick post this week, because I’ve been busy rebuilding our ELB Logs Processor in .NET Core. I had some issues converting it to use HTTP instead of TCP for connecting to Logstash and I just got tired of dealing with Javascript.

I’m sure Javsacript is a perfectly fine language capable of accomplishing many wonderous things. Its not a language I will voluntarily choose to work with though, not when I can pick C# to accomplish the same thing.

On to the meat of this post though, which is quick heads up for people who want to upgrade from Logstash 5.2.X to 5.4.0 (something I did a recently for the Broker/Indexer layer inside our Log Aggregation Stack).

Make sure you configure a data directory and that that directory both exists and has appropriate permissions.

Queueing Is Very British

Logstash 5.4.0 marked the official release of the Persistent Queues feature (which had been in beta for a few versions). This is a pretty neat feature that allows you to skip the traditional queue/cache layer in your log aggregation stack. Basically, when enabled, it inserts a disk queue into your Logstash instance in between inputs and filters/outputs. It only works for inputs that have request/response models (so HTTP good, TCP bad), but it’s a pretty cool feature all round.

I have plans to eventually use it to completely replace our Cache and Indexer layers in the log aggregation stack (a win for general complexity and number of moving parts), but when I upgraded to 5.4.0 I left it disabled because we already have Elasticache:Redis for that.

That didn’t stop it from causing problems though.

I Guess They Just Like Taking Orderly Turns

Upgrading the version of Logstash we use is relatively straightforward. We bake a known version of Logstash into a new AMI via Packer, update an environment parameter for the stack, kick off a build and let TeamCity/Octopus take care of the rest.

To actually bake the AMI, we just update the Packer template with the new information (in this case, the Logstash version that should be installed via yum) and then run it through TeamCity.

On the other side, in the environment itself, when we update the AMI in use, CloudFormation will slowly replace all of the EC2 instances inside the Auto Scaling Group with new ones, waiting for each one to come online before continuing. We use Octopus Deploy Triggers to automate the deployment of software to those machines when they come online.

This is where things started to fall down with Logstash 5.4.0.

The Octopus deployment of the Logstash Configuration was failing. Specifically, Logstash would simply never come online with the AMI that used 5.4.0 and the configuration that we were using successfully for 5.2.0.

The Logstash log files were full of errors like this:

[2017-05-24T04:42:02,021][FATAL][logstash.runner          ] An unexpected error occurred! 
{
    : error =>  #  < ArgumentError: Path "/usr/share/logstash/data/queue" must be a writable directory.It is not writable. > ,
    : backtrace => [
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:433:in `validate'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:216:in `validate_value'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:132:in `validate_all'", 
        "org/jruby/RubyHash.java:1342:in `each'", 
        "/usr/share/logstash/logstash-core/lib/logstash/settings.rb:131:in `validate_all'", 
        "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:217:in `execute'", 
        "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:67:in `run'", 
        "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:185:in `run'", 
        "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/clamp-0.6.5/lib/clamp/command.rb:132:in `run'", 
        "/usr/share/logstash/lib/bootstrap/environment.rb:71:in `(root)'"
    ]
}

A bit weird considering I hadn’t changed anything in our config, but it makes sense that maybe Logstash itself can’t write to the directory it was installed into by yum, and the new version now needs to do just that.

Moving the data directory was simple enough. Add path.datato the logstash.yml inside our configuration package, making sure that the data directory exists and that the Logstash user/group has ownership and full control.

I still got the same error though, except the directory was different (it was the one I specified).

I Mean Who Doesn’t

I fought with this problem for a few hours to be honest. Trying various permutations of permissions, ACLs, ownership, groups, users, etc.

In the end, I just created the queue directory ahead of time (as part of the config deployment) and set the ownership of the data directory recursively to the Logstash user/group.

This was enough to make Logstash stop complaining about the feature I didn’t want to use and get on with its life.

I still don’t understand what happened though, so I logged an issue in the Logstash repo in Github. Maybe someone will explain it one day. Weirdly it looks like Logstash created a directory that it was not allowed to access (the /queue directory under the specified data directory), which leads me towards something being wrong with my configuration (like ownership or something like that), but I couldn’t find anything that would point to that.

Conclusion

This one really came out of left field. I didn’t expect the upgrade to 5.4.0 to be completely painless (rarely is a software upgrade painless), but I didn’t expect to struggle with an issue caused by a feature that I didn’t even want to use.

What’s even weirder about the whole thing is Persistent Queues were available in the version of Logstash that was was upgrading from (5.2.0), at least in beta form, and I had no issues whatsoever.

Don’t get me wrong, Logstash is an amazing product, but it can also be incredibly frustrating.

0 Comments

The log stack is mostly under control now:

  • We have appropriate build and deployment pipelines in place
  • Everything is hosted in our AWS accounts
  • We can freely deploy both environments and configuration for each layer in the stack (Broker, Cache, Indexer, Storage)
  • We’ve got Cerebro in place to help us visualize Elasticsearch

We’re in a pretty good place to be honest.

Still, there are a few areas I would like to improve before I run away screaming, especially while we still have the old stack up and running in parallel (while we smooth out any kinks in the new stack).

One area in particular that needs some love is the way in which we store data in Elasticsearch. Up until now, we’ve mostly just left Elasticsearch to its own devices. Sure, it had a little help from Logstash, but it was mostly up to Elasticsearch what it did with incoming data. We did have a simple index template in place, but it was pretty vanilla. Basically we just left dynamic field mapping on and didn’t really think about it a whole lot.

The problem with this approach is that Elasticsearch is technically doing a whole bunch of work (and storing a whole bunch of data) for no real benefit to us, as we have thousands of fields, but only really search/aggregate on a few hundred at best. Our daily logstash indexes contain quite a lot of documents (around 40 million) and generally tally up at around 35GB each. Depending on how much of that 35GB of storage belongs to indexed fields that have no value, there might be considerable savings in reining the whole process in.

That doesn’t even take into account the cognitive load in having to deal with a large number of fields whenever you’re doing analysis, or the problems we’ve had with Kibana and refreshing our index mappings when there are many fields.

It was time to shackle the beast.

Anatomy Of A Template

Index templates are relatively simple constructs, assuming you understand some of the basic concepts behind indexes, types and field mappings in Elasticsearch. You could almost consider them to be schemas, but that is not strictly accurate, because you can change a schema, but you can’t really change an index once its been created. They really are templates in that sense, because they only apply when a new index is created.

Basically, a template is a combination of index settings (like replicas, shards, field limits, etc), types (which are collections of fields), and field mappings (i.e. Event.Name should be treated as text, and analysed up to the first 256 characters). They are applied to new indexes based on a pattern that matches against the new indexes name. For example, if I had a template that I wanted to apply to all logstash indexes (which are named logstash-YY.MM.DD), I would give it a pattern of logstash-*.

For a more concrete example, here is an excerpt from our current logstash index template:

{
  "order": 0,
  "template": "logstash-*",
  "settings": {
    "index": {
      "refresh_interval": "5s",
      "number_of_shards": "3",
      "number_of_replicas": "2",
      "mapper.dynamic": false
    }
  },
  "mappings": {
    "logs": {
      "dynamic" : false,
      "_all": {
        "omit_norms": true,
        "enabled": false
      },
      "properties": {
        "@timestamp": {
          "type": "date",
          "doc_values": true,
          "index": true
        },
        "@version": {
          "type": "keyword",
          "index": false,
          "doc_values": true
        },
        "message" : {
          "type" : "text",
          "index": false,
          "fielddata": false
        },
        "Severity" : {
          "type": "keyword",
          "index": true,
          "doc_values": true
        },
        "TimeTaken" : {
          "type": "integer",
          "index": true,
          "doc_values": true
        }
      }
    }
  },
  "aliases": {}
}

Templates can be managed directly from the Elasticsearch HTTP API via the /_template/{template-name} endpoint.

By default, the mappings.{type}.dynamic field is set to true when creating an index. This means that based on the raw data encountered, Elasticsearch will attempt to infer an appropriate type for the field (i.e. if it sees numbers, its probably going to make it a long or something). To be honest, Elasticsearch is pretty good at this, assuming your raw data doesn’t contain fields that sometimes look like numbers and sometimes look like text.

Unfortunately, ours does, so we can sometimes get stuck in a bad situation where Elasticseach will infer a field as a number, and all documents with text there will fail. This is a mapping conflict, and is a massive pain, because you can’t change a field mapping. You have to delete the index, or make a new index and migrate the data across. In the case of logstash, because you have time based indexes, you can also just wait it out.

This sort of thing can be solved by leaving dynamic mapping on, but specifying the type of the troublesome fields in the template.

The other downside of dynamic mapping is the indexing of fields that you really don’t need to be indexed, which takes up space for no benefit. This is actually pretty tricky though, because if you don’t index a field in some way, its still stored, but you can’t search or aggregate on it without creating a new index and adding an appropriate field mapping. I don’t know about you, but I don’t always know exactly what I want to search/aggregate on before the situation arises, so its a dangerous optimization to make.

This is especially true for log events, which are basically invisible up to the point where you have to debug some arcane thing that happened to some poor bastard.

I’m currently experimenting with leaving dynamic mapping off until I get a handle on some of the data coming into our stack, but I imagine that it will probably be turned back on before I’m done, sitting alongside a bunch of pre-defined field mappings for consistency.

Template Unleashed

With a template defined (like the example above), all that was left was to create a deployment pipeline.

There were two paths I could have gone down.

The first was to have a package specifically for the index template, with its own Octopus project and a small amount of logic that used the Elasticsearch HTTP API to push the template into the stack.

The second was to incorporate templates into the Logging.ELK.Elasticsearch.Config package/deployment, which was the package that dealt with the Elasticsearch configuration (i.e. master vs data nodes, EC2 discovery, ES logging, etc).

In the end I went with the second option, because I could not find an appropriate trigger to bind the first deployment to. Do you deploy when a node comes online? The URL might not be valid then, so you’d probably have to use the raw IP. That would mean exposing those instances outside of their ELB, which wasn’t something I wanted to do.

It was just easier to add some logic to the existing configuration deployment to deploy templates after the basic configuration completes.

# Wait for a few moments for Elasticsearch to become available
attempts=0
maxAttempts=20
waitSeconds=15
until $(curl --output /dev/null --silent --head --fail http://localhost:9200); do
    if [[ $attempts -ge $maxAttempts ]]; then 
        echo "Elasticsearch was not available after waiting ($attempts) times, sleeping for ($waitSeconds) seconds between each connection attempt"
        exit 1 
    fi
    attempts=$(($attempts + 1))
    echo "Waiting ($waitSeconds) to see if Elasticsearch will become available"
    sleep $waitSeconds
done

# Push the index template
template_upload_status=$(curl -XPUT --data "@/tmp/elk-elasticsearch/templates/logstash.json" -o /tmp/elk-elasticsearch/logstash-template-upload-result.json -w '%{http_code}' http://localhost:9200/_template/logstash;)
if [[ $template_upload_status -ne 200 ]]; then
    echo "Template upload failed"
    cat /tmp/elk-elasticsearch/logstash-template-upload-result.json
    exit 1
fi

A little bit more complicated than I would have liked, but it needs to wait for Elasticsearch to come online (and for the cluster to go green) before it can do anything, and the previous steps in this script actually restart the node (to apply configuration changes), so its necessary.

Conclusion

I’m hopeful that a little bit of template/field management will give us some big benefits in terms of the amount of fields we need to deal with and how much storage our indexes consume. Sure, we could always manage the template manually (usually via Kopf/Cerebro), but it feels a lot better to have it controlled and tracked in source control and embedded into our pipeline.

As I mentioned earlier, I still haven’t quite decided how to handle things in the long run, i.e. the decision between all manual mappings or some manual and the rest dynamic. It gets a bit complicated with index templates only applying once for each index (at creation), so if you want to put some data in you need to either anticipate what it looks like ahead of time, or you need to wait until the next index rolls around. I’ve got our logstash indexes running hourly (instead of daily), which helps, but I think it causes performance problems of its own, so its a challenging situation.

The other thing to consider is that managing thousands of fields in that template file sounds like its going to be a maintenance nightmare. Even a few hundred would be pretty brutal, so I’m wary of trying to control absolutely all of the things.

Taking a step back, it might actually be more useful to just remove those fields from the log events inside the Indexer layer, so Elasticsearch never even knows they exist.

Of course, you have to know what they are before you can apply this sort of approach anyway, so we’re back to where we started.

0 Comments

This post is not as technical as some of my others. I really just want to bring attention to a tool for Elasticsearch that I honestly don’t think I could do without.

Cerebro.

From my experience, one of the hardest things to wrap my head around when working with Elasticsearch was visualizing how everything fit together. My background is primarily C# and .NET in a very Microsoft world, so I’m used to things like SQL Server, which comes with an excellent exploration and interrogation tool in the form of SQL Server Management studio. When it comes to Elasticsearch though, there seems to be no equiavelent, so I felt particularly blind.

Since starting to use Elasticsearch, I’ve become more and more fond of using the command line, so I’ve started to appreciate its amazing HTTP API more and more, but that initial learning curve was pretty vicious.

Anyway, to bring it back around, my first port of call when I started using Elasticsearch was to find a tool conceptually similar to SQL Server Management Studio. Something I could use to both visualize the storage system (however it worked) and possibly even query it as necessary.

I found Kopf.

Kopf did exactly what I wanted it to do. It provided a nice interface on top of Elasticsearch that helped me visualize how everything was structured and what sort of things I could do. To this day, if I attempt to visualize an Elasticsearch cluster in my head, the pictures that come to mind are of the Kopf interface. I can thank it for my understanding of the cluster, the nodes that make it up and the indexes stored therein, along with the excellent Elasticsearch documentation of course.

Later on I learnt that Kopf didn’t have to be used from the creators demonstration website (which is how I had been using it, connecting from my local machine to our ES ELK cluster), but could in fact be installed as a plugin inside Elasticsearch itself, which was even better, because you could access it from {es-url]}/plugins/_kopf, which was a hell of a lot easier.

Unfortunately, everything changed when the fire nation attacked…

No wait, that’s not right.

Everything changed when Elasticsearch 5 was released.

I’m The Juggernaut

Elasticsearch 5 deprecated site plugins. No more site plugins meant no more Kopf, or at least no more Kopf hosted within Elasticsearch. This made me sad, obviously, but I could still use the standalone site, so it wasn’t the end of the world.

My memory of the next bit is a little bit fuzzy, but I think even the standalone site stopped working properly when connecting to Elasticsearch 5. The creator of Kopf was no longer maintaining the project either, so it was unlikely that the problems would be solved.

I was basically blind.

Enter Cerebro.

No bones about it, Cerebro IS Kopf. It’s made by the same guy and is still being actively developed. Its pretty much a standalone Kopf (i.e. built in web server), and any differences between the two (other than some cosmetic stuff and the capability to easily save multiple Elasticsearch addresses) are lost on me.

As of this post, its up to 0.6.5, but as far as I can tell, it’s fully functional.

For my usage, I’ve incorporated Cerebro into our ELK stack, with a simple setup (ELB + single instance ASG), pre-configured with the appropriate Elasticsearch address in each environment that we spin up. As is the normal pattern, I’ve set it up on an AMI via Packer, and I deploy its configuration via Octopus deploy, but there is nothing particularly complicated there.

Kitty, Its Just A Phase

This post is pretty boring so far, so lets talk about Cerebro a little with the help of a screenshot.

This is the main screen of Cerebro, and it contains a wealth of information to help you get a handle on your Elasticsearch cluster.

It shows an overview of the cluster status, data nodes, indexes and their shards and replicas.

  • The cluster status is shown at the top of the screen, mostly via colour. Green good, yellow troublesome, red bad. Helpfully enough, the icon in the browser also changes colour according to the cluster status.
  • Data nodes are shown on the left, and display information like memory, cpu and disk, as well as IP address and name.
  • Indexes pretty much fill the rest of the screen, displaying important statistics like the number of documents and size, while allowing you to access things like mappings and index operations (like delete)
  • The intersection of index and data node gives information about shard/replica allocation. In the example above, we have 3 shards, 2 replicas and 3 nodes, so each node has a full copy of the data. Solid squares indicate the primary shard.
  • If you have unassigned or relocating shards, this information appears directly above the nodes, and shards currently being moved are shown in same place as normal shards, except blue.

Honestly, I don’t really use the other screens in Cerebro very much, or at least nowhere near as much as I use the overview screen. The dedicated nodes screen can be useful to view your master nodes (which aren’t shown on the overview), and to get a more performance focused display. I’ve also used the index templates screen for managing/viewing our logstash index template, but that’s mostly done through an Octopus deployment now.

There are others (including an ad-hoc query screen), but again, I haven’t really dug into them in depth. At least not enough to talk about them anyway.

That first screen though, the overview, is worth its weight in gold as far as I’m concerned.

Conclusion

I doubt I would understand Elasticsearch anywhere near as much as I do without Kopf/Cerebro. Realistically, I don’t really understand it much at all, but that little understanding I do have would be non-existent without these awesome tools.

Its not just a one horse town though. Elastic.co provides some equivalent tools as well (like Monitoring (formerly Marvel)) which offer similar capabilities, but they are mostly paid services as far as I can tell, so I’ve been hesitant to explore them in more depth.

I’m already spending way too much on the hardware for our log stack, so adding software costs on top of that is a challenging battle that I’m not quite ready to fight.

It doesn’t help that the last time I tried to price it, their answer for “How much for the things?” was basically “How much you got?”.