The Bigger Picture
- Posted in:
- logging
- elasticsearch
- c#
- metrics
We have a lot of logs.
Its mostly my fault to be honest. It was only a few years ago that I learned about log aggregation, and once you have an ELK stack, everything looks like an structured log event formatted as JSON.
We aggregate a wealth of information into our log stack these days, including, but not limited to:
- Business intelligence events from our legacy software (i.e. customer did X)
- Application logs from a variety of places
- IIS logs from our APIs
- ELB logs from AWS
- System statistics for infrastructure (i.e. CPU, Memory, etc)
Now, if I had my way, we would keep everything forever. My dream would be to be able to ask the question “What did our aggregate API traffic look like over the last 12 months?”
Unfortunately, I can’t keep the raw data forever.
But I might be able to keep a part of it.
Storage Space
Storage space is pretty cheap these days, especially in AWS. In the Asia Pacific region, we pay $US 0.12 per GB per month for a stock standard, non-provisioned IOPS ELB volume.
Our ELK stack accumulates gigabytes of data every day though, and trying to store everything for all of eternity can add up pretty quickly. This gets even more complicated by the nature of Elasticsearch, because it likes to keep replicas of things just in case a node explodes, so you actually need more storage space than you think in order to account for redundancy.
In the end we somewhat randomly decided to keep a bit more than a months worth of data (40 days), which gives us the capability to reliably support our products, and to have a decent window for viewing business intelligence and usage. We have a scheduled task in TeamCity that leverages Curator to remove data as appropriate.
Now, a little more than a month is a pretty long time.
But I want more.
In For The Long Haul
In any data set, you are likely to find patterns that emerge over a much longer period than a month.
A good example would be something like daily active users. This is the sort of trend that is likely to show itself over months or years, especially for a relatively stable product. Unless you’ve done something extreme of course, in which case we might get a meaningful trend over a much shorter period.
Ignoring the extremes, we have all the raw data required to calculate the metric, we’re just not keeping it. If we had some way of summarising it into a smaller data set though, we can keep it for a much longer period. Maybe some sort of mechanism to do some calculations and store the resulting derivation somewhere safe?
The simplest approach is some sort of script or application that runs on a schedule and uses the existing data in the ELK stack to create and store new documents, preferably back into the ELK stack. If we want to ensure those new documents don’t get deleted by Curator, all we have to do is put them into different indexes (as Curator is only cleaning up indexes prefixed with logstash).
Seems simple enough.
Generator X
For once it actually was simple enough.
At some point in the past we actually implemented a variation of this idea, where we calculated some metrics from a database (yup, that database) and stored them in an Elasticsearch instance for later use.
Architecturally, the metric generator was a small C# command line application scheduled for daily execution through TeamCity, so nothing particularly complicated.
We ended up decommissioning those particular metrics (because it turned out they were useless) and disabling the scheduled task, but the framework already existed to do at least half of what I wanted to do; the part relating to generating documents and storing them in Elasticsearch. All I had to do was extend it to query a different data source (Elasticsearch) and generate a different set of metrics documents for indexing.
So that’s exactly what I did.
The only complicated part was figuring out how to query Elasticsearch from .NET, which as you can see from the following metrics generation class, can be quite a journey.
public class ElasticsearchDailyDistinctUsersDbQuery : IDailyDistinctUsersDbQuery { public ElasticsearchDailyDistinctUsersDbQuery ( SourceElasticsearchUrlSetting sourceElasticsearch, IElasticClientFactory factory, IClock clock, IMetricEngineVersionResolver version ) { _sourceElasticsearch = sourceElasticsearch; _clock = clock; _version = version; _client = factory.Create(sourceElasticsearch.Value); } private const string _indexPattern = "logstash-*"; private readonly SourceElasticsearchUrlSetting _sourceElasticsearch; private readonly IClock _clock; private readonly IMetricEngineVersionResolver _version; private readonly IElasticClient _client; public IEnumerable<DailyDistinctUsersMetric> Run(DateTimeOffset parameters) { var start = parameters - parameters.TimeOfDay; var end = start.AddDays(1); var result = _client.Search<object> ( s => s .Index(_indexPattern) .AllTypes() .Query ( q => q .Bool ( b => b .Must(m => m.QueryString(a => a.Query("Application:GenericSoftwareName AND Event.Name:SomeLoginEvent").AnalyzeWildcard(true))) .Must(m => m .DateRange ( d => d .Field("@timestamp") .GreaterThanOrEquals(DateMath.Anchored(start.ToUniversalTime().DateTime)) .LessThan(DateMath.Anchored(end.ToUniversalTime().DateTime)) ) ) ) ) .Aggregations(a => a .Cardinality ( "DistinctUsers", c => c.Field("SomeUniqueUserIdentifier") ) ) ); var agg = result.Aggs.Cardinality("DistinctUsers"); return new[] { new DailyDistinctUsersMetric(start) { count = Convert.ToInt32(agg.Value), generated_at = _clock.UtcNow, source = $"{_sourceElasticsearch.Value}/{_indexPattern}", generator_version = _version.ResolveVersion().ToString() } }; } }
Conclusion
The concept of calculating some aggregated values from our logging data and keeping them separately has been in my own personal backlog for a while now, so it was nice to have a chance to dig into it in earnest.
It was even nicer to be able to build on top of an existing component, as it would have taken me far longer if I had to put everything together from scratch. I think its a testament to the quality of our development process that even this relatively unimportant component was originally built following solid software engineering practices, and has plenty of automated tests, dependency injection and so on. It made refactoring it and turning it towards a slightly different purpose much easier.
Now all I have to do is wait months while the longer term data slowly accumulates.