0 Comments

As should be somewhat obvious by now, I’m a big fan of logging. I try to justify that interest in a logical fashion, stating the intelligence benefits but really, I just personally find it interesting and as a result it tends to be something that I focus on when developing software.

Log aggregation as a concept is still pretty new to me though. It wasn’t until a few years ago that I even became aware that it even existed, even though something always lingered in the corners of my mind about the usefulness of logs that were trapped on hundreds of different machines and environments. The idea to take the logs from a whole bunch of different locations (servers, client machines, everything), destructure them and put them all into a centralised, searchable location kind of blew my mind to be honest, and its been something of a passion ever since.

In an interesting twist of fate, my first exposure to log aggregation involved RavenDB. One of the previous developers had set up a RavenDB database to act as the destination for log events coming out of a piece of desktop software, and when i realized what it was, I was intent on using it to understand how the software was being used.

The idea was amazing.

The implementation?

Not so much.

A Missed Opportunity

As I mentioned above, at some point in the past the development team identified a need to raise and aggregate structured log events for a variety of reasons, like business intelligence, error reporting, general logging and so on. They picked RavenDB as the storage location, probably because it was so easy to use from a development point of view, and trivial to integrate into a C# application using the RavenDB client to write arbitrary plain old C# objects with no real messing around.

A simple custom logger interface was created, which would expose the capability to write log events that derived from some common base, and the implementation of that interface used the RavenDB client. From a storage point of view, staging and production instances of RavenDB were provisioned inside AWS, and exposed to the greater internet via an Elastic Load Balancer. The very next release of the software was writing all sorts of interesting events into RavenDB with very little difficulty, like notifications that the application was being started (including things like who started it and what version it was on) and errors that were occurring during normal usage.

It was all very useful.

That is, until you tried to actually search the resulting log events and glean some information from them.

This was back in RavenDB 2.X, so the default UI was the Silverlight implementation of the RavenDB Studio, which wasn’t exactly the most amazing piece of software.

Even ignoring the usability problems, RavenDB 2.X did not have good support for ad-hoc queries, and the no one had thought ahead and set up fields for indexing. This meant that the wealth of information available damn near unusable.

How Could You Not Call It Muninn?

Shortly after I started and found out about all of this awesome information going to waste I put together an ELK stack and hacked together a bridge between the two.

Whenever a log event was written into RavenDB, a small memory resident application would automatically write it out to a JSON file, and Logstash would pick it up and put it into the ELK stack.

With the new ELK stack in place, whenever we wrote new code we would ensure that it could aggregate its logs as necessary, augmenting the system with new capabilities each time. In the end it was mostly just log files being picked up and processed by local Logstash instances, but we process ELB logs in Lambda by posting directly to Logstash over HTTP and we exposed a public facing authenticated Logging API for software to aggregate log events outside of our little walled garden in AWS.

The rest is history, and we’ve been using ELK ever since.

But we never changed the initial piece of desktop software and its been happily writing into RavenDB this whole time, with the events being replicated into the ELK stack via that “temporary” hack.

Its The Perfect Name

Don’t get me wrong, the system worked, and log events from the desktop software were available in our ELK stack, it just felt so inefficient to have this whole other system in place providing no real additional value.

Not only that, while it did technically work, there were definitely times when that particularly log event channel failed spectacularly, from full disks (no-one had ever implemented maintenance or cleanup of any sort on the RavenDB database) through to performance problems similar to the ones that we saw in one of our APIs.

The last straw was when I did an audit on our AWS costs and discovered that we were actually paying a non-trivial amount of cash for the RavenDB logging infrastructure and that just seemed wrong (plus, if I recovered that cash I could put it into other things, like the ELK stack).

All told, there were more reasons to get rid of it than there were to keep it.

But how?

Well, as I mentioned earlier, when the development team initially implementing the log aggregation, they put a simple logging interface in place, with the RavenDB specific implementation behind that interface.

Switching the logging to use our Serilog based HTTP sink (which connected to our authenticated Logging API) was as simple as providing a new implementation of the old logging interface, so even though its usage infested the entire system, we really only had to change it once at the top level.

Of course, it wasn’t quite that simple (references to RavenDB were strewn through at least 3 different libraries that dealt with logging, so they had to be removed), but all told it was a hell of a lot easier than it would have been without that abstraction.

Obviously we want to switch to using Serilog directly (rather than a custom logging abstraction that provides no real value), but that can happen slowly, bit by bit, and the flow of log events will be maintained the entire time.

Summary

Unfortunately, we can’t actually decommission the RavenDB logging infrastructure just yet, as we have to wait until everyone in the wild upgrades to the latest version or we risk missing out on important intelligence from the slackers. I’m keeping a close eye on the total number of logs received through the old system, and the moment it drops below a certain threshold, I’m going to nuke the entire thing.

To be honest, the removal of the RavenDB logging channel that I described above makes me happier than it really has cause to. It’s been bugging me for a while (mostly from an engineering point of view), so it felt really good to get rid of it in the latest release of our software. That sort of maintenance and consolidation doesn’t always have a direct and measurable impact on the bottom line, but its still important for the system as a whole and important for engineering morale.

Its like ripping out a mostly dead plant from a garden.

Better to make room for something that might actually be useful.

0 Comments

And thus we come to the end of this particular saga, of rewriting the our AWS Lambda ELB Logs Processor in C# using .NET Core.

To summarise the posts in this series:

  1. I went through some of the basics around how to use C# and .NET Core to implement an AWS Lambda function
  2. I explained how we constructed the automated build, test and deployment pipeline for the .NET Core ELB Logs Processor
  3. I outlined how we tested the new .NET Core Lambda function, with a specific focus around how we handled the AWS Credentials

In conjunction with those posts, I’ve also uploaded a copy of the entire source code (build and deployment scripts included) to Solavirum.Logging.ELB.Processor.Lambda, mostly as a master reference point. Its a copy of our internal repository, quickly sanitized to remove anything specific to our organization, so I can’t guarantee that it works perfectly, but there should be enough stuff there to demonstrate the entire process working end to end.

As with all the rest of our stuff, the build script is meant to run from inside TeamCity, and the deployment script is meant to run from inside Octopus Deploy, so if you don’t have either of those things, you will have to use your imagination.

Of course, everything I’ve gone through is purely academic unless its running successfully in Production…

Seize The Means Of Production

If you’ve read any of my previous posts about the Javascript/Node.js ELB Logs Processor, you’re probably aware of the fact that it tended to look like it was working fine right up to the point where I published it into production and the sheer weight of traffic caused it to have a meltdown.

Unsurprisingly, the difference between a sample file with a few hundred lines in it and a real file with hundreds of thousands of lines is significant, and that’s not even really taking into account the difference between my local development environment and AWS Lambda.

In an unprecedented result though, the .NET Core implementation worked just fine when I shipped it to our production environment.

Unfortunately it was around twenty times slower.

The old Javascript/Node.js implementation could process a file in around 9 seconds, which is  pretty decent.

The .NET Core implementation processes a file in around 200 seconds, which is a hell of a lot slower.

Now, I can explain some of that away with the fact that the .NET Core implementation uses HTTP instead of TCP to communicate with Logstash (which was the primary reason I rewrote it in the first place), but it was still pretty jarring to see.

Interestingly enough, because executing Lambda functions is so cheap, the total cost increase was like $US8 a month, even though it was twenty times slower. so who really cares.

In the end, while the increase in processing time was undesirable, I’d rather have slower C# and HTTP over faster Javascript and TCP.

Summary

Overall, I’m pretty happy with how this whole .NET Core rewrite went, for a number of reasons:

  • I got to explore and experiment with the newest .NET and C# framework, while delivering actual business value, which was great.
  • In the process of developing this particular solution, I managed to put together some more reusable scripts and components for building other .NET Core components
  • I finally switched the ELB Logs Processor to HTTP, which means that the overall feed of events into our ELK stack is more reliable (TCP was prone to failure)
  • I managed to get a handle on fully automated testing of components that require AWS Credentials, without accidentally checking the credentials into source control

In the end though, I’m mostly just happy because its not Javascript.

0 Comments

Last week I wrote a high level overview of the build, package, test and deploy process that we created in order to get .NET Core C# into AWS Lambda.

As part of that post, I briefly touched on the validation part of the proces, which basically amounted to “we wrote some tests in C# using XUnit, then used the dotnet command line interface to run them as part of the build”.

Considering that the testing took a decent amount of the total time, that amount of coverage isn’t really appropriate, so this post will elaborate further, and explain some of the challenges we encountered and how we overcame them.

I’m A Little Testy

At a high level, testing a .NET Core AWS Lambda function written in C# is not really any more difficult than testing other C# code. Like the stuff I’ve been writing for years.

Unit test classes in isolation, integration test classes together, have a few end to end tests (if possible) and you’re probably in a good place.

When you write a lambda function, all you really need is a single class with a method on it (Handler::Handler or equivalent). You could, if you wanted, put literally all of your logic into this class/method combination and have a fully functional lambda function.

That would probably make it harder to test though.

A better idea is to break it down into its constituent parts, ideally taking the form of small classes with a single responsibility, working together to accomplish a greater goal.

For the ELB Logs Processor, there are three clear responsibilities:

  1. Read a file from S3, and expose the information therein as lines of text
  2. Take a line of text and transform it into some sort of structure, potentially breaking it down and adding meta information 
  3. Take a structure and output it to a Logstash endpoint

Technically there is also a fourth responsibility which is to tie the three things above together (i.e. orchestration), but we’ll get to that in a minute.

Anyway, if the responsibilities are clearly separated, we’re in unit testing territory, which is familiar enough that I don’t really want to talk about it too much. Grab your testing framework of choice (XUnit is a good pick for .NET Core, NUnit didn’t seem to work too well in Visual Studio + Resharper), mock out your dependencies and test away.

You can apply the same sort of testing to the orchestrator (the Handler) by giving it its dependencies through its constructor. Make sure you give it a parameterless constructor with default values though, or you’ll have issues running it in AWS.

This is basically just a poor mans dependency injection, but my favour dependency injection library, Ninject, does not exist in .NET Core and I didn’t want to invest the time to learn one of the other ones (like AutoFac) for such a small project.

Of course, while unit tests are valuabl, they don’t really tell us whether or not everything will work when you put it all together.

That’s where integration tests come in.

Integration Functions

In my experience, I’ve consistently found there to be two distinct flavours of integration test.

The first is the test that validates whether or not everything works when all of your things are put together. This tends to end up being a test of your dependency injection configuration (if you’re using it), where you create some top level object using your container of choice, then execute a series of methods on it (and its children) echoing what would normally happen in your program execution. You might mock out the very edges of the system (like databases, API dependencies, disk-based configuration, etc), but in general you want to validate that your tens (hundreds? thousands?) of classes play nice together and actually accomplish their overarching goal.

The second type of test is the one that validates whether or not a class that relies on an external system works as expected. These are very close to unit tests in appearance (single class, some mocked out dependencies), but are much slower and more fragile (due to the reliance on something outside of the current code base). They do have a lot of value though, because beyond application failures, they inform you when your integration with the external dependency breaks for whatever reason.

The ELB Logs Processor features both types of integration test, which is nice, because it makes for a comprehensive blog post.

The simpler integration test is the one for the component that integrates with S3 to download a file and expose it as a series of lines. All you need to do is put a file in a real S3 bucket, then run the class with an appropriate set of parameters to find that file, download it and expose it in the format that is desired. You’re free to simulate missing files (i.e. go get this file that doesn’t exist) and other error cases as desired as well, which is good.

The other integration test involves the Handler, using its default configuration and a realistic input. The idea would be to run the Handler in a way that is as close to how it will be run in reality as possible, and validate that it does what it advertises. This means we need to put some sort of sample file in S3, fire an event at the Handler describing that file (i.e. the same event that it would receive during a normal trigger in AWS), validate that the Handler ran as expected (perhaps measuring log output or something) and then validate that the events from the sample file were available in the destination ELK stack. A complicated test for sure, but one with lots of value.

As is always the case, while both of those tests are conceptually fairly simple and easy to spell out, they both have to surmount the same challenge.

AWS Authentication and Authorization.

401, Who Are You?

The various libraries supplied by AWS (Javascript, Powershell, .NET, Python, etc) all handle credential management in a similar way.

They use a sort of priority system for determining the source of their credentials, and it looks something like this:

  1. If credentials have been supplied directly in the code (i.e. key/secret), use those
  2. Else, if a profile has been supplied, look for that and use it
  3. Else, if no profile has been supplied, use the default profile
  4. Else, if environment variables have been set, use those
  5. Else, use information extracted from the local machine (i.e. IAM profiles or EC2 meta information)

The specific flow differs from language to language (.NET specific rules can be found here for example), but the general flow is pretty consistent.

What does that mean for our tests?

Well, the actual Lambda function code doesn’t need credentials at all, because the easiest way to run it is with an IAM profile applied to the Lambda function itself in AWS. Assuming no other credentials are available, this kicks in at option 5 in the list above, and is by far the least painful (and most secure) option to use.

When running the tests during development, or running them on the build server, there is no ambient IAM profile available, so we need to find some other way to get credentials to the code.

Luckily, the .NET AWS library allows you to override the logic that I described above, which is a pretty great (even though it does it with static state, which makes me sad).

With a little work we can nominate an AWS profile with a specific name to be our current “context”, and then use those credentials to both setup a temporary S3 bucket with a file in it, and to run the code being tested, like in the following Handler test.

/// <summary>
/// This test is reliant on an AWS profile being setup with the name "elb-processor-s3-test" that has permissions
/// to create and delete S3 buckets with prefixes oth.console.elb.processor.tests. and read items in those buckets
/// </summary>
[Fact]
public void WhenHandlingAValidS3Event_ConnectsToS3ToDownloadTheFile_AndOutputsEventsToLogstash()
{
    var inMemoryLogger = new InMemoryLogger();
    var xunitLogger = new XUnitLogger(_output);
    var logger = new CompositeLogger(inMemoryLogger, xunitLogger);

    var application = Guid.NewGuid().ToString();
    var config = new Dictionary<string, string>
    {
        { Configuration.Keys.Event_Environment, Guid.NewGuid().ToString() },
        { Configuration.Keys.Event_Application, application},
        { Configuration.Keys.Event_Component, Guid.NewGuid().ToString() },
        { Configuration.Keys.Logstash_Url, "{logstash-url}" }
    };
    using (AWSProfileContext.New("elb-processor-s3-test"))
    {
        logger.Log("Creating test bucket and uploading test file");
        var s3 = new AmazonS3Client(new AmazonS3Config { RegionEndpoint = RegionEndpoint.APSoutheast2 });
        var testBucketManager = new TestS3BucketManager(s3, "oth.console.elb.processor.tests.");
        using (var bucket = testBucketManager.Make())
        {
            var templateFile = Path.Combine(ApplicationEnvironment.ApplicationBasePath, @"Helpers\Data\large-sample.log");
            var altered = File.ReadAllText(templateFile).Replace("@@TIMESTAMP", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.ffffffZ"));
            var stream = new MemoryStream(Encoding.UTF8.GetBytes(altered));
            var putResult = s3.PutObjectAsync(new PutObjectRequest { BucketName = bucket.Name, InputStream = stream, Key = "test-file" }).Result;

            var message = new S3EventBuilder().ForS3Object(bucket.Name, "test-file").Build();

            logger.Log("Executing ELB logs processor");
            var handler = new Handler(logger, config);
            handler.Handle(message);
        }
    }

    // Check that there are some events in Elasticsearch with our Application
    Func<long> query = () =>
    {
        var client = new HttpClient {BaseAddress = new Uri("{elasticsearch-url}")};
        var raw = client.GetStringAsync($"/logstash-*/_search?q=Application:{application}").Result;
        dynamic result = JsonConvert.DeserializeObject(raw);
        return (long) result.hits.total;
    };

    query.ResultShouldEventuallyBe(hits => hits.Should().BeGreaterThan(0, $"because there should be some documents in Elasticsearch with Application:{application}"));
}

The code for the AWSProfileContext class is relatively simple, in that while it is in effect it inserts the profile into the first position of the credential fallback process, so that it takes priority over everything else. When disposed, it removes that override, returning everything to the way it was. A similar dispose-delete pattern is used for the temporary S3 bucket.

With a profile override in place, you can run the tests in local development just by ensuring you have a profile available with the correct name, which allows for a nice smooth development experience once a small amount of manual setup has been completed.

To run the tests on the build servers, we put together a simple helper in Powershell that creates an AWS profile, runs a script block and then removes the profile. The sensitive parts of this process (i.e. the credentials themselves) are stored in TeamCity, and supplied when the script is run, so nothing dangerous needs to go into source control at all.

To Be Continued

The most complicated part of testing the ELB Logs Processor was getting the AWS credentials working consistently across a few different execution strategies, while not compromising the safety of the credentials themselves. The rest of the testing is pretty just vanilla C# testing.

The only other thing worth mentioning is that the AWS libraries available for .NET Core aren’t quite the same as the normal .NET libraries, so some bits and pieces didn’t work as expected. The biggest issue I ran into was the how a lot of the functionality for managing credential is typically configured through the app.config, and that sort of thing doesn’t even seem to exist in .NET Core. Regardless, we still got it work in the end, even though we did have to use the obsolete StoredProfileAWSCredentials class.

Next week will be the last post in this series, drawing everything together and focusing on actually running the new .NET Core ELB Logs Processor in a production environment.

0 Comments

So, last week was mostly an introduction to .NET Core and AWS Lambda. Not a lot of meat.

This week I’ll be writing about the build and deployment pipeline that we set up, and how it ties in with our normal process.

Hopefully it will be a bit more nutritious.

The Story So Far

Over the last few years we’ve put together a fairly generic build and deployment process using Powershell that focuses on ensuring the process can be run painlessly on a fresh checkout of the repository. The goal is to have everything either embedded or bootstrapped, and to have any and all logic encapsulated within the scripts.

What this means is that we have a relatively simple setup in TeamCity, which is basically just parameter management and a small bit of logic for kicking off the scripts that live inside the repositories.

The same approach is extended to deployment as well; we have Powershell (or an equivalent scripting language) that does the actual deployment, and then only use Octopus Deploy for orchestration and variable management.

Because we need to build, test and deploy a variety of things (like C# desktop applications, Javscript websites, the old Javascript ELB Logs Processor, CloudFormation environments), the common scripts feature pluggable processes in the form of Powershell script blocks, allowing the user to choose what sort of build is needed (simple Nuget package over some files? A full MSBuild execution that generates an .exe in a Nuget package?), how and when it is tested (Nunit over .dlls containing the phrase “Tests” before deployment? Selenium using Webdriver IO through npm after deployment?) and how it is deployed (pretty much just Octopus Deploy).

The API for the pluggable processes (which I’m going to call engines from this point forward) is relatively lax, in that the Powershell just calls the supplied script block with known parameters, and fails miserably if they don’t exit.

In retrospect, we probably should have just used something already available, like psake, but it just didn’t seem necessary at the time. We’re probably going to get hurt if we try to do anything that only works on Linux systems (no Powershell there), but it hasn’t happened yet, so I’m sure it will never happen.

To summarise, we have a framework that allows for configurable approaches to build, test and deployment, and a .NET Core AWS Lambda function should be able to slot right in.

Vrmm Vrmm Goes The Engine

In order to reliably get the lambda function code from the git repository in which it lives, to the actual AWS Lambda function where it will be executed, we needed to do three things:

  1. Creation of an artefact containing a versioned and validated copy of the code to be executed
  2. The validation logic itself (i.e. testing)
  3. Deploying said artefact to the AWS Lambda function

Coincidentally, these three things line up nicely with the pluggable components in our generic build and deployment process, so all we have to do is provide an engine for each, then run the overall process, and everything should be sweet.

Bob The Builder

The first thing to deal with is the actual creation of something that can be deployed, which for us means a Nuget package.

Referring back to the Javascript version of the ELB Logs Processor, there really wasn’t a lot to the creation of the artefact. Use npm to make sure the appropriate dependencies are present, a quick consolidation step with webpack, then wrap everything up in a Nuget package.

For the C# version, its a similar story, except instead of using node/npm, we can use the dotnet executable available in the SDK, followed by the Nuget command line tool to wrap it all up neatly.

Basically, it looks like this:

dotnet restore {solution-path}
dotnet build –c Release {solution-path}
dotnet publish –o {output-path} –c Release {solution-path}
nuget pack {nuspec-path} {output-directory} {version}

The actual script block for the engine is a little bit more complicated than that (TeamCity integration, bootstrapping the .NET Core SDK, error handling/checking, etc), but its still relatively straightforward.

As I mentioned last week, the only thing to watch out for here is to make sure that you’re targeting the appropriate runtime with your solution configuration, and not to try and build a standalone executable. The code will actually be executed on a Linux machine behind the scenes, and AWS Lambda will make sure the .NET Core runtime is available, so all you have to worry about is your code. I’m pretty sure if you try and supply an OS independent executable it will fail miserably.

Emotional Validation

The second thing to deal with is the validation of the artefact.

Similar to the build process, if we refer back to the Javascript ELB Logs Processor, we had no testing at all. Literally nothing. Our validation was “deploy it and see if the logs stop”.

Don’t get me wrong, there are tonnes of ways to write tests and automate them when using node/npm, we just never went to the effort of actually doing it.

I’m much more familiar with C# (even taking into account the differences of .NET Core), and we don’t write code without tests, so this was an ideal opportunity to put some stuff together for validation purposes.

I’m going to dedicate an entire post to just how we went about testing the rewrite (some interesting stuff there about managing AWS credentials in a secure way), but at a high level we followed our standard pattern of having dedicated test projects that sit in parallel to the code (i.e. Lambda.Tests.Unit, Lambda.Tests.Integration) with automatic execution through the generic build process via an engine.

This particular engine is applied predeployment (i.e. before the Nuget package is pushed to Octopus and a project executed), and is pretty straight forward:

dotnet test {project-path} –c Release –no-build –logger {output-file-path}

The only caveat here was the general immaturity of .NET Core, as I had some problems using NUnit (which is the testing framework we typically use for C#), but XUnit worked perfectly fine. They aren’t that dissimilar, so it wasn’t a huge problem.

Tentacle Joke

The last thing to deal with is the actual deployment of the artefact to AWS.

Assuming step 1 and 2 have gone well (i.e. we have a versioned Nuget package that has been validated), all that’s left is to leverage Octopus Deploy.

For us this means a relatively simple project inside Octopus that just deploys the Nuget package to a worker tentacle, and a Powershell script embedded in the package that does the bulk of the work.

$awsKey = Get-OctopusParameter "AWS.Deployment.Key";
$awsSecret = Get-OctopusParameter "AWS.Deployment.Secret";
$awsRegion = Get-OctopusParameter "AWS.Deployment.Region";

$version = Get-OctopusParameter "Octopus.Release.Number";

$functionName = Get-OctopusParameter "AWS.Lambda.Function.Name";

$aws = Get-AwsCliExecutablePath

$env:AWS_ACCESS_KEY_ID = $awsKey
$env:AWS_SECRET_ACCESS_KEY = $awsSecret
$env:AWS_DEFAULT_REGION = $awsRegion

$functionPath = "$here\function"

Write-Verbose "Compressing lambda code file"
Add-Type -AssemblyName System.IO.Compression.FileSystem
[system.io.compression.zipfile]::CreateFromDirectory($functionPath, "index.zip")

Write-Verbose "Updating Log Processor lambda function [$functionName] to version [$version]"
(& $aws lambda update-function-configuration --function-name $functionName --runtime "dotnetcore1.0" --handler "Lambda::Handler::Handle") | Write-Verbose
(& $aws lambda update-function-code --function-name $functionName --zip-file fileb://index.zip) | Write-Verbose
(& $aws lambda update-function-configuration --function-name $functionName --description $version) | Write-Verbose

In summary, the script takes the contents of the Nuget package that was just deployed, zips up the relevant parts of it (in this case, everything under the “function” directory) and then uses the AWS CLI and some Octopus parameters to push the zip file to a known Lambda function.

Really the only interesting thing here is that we take the version from the Nuget package and apply it directly to the function in AWS for backwards traceability.

To Be Continued

The steps and processes that I’ve outlined above only describe the key parts of the build and deployment process. A lot of the detail is missing, and that is on purpose, because the entire thing is quite complicated and describing it comprehensively would make this blog post into more of a novel.

Its all ready pretty long.

The key things to take away are the .NET Core build and publish process (using the dotnet cmdline application), the .NET Core testing process (again, through the cmdline application) and the simplest way that I’ve found to actually push code to a Lambda function in an automated fashion.

Next week I’ll extrapolate on the testing of the lambda function, specifically around managing and using AWS credentials to provide an end to end test of function logic.