Cleaning House

September 22. 2015 0 Comments

As I’ve said repeatedly before, logging is easily one of the most important things that comes out of a piece of software. Good logging can be used for all sorts of purposes, but most importantly, it makes maintaining and supporting a piece of software much much easier. I won’t go into any more detail about how important I think logging is right now, as I’m going to use this post to talk about dealing with the bits and pieces that logging often leaves behind.

Mostly that means log files.

In our environment setup, we have drifted towards a pattern where all log files are written into a common logs directory, with subdirectories for each component.

If we want to find anything out about our components, we can just go to the logs directory. Nice and easy, no need to waste time hunting over the file system for logs that matter. Of course, its only used as a backup, as all of our logs are aggregated into ELK, but its still useful.

At this point in time, most of our components simply create log files on disk, and write everything there. Its then up to an external component (logstash in most cases, nxlog in very rare cases) to take those files, do something to them and write the events to some other location for aggregation. We have log files being created by Squid proxies, performance counters, web services, iis, windows services and Powershell scripts and probably many other things that aren't occurring to me right now.

Of course, the downside of this extensive logging is that you are constantly consuming more and more disk space. Some logging frameworks and mechanisms manage that for you (either with a maximum directory size or some maximum number of days of log files), but not all.

A great side effect of having all logs go to the same directory, is that it’s easy to write something to clean up that directory based on some retention policy (not written in the last 14 days for example).

So we did just that.

Scheduled Annihilation

Every environment that we create has two Powershell scripts executed on all machines by default.

The first sets up the logs directory at the common location, ensuring that it is writable by all appropriate users, and that all subdirectories and files are also appropriately writable.

We implemented this because when we setup a Performance Counter to write a log file containing system stats, it created the directory with an owner of System, and everything else just failed miserably from that point foward, completely breaking the common log location right off the bat. It was, for some reason, stupidly difficult to claim the folder back and fix its permissions as well, so we just implemented a script to configure the folder ahead of time.

function InitialiseCommonLogsDirectory
{
    $directoryPath = "C:\logs"
    if (-not (Test-Path $directoryPath))
    {
        Write-Verbose "Creating common logs directory at [$directoryPath]"
        New-Item $directoryPath -Type Directory 
    }

    Write-Verbose "Making common logs directory [$directoryPath] writeable by everyone."
    $Acl = Get-Acl $directoryPath

    $Ar = New-Object System.Security.AccessControl.FileSystemAccessRule("Everyone", "FullControl", "ContainerInherit,ObjectInherit", "None", "Allow")

    $Acl.SetAccessRule($Ar)
    Set-Acl $directoryPath $Acl
}

The second script installs a Scheduled Task to run a fairly simple piece of Powershell that just deletes all files that hadn’t been written to in the last 14 days. The assumption was that the physical files on disk were relatively unimportant assuming the log aggregation was working correctly, but we keep a small window just in case.

Annoyingly, the commands for setting up a scheduled task via Powershell are a bit arcane, mostly because you have to resort to using COM services to get the job done. The script to setup the scheduled task looked like this:

function CreateLogsClearingTask
{
    $directoryPath = "C:\logs"

    $taskName = "Clear Old Logs"

    $service = new-object -ComObject("Schedule.Service")
    $service.Connect()
    $rootFolder = $service.GetFolder("\")
    try
    {
        $rootFolder.DeleteTask($taskName, 0)
    }
    catch
    {
        if (-not ($_.Exception.Message -like "*The system cannot find the file specified*"))
        {
            throw $_.Exception
        }
    }

    $daysOld = 14
    $taskDescription = "Clears Logs that are older than [$daysOld] days"
    $taskCommand = "powershell"
    $taskArguments = "-Command `"&{ Get-ChildItem $directoryPath\* -Include *.log -Recurse | Where LastWriteTime -lt (Get-Date).AddDays(-$daysOld) | Remove-Item }`" -ExecutionPolicy Bypass"
 
    write-verbose "Creating Scheduled Task to automatically execute [$taskCommand] with arguments [$taskArguments] on a regular basis."

    $TaskStartTime = [DateTime]::Now.Date.AddHours(1)
 
    $TaskDefinition = $service.NewTask(0) 
    $TaskDefinition.RegistrationInfo.Description = "$taskDescription"
    $TaskDefinition.Settings.Enabled = $true
    $TaskDefinition.Settings.AllowDemandStart = $true
 
    $triggers = $TaskDefinition.Triggers
    #http://msdn.microsoft.com/en-us/library/windows/desktop/aa383915(v=vs.85).aspx
    $trigger = $triggers.Create(2)
    $trigger.StartBoundary = $TaskStartTime.ToString("yyyy-MM-dd'T'HH:mm:ss")
    $trigger.Enabled = $true
    $trigger.DaysInterval = 1
 
    # http://msdn.microsoft.com/en-us/library/windows/desktop/aa381841(v=vs.85).aspx
    $Action = $TaskDefinition.Actions.Create(0)
    $action.Path = "$taskCommand"
    $action.Arguments = "$taskArguments"
 
    #http://msdn.microsoft.com/en-us/library/windows/desktop/aa381365(v=vs.85).aspx
    $rootFolder.RegisterTaskDefinition("$TaskName",$TaskDefinition,6,"System",$null,5)

    $task = $rootFolder.GetTask($taskName)

    Write-Verbose "Executing Scheduled Task."
    $task.Run($null)
}

Not that friendly.

You may be thinking, why not just use the awesome scheduled task cmdlets available in later versions of Windows, New-ScheduledTaskTrigger, Get-ScheduledTask and Register-ScheduledTask. These commands are extremely useful and easy to understand, but unfortunately, in our particular case, we needed to target operating systems where those features are not available. Specifically, our build server (which runs the automated tests for the Powershell scripts) is running an older version of Windows Server, where the cmdlets are unavailable. Sometimes you just have to target the lowest common denominator.

Straightforward, right? I mean, it looks like it does the job.

It did not.

Logs! Logs Everywhere!

We didn’t notice for a pretty long time, because the growth was tiny on our non-production environments, but once we started to have some real traffic, the log files for IIS and the Web Service started to add up quickly. Luckily, we have dashboards that track the disk usage of our various server components, and someone happened to see the pattern of the % free space on machines dropping steadily over time.

A closer look at the script above will show you that it was failing because it was trying to clean up C:\logs\logs, which of course doesn’t exist. The directory variable had C:\logs already in it, and the Powershell script was appending another \logs to it before it was being run.

It turns out that the task had been failing continuously, pretty much since we implemented it, and because the leading indicator of failure was missing (a failing task), no-one noticed. The task thought it had succeeded because Powershell returns an exit code of 0 when an exception is thrown.

I cleaned up the script, wrote a test to verify that it was actually deleting old files (without error), incorporated the fixed scripts into our environment setup and deployed the fix to the live machines.

Once deployed, the task said it succeeded, but looking at the log files, they hadn’t actually been cleaned up at all. It had failed silently again.

Powershell and Failure: Close Friends

The root cause for the task not indicating failure correctly was that the Powershell executable, for some ungodly reason, returns an exit code of 0 when an exception is thrown. If you want to make it act like a normal application would (non-zero exit code = bad, zero = good) you have to wrap your script in a try…catch at the root and exit with a non-zero code yourself.

A little bit more refactoring around the Powershell scripts that create the scheduled task, and I had it automatically wrapping the supplied script with a try..catch and exiting with a failure if an exception occurred. I’ve included the script that creates the task below, but its only a part of a bigger system (which you can see in the Solavirum.Scripts.Common repository, when I finish updating it).

function New-DailyPowershellScheduledTask
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$taskName,
        [string]$taskDescription,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$script,
        [TimeSpan]$TimeOfDay="0:00",
        [switch]$ExecuteImmediately=$true
    )

    $service = new-object -ComObject("Schedule.Service")
    $service.Connect()
    $rootFolder = $service.GetFolder("\")

    try
    {
        $command = "powershell.exe"
        $arguments = _ConvertPowershellScriptIntoArgumentsForScheduledTask -script $script -taskName $taskName

        $TaskStartTime = [DateTime]::Now.Date.Add($TimeOfDay)
 
        $TaskDefinition = $service.NewTask(0) 
        $TaskDefinition.RegistrationInfo.Description = "$taskDescription"
        $TaskDefinition.Settings.Enabled = $true
        $TaskDefinition.Settings.AllowDemandStart = $true
 
        $triggers = $TaskDefinition.Triggers
        #http://msdn.microsoft.com/en-us/library/windows/desktop/aa383915(v=vs.85).aspx
        $trigger = $triggers.Create(2)
        $trigger.StartBoundary = $TaskStartTime.ToString("yyyy-MM-dd'T'HH:mm:ss")
        $trigger.Enabled = $true
        $trigger.DaysInterval = 1
 
        # http://msdn.microsoft.com/en-us/library/windows/desktop/aa381841(v=vs.85).aspx
        $Action = $TaskDefinition.Actions.Create(0)
        $action.Path = "$command"
        $action.Arguments = "$arguments"
 
        #http://msdn.microsoft.com/en-us/library/windows/desktop/aa381365(v=vs.85).aspx
        $rootFolder.RegisterTaskDefinition("$TaskName",$TaskDefinition,6,"System",$null,5)

        $task = $rootFolder.GetTask($taskName)

        if ($ExecuteImmediately)
        {
            Write-Verbose "An immediate execution of the just created scheduled task [$taskName] was requested. Executing now."
            $task.Run($null)
        }
    }
    finally
    {
        [System.Runtime.Interopservices.Marshal]::ReleaseComObject($service)
    }
}

function _ConvertPowershellScriptIntoArgumentsForScheduledTask
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$script,
        [string]$taskName
    )

    $taskLogsDirectoryName = "$($taskName.Replace(' ', '_'))"
    $rootLogsDirectoryPath = "C:\logs\scheduled-tasks"
    $logsDirectoryPath = "$rootLogsDirectoryPath\$taskLogsDirectoryName"
    $logDirectory = New-Item -ItemType Directory -Path $logsDirectoryPath -Force
    return "-command `"& { try { `$timestamp = [DateTime]::Now.ToString('yyyyMMdd'); Start-Transcript -Path $logsDirectoryPath\`$timestamp.log -Append; $script } catch { Stop-Transcript; exit 1; } }`" -ExecutionPolicy Bypass"
}

Now that I had fixed the issue with it failing to fail in an obvious way, it was time to add some logging. So I could log why it wasn’t deleting the logs.

Using the code that wraps the incoming script in a try…catch, I added a call to Start-Transcript at the start of the script, and a call to Stop-Transcript at the end. Start-Transcript is a useful feature built into Powershell that lets you write to a file everything interesting that happened during the transcript block. Its not perfect (it doesn’t seem to record Verbose statements), but it does record the output stream and any errors and warnings, which is enough.

Be careful. If you supply Start-Transcript with a file path inside a directory that does not exist, it appears to just fail silently. I’m pretty sure I tried the –Force flag (which usually deals with these things), but it didn’t seem to help. Make sure your destination directory exists before you try to start the transcript.

I made sure that the log files coming out of the scheduled task were appropriately date stamped (so they too, could in turn be cleaned up) and deployed the updated scheduled task again.

The Root of My Problems

The transcript file showed that I was getting an Access Denied exception when the script was running as a task. Obviously this did NOT happen in my tests, so it was something specific to the deployed environments.

Fair enough, that can happen with scheduled tasks, because they default to running as the System account, which doesn’t necessarily have normal permissions.

The interesting thing was that the root logs directory was specifically marked as full control to all appropriate users, and that included the System account. Checking the permissions on the directory with Get-Acl proved that to be true. System clearly had the ability to do whatever it liked to the logs directory and its content.

So where was the Access Denied error coming from?

Time to go back to debugging basics. It was time for Process Monitor (procmon). Process Monitor is one of those awesome sysinternals tools supplied by Microsoft/someone at Microsoft. It lets you monitor every single event happening within windows, including interactions with the registry, file system and all sorts of other things. Also, more importantly for the maintenance of sanity, it lets you filter those events.

Firing up procmon and running the scheduled task showed that it was clearly deleting the very first file that it found that was old enough to be deleted, and then stopping. There was no obvious errors coming from the file system, the process just seemed to stop trying. If I ran the task again, it actually had deleted the first file, and it would delete the next one (which was now the first) and stop again.

I still have no idea why it was doing this. I suspect it might have had something to do with deleting files during a streaming operation in Powershell (i.e. Get|Where|Delete), but if that were the case, I would think it would happen during my test as well.

I fixed the issue by wrapping the Get|Where in brackets, to make sure it was evaluated before it was streamed to the Delete functions, kind of like using a ToArray or ToList in C# when enumerating through a collection you intend to remove values from.

Once I made that change everything worked. The files were deleted, the task indicated success and all was well in the world.

Conclusion

There are a number of lessons to come out of this, but the most important is to make sure that your failure indicators actually indicate failure when a failure occurs, which is really just a specific application of “test all the things”.

Its not enough to just test it yourself either (in a unit test or similar), you must test in the environment that it will actually be deployed to, because the differences catch you out. In additional to testing, you need to have indicators or events that you monitor in order to make sure everything is acting as you expected. In our case, the monitoring that we had on the machines allowed us to notice the issue, but if no-one was paying attention, we would have probably had issues when the disk finally filled up.

It would have been better to have alerts that warned us when something was trending the wrong way. We probably glossed over that sort of thing because we thought it was a solved problem (after all, we had a script), so I suppose the moral is to question and check everything, even if you think its fine.

Because its probably not.

The Last Piece of the Logging Puzzle

April 28. 2015 0 Comments

Posted in:
logging
amazon
elk
cloud

As I’ve already stated, I’ve spent the last few weeks working on putting together log aggregation so that we know what our new service looks like in real time.

I’ve incorporated IIS logs, the application logs, machine statistics (memory, CPU, etc) and Windows Event logs into the log aggregator, and successfully used those events for analysis during load testing.

There was one piece missing though, which meant there was a hole in our ability to monitor how our service was actually operating in the face of actual usage.

The Elastic Load Balancer, or ELB, that sits in front of the publically accessible web service.

During load testing, I noticed that sometimes JMeter would record an error (specifically a 504, Gateway Timeout) but our dashboard in Kibana would show nothing. No errors, everything seemed fine.

It turned out that there was a default timeout on the ELB of 60 seconds, and at that point in the load testing, some requests were taking longer than that without causing any traffic over the connection. The ELB would terminate the connection, return a 504 to the client, but the request would still complete successfully (eventually) in the backend.

I needed to get eyes on the ELB.

Its Log!

Turning logging on for an ELB is fairly easy.

Just give it the S3 bucket you want it to log to, a prefix to use for entries made into the bucket and a time interval, and off it goes. All of this can be done through the CloudFormation template, which fits well into our strategy for environment setup (no manual tasks, automate all the things).

The only complex bit is setting up a bucket policy that sets the correct permissions to allow the ELB to write to the bucket, which is all pretty well documented. There is simply a well known ARN for what I assume is all Load Balancers in a region, and you setup a simple Put/Get/List policy to allow it to do its thing.

The only gotcha I ran into was when I included an underscore (_) in the prefix configuration setting for the ELB. The prefix setting is intended to make sure that the keys for files written into the bucket start with a common value. When I included an underscore, I got nothing but Access Denied errors. This was at the same time as I was setting up the bucket policy, so I assumed I had done that incorrectly. Turns out my bucket policy was flawless, and it was a completely unrelated (and unexpected) issue causing the Access Denied errors.

Very frustrating.

With that fixed though, the logs started flowing.

Content Rich

The ELB logs contain things like the ELB IP and port, where the request was forwarded to (IP and port again), the time to forward, process and respond to requests (3 separate entries, process is the time it takes for your server to do its thing), response codes, bytes transferred and other things. Very similar to IIS really, which is not unexpected.

Now all I had to do was get the information into our Log Aggregator.

Stashing Those Logs

I had been using Nxlog as my log processor. It was responsible for picking up files, processing them as necessary, enriching them with various pieces of information (hostname, component, application) and then shipping the results off via TCP to our log aggregator where Logstash was listening.

Nxlog is a fine product, but its scripting language is a hard to get a handle on, and the documentation is a bit sparse. Also it has no concept of decimal numbers, which meant that I had to convert some numbers to integers (like decimal seconds to milliseconds) via regular expressions. Altogether it got the job done, but I wasn’t particularly happy with it.

I thought that since I needed to do something a little bit more complicated (get files from S3 and process them) that I would use Logstash this time. Logstash as a log processor is a lot easier to distribute, configure and debug, which is nice. Its configuration is all in json, and is very easy to wrap your head around, and it has lots of component to accomplish various tasks like getting files from S3, parsing CSV lines, mutating fields to the correct type, etc. It even has a mutator (Logstash calls them filters) that allows you to execute arbitrary Ruby code for those times when you have to do something unusual.

Even better, Logstash is what's listening on the other end of the pipeline, so they play well together, and you only need to know 1 piece of software, instead of 2.

I built a similar distributable project to what I built for Nxlog, that creates a NuGet package that Octopus can deploy to get a copy of Logstash up and running on the target machine as a Windows Service. I won’t go into this in too much detail, but it was essentially the same thing that I did for Nxlog, except with different dependencies (JRE, Logstash, NSSM for service installation/configuration).

I added a small EC2 instance to our environment setup to act as a Log Processor, with the intent that it would immediately be used to process the ELB logs, but may also be used in the future to process other logs that don’t necessarily fit onto a specific machine (S3 access logs is the only one that comes to mind, but I’m sure there are more). The Logs Processor had an IAM role allowing it full control over the logs bucket that ELB was using (which was also created as part of the environment. Nice and clean, and no credentials stored anywhere.

I created a Logstash configuration to grab files from S3 and process them, and then deployed it to the Logs Processor.

Access Denied.

Permission to Rage Requested

The current release version of Logstash (1.4.2) does not support the usage of IAM roles for the S3 input. If I wanted to use that input, I would have to enter the credentials manually into the config file. I could do this easily enough at deployment time (storing the credentials in Octopus, which is much better than in source control), but I would need to actually have a user setup that could access the bucket. As the bucket is created during environment creation, this would mean that the credentials would change every time the environment was recreated. We create temporary environments all the time, so this would mean a manual step editing Octopus every time you wanted to get something to work.

That's unacceptable.

I contemplated using a small script during deployment time to grab some credentials from the IAM role on the machine and enter them into the config file, but those credentials expire and Logstash was running as a service, so at some stage it would just stop working and someone would have to do something to make it work again.

Again, unacceptable.

Luckily for me, the wonderful people behind Logstash (and specifically the S3 plugin) have developed a new version that allows the usage of IAM roles, and it was already in beta. Its a little unstable still (Release Candidate 2), but it was good enough for my purposes.

While doing some reading about Logstash and the new version I discovered that the file input was basically completely broken on Windows. The component that it was leveraging to get the unique identifier for files in order to record the position in the file that it was up to does not work in 1.4.2 and below, so you end up missing huge chunks of data when processing multiple files. This actually explained why I was having so much difficulty using the earlier version to process a large amount of IIS logs from a disconnected machine, and why there was holes in my data. Long story short, if you’re using the file input in Logstash and you’re on windows, get the latest release candidate.

I incorporated the 1.5 RC2 release into my deployment, but I still couldn’t get the S3 input to work.

Why Is It Always A Proxy

I hate proxies.

Not because of what they are. I think they actually do some useful things, like caching, obfuscating where requests are coming from when accessing the internet from within a network and preventing access to bad websites.

No I hate proxies because the support for them is always a pain in the ass. Every application seems to support proxies differently, if they support them at all. Some automatically read from the Internet Explorer registry setting for the proxy, some use the HTTP_PROXY environment variable, some have their own personal settings. This means that every time you want to use a piece of software in an environment that uses a proxy, you have to fight with it to get it to work.

Such was the case with the S3 input. The underlying Ruby based aws-sdk has support for proxies (as does the .NET one), with a Set-AwsProxy method.

I could not, for the life of me, figure out how to configure Logstash with a proxy for the AWS component though.

So , I was stuck. I had all the configuration in place to process the ELB logs, but I didn't have the logs themselves.

In the end I created a small Powershell script that uses the AWS Powershell Component to move all files from an S3 bucket to a local directory on a timer. I then installed that script as a Windows Service using NSSM. Finally I edited my Logstash configuration to process the local files instead. After tweaking my config to process the files correctly, everything started coming through into the Log Aggregator as expected, and I added the missing piece to our intelligence about the service.

I don’t like this solution, because it adds more moving parts than I think is strictly necessary, but sometimes you have to make compromises.

Summary

I’ve uploaded a repository with my deployable build for Logstash here, so hopefully someone else can benefit from the effort that I put into making it re-usable.

Setting up a deployment pipeline for this component saved me a lot of time throughout the development process, making redeploying my changes when I made a mistake or needed to change a dependency (like upgrading to Logstash 1.5 RC2) a breeze. I highly recommend spending that initial bit of effort in setting things up at the start so you can move quickly later.

In regards to the actual ELB logs, they don’t provide any groundbreaking information that IIS didn't already give us, except for the case where connections are terminated at the ELB due to inactivity. At least to my knowledge anyway, I suppose they will track if the underlying instances go offline, which will be good. The ELB entries come in a bit slower than the IIS ones (due to the delay before the log files are published from the ELB + the delay added by my own S3 downloader and Logstash file processor pair), but there’s not really much I can do about that.

I still hate proxies. Only because its easier to hate one thing than every application that doesn’t support them.

Continuously Integrating Log Configuration

March 31. 2015 0 Comments

Its that time of the release cycle where you need to see if your system will actually stand up to people using it. Ideally this is the sort of thing you do as you are developing the solution (so the performance and load tests grow and evolve in parallel with the solution), but I didn’t get that luxury this time.

Now that I’m doing something new, that means another serious of blog posts about things that I didn’t understand before, and now partially understand! Mmmmmm, partial understanding, the goal of true champions.

At a high level, I want to break our new API. I want to find the point at which it just gives up, because there are too many users trying to do things, and I want to find out why it gives up. Then I want to fix that and do the whole process again, hopefully working my way up to how many users we are expecting + some wiggle room.

Over the next couple of weeks I’m going to be talking about JMeter (our load testing application of choice), Nxlog (a log shipping component for Windows), ELK (our log aggregator), AWS (not a huge amount) and some other things that I’ve done while working towards good, reproducible load tests.

Todays topic is something that has bugged me for a few weeks, ever since I implemented shipping logs from our API instances into our log aggregator.

Configurific

Our log shipper of choice is Nxlog, being that we primarily use Windows servers. The configuration for Nxlog, the part that says what logs to process and how and where to send them, is/was installed at initialization time for the AWS instances inside our auto scaling group. This seemed like a good idea at the time, and it got us a working solution fairly quickly, which is always helpful.

Once I started actually using the data in the log aggregator though, I realised very quickly that I needed to make fairly frequent changes to the configuration, to either add more things being logged (windows event log for example) or to change fields or fix bugs. I needed to be able to deploy changes, and I didn’t want to have to do that by manually copying configuration files directly onto the X number of machines in the group. I certainly didn’t want to recreate the whole environment every time I wanted to change the log configuration either.

In essence, the configuration quickly became very similar to a piece of software. Something that is deployed, rather than just incorporated into the environment once when it is created. I needed to be able to version the configuration, easily deploy new versions and to track what versions were installed where.

Sounds like a job for Octopus Deploy.

Many Gentle Yet Strong Arms

I’ve spoken about Octopus Deploy a few times before on this blog, but I’ll just reiterate one thing.

Its goddamn amazing.

Honestly, I’m not entirely sure how I lived without it. I assume with manual deployment of software and poor configuration management.

We use Octopus in a fairly standard way. When the instances are created in our environment they automatically install an Octopus Tentacle, and are tagged with appropriate roles. We then have a series of projects in Octopus that are are configured to deploy various things to machines in certain roles, segregated by environment. Nothing special, very standard Octopus stuff.

If I wanted to leverage Octopus to manage my log configuration, I would need to package up my configuration files (and any dependencies, like Nxlog itself) into NuGet packages. I would then have to configure at least one project to deploy the package to the appropriate machines.

Tasty Tasty Nougat

I’ve used NuGet a lot before, but I’ve never really had to manually create a package. Mostly I just publish .NET libraries, and NuGet packaging is well integrated into Visual Studio through the usage of .csproj files. You don’t generally have to think about the structure of the resulting package much, as its taken care of by whatever the output of the project is. Even when producing packages for Octopus, intended to be deployed instead of just referenced from other projects, Octopack takes are of all the heavy lifting, producing a self contained package with all of the dependencies inside.

This time, however, there was no project.

I needed to package up the installer for Nxlog, along with any configuration files needed. I would also need to include some instructions for how the files should be deployed.

Luckily I had already written some scripts for an automated deployment of Nxlog, which were executed as part of environment setup, so all I needed to do was package all of the elements together, and give myself a way to easily create and version the resulting package (so a build script).

I created a custom nuspec file to define the contents of the resulting package, and relied on the fact that Octopus runs appropriately named scripts at various times during the deployment lifecycle. For example, the Deploy.ps1 script is run during deployment. The Deploy script acts as an entry point and executes other scripts to silently install Nxlog, modify and copy the configuration file to the appropriate location and start the Nxlog service.

Alas Octopus does not appear to have any concept of uninstallation, so there is no way to include a cleanup component in the package. This would be particularly useful when deploying new versions of the package that now run a new version of Nxlog for example. Instead of having to include the uninstallation script inside the new package, you could deploy the original package with clean up code already encapsulated inside, ready to be run when uninstallation is called for (for example, just before installing the new package).

There was only one issue left to solve.

How to select the appropriate configuration file based on where the package is being deployed? I assumed that there will be multiple configuration files, because different machines require different logging. This is definitely not a one-size fits all kind of thing.

Selection Bias

In the end, the way in which I solved the configuration selection problem was a bit hacky. It worked (hacky solutions usually do), but it doesn’t feel nice and clean. I generally shy away from solutions like that, but as this continuous log configuration deployment was done while I was working on load/performance testing, I had to get something in place that worked, and I didn’t have time to sand off all the rough edges.

Essentially each unique configuration file gets its own project in Octopus, even though they all reference the same NuGet package. This project has a single step, which just installs the package onto the machines with the specified roles. Since machines could fill multiple roles (even if they don’t in our environments) I couldn’t use the current role to select the appropriate configuration file.

What I ended up doing was using the name of the project inside the Deploy script. The project name starts with NXLOG_ and the last part of the name is the name of the configuration file to select. Some simple string operations very quickly get the appropriate configuration file, and the project can target any number of roles and environments as necessary.

It definitely isn't the greatest solution, but it gets the job done. There are some other issues that I can foresee as well, like the package being installed multiple times on the same machine through different projects, which definitely won’t give the intended result. That's partially a factor of the way Nxlog works though, so I’m not entirely to blame.

Results

In the end I managed to get something together that accomplished my goal. With the help of the build script (which will probably be incorporated into a TeamCity Build Configuration at some stage), I can now build, version and optionally deploy a package that contains Nxlog along with automated instructions on how and when to install it, as well as configuration describing what it should do. I can now make log changes locally on my machine, commit them, and have them appear on any number of servers in our AWS infrastructure within moments.

I couldn’t really ask for anything more.

I’ve already deployed about 50 tweaks to the log configuration since I implemented this continuous deployment, as a result of needing more information during load and performance testing, so I definitely think it has been worthwhile.

I like that the log configuration isn’t something that is just hacked together on each machine as well. Its controlled in source and we have a history of all configurations that were previously deployed, and can keep an eye on what package is active in what environments.

Treating the log configuration like I would treat an application was definitely the right choice.

I’ve knocked together a repository containing the source for the Nxlog package and all of the supporting elements (i.e. build scripts and tests). As always, I can’t promise that I will maintain it moving forward, as the actual repository I’m using is in our private BitBucket, but this should be enough for anyone interested in what I’ve written about above.

Next time: Load testing and JMeter. I write Java for the first time in like 10 years!

A Safe Place to Log, Part 02

March 24. 2015 0 Comments

Last time I outlined the start of setting up an ELK based Log Aggregator via CloudFormation. I went through some of the problems I had with executing someone else's template (permissions! proxies!), and then called it there, because the post was already a mile long.

Now for the thrilling conclusion where I talk about errors that mean nothing to me, accepting defeat, rallying and finally shipping some real data.

You Have Failed Me Again Java

Once I managed to get all of the proxy issues sorted, and everything was being downloaded and installed properly, the instance was still not responding to HTTP requests over the appropriate ports. Well, it seemed like it wasn’t responding anyway.

Looking into the syslog, I saw repeated attempts to start Elasticsearch and Logstash, with an equal number of failures, where the process had terminated immediately and unexpectedly.

The main issue appeared to be an error about “Bad Page Map”, which of course makes no sense to me.

Looking it up, it appears as though there was an issue with the version of Ubuntu that I was using (3.13? Its really hard to tell what version means what), and it was not actually specific to Java. I’m going to blame Java anyway. Apparently the issue is fixed in 3.15.

After swapping the AMI to the latest distro of Ubuntu, the exceptions no longer appeared inside syslog, but the damn thing still wasn’t working.

I could get to the appropriate pages through the load balancer, which would redirect me to Google to supply OAuth credentials, but after supplying appropriate credentials, nothing else ever loaded. No Kibana, no anything. This meant of course that ES was (somewhat) working, as the load balancer was passing its health checks, and Kibana was (somewhat) working because it was at least executing the OAuth bit.

Victory?

Do It Yourself

It was at this point that I decided to just take it back to basics and start from scratch. I realised that I didn’t understand some of the components being installed (LogCabin for example, something to do with the OAuth implementation?), and that they were getting in the way of me accomplishing a minimum viable product. I stripped out all of the components from the UserData script, looked up the latest compatible versions of ES, Logstash and Kibana, installed them, and started them as services. I had to make some changes to the CloudFormation template as well (ES defaults to 9200, Kibana 4 to 5601, had to expose the appropriate ports and make some minor changes to the health check. Logstash was fine).

The latest version of Kibana is more self contained than previous ones, which is nice. It comes with its own web server, so all you have to do is start it up and it will listen to and respond to requests via port 5601 (which can be changed). This is different to the version that I was originally working with (3?), which seemed to be hosted directly inside Elasticsearch? I’m still not sure what the original template was doing to be honest, all I know is that it didnt work.

Success!

A Kibana dashboard, load balancers working, ES responding. Finally everything was up and running. I still didn’t fully understand it fully, but it was a hell of a lot more promising than it was before.

Now all I had to do was get some useful information into it.

Nxlog Sounds Like a Noise You Make When You Get Stabbed

There are a number of ways to get logs into ES via Logstash. Logstash itself can be installed on other machines and forward local logs to a remote Logstash, but its kind of heavy weight for that sort of thing. Someone has written a smaller component called Logstash-Forwarder which does a similar thing. You can also write directly to ES using Logstash compatible index names if you want to as well (Serilog offers a sink that does just that).

The Logstash solutions above seem to assume that you are gathering logs on a Unix based system though, and don’t really offer much in the way of documentation or help if you have a Windows based system.

After a small amount of investigation, a piece of software called Nxlog appears to be the most commonly used log shipper as far as Windows is concerned.

As with everything in automation, I couldn’t just go onto our API hosting instances and just install and configure Nxlog. I had to script it, and then add those scripts to the CloudFormation template for our environment setup.

Installing Nxlog from the command line is relatively simple using msiexec and the appropriate “don't show the damn UI” flags, and configuring it is simple as well. All you need to do is have an nxlog.conf file configured with what you need (in my case, iis and application logs being forwarded to the Logstash endpoint) and then copy it to the appropriate conf folder in the installation directory.

The nxlog configuration file takes some getting used to, but their documentation is pretty good, so its just a matter of working through it. The best tip I can give is to create a file output until you are sure that nxlog is doing what you think its doing, and then flipping everything over to output to Logstash. You’ll save a log of frustration if you know exactly where the failures are (and believe me, there will be failures).

After setting up Nxlog, it all started to happen! Stuff was appearing in Kibana! It was one of those magical moments where you actually get a result, and it felt good.

Types? We need Types?

I got everything working nicely in my test environment, so I saved my configuration, tore down the environments and created them again (to verify they could be recreated). Imagine my surprise when I was getting Nxlog internal messages into ES, but nothing from IIS. I assumed that I had messed up Nxlog somehow, so I spent a few hours trying to debug what was going wrong. My Nxlog config appeared to be fine, so I assumed that there was something wrong with the way I had configured Logstash. Again, it seemed to be fine.

It wasn't until I looked into the Elasticsearch logs that I found out why all of my IIS logs were not making it. The first document sent to Elasticsearch had a field called EventRecievedTime (from the Nxlog internal source) which was a date, represented as ticks since X, i.e. a gigantic number. ES had inferred the type of this field as a long. The IIS source also had a field called EventRecievedTime, which was an actual date (i.e. YYYY-MM-DD HH:mm). When any IIS entry arrived into ES from Logstash, ES errored out trying to parse the datetime into a long, and discarded it. Because of the asynchronous nature of the system, there was no way for Logstash to communicate the failure back to anywhere that I could see it.

After making sure that both EventRecievedTimes were dates, everything worked swimmingly.

I suppose this might reoccur in the future, with a different field name conflicting. I’m not sure exactly what the best way to deal with this would be. Maybe the Elasticsearch logs should be put into ES as well? At least then I could track it. You could setup a template to strongly type the fields as well, but due to the fluidity of ES, there are always going to be new fields, and ES will always try to infer an appropriate type, so having a template won’t stop it from occurring.

Mmmmmmm Kibana

Look at this dashboard.

Look at it.

So nice.

I haven’t even begun to plumb the depths of the information now at my fingertips. Most of those charts are simple (average latency, total requests, response codes, requested paths), but it still provides a fantastic picture of the web applications in question.

Conclusion

Last time I wrote a CloudFormation template, I didn’t manage to get it into a publically available repository, which kind of made the blog posts around it significantly less useful.

This time I thought ahead. You can find all of the scripts and templates for the Log Aggregator in this repository. This is a copy of our actual repository (private, in Bitbucket), so I’m not sure if I will be able to keep it up to date as we make changes, but at least there is some context to what I’ve been speaking about.

I’ve included the scripts that setup and configure Nxlog as well. These are actually located in the repository that contains our environment setup, but I think they are useful inside this repository as a point of reference for setting up log shipping on a Windows system. Some high level instructions are available in the readme of the repository.

Having a Log Aggregator, even though it only contains IIS and application logs for a single application, has already proved useful. It adds a huge amount of transparency to what is going on, and Kibana’s visualisations are extremely helpful in making sense of the data.

Now to do some load testing on the web application and see what Kibana looks like when everything breaks.

A Safe Place to Log, Part 01

March 17. 2015 0 Comments

Logging is one of the most important components of a good piece of software.

That was a statement, not a question, and is not up for debate.

Logging enables you to do useful things, like identify usage patterns (helpful when decided what sections of the application need the most attention), investigate failures (because there are always failures, and you need to be able to get to their root causes) and keep an eye on performance (which is a feature, no matter what anyone else tells you). Good logging enables a piece of software to be supported long after the software developers who wrote it have moved on, extending the life expectancy of the application and thus improving its return on investment.

It is a shame really, that logging is not always treated like a first class citizen. Often it is an afterthought, added in later after some issue or failure proves that it would have been useful, and then barely maintained from that point forward.

Making sure your application has excellent logging is only the first part though, you also need somewhere to put it so that the people who need to can access it.

The most common approach is to have logs be output to a file, somewhere on the local file system relative to the location where the software is installed and running. Obviously this is better than not having logs, but only just barely. When you have log files locally, you are stuck in a reactive mindset, using the logs as a diagnostic tool when a problem is either observed or reported through some other channel (like the user complaining).

The better approach is to the send the logs somewhere. Somewhere they can be watched and analysed and alerted on. You can be proactive when you have logs in a central location, finding issues before the users even notice and fixing them even faster.

I’m going to refer to that centralised place where the logs go as a Log Aggregator, although I’m not sure if that is the common term. It will do for now though.

Bucket O’ Logs

At my current job, we recently did some work to codify the setup of our environments. Specifically, we used AWS CloudFormation and Powershell to setup an auto scaling group + load balancer (and supporting constructs) to be the home for a new API that we were working on.

When you have a single machine, you can usually make do with local logs, even if its not the greatest of ideas (as mentioned above). When you have a variable number of machines, whose endpoints are constantly shifting and changing, you really need a central location where you can keep an eye on the log output.

Thus I’ve spent the last week and bit working on exactly that. Implementing a log aggregator.

After some initial investigation, we decided to go with an ELK stack. ELK stands for Elasticsearch, Logstash and Kibana, three components that each serve a different purpose. Elasticsearch is a document database with strong analysis and search capabilities. Logstash is a ETL (Extract, Transform, Load) system, used for moving logs around as well as transforming and mutating them into appropriate structures to be stored in Elasticsearch and Kibana is a front end visualisation and analysis tool that sits on top of Elasticsearch.

We went with ELK because a few other teams in the organization had already experimented with it, so there was at least a little organizational knowledge floating around to exploit. Alas, the other teams had not treated their ELK stacks as anything reusable, so we still had to start from scratch in order to get anything up and running.

We did look at a few other options (Splunk, Loggly, Seq) but it seemed like ELK was the best fit for our organisation and needs, so that was what we went with.

Unix?

As is my pattern, I didn’t just want to jam something together and call that our log aggregator, hacking away at a single instance or machine until it worked “enough”. I wanted to make sure that the entire process was codified and reproducible. I particularly liked the way in which we had done the environment setup using CloudFormation, so I decided that would be a good thing to aim for.

Luckily someone else had already had the same idea, so in the true spirit of software development, I stole their work to bootstrap my own.

Stole in the sense that they had published a public repository on GitHub with a CloudFormation template to setup an ELK stack inside it.

I cloned the repository, wrote a few scripts around executing the CloudFormation template and that was that. ELK stack up and running.

Right?

Ha! It’s never that easy.

Throughout the rest of this post, keep in mind that I haven't used a Unix based operating system in anger in a long time. The ELK stack used a Ubuntu distro, so I was at a disadvantage from the word go. On the upside, having been using cmder a lot recently, I was far more comfortable inside a command line environment than I ever have been before. Certainly more comfortable than I was when I was used Unix.

Structural Integrity

The structure of the CloudFormation template was fairly straightforward. There were two load balancers, backed by an auto scaling group. One of the load balancers was public, intended to expose Kibana. The other was internal (i.e. only accessible from within the specified VPC) intended to expose Logstash. There were some Route53 entries to give everything nice names, and an Auto Scaling Group with a LaunchConfig to define the configuration of the instances themselves.

The auto scaling group defaulted to a single instance, which is what I went with. I’ll look into scaling later, when it actually becomes necessary and we have many applications using the aggregator.

As I said earlier, the template didn’t just work straight out of the repository, which was disappointing.

The first issue I ran into was that the template called for the creation of an IAM role. The credentials I was using to execute the template did not have permissions to do that, so I simply removed it until I could get the appropriate permissions from our AWS account managers. It turns out I didn’t really need it anyway, as the only thing I needed to access using AWS credentials was an S3 bucket (for dependency distribution) which I could configure credentials for inside the template, supplied as parameters.

Removing the IAM role allowed me to successfully execute the template, and it eventually reach that glorious state of “Create Complete”. Yay!

It still didn’t work though. Booooo!

Its Always a Proxy

The initial template assumed that the instance would be accessible over port 8080. The public load balancer relied on that fact and its health check queried the __es path. The first sign that something was wrong was that the load balancer thought that the instance inside it was unhealthy, so it was failing its health check.

Unfortunately, the instance was not configured to signal failure back to CloudFormation if its setup failed, so although CloudFormation had successfully created all of its resources, when I looked into the cloud-init-output.log file in /var/log, it turned out that large swathes of the init script (configured in the UserData section of the LaunchConfig) had simply failed to execute.

The issue here, was that we require all internet access from within our VPC to outside to go through a proxy. Obviously the instance was not configured to use the proxy (how could it be, it was from a public git repo), so all communications to the internet were being blocked, including calls to apt-get and the download of various configuration files directly from git.

Simple enough to fix, set the http_proxy and https_proxy environment variables with the appropriate value.

It was at this point that I also added a call to install the AWS CloudFormation components on the instance during initialisation, so that I could use cfn-signal to indicate failures. This at least gave me an indication of whether or not the instance had actually succeeded its initialization, without having to remote into the machine to look at the logs.

When working on CloudFormation templates, its always useful to have some sort of repeatable test that you can run in order to execute the template, ideally from the command line. You don’t want to have to go into the AWS Dashboard to do that sort of thing, and its good to have some tests outside the template itself to check its external API. As I was already executing the template through Powershell, it was a simple matter to include a Pester test that executed the template, checked that the outputs worked (the outputs being the Kibana and Logstash URLs) and then tear the whole thing down if everything passed.

At this point I also tried to setup some CloudWatch logs that would automatically extract the contents of the various initialization log files to a common location, so that I could view them from the AWS Dashboard when things were not going as expected. I did not, in the end, manage to get this working. The irony of needing a log aggregator to successfully setup a log aggregator was not lost on me.

Setting the environment variables fixed the majority of the proxy issues, but there was one other proxy related problem left that I didn’t pick up until much later. All of the Elasticsearch plugins were failing to install, for exactly the same reason. No proxy settings. Apparently Java does not read the system proxy settings (bad Java!) so I had to manually supply the proxy address and port to the call to the Elasticsearch plugin installation script.

The initialisation log now showed no errors, and everything appeared to be installed correctly.

But it still wasn’t working.

To Be Continued

Tune in next week for the thrilling conclusion, where I discover a bug caused by the specific combination of Ubuntu version and Java version, get fed up with the components being installed and start from scratch and then struggle with Nxlog in order to get some useful information into the stack.