0 Comments

Something is eating all of the memory on some of our production API instances. I say somethingbecause its non-trivial diagnosing exactly what is eating the memory.

How is that even possible? Well, its eating the memory in such a way that the monitoring tools available (i.e. task manager, performance counters, etc) are completely unable to say which process is the culprit. The processes don’t have the memory, at least not in any of their working sets and the only way to notice that it is missing is that the amount of total available memory is dropping over time. It seems that the memory is accumulating in the non-paged memory pool.

Ugh, non-paged pool memory leak. Not fun. Probably a device driver or something else equally low level.

As is usually the case with this sort of thing, I blame Logstash, hence the tag on this post, but I can’t really back that up.

Unfortunately, we have not yet identified the root cause. Instead, this post will talk about some things we did to run away screaming from the problem until we have time to investigate in depth. Sometimes you just have to make it work so that everyone can stop panicking long enough to form coherent thoughts.

Immediate Mitigation

First step, scheduled reboot for the affected boxes before they die. That maintains the level of service while we attempt to get to the bottom of the issue.

Easiest way to accomplish this? Calendar reminder for a few specific people in the organisation. Odds are at least one of those people will action the item and that everything will continue to work as expected from an external point of view.

The risks here are many and varied. What if everyone on the list expects that someone on the list will do the thing? What if everyone is on holidays (Christmas is a particular bad time for this), or sick. If the scheduled task lasts long enough, you have to consider what will happen as people leave the organisation.

Its a pretty bad sign if your immediate, manual mitigation step lasts long enough for the people involved to leave the organisation. Either you are bad at prioritising or you have some serious churn problems.

Engineers and Manual Tasks

The easiest way to get something automated is to assign a regular, manual task to an engineer, or group of engineers. There is nothing an engineer hates more than repeatedly doing the same thing on some schedule. The response? Automation.

On our case, we originally thought that the best way to automate this particular restart was using a tag based system like we do for managing start and stop times for EC2 instances. The problem was, we didn’t want to restart all of the API instances inside the auto scaling group, just the oldest one (because it was the mostly likely to be closest to experiencing the problem). We didn’t want to get into a situation where we brought down the service because everything restarted at once.

Our next thought was to target the auto scaling group instead of the API instances themselves. On some regular interval, we could scaling up to N + 1, then after everything was good, scale down to N again. This would automatically terminate the oldest instance (because our termination policy was set to oldest first). Seems simple enough.

Luckily, because we went too far down the “lets write our own script path” on of our operations team remember that this functionality (scheduled scaling policies) was actually already a feature in AWS. Alas, its not exposed via the AWS management console (i.e. the website), but you can definitely create and manage the policies from the command line using the AWS CLI.

I’m not sure if you can use the equivalent AWS client libraries (like the Powershell cmdlets), but its definitely available in the CLI.

We created two policies. Scale up to N + 1 at midnight, and then scale down to N at 0100. This acts as a recycle for the instances we are having problems with, and leverages no custom code or scripting. Its just AWS functionality.

To create a schedule, assuming you have already configured the CLI, you can use the following snippet:

aws autoscaling put-scheduled-update-group-action --scheduled-action-name ScaleUp --auto-scaling-group-name <ASG Name> --recurrence "0 0 * * 1" --desired-capacity 3

This will create a scheduled action to set the desired capacity to 3 for the specified Auto Scaling Group at midnight UTC on every Monday of the year (standard Cron format, the only thing to remember is that it will execute based on UTC time).

Summary

I’m pretty disappointed that we still haven’t actually had a chance to really did into what the root cause of the issue is. In all seriousness, I do actually blame Logstash, specifically its TCP output that we use to write to another Logstash endpoint as part of our log aggregation. We’ve had some issues with that plugin before, and it wouldn’t surprise me if there was some issue where it was not properly disposing of sockets or some other sort of low level object as part of its normal operation.

I worry that the automated solution that we put into place (to workaround the issue by recycling) will probably remain in place for far longer than anyone wants it to. From a business point of view what is the motivation to identify and solve the root cause when everything is working well enough, at least from an outside perspective.

Still, its better than having to manually recycle the instances ourselves.

0 Comments

The last post I made about our adventures with RavenDB outlined the plan, upgrade to RavenDB 3. First step? Take two copies of our production environment, leave one at RavenDB 2.5 and upgrade the other to RavenDB 3. Slam both with our load tests in parallel and then see which one has better performance by comparing the Kibana dashboard for each environment (it shows things like CPU usage, request latency, disk latency, etc).

The hope was that RavenDB 3 would show lower resource usage and better performance all round using approximately the same set of data and for the same requests. This would give me enough confidence to upgrade our production instance and hopefully mitigate some of the issues we’ve been having.

Unfortunately, that’s not what happened.

Upgrade, Upgrade, Upgrade, Upgrade!

Actually upgrading to RavenDB 3 was painless. For RavenDB 2.5 we build a Nuget package that contains all of the necessary binaries and configuration, along with Powershell scripts that setup an IIS website and application pool automatically on deployment. RavenDB 3 works in a very similar way, so all I had to do was re-construct the package so that it worked in the same way except with the newer binaries. It was a little bit fiddly (primarily because of how we constructed the package the first time), but it was relatively easy.

Even better, the number of binaries and dependencies for RavenDB 3 is lower than RavenDB 2.5, which is always nice to see. Overall I think the actual combined sized may have increased, but its still nice to have a smaller number of files to manage.

Once I had the package built, all I had to do was deploy it to the appropriate environment using Octopus Deploy.

I did a simple document count check before and after and everything was fine, exactly the same number of documents was present (all ~100K of them).

Resource usage was nominal during this upgrade and basically non-existent afterwards.

Time to simulate some load.

What a Load

I’ve written previously about our usage of JMeter for load tests, so all I had to do was reuse the structure I already had in place. I recently did some refactoring in the area as well, so it was pretty fresh in my mind (I needed to extract some generic components from the load tests repository so that we could reuse them for other load tests). I set up a couple of JMeter worker environments in AWS and started the load tests.

Knowing what I do now I can see that the load tests that I originally put together don’t actually simulate the real load on the service. This was one of the reasons why our initial, intensive load testing did not find any of the issues with the backend that we found in production. I’d love to revisit the load profiles at some stage, but for now all I really needed was some traffic so that I could compare the different versions of the persistence store.

RavenDB 2.5 continued to do what it always did when the load tests were run. It worked just fine. Minimal memory and CPU usage, disk latency was low, all pretty standard.

RavenDB 3 ate all of the memory on the machine (16GB) over the first 10-15 minutes of the load tests. This caused disk thrashing on the system drive, which in turn annihilated performance and eventually the process crashed and restarted.

Not a good sign.

I’ve done this test a few times now (upgrade to 3, run load tests) and each time it does the same thing. Sometimes after the crash it starts working well (minimal resource usage, good performance), but sometimes even when it comes back from the crash it does the exact same thing again.

Time to call in the experts, i.e. the people that wrote the software.

Help! I Need Somebody

We don’t currently have a support contract with Hibernating Rhinos (the makers of RavenDB). The plan was to upgrade to RavenDB 3 (based on the assumption that its probably a better product), and if our problems persisted, to enter into a support contract for dedicated support.

Luckily, the guys at Hibernating Rhinos are pretty awesome and interact regularly with the community at the RavenDB Google Group.

I put together a massive post describing my current issue (mentioning the history of issues we’ve had to try and give some context), which you can find here.

The RavenDB guys responded pretty quickly (the same day in fact) and asked for some more information (understandably). I re-cloned the environment (to get a clean start) and did it again, except this time I was regularly extracting statistics from RavenDB (using the /stats and /admin/stats endpoints), as well as dumping the memory when it got high (using procdump) and using the export Debug information functionality built into the new Raven Studio (which is so much better than the old studio that it’s not funny). I packaged together all of this information together with the RavenDB log files and posted a response.

While looking through that information, Oren Eini (the CEO/Founder of Hibernating Rhinos) noticed that there were a number of errors reported around not being able to find a Lucene.Net.dll file on the drive where I had placed the database files (we separated the database files from the libraries, the data lives on a large, high throughput volume while the libraries are just on the system drive). I don’t know why that file should be there, or how it should get there, but at least it was progress!

The Battle Continues

Alas, I haven’t managed to return to this particular problem just yet. The urgency has diminished somewhat (the service is generally running a lot better after the latest round of hardware upgrades), and I have been distracted by other things (our Octopus Deploy slowing down our environment provisioning because it is underpowered), so it has fallen to the wayside.

However, I have plans to continue the investigation soon. Once I get to the root of the issue, I will likely make yet another post about RavenDB, hopefully summarising the entire situation and how it was fixed.

Software developers, perpetually hopeful…

0 Comments

The L in ELK stands for Logstash. This is not news.

When we put together the system we use for log aggregation, we needed to find a way to deal with Logstash such that it would fit into our relatively new (and still maturing) model for automated build and deployment. We put together a system that stores a deployable version of Logstash in Bitbucket, a build/package script written in Powershell that puts everything together into a Nuget package, TeamCity to listen for changes and then Octopus Deploy for managing the deployment.

The difficult part in this was that each place that required Logstash required a different configuration (because it was processing different files or transforming the data in different ways). There was a lot of commonality, but it was mostly in the actual installation and management of Logstash, rather than the configuration Logstash happened to be running at any particular point in time.

We have one git repository that contains everything necessary to run a copy of Logstash on a machine. This includes Logstash itself, the Java Runtime Environment (JRE) and a set of Powershell scripts that allow us to install Logstash as a Windows Service and choose which configuration it should be running. This meant that all of the configuration files for each of our Logstash deployments all lived in the same repository.

This worked okay for a little while, until we started adding more configuration files.

One Change Makes Many Work?

Suddenly, we had a bunch Build Configurations in TeamCity triggering off changes to the one repository. They only triggered off changes to their own Logstash configurations at least, but they all triggered whenever we made changes to Logstash itself or to the scripts surrounding it. Technically, they were all building (and deploying) the same package (X.Logging.Logstash.nupkg), but each would generate a different version, and deploy a different Octopos Deploy project. Luckily our versions are based off the current time, so it wasn’t like the version kept going up and down (because of the difference in number of builds), but there was the occasional conflict when two TeamCity tasks just happened to build on two different Build Agents within a few seconds of each other (which would generate identical packages).

The bigger issue was that each package was over 100MB! Logstash is a good 80MB by itself and the JRE is another 40MB, so once you add in a MB or two for everything else, your package is huge.

Yes, technically we could deal with this issue by making sure Java and Logstash are installed on the target machines ahead of time, but I don’t like this from an isolation/self-reliance point of view. I want to be able to push a package with the minimal dependencies already in existence on the target machine, ideally just an Octopus tentacle and Powershell (for Windows boxes anyway). Anything else that is required should be contained within the package itself (or, in extreme cases, bootstrapped from somewhere after deployment (but that just moves the problem around slightly)).

Suddenly a checkin to the repository would gum up our CI build agents with a set of relatively unimportant tasks, stopping other builds from progressing and interrupting peoples work.

The easiest (but somewhat fiddly) solution was to split the concept of an installed version of Logstash from the configuration it happened to be running. With this approach we could deploy Logstash to a target machine once and then not have to pay the price of shifting that much data over the network every single time we wanted to alter the config. When we did want to upgrade Logstash, we could simply build a new version and have it deployed in the same way.

The plan was relatively simple. Create one repository for a deployable version of Logstash by itself (making sure to generalise it enough such that you could easily point it at any configuration you wanted) and then split out each configuration into a repository of its own. Whenever Logstash changes, it would be built and published to Octopus, but nothing would be deployed. Each configuration repository would be able to choose to upgrade to the new version (by changing the dependency in source) and then TeamCity would pick up the changes and run the normal build/package/deploy cycle for that configuration.

Executions Are Tiresome

As is almost always the case, coming up with the idea and its execution plan was a lot more fun than actually doing it.

The Logstash repository that we had already was pretty tightly coupled to the way it handled configuration. It actually used the current Octopus Project name during the installation to determine the configuration that it should be running, and each configuration really only consisted of a single Logstash conf file.

The first task was to generalise the Logstash installation, so that we could deploy it separately and then have the configuration projects use it via a known interface. Nothing particularly interesting here from a design standpoint, just a few Powershell functions. Execute-Logstash, Install-LogstashService (and its counterpart, Remove-LogstashService) and some helpers for generating configuration files based on templates (because sometimes during deployment you need to be able to substitute some deployment specific values into your configuration, like AWS Instance Id).

The next task was taking one of the current configuration files and converting it into the new pattern, a repository of its own. This repository would need to contain everything necessary for the configuration of a Logstash instance, plus some tests to verify that the config file works as expected when given a set of known inputs.

Its not overly easy to test a Logstash configuration, especially when it has never been tested before. Like all code, you need to be able to substitute certain values (like the location of the log files to read) and then figure out a way to measure the outcome, without changing the actual configuration too much. The approach I settled on was to parameterise the log locations like I mentioned above and to add an additional output during testing that wrote everything to a file. That way I could read the file in and check to make sure that it was outputting as many lines as I expected.

The last task was to rebuild the Octopus Deploy project for the configuration to deploy both the Logstash deployable component and the configuration and verify that it installed and ran correctly on deployment. The most difficult part here was that different versions of each component were required, so we had to extend our Octopus scripts to handle this properly (i.e. step 1 which deploys Logstash needs to know that it should deploy version 1.5.2, but step 2 needs to deploy version 1.0.15235.12789 of the actual configuration).

Conclusion

I really should have structured our Logstash deployments in this way from the start. Its almost always better to separate configuration from application code, especially when you don’t control the application. Often you will find that configuration changes a lot more than the code does, and when the code itself is quite large (as is the case with Logstash and its dependencies) it can get quite painful shifting all of those bytes around for no real reason.

But, alas, you can never see into the future with any sort of clarity, and you need to be willing to change things that you’ve done in the past when problems are identified or a better solution comes along.

Can’t get sentimental about these things.

0 Comments

As part of my efforts in evaluating Raven 3 (as a replacement of Raven 2.5), I had to clone our production environment. The intent was that if I’m going to test whether the upgrade will work, I should definitely do it on the same data that is actually in Production. Hopefully it will be better, but you should verify your assumptions regardless.

What I really wanted to do was clone the environment, somehow shunt a copy of all of the current traffic though to the clone (ideally with no impact to the real environment) and then contrast and compare the results. Of course, that’s not simple to accomplish (unless you plan for it from the start) so I had to compromise and just take a copy of the existing data, which acts as the baseline for our load tests. I really do want to get that replicated traffic concept going, but its going to take a while.

On the upside, cloning one of our environments is a completely hands-free affair. Everything is automated, from the shutting down of the existing environment (can’t snapshot a volume without shutting down the machine that’s using it) through to the creation of the new environment, all the way to the clone of the S3 bucket that we use to store binary data.

4 hours later, I had my clone.

That’s a hell of a long time. For that 4 hours, the actual production service was down (because it needs to be non-changing for the clone to be accurate). I mean, it was a scheduled downtime, so it happened at like midnight, and our service is really only used during business hours, but its still pretty bad.

Where did all the time go?

The S3 clone.

Cloning Myself is a Terrible Idea

Well, it wasn’t all S3 to be honest. At least 30 minutes of the clone was taking up by snapshotting the existing data volume and bringing up the new environment. AWS is great, but it still takes time for everything to initialize.

The remaining 3.5 hours was all S3 though.

Our binary data bucket is approximately 100GB with a little over a million files (mostly images). I know this now thanks to the new CloudWatch metrics that AWS provides for S3 buckets (which I’m pretty sure didn’t exist a few months ago).

I’m not doing anything fancy for the bucket clone, just using the AWS CLI and the s3 sync command, doing a bucket to bucket copy. I’m definitely not downloading and then reuploading the files or anything crazy like that, so maybe it just takes that long to copy that much data through S3?

There’s got to be a better way!

They Would Fight

When you have what looks like a task that is slow because its just one thing doing it, the typical approach is to try and make multiple things do it, all at the same time, i.e. parallelise it.

So that’s where I started. All of our environment setup/clone is written in Powershell (using either the AWS Powershell Cmdlets or the AWS CLI), so my first thought was “How can I parallelize in Powershell?”

Unsurprisingly, I’m not the only one who thought that, so in the tradition of good software developers everywhere, I used someone else's code.

At that Github link you can find a function called Invoke-Parallel, which pretty much does exactly what I wanted. It creates a worker pool that pulls from a list of work up to some maximum number of concurrent operations. What was the pool of work though? Bucket prefixes.

Our binary data bucket works a lot like most S3 buckets, it uses keys that look a lot like file paths (even though that’s very much not how S3 works), with “/” as the path delimiter. It’s simple enough to get a list of prefixes in a bucket to the first delimiter, so our body of work becomes that set. All you need to do then is write a script to copy over the bucket contents based on a given prefix, then supply that script to the Invoke-Parallel function.

function Clone-S3Bucket
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$sourceBucketName,
        [Parameter(Mandatory=$true)]
        [string]$destinationBucketName,
        [Parameter(Mandatory=$true)]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [string]$awsRegion,
        [switch]$parallelised=$false
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

    $awsCliExecutablePath = Get-AwsCliExecutablePath

    try
    {
        $old = Set-AwsCliCredentials $awsKey $awsSecret $awsRegion

        Write-Verbose "Cloning bucket [$sourceBucketName] to bucket [$destinationBucketName]"

        if ($parallelised)
        {
            # This is the only delimiter that will work propery with s3 cp due to the way it does recursion
            $delimiter = "/"
            $parallelisationThrottle = 10

            Write-Verbose "Querying bucket [$sourceBucketName] for prefixes to allow for parallelisation"
            $listResponseRaw = [string]::Join("", (& $awsCliExecutablePath s3api list-objects --bucket $sourceBucketName --output json --delimiter $delimiter))
            $listResponseObject = ConvertFrom-Json $listResponseRaw
            $prefixes = @($listResponseObject.CommonPrefixes | Select -ExpandProperty Prefix)

            . "$rootDirectoryPath\scripts\common\Functions-Parallelisation.ps1"

            if ($prefixes -ne $null)
            {
                Write-Verbose "Parallelising clone over [$($prefixes.Length)] prefixes"
                $copyRecursivelyScript = { 
                    Write-Verbose "S3 Copy by prefix [$_]";
                    $source = "s3://$sourceBucketName/$_"
                    $destination = "s3://$destinationBucketName/$_"
                    & $awsCliExecutablePath s3 cp $source $destination --recursive | Write-Debug 
                }

                $parallelOutput = Invoke-Parallel -InputObject $prefixes -ImportVariables -ScriptBlock $copyRecursivelyScript -Throttle $parallelisationThrottle -Quiet
            }
            else
            {
                Write-Verbose "No prefixes were found using delimiter [$delimiter]"
            }

            $keys = $listResponseObject.Contents | Select -ExpandProperty Key

            if ($keys -ne $null)
            {
                Write-Verbose "Parallelising clone over [$($keys.Length)] prefixes"
                $singleCopyScript = { 
                    Write-Verbose "S3 Copy by key [$_]";

                    $copyArguments = @()
                    $copyArguments += "s3"
                    $copyArguments += "cp"
                    $copyArguments += "s3://$sourceBucketName/$_"
                    $copyArguments += "s3://$destinationBucketName/$_"
                    & $awsCliExecutablePath @copyArguments | Write-Debug
                }

                $parallelOutput = Invoke-Parallel -InputObject $keys -ImportVariables -ScriptBlock $singleCopyScript -Throttle $parallelisationThrottle -Quiet
            }
        }
        else
        {
            (& $awsCliExecutablePath s3 sync s3://$sourceBucketName s3://$destinationBucketName) | Write-Debug
        }
    }
    finally
    {
        $old = Set-AwsCliCredentials $old.Key $old.Secret $old.Region
    }
}

There Can Be Only One

Now, like any developer knows, obviously my own implementation is going to be better than the one supplied by a team of unknown size who worked on it for some unspecified length of time, but the key fact to learn would be just how much better it was going to be.

I already had a Powershell test for my bucket clone (from when I first wrote it to use the AWS CLI directly), so I tuned it up a little bit to seed a few hundred files (400 to be exact), evenly distributed into prefixed and non-prefixed keys. These files were then uploaded into a randomly generated bucket and both my old code and the newer parallelised code was execute to clone that bucket into a new bucket.

Describe "Functions-AWS-S3.Clone-S3Bucket" -Tags @("RequiresCredentials") {
    Context "When supplied with two buckets that already exist, with some content in the source bucket" {
        It "Ensures that the content of the source bucket is available in the destination bucket" {
            $workingDirectoryPath = Get-UniqueTestWorkingDirectory
            $creds = Get-AwsCredentials
            $numberOfGeneratedFiles = 400
            $delimiter = "/"

            $sourceBucketName = "$bucketPrefix$([DateTime]::Now.ToString("yyyyMMdd.HHmmss"))"
            (New-S3Bucket -BucketName $sourceBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion) | Write-Verbose
            
            . "$rootDirectoryPath\scripts\common\Functions-Parallelisation.ps1"

            $aws = Get-AwsCliExecutablePath

            $old = Set-AwsCliCredentials $creds.AwsKey $creds.AwsSecret $creds.AwsRegion

            $fileCreation = {
                $i = $_
                $testFile = New-Item "$workingDirectoryPath\TestFile_$i.txt" -ItemType File -Force
                Set-Content $testFile "Some content with a value dependent on the loop iterator [$i]"
                $key = $testFile.Name
                if ($i % 2 -eq 0)
                {
                    $key = "sub" + $delimiter + $key
                }

                if ($i % 4 -eq 0)
                {
                    $key = (Get-Random -Maximum 5).ToString() + $delimiter + $key
                }

                & $aws s3 cp $testFile.FullName s3://$sourceBucketName/$key
            }

            Set-AwsCliCredentials $old.Key $old.Secret $old.Region

            1..$numberOfGeneratedFiles | Invoke-Parallel -ScriptBlock $fileCreation -ImportVariables -Throttle 10 -Quiet

            $destinationBucketName = "$bucketPrefix$([DateTime]::Now.ToString("yyyyMMdd.HHmmss"))"
            $destinationBucket = (New-S3Bucket -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion) | Write-Verbose

            try
            {
                $time = Measure-Command { Clone-S3Bucket -SourceBucketName $sourceBucketName -DestinationBucketName $destinationBucketName -AwsKey $creds.AwsKey -AwsSecret $creds.AwsSecret -AwsRegion $creds.AwsRegion -Parallelised }

                $contents = @(Get-S3Object -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion)

                $contents.Length | Should Be $numberOfGeneratedFiles
            }
            finally
            {
                try
                {
                    (Remove-S3Bucket -BucketName $sourceBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion -DeleteObjects -Force) | Write-Verbose
                }
                catch 
                {
                    Write-Warning "An error occurred while attempting to delete the bucket [$sourceBucketName]."
                    Write-Warning $_
                }

                try
                {
                    (Remove-S3Bucket -BucketName $destinationBucketName -AccessKey $creds.AwsKey -SecretKey $creds.AwsSecret -Region $creds.AwsRegion -DeleteObjects -Force) | Write-Verbose
                }
                catch
                {
                    Write-Warning "An error occurred while attempting to delete the bucket [$destinationBucketName]."
                    Write-Warning $_
                }
            }
        }
    }
}

The old code took 5 seconds. That’s forever!

The new code took 50 seconds!

Yup, 10 times slower.

A disheartening result, but not all that unexpected when I think about it.

The key point here, that I was unaware of, is that the AWS CLI sync is already multithreaded, running a number of requests in parallel to deal with exactly this issue. Just trying to multitask within the same process gives me very little, and in reality is actually worse, because the CLI is almost certainly much more highly optimised than my own Powershell based parallelisation code.

Conclusion

Unfortunately I don’t yet have an amazing solution for cloning large S3 buckets. I’ll get back to it in the future, but for now I just have to accept that a clone of our production environment takes hours.

I think that if I were to use a series of workers (probably in AWS) that I could feed work to via a message queue (RabbitMQ, SQS, whatever) I could probably improve the clone speed, but that’s a hell of a lot of effort, so I’ll need to give it some more thought.

Another important takeaway from this experiment is that you should always measure the solutions you’ve implemented. There is no guarantee that your apparently awesome code is any better than something else, no matter how attached to it you might be.

Prove its awesomeness with numbers, and then, if its bad, let it die.

0 Comments

A few months ago we released a new service allowing our users to complete some of their work through a mobile application. For an application that is primarily locked to a computer within an office, it was a pretty big improvement. Its not the first time we’ve tried to add this functionality, but it was one of the better attempts.

That is, until people really started hammering it. Then it went downhill kind of quickly.

Before I started working here, an architectural decision was made to use a document database to store the data for this mobile application. The idea was that the database would be a temporary store, almost like a waystation, that would allow two way synchronization of the data between two applications (the mobile application and the clients server). Conceptually, its not a bad design. Its not necessarily the design I would have suggested, but it has merit.

The document database selected was RavenDB, specifically version 2.5.

The people who made that particular architectural design are no longer with company for various reasons, so it was up to my team and I to complete the work and actually release something. We did our best and after a fairly lengthy development period followed by an equally lengthy beta period, we released into the wild. As I mentioned above, it started well, but didn’t seem to scale to the amount of users we started seeing. I’m not talking hundreds of thousands of users either, just a few hundred, so it definitely wasn’t one of those problems where you are cursed by your own success.

The root cause for the performance problems? It appeared to be RavenDB.

An Unkindness

I always make the assumption that if a commercial component looks like its not holding up its end of the bargain, its probably not the components fault. Its almost certainly the developers fault, because they either configured it wrong or generally did not understand it enough to know that they were using it in entirely the wrong way.

I think this is true for our problems with RavenDB, but I still don’t know exactly where we went wrong.

I’ll start at the beginning.

The first architectural design had two RavenDB instances in EC2 hidden behind a load balancer. They were configured to replicate to each other. This pair was reduced down to a single instance when we discovered that that particular structure was causing issues in the system (using my future knowledge, I now know that’s not how RavenDB does redundancy). The intention was that if load testing showed that we had issues with only one instance, we would revisit.

Our load testing picked up a bunch of problems with various things, but at no point was the RavenDB instance the bottleneck, so we assumed we would be okay.

Unfortunately, the load tests were flawed somehow, because once the system started to be used in anger, the RavenDB instance was definitely the bottleneck.

When we released, the database was initially hosted on a t2.medium. These instances are burstable (meaning their CPU can spike), but are limited by CPU credits. It became obvious very quickly that the database was consuming far more CPU credits than we expected (its CPU usage was averaging something like 80%), so we quickly shifted it to an m3.medium (which does not use CPU credits). This worked for a little while, but eventually we started experiencing performance issues again as usage increased. Another shift of the underlying hardware to an m4.large improved the performance somewhat, but not as much as we expected.

When we looked into the issue, we discovered a direct correlation between the latency of requests to the service and the disk latency of the data disk that RavenDB was using for storage. What followed was a series of adjustments to the data disk, mostly related around switching to provisioned IOPS and then slowing scaling it up until the latency of the disk seemed to no longer be the issue.

But we still had performance problems, and at this point the business was (rightly) getting a bit antsy, because users were starting to notice.

After investigation, the new cause of the performance problems seemed to be paging. Specifically the RavenDB process was consuming more memory than was available and was paging to the very slow system drive. Scaling the underlying instance up to an m4.xlarge (for more memory and compute) alleviated this particular problem.

We had a number of other issues as well:

  • Because we host RavenDB in IIS, the default application pool recycle that occurs every 29 hours eventually started happening during our peak times, which didn’t end well. We now schedule the restart for early in the morning. This was made somewhat more difficult by the fact that RavenDB can’t handle overlapping processes (which IIS uses to avoid downtime during restarts).
  • We’ve had the RavenDB process crash from time to time. IIS handles this (by automatically restarting the process), but there is still a period of downtime while the whole thing heats up again.

That brings us to the present. The service is running well enough, and is pretty performant, but it really does feel like we’ve thrown way too much power at it for what it accomplishes.

Where to Now?

Raven 2.5 is old. Very old.

Our next step is to upgrade to Raven 3, and then directly contrast and compare the performance of the two versions under similar load to see exactly what we’re getting ourselves into.

The logic behind the upgrade is that the newer version is far more likely to have better performance, and we’re far more likely to be able to easily get support for it.

Initial investigations show that the upgrade itself is relatively painless. The older Raven 2.5 client is completely compatible with the new server, so we don’t even need to upgrade the API components yet. All of the data appears to migrate perfectly fine (and seamlessly), so all we need to do is put some effort into comparing performance and then we should be sweet.

Secondarily, we’re going to be setting up at least one other Raven instance, primarily as a backup, but maybe also as a load balancing measure. I’ll have to look into it more before we figure out exactly what its capable of, but at the very least we need a replicated backup.

Summary

This post was more of an introduction into some of the issues that we’ve been having with the persistence of our service, but it does bring to light some interesting points.

Needing to support something in production is very different from just deciding to use it during development. There is a whole other set of tools and services that are required before you can successfully use something in production, and a completely different set of expertise and understanding required. Because the development process was so smooth (from the persistence point of view), we never really had to dig into the guts of Raven and really figure out what it was doing, so we were completely unprepared when everything went to hell during actual usage.

Trial by fire indeed.