That Neat Little Symbol From Half-Life, Part 2

November 29. 2016 0 Comments

Now that I am feeling less like dying horribly from a terrible plague, its time to continue to talk about processing ELB logs into an ELK stack via AWS Lambda.

Last time I talked about our current situation, our motivations for wanting to switch to something better and the general plan moving forward.

To summarise the plan, the new log processor system needs 3 main parts:

The Lambda function code, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

I’m basically going to do a short blog post on each of those pieces of work, and maybe one at the end to tie it all together.

With that in mind, lets talk Javascript.

Unforeseen Consequences

When you’re writing Lambda functions, Javascript (via Node.js) is probably your best bet. Sure you can run Java or Python (and maybe one day C# using .NET Core), but Javascript is almost certainly going to be the easiest. Its what we chose we when put together the faster S3 clone prototype, and while the fundamentally asynchronous nature of Node.js took some getting used to, it worked well enough.

When it comes to Javascript, I don’t personally write a lot of it. If I’m writing a server side component, I’m probably going to pick C# as my language of choice (with all its fancy features like “a compiler” and “type safety”) and I don’t find myself writing things like websites or small Javascript applications very often, if at all. My team definitely writes websites though, and we use React.js to do it, so its not like Javascript is an entirely foreign concept.

For the purposes of reading in and parsing an ELB log file via a Lambda function, we didn’t need a particularly complex piece of Javascript. Something that reads the specified file from S3 after the Lambda function triggers, something to process the contents of that file line by line, something to parse and format those lines in a way that a Logstash input wll accept, and something to push that JSON payload to the Logstash listener over raw TCP.

Without further ado, I give you the completed script:

'use strict';

let aws = require('aws-sdk');
let s3 = new aws.S3({ apiVersion: '2006-03-01' });
let readline = require('readline');
let net = require('net');

const _type = 'logs';
const _sourceModuleName = 'ELB';
const _logHost = '#{LogHost}'
const _logPort = #{LogPort}
const _environment = '#{Octopus.Environment.Name}'
const _component = '#{Component}'
const _application = '#{Application}'

function postToLogstash(entry){
    console.log("INFO: Posting event to logstash...");

    var socket = net.createConnection(_logPort, _logHost);
    var message = JSON.stringify(entry) + "\n";
    socket.write(message);
    socket.end();

    console.log("INFO: Posting to logstash...done");
}

exports.handler = (event, context, callback) => {
    console.log('INFO: Retrieving log from S3 bucket...');

    const bucket = event.Records[0].s3.bucket.name;
    const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    const params = {
        Bucket: bucket,
        Key: key
    };

    const reader = readline.createInterface({
        input: s3.getObject(params).createReadStream()
    });

    const expectedColumns = 12;

    reader.on('line', function(line) {
        console.log("INFO: Parsing S3 line entry...");

        const columns = line.split(/ (?=(?:(?:[^"]*"){2})*[^"]*$)/);

        if(columns.length >= expectedColumns){
            var entry = {
                EventReceivedTime: columns[0],
                LoadBalancerName: columns[1],
                PublicIpAndPort: columns[2],
                InternalIpAndPort: columns[3],
                TimeToForwardRequest: parseFloat(columns[4]) * 1000,
                TimeTaken: parseFloat(columns[5]) * 1000,
                TimeToForwardResponse: parseFloat(columns[6]) * 1000,
                Status: columns[7],
                BackendStatus: columns[8],
                BytesUploadedFromClient: parseInt(columns[9]),
                BytesDownloadedByClient: parseInt(columns[10]),
                FullRequest: columns[11],
                Component: _component,
                SourceModuleName: _sourceModuleName,
                Environment: _environment,
                Application: _application,
                Type: _type
            };
            postToLogstash(entry);
        } else {
            console.log("ERROR: Invalid record length was expecting " + expectedColumns.length + " but found " + columns.length);
            console.log('ERROR: -------');
            console.log(line);
            console.log('ERROR: -------');
        }
    });
};

Nothing to fancy.

In the interest of full disclosure, I did not write the script above. It was written by a few guys from my team initially as a proof of concept, then improved/hardened as a more controlled piece of work.

Office Complex

You might notice some strange variable names at the top of the script (i.e. #{Octopus.Environment.Name}).

We use Octopus deploy for all of our deployments, and this includes the deployment of Lambda functions (which we package via Nuget and then deploy via the AWS Powershell Cmdlets/CLI inside Octopus). The #{NAME} notation is a way for Octopus to substitute variable values into files during deployment. This substitution is very useful, and can be scoped via a variety of things (like Environment, Machine, Role, etc), so by the time the script actually gets into AWS those variables are filled in with actual values.

Other than our use of Octopus variables, other things to note in this piece of Javascript are:

At no point does the function specify which credentials are being used to access the S3 bucket containing the ELB log files. This is because the Lambda function has been configured with an IAM role allowing it to access the required resources. The AWS Javascript library has built in support for running inside supported AWS contexts (like EC2 instances and Lambda functions), such that it can use the role applied to the context to get appropriate temporary credentials. This is a much better approach than using specific credentials in the script, which can result in unintended consequences if you aren’t eternally vigilant.
You need to make sure that you’re Lambda function is configured with an appropriate security group that allows it to use the expected outbound channel (i.e. make sure it can get to the Logstash host you’re trying to connect to). This was somewhat of an issue for us as our ELK stack is hosted inside another AWS account (our OPs account), so we had to make sure that all of the appropriate VPC peering was configured before it would work correctly. It can be a bit of a pain to setup the smallest possible surface area, but don’t be tempted to just configure the Lambda function to be able to do everything and go anywhere. Smaller surface area = more security and the policy of least privilege is one you should always follow.
Its important to ensure that if you’re doing TCP communication in a Lambda function, that you make sure to close your socket when you’re done with it, or the Lambda function might not exit. It might, but it also might not, and it can really throw you for a loop if you don’t know why.

To Be Continued

That’s it for this week. The Javascript I’ve included above is pretty generic (apart from the specific set of fields that we like to have in our log events) and will successfully process an ELB log file from S3 to a Logstash instance listening on a port of your choosing (probably 6379) when used in a Lambda function. Feel free to reuse it for your own purposes.

Next week I’ll continue this series of posts with information about how we use CloudFormation to setup our Lambda function as part of one of our environment definitions.

CloudFormation and Lambda aren’t the best of friends yet, so there is some interesting stuff that you have to be aware of.

That Neat Little Symbol From Half-Life, Part 1

November 15. 2016 0 Comments

We’ve come a long way in our log aggregation journey. Don’t get me wrong, we still have a long way to go, but bit by bit, we’re getting better at it.

A good example of getting better, is the way in which we process our Elastic Load Balancer (ELB) logs. Over a year ago I put together a system for processing these logs into our ELK stack with way too many moving parts. It used Logstash (for processing) and Powershell (for downloading files from S3) hosted on an EC2 instance to aggregate ELB logs from S3 into to our ELK stack. Somewhat complicated in practice, but it worked, even if I was never particularly happy with it.

As is the way with these things though, because it did work, we’ve had no reason to revisit it, and we’re successfully applied the same approach to at least 3 other environments we’ve setup since.

It wasn’t without its share of problems though:

The EC2 instances hosting the solution had a tendency to cap themselves at 100% CPU for long periods. They were initially t2.mediums, but they kept expending all of their CPU credits, so we had to upgrade them to m3.mediums, which was a 50% increase in cost ($US 115/month). Never did figure out exactly what needed all that CPU, but the leading theory was Logstash.
For a while, the logs simply stopped processing after a period of time (days/weeks). This turned out to be an issue with accumulating memory dumps from Java as a result of Logstash crashing and NSSM automatically restarting it.
These were the machines most vulnerable to the memory leak in Logstash that causes its TCP driver to accumulate non-paged memory on Windows AWS instances due to some driver problem.

Good old Logstash.

To turn the discussion back to getting better, we had the opportunity to revisit this process when building some new environments, using all of the knowledge and experience that we’d gained in the intervening period. I think we came up with a much more efficient and understandable solution, but it wasn’t without its share of difficulties, which makes for a good post.

Anomalous Materials

One of the primary weaknesses in the previous approach for processing ELB logs was that it required an entire EC2 instance all to itself, for each environment that we spun up. We did this in order to keep each log processor isolated from the other and to allow us to be able to spin up an entirely self-contained environment without having to worry about some common machine that processed all of the logs in a bunch of different buckets.

Another weakness in the process that bothered me was that it had way too many moving parts. Sometimes you have to have a lot of moving parts working together in order to accomplish a goal, but you should always strive for simplicity, both from an operational point of view and from a maintenance point of view. Less is almost always better in software development.

AWS has come a long way since we jammed the initial solution together, so we decided to use this opportunity to simplify the process and experiment with some AWS tools that we don’t frequently use.

After some discussion, the we formed an idea of what we would like the new log processor to look like. We wanted to use Lambda to process the ELB logs as they were created, pushing them to the same Logstash ingress endpoint that we’ve been using consistently for the last year or so. The benefits we were expecting were a reduction in complexity (no need to have 4 different things working together), a reduction in running cost (mostly due to the removal of the EC2 instance) and a reduction in latency (the Lambda function would trigger whenever a file was written to the S3 bucket by the ELB, which meant no more polling for changes).

For those of you unfamiliar with Lamba, its a service offered by AWS that lets you configure code to run whenever a variety of events occurs. I’ve used it before to create a quicker S3 bucket clone, so if you want some more information, feel free to read up on that adventure.

In order to accomplish our goal, we would need to deal with 3 things:

The Lambda function code itself, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

Nothing particularly insane there, but definitely a few things that we’d never done before.

To Be Continued

In order to avoid creating a single monstrous post with more words than a small novel, I’m going to break it here.

Next week I’ll continue, explaining the Javascript code that we put together to process the log files (its not particularly complicated) and how we configured the Lambda function by incorporating its setup into our environment setup.

Until then, may all your Lambda functions execute quickly and your S3 buckets not turn into ghosts.

The Case of the Slow Package Upload

October 25. 2016 0 Comments

Unfortunately for us, we had to move our TeamCity build server into another AWS account recently, which is never a pleasant experience, though it is often eye opening.

We had to do this for a number of reasons, but the top two were:

We’re consolidating some of our many AWS accounts, to more easily manage them across the organisation.
We recently sold part of our company, and one of the assets included in that sale was a domain name that we were using to host our TeamCity server on.

Not the end of the world, but annoying all the same.

Originally our goal was to do a complete refresh of TeamCity, copying it into a new AWS account, with a new URL and upgrading it to the latest version. We were already going to be disrupted for a few days, so we thought we might as well make it count. We’re using TeamCity 8, and the latest is 10, which is a pretty big jump, but we were hopeful that it would upgrade without a major incident.

I should have remembered that in software development, hope it for fools. After the upgrade to TeamCity 10, the server hung on initializing for long enough that we got tired of waiting (and I’m pretty patient).

So we abandoned the upgrade, settling for moving TeamCity and rehosting it at a different URL that we still had legal ownership of.

That went relatively well. We needed to adapt some of our existing security groups in order to correctly grant access to various resources from the new TeamCity Server/Build Agents, but nothing we hadn’t dealt with a hundred times before.

Our builds seemed to be working fine, compiling, running tests and uploading nuget packages to either MyGet or Octopus Deploy as necessary.

As we executed more and more builds though, some of them started to fail.

Failures Are Opportunities To Get Angry

All failing builds were stopping in the same place, when uploading the nuget package at the end of the process. Builds uploading to Octopus Deploy were fine (its a server within the same AWS VPC, so that’s not surprising), but a random sampling of builds uploading packages to MyGet had issues.

Investigating, the common theme of the failing builds was largish packages. Not huge, but at least 10 MB. The nuget push call would timeout after 100s, trying a few times, but always experiencing the same issue.

With 26 MB of data required to be uploaded for one of our packages (13 MB package, 13 MB symbols, probably should optimize that), this meant that the total upload speed we were getting were < 300 KBps, which is pretty ridiculously low for something literally inside a data centre.

The strange thing was, we’d never had an issue with uploading large packages before. It wasn’t until we moved TeamCity and the Build Agents into a new AWS account that we started having problems.

Looking into the network configuration, the main differences I could determine were:

The old configuration used a proxy to get to the greater internet. Proxies are the devil, and I hate them, so when we moved into the new AWS account, we put NAT gateways in place instead. Invisible to applications, a NAT gateway is a far easier way to give internet access to machines that do not need to be exposed on the internet directly.
Being a completely different AWS account means that there is a good chance those resources would be spun up on entirely different hardware. Our previous components were pretty long lived, so they had consistently been running on the same stuff for months.

At first I thought maybe the NAT gateway had some sort of upload limit, but uploading large payloads to other websites was incredibly fast. With no special rules in place for accessing the greater internet, the slow uploads to MyGet were an intensely annoying mystery.

There was another thing as well. We wrap our usages of Nuget.exe in Powershell functions, specifically to ensure we’re using the various settings consistently. One of the settings we were setting by default with each usage of push, was the timeout. It wasn’t set to 100 seconds though, it was set to 600.

Bugs, Bugs, Bugs

A while back I had to upgrade to the latest Nuget 3.5 release candidate in order to get a fix for a bug that was stopping us from deploying empty files from a package. Its a long story, but it wasn’t something we could easily change. Unfortunately, the latest release candidate also has a regression in it where the timeout for the push is locked at 100 seconds, no matter what you do.

Its been fixed since, but there isn’t another release candidate yet.

Rolling back to a version that allows the timeout to work correctly, stops the other thing from working.

That whole song and dance is how software feels sometimes.

With no obvious way to simply increase the timeout, and because all other traffic seemed to be perfectly fine, it was time to contact MyGet support.

They responded that its something they’ve seen before, but they do not know the root cause. It appears to be an issue with the way that AWS is routing traffic to their Azure hosts. It doesn’t happen all the time, but when it does, it tanks performance. They suggested recycling the NAT gateway to potentially get it on new hardware (and thus give it a chance at getting access to better routes), but we tried that and it didn’t make a difference. We’ve since sent them some detailed fiddler and network logs to help them diagnose the issue, but I wouldn’t be surprised if it was something completely out of their control.

On the upside, we did actually have a solution that was already working.

Our old proxy.

It hadn’t been shut down yet, so we configured the brand new shiny build agents to use the old proxy and lo and behold, packages uploaded in a reasonable time.

This at least unblocked our build pipeline so that other things could happen while we continue to investigate.

Conclusion

Disappointingly, that’s where this blog post ends. The solution that we put into place temporarily with the old proxy (and I really hate proxies) is a terrible hack, and we’re going to have to spend some significant effort fixing it properly because if that proxy instance dies, we could be returned to exactly the same place without warning (if the underlying issue is something to do with routing that is out of control).

Networking issues like the one I’ve described above are some of the most frustrating, especially because they can happen when you least expect it.

Not only are they basically unfathomable, there is very little you can do to actively fix the issue other than moving your traffic around trying to find a better spot.

Of course, we can also optimise the content of our packages to be as small as possible, hopefully making all of our artifacts small enough to be uploaded with the paltry amount of bandwidth available.

Having a fixed version of Nuget would be super nice as well.

Ahhh Encoding, We Meet Again

October 18. 2016 0 Comments

Posted in:
aws
encoding
c#

A few months back I made a quick post about some automation that we put into place when running TeamCity Build Agents on spot-price instances in AWS. Long story short, we used EC2 userdata to automatically configure and register the Build Agent whenever a new spot instance was spun up, primarily as a way to deal with the instability in the spot price which was constantly nuking our machines.

The kicker in that particular post was that when we were editing the TeamCity Build Agent configuration, Powershell was changing the encoding of the file such that it looked perfectly normal at a glance, but the build agent was completely unable to read it. This lead to some really confusing errors about things not being set when they were clearly set and so on.

All in all, it was one of those problems that just make you hate software.

What does all of this have to do with this weeks post?

Well, history has a weird habit of repeating itself in slightly different ways.

More Robots

As I said above, we’ve put in some effort to make sure that our TeamCity Build Agent AMI’s can mostly take care of themselves on boot if you’ve configured the appropriate userdata in EC2.

Unfortunately, each time we wanted a brand new instance (i.e. to add one to the pool or to recreate existing ones because we’d updated the underlying AMI) we still had to go into the AWS Management Dashboard and set it all up manually, which meant that we needed to remember to set the userdata from a template, making sure the replace the appropriate tokens.

Prone to failure.

Being that I had recently made some changes to the underlying AMI (to add Powershell 5, MSBuild 2015 and .NET Framework 4.5.1) I was going to have to do the manual work.

That’s no fun. Time to automate.

A little while later I had a relatively simple Powershell script scraped together that would spin up an EC2 instance (spot or on-demand) using our AMI, with all of our requirements in place (tagging, names, etc).

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsKey,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsSecret,
    [string]$awsRegion="ap-southeast-2",
    [switch]$spot=$false,
    [int]$number
)

$here = Split-Path $script:MyInvocation.MyCommand.Path
. "$here\_Find-RootDirectory.ps1"

$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

. "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

Ensure-AwsPowershellFunctionsAvailable

$name = "[{team}]-[dev]-[teamcity]-[buildagent]-[$number]"

$token = [Guid]::NewGuid().ToString();

$userData = [System.IO.File]::ReadAllText("$rootDirectoryPath\scripts\buildagent\ec2-userdata-template.txt");
$userData = $userData.Replace("@@AUTH_TOKEN@@", $token);
$userData = $userData.Replace("@@NAME@@", $name);

$amiId = "{ami}";
$instanceProfile = "{instance-profile}"
$instanceType = "c3.large";
$subnetId = "{subnet}";
$securityGroupId = "{security-group}";
$keyPair = "{key-pair}";

if ($spot)
{
    $groupIdentifier = new-object Amazon.EC2.Model.GroupIdentifier;
    $groupIdentifier.GroupId = $securityGroupId;
    $name = "$name-[spot]"
    $params = @{
        "InstanceCount"=1;
        "AccessKey"=$awsKey;
        "SecretKey"=$awsSecret;
        "Region"=$awsRegion;
        "IamInstanceProfile_Arn"=$instanceProfile;
        "LaunchSpecification_InstanceType"=$instanceType;
        "LaunchSpecification_ImageId"=$amiId;
        "LaunchSpecification_KeyName"=$keyPair;
        "LaunchSpecification_AllSecurityGroup"=@($groupIdentifier);
        "LaunchSpecification_SubnetId"=$subnetId;
        "LaunchSpecification_UserData"=[System.Convert]::ToBase64String([System.Text.Encoding]::Unicode.GetBytes($userData));
        "SpotPrice"="0.238";
        "Type"="persistent";
    }
    $request = Request-EC2SpotInstance @params;
}
else
{
    . "$rootDirectoryPath\scripts\common\Functions-Aws-Ec2.ps1"

    $params = @{
        ImageId = $amiId;
        MinCount = "1";
        MaxCount = "1";
        KeyName = $keyPair;
        SecurityGroupId = $securityGroupId;
        InstanceType = $instanceType;
        SubnetId = $subnetId;
        InstanceProfile_Arn=$instanceProfile;
        UserData=$userData;
        EncodeUserData=$true;
    }

    $instance = New-AwsEc2Instance -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion -InstanceParameters $params -IsTemporary:$false -InstancePurpose "DEV"
    Tag-Ec2Instance -InstanceId $instance.InstanceId -Tags @{"Name"=$name;"auto:start"="0 8 ALL ALL 1-5";"auto:stop"="0 20 ALL ALL 1-5";} -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
}

Nothing special here. The script leverages some of our common scripts (partially available here) to do some of the work like creating the EC2 instance itself and tagging it, but its pretty much just a switch statement and a bunch of parameter configuration.

On-Demand instances worked fine, spinning up the new Build Agent and registering it with TeamCity as expected, but for some reason instances created with the –Spot switch didn’t.

The spot request would be created, and the instance would be spawned as expected, but it would never configure itself as a Build Agent.

Thank God For Remote Desktop

As far as I could tell, instances created via either path were identical. Same AMI, same Security Groups, same VPC/Subnet, and so on.

Remoting onto the bad spot instances I could see that the Powershell script supplied as part of the instance userdata was not executing. In fact, it wasn’t even there at all. Typically, any Powershell script specified in userdata with the <powershell></powershell> tags is automatically downloaded by the EC2 Config Service on startup and placed inside C:/Program Files (x86)/Amazon/EC2ConfigService/Scripts/UserData.ps1, so it was really unusual for there to be nothing there even though I had clearly specified it.

I have run into this sort of thing before though, and the most common root cause is that someone (probably me) forgot to enable the re-execution of userdata when updating the AMI, but that couldn’t be the case this time, because the on-demand instances were working perfectly and they were all using the same image.

Checking the userdata from the instance itself (both via the AWS Management Dashboard and the local meta data service at http://169.254.169.254/latest/user-data) I could clearly see my Powershell script.

So why wasn’t it running?

It turns out that the primary difference between a spot request and an on-demand request is that you have to base64 encode the data yourself for the spot request (whereas the library takes care of it for the on-demand request). I knew this (as you can see in the code above), but what I didn’t know was that the EC2 Config Service is very particular about the character encoding of the underlying userdata. For the base64 conversion, I had elected to interpret the string as Unicode bytes, which meant that while everything looked fine after the round trip, the EC2 Config Service had no idea what was going on. Interpreting the string as UTF8 bytes before encoding it made everything work just fine.

Summary

This is another one of those cases that you run into in software development where it looks like something has made an assumption about its inputs, but hasn’t put the effort in to test that assumption before failing miserably. Just like with the TeamCity configuration file, the software required that the content be encoded as UTF8, but didn’t tell me when it wasn’t.

Or maybe it did? I couldn’t find anything in the normal places (the EC2 Config Service log files), but those files can get pretty big, so I might have overlooked it. AWS is a pretty well put together set of functionality, so its unlikely that something as common as an encoding issue is completely unknown to them.

Regardless, this whole thing cost me a few hours that I could have spent doing something else.

Like shaving a different yak.

The Burden of Being Publicly Accessible

September 27. 2016 0 Comments

For anyone following the saga of my adventures with RavenDB, good news! It turns out its much easier to run a RavenDB server on reasonable hardware when you just put less data in it. I cleaned out the massive chunk of abandoned data over the last few weeks and everything is running much better now.

That’s not what this blog post is about though, which I’m sure is disappointing at a fundamental level.

This post is a quick one about some of the fun times that we had setting up access to customer data for the business to use for analysis purposes. What data? Well, lets go back a few steps to set the stage.

Our long term strategy has been to free customer data stuck in on-premises databases so that the customer can easily use it remotely, in mobile applications and webpages, without having to have physical access to their database server (i.e. be in the office, or connected to their office network). This sort of strategy benefits both parties, because the customer gains access to new, useful services for when they are on the move (very important for real estate agents) and we get to develop new and exciting tools and services that leverage their data. Win-win.

Of course, everything we do involving this particular product is a stopgap until we migrate those customers to a completely cloud based offering, but that’s a while away yet, and we need to stay competitive in the meantime.

As part of the creation of one of our most recent services, we consolidated the customer data location in the Cloud, and so now we have a multi tenant database in AWS that contains a subset of all of the data produced by our customers. We built the database to act as the backend for a system that allows the customer to easily view their data remotely (read only access), but the wealth of information available in that repository piqued the interest of the rest of the business, mostly around using it to calculate statistics and comparison points across our entire customer base.

Now, as a rule of thumb, I’m not going to give anyone access to a production database in order to perform arbitrary, ad-hoc queries, no matter how hard they yell at me. There are a number of concerns that lead towards this mindset, but the most important one is that the database has been optimized to work best for the applications that run on it. It is not written with generic, ad-hoc intelligence queries in mind, and any such queries could potentially have an impact on the operation of the database for its primary purpose. The last thing I want is for someone to decide they want to calculate some heavy statistics over all of the data present, tying up resources that are necessary to answer queries that customers are actually asking. Maintaining quality of service is critically important.

However, the business desire is reasonable and real value could be delivered to the customer with any intelligence gathered.

So what were we to do?

Stop Copying Me

The good thing about working with AWS is that someone, somewhere has probably already tried to do what you’re trying to do, and if you’re really lucky, Amazon has already built in features to make doing the thing easy.

Such was the case with us.

An RDS read-replica neatly resolves all of my concerns. The data will be asynchronously copied from the master to the replica, allowing business intelligence queries to be performed with wild abandon without having to be concerned with affecting the customer experience. You do have to be aware of the eventually consistent nature of the replica, but that’s not as important when the queries being done aren’t likely to be time critical. Read-replicas can even be made publicly accessible (without affecting the master), allowing you to provision access to them without requiring a VPN connection or something similarly complicated.

Of course, if it was that easy, I wouldn’t have written a blog post about it.

Actually creating a read-replica is easy. We use CloudFormation to initialise our AWS resources, so its a fairly simple matter to extend our existing template with another resource describing the replica. You can easily specify different security groups for the replica, so we can lock it down to be publicly resolvable but only accessible from approved IP addresses without too much trouble (you’ll have to provision a security group with the appropriate rules to allow traffic from your authorised IP addresses, either as part of the template, or as a parameter injected into the template).

There are some tricks and traps though.

If you want to mark a replica as publicly accessible (i.e. it gets a public IP address) you need to make sure you have DNS Resolution and DNS Hostnames enabled on the host VPC. Not a big deal to be honest, but I think DNS Hostnames default to Off, so something to watch out for. CloudFormation gives a nice error message in this case, so its easy to tell what to do.

What’s not so easy is that if you have the standard public/private split of subnets (where a public subnet specifies the internet gateway for routing of all traffic and a private subnet either specifies nothing or a NAT) you must make sure to put your replica in the public subnets. I think this applies for any instance that is going to be given a public IP address. If you don’t do this, no traffic will be able to escape from the replica because the router table will try to push it through the NAT on the way out. This complicates things with the master RDS instance as well, because both replica and master must share the same subnet group, so the master must be placed in the public subnets as well.

With all the CloudFormation/AWS/RDS chicanery out of the way, you still need to manage access to the replica using the standard PostgreSQL mechanisms though.

The Devil Is In The Details

The good thing about PostgreSQL read replicas is that they don’t allow any changes at all, even if using the root account. They are fundamentally readonly, which is fantastic.

There was no way that I was going to publicise the master password for the production RDS instance though, so I wanted to create a special user just for the rest of the business to access the replica at will, with as few permissions as possible.

Because of the aforementioned readonly-ness of the replica, you have to create the user inside the master instance, which will then propagate it across to the replica in time. When it comes to actually managing permissions for users in the PostgreSQL database though, its a little bit different to the RDBMS that I’m most familiar with, SQL Server. I don’t think its better or worse, its just different.

PostgreSQL servers hosts many databases, and each database hosts many schemas. Users however, appear to exist at the server level, so in order to manage access, you need to grant the user access to the databases, schemas and then tables (and sequences) inside that schema that you want them to be able to use.

At the time when our RDS instance is initialised, there are no databases, so we had to do this after the fact. We could provision the user and give it login/list database rights, but it couldn’t select anything from tables until we gave it access to those tables using the master user.

GRANT USAGE ON {schema} TO {username}
GRANT SELECT ON ALL TABLES IN {schema} TO {username}
GRANT SELECT ON ALL SEQUENCES IN {schema} TO {username}

Granting access once is not enough though, because any additional tables created after the statement is executed will not be accessible. To fix that you have to alter the default privileges of the schema, granting the appropriate permissions for the user you are interested in.

ALTER DEFAULT PRIVILEGES IN SCHEMA {schema}
GRANT SELECT ON TABLES TO {username}

With all of that out of the way, we had our replica.

Conclusion

Thanks to AWS, creating and managing a read-replica is a relatively painless procedure. There are some tricks and traps along the way, but they are very much surmountable. Its nice to be able to separate our concerns cleanly, and to have support for doing that at the infrastructure level.

I shudder to think how complicated something like this would have been to setup manually.

I really do hope AWS never goes full evil and decides to just triple or quadruple their prices though, because it would take months to years to replicate some of the things we’re doing in AWS now.

We’d probably just be screwed.