0 Comments

A few months back I made a quick post about some automation that we put into place when running TeamCity Build Agents on spot-price instances in AWS. Long story short, we used EC2 userdata to automatically configure and register the Build Agent whenever a new spot instance was spun up, primarily as a way to deal with the instability in the spot price which was constantly nuking our machines.

The kicker in that particular post was that when we were editing the TeamCity Build Agent configuration, Powershell was changing the encoding of the file such that it looked perfectly normal at a glance, but the build agent was completely unable to read it. This lead to some really confusing errors about things not being set when they were clearly set and so on.

All in all, it was one of those problems that just make you hate software.

What does all of this have to do with this weeks post?

Well, history has a weird habit of repeating itself in slightly different ways.

More Robots

As I said above, we’ve put in some effort to make sure that our TeamCity Build Agent AMI’s can mostly take care of themselves on boot if you’ve configured the appropriate userdata in EC2.

Unfortunately, each time we wanted a brand new instance (i.e. to add one to the pool or to recreate existing ones because we’d updated the underlying AMI) we still had to go into the AWS Management Dashboard and set it all up manually, which meant that we needed to remember to set the userdata from a template, making sure the replace the appropriate tokens.

Prone to failure.

Being that I had recently made some changes to the underlying AMI (to add Powershell 5, MSBuild 2015 and .NET Framework 4.5.1) I was going to have to do the manual work.

That’s no fun. Time to automate.

A little while later I had a relatively simple Powershell script scraped together that would spin up an EC2 instance (spot or on-demand) using our AMI, with all of our requirements in place (tagging, names, etc).

[CmdletBinding()]
param
(
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsKey,
    [Parameter(Mandatory=$true)]
    [ValidateNotNullOrEmpty()]
    [string]$awsSecret,
    [string]$awsRegion="ap-southeast-2",
    [switch]$spot=$false,
    [int]$number
)

$here = Split-Path $script:MyInvocation.MyCommand.Path
. "$here\_Find-RootDirectory.ps1"

$rootDirectory = Find-RootDirectory $here
$rootDirectoryPath = $rootDirectory.FullName

. "$rootDirectoryPath\scripts\common\Functions-Aws.ps1"

Ensure-AwsPowershellFunctionsAvailable

$name = "[{team}]-[dev]-[teamcity]-[buildagent]-[$number]"

$token = [Guid]::NewGuid().ToString();

$userData = [System.IO.File]::ReadAllText("$rootDirectoryPath\scripts\buildagent\ec2-userdata-template.txt");
$userData = $userData.Replace("@@AUTH_TOKEN@@", $token);
$userData = $userData.Replace("@@NAME@@", $name);

$amiId = "{ami}";
$instanceProfile = "{instance-profile}"
$instanceType = "c3.large";
$subnetId = "{subnet}";
$securityGroupId = "{security-group}";
$keyPair = "{key-pair}";

if ($spot)
{
    $groupIdentifier = new-object Amazon.EC2.Model.GroupIdentifier;
    $groupIdentifier.GroupId = $securityGroupId;
    $name = "$name-[spot]"
    $params = @{
        "InstanceCount"=1;
        "AccessKey"=$awsKey;
        "SecretKey"=$awsSecret;
        "Region"=$awsRegion;
        "IamInstanceProfile_Arn"=$instanceProfile;
        "LaunchSpecification_InstanceType"=$instanceType;
        "LaunchSpecification_ImageId"=$amiId;
        "LaunchSpecification_KeyName"=$keyPair;
        "LaunchSpecification_AllSecurityGroup"=@($groupIdentifier);
        "LaunchSpecification_SubnetId"=$subnetId;
        "LaunchSpecification_UserData"=[System.Convert]::ToBase64String([System.Text.Encoding]::Unicode.GetBytes($userData));
        "SpotPrice"="0.238";
        "Type"="persistent";
    }
    $request = Request-EC2SpotInstance @params;
}
else
{
    . "$rootDirectoryPath\scripts\common\Functions-Aws-Ec2.ps1"

    $params = @{
        ImageId = $amiId;
        MinCount = "1";
        MaxCount = "1";
        KeyName = $keyPair;
        SecurityGroupId = $securityGroupId;
        InstanceType = $instanceType;
        SubnetId = $subnetId;
        InstanceProfile_Arn=$instanceProfile;
        UserData=$userData;
        EncodeUserData=$true;
    }

    $instance = New-AwsEc2Instance -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion -InstanceParameters $params -IsTemporary:$false -InstancePurpose "DEV"
    Tag-Ec2Instance -InstanceId $instance.InstanceId -Tags @{"Name"=$name;"auto:start"="0 8 ALL ALL 1-5";"auto:stop"="0 20 ALL ALL 1-5";} -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
}

Nothing special here. The script leverages some of our common scripts (partially available here) to do some of the work like creating the EC2 instance itself and tagging it, but its pretty much just a switch statement and a bunch of parameter configuration.

On-Demand instances worked fine, spinning up the new Build Agent and registering it with TeamCity as expected, but for some reason instances created with the –Spot switch didn’t.

The spot request would be created, and the instance would be spawned as expected, but it would never configure itself as a Build Agent.

Thank God For Remote Desktop

As far as I could tell, instances created via either path were identical. Same AMI, same Security Groups, same VPC/Subnet, and so on.

Remoting onto the bad spot instances I could see that the Powershell script supplied as part of the instance userdata was not executing. In fact, it wasn’t even there at all. Typically, any Powershell script specified in userdata with the <powershell></powershell> tags is automatically downloaded by the EC2 Config Service on startup and placed inside C:/Program Files (x86)/Amazon/EC2ConfigService/Scripts/UserData.ps1, so it was really unusual for there to be nothing there even though I had clearly specified it.

I have run into this sort of thing before though, and the most common root cause is that someone (probably me) forgot to enable the re-execution of userdata when updating the AMI, but that couldn’t be the case this time, because the on-demand instances were working perfectly and they were all using the same image.

Checking the userdata from the instance itself (both via the AWS Management Dashboard and the local meta data service at http://169.254.169.254/latest/user-data) I could clearly see my Powershell script.

So why wasn’t it running?

It turns out that the primary difference between a spot request and an on-demand request is that you have to base64 encode the data yourself for the spot request (whereas the library takes care of it for the on-demand request). I knew this (as you can see in the code above), but what I didn’t know was that the EC2 Config Service is very particular about the character encoding of the underlying userdata. For the base64 conversion, I had elected to interpret the string as Unicode bytes, which meant that while everything looked fine after the round trip, the EC2 Config Service had no idea what was going on. Interpreting the string as UTF8 bytes before encoding it made everything work just fine.

Summary

This is another one of those cases that you run into in software development where it looks like something has made an assumption about its inputs, but hasn’t put the effort in to test that assumption before failing miserably. Just like with the TeamCity configuration file, the software required that the content be encoded as UTF8, but didn’t tell me when it wasn’t.

Or maybe it did? I couldn’t find anything in the normal places (the EC2 Config Service log files), but those files can get pretty big, so I might have overlooked it. AWS is a pretty well put together set of functionality, so its unlikely that something as common as an encoding issue is completely unknown to them.

Regardless, this whole thing cost me a few hours that I could have spent doing something else.

Like shaving a different yak.

0 Comments

Its been at least a few years since I’ve actually worked with a dedicated DBA. I’m sure the role still exists, but it seems like a lot of organisations mostly just expect their developers to be able to do anything at least passably well, especially with the seemingly inexorable merge of the development and operations responsibilities.

As a developer, I personally like the trend, because it moves me closer to the end-user and it gives me a greater sense of responsibility about the software I produce. If I’m going to be the one who has to support it and the one that gets woken up at 1 in the morning because production servers are melting down due to a poorly optimised data update algorithm, then you better believe I will be writing the most maintainable, supportable and reliable software that I can.

I mean, I would have done that anyway, but now I have even more of a personal stake in the situation.

Where I work now, we still have a dedicated operations department, but they are mostly responsible for managing the IT operations of the business as a whole, rather than the deployment specifics of any software that my team writes. They do still participate in our process in an advisory role (something for which I am perpetually grateful), but other than that, we’re pretty much on our own.

Anyway, the point that this rambling preface is slowly trying to arrive at is we’ve had some teething issues with one of our more recent releases that I thought might be worth talking about.

Specifically, massive, unexpected read load on our relational databases during periods of high write activity.

Everything In a Box To The Write

The service at the core of the data freeing functionality that I’ve mentioned previously is our synchronization service.

Its only purpose is to facilitate the continual synchronization of data from on-premises databases to a central cloud repository, with the help of a locally installed application. I won’t go into too much detail about the actual algorithm in play, but its mostly just batched updates to the cloud database at regular intervals through a HTTP API, where the batched data is obtained by leveraging the SQL Server row version concept.

In the last few weeks, we’ve been deploying the on-premises application to customers in order to kick off the synchronization process, a chunk of customers at a time.

It went pretty well, releasing to each group and watching their data upload, until we hit some sort of breakpoint and the Read IOPS of our backend RDS service rose to an unsustainable level.

We were seeing numbers in excess of 1500 for read IOPS (and 200ish for write), which is somewhat problematic, because the database is relatively small (150GB), which means it only has around 450 baseline IOPS split between reads and writes. Considering the way that AWS volumes work (guaranteed baseline, spike up to 3000 by consuming IO credits), the consumption rate we were seeing would leave us high and dry within an hour or two. We scaled a number of things to deal with the traffic (which eventually subsided as the initial flurry of data gave way to more incremental updates).

But where did the massive amount of read load come from?

To understand that, I’m going to have to outline a little bit more about the service itself.

The API for the service exposes an endpoint for each table we’re synchronizing, differentiated by customer identity. Something like the following:

https://{url}/{customer-identity}/tables/{table-name}

A GET to this endpoint returns a manifest of sorts, stating information about what data is currently available in the service for that customer-table combination, which is then used to decide what to upload. A POST allows for inserts and updates and a DELETE allows for either the removal of all data or the removal of a subset of data defined by the body.

Inside the service, the POST body is handled by iterating through the rows contained therein, and executing an add or update for each one using Entity Framework, saving as we go.

Profiling the resulting queries from the execution of a POST, we discovered that entity framework will always do a SELECT first on each row in question, in order to determine whether to run an INSERT or an UPDATE. Given that the number of these SELECT operations dwarfed the number of reads resulting from GET requests, we assumed that that was where the majority of the read load was coming from.

Now we just had to find out how to optimise it.

The EF Tradeoff

One of the really nice things when working with Entity Framework is that you don’t need to give too much thought to the resulting queries that it executes on the underlying data store. I mean, the whole point of the library is to reduce the visibility of those sorts of things, adding a nice abstraction layer over the top so you don’t have to deal with it.

When it comes to performance though, that is also one of the places where EF can hurt you the most.

For our purposes, EF increased the speed at which we could deliver this service (by removing some of the nastyness around talking to databases), but by being so far away from the proverbial metal, when it came time to try and deal with performance issues, there didn’t seem to be much that we could do.

We tried a few things with EF, including:

  • Just adding the entities, instead of add/update, which forced EF to insert and fail on primary key violations. This was faster, but a lot of our load is actually updates as opposed to pure inserts, so applied to our production environment it would have simply caused other performance issues (as a result of the constant exceptions).
  • Turning off the various entity tracking features in EF, as we didn’t need them for the relatively simple inserts we were doing. This helped a little bit, but the same pattern of SELECT, INSERT/UPDATE was still present afterwards, so it wouldn’t have resolved out read problem.
  • Updating to the very latest version of EF6 and Npgsql (just in case there had been some opimisations that we’d missed), but it was pretty much the same. We were not that far behind the curve anyway, so it wasn’t too surprising.
  • Batching our data updates under a single SaveChanges. This helped a fair amount, but would require us to change the way we handle update errors to remove the bad rows and try the update again rather than simply moving on. We don’t have any real statistics as to how many errors we actually get during updates, so this one was plausible, but would need more investigation.

Then we tried something completely different.

No EF.

Just pure ADO.NET insert calls, catching primary key violations and doing updates.

This approach was so much faster it wasn’t funny, but it left a bad taste in our mouths at multiple levels. For starters, we would have to write and maintain the SQL ourselves, which is something we’d rather not do (for various reasons, maintainability being one and database specific code being another). Even getting past that, we were still handling exceptions as part of normal application flow, which didn’t feel right, and would almost certainly lead to performance problems of its own, even if it was dodging writes.

It was here that PostgreSQL came to our rescure.

Version 9.5 introduced an UPSERT command in the form of INSERT () ON CONFLICT UPDATE (). Now we could leave the whole thing up to the database and move on with our lives. We’d still have to maintain the SQL ourselves (or write a generator of some sort), but that would be a small price to pay for a massive performance increase.

Unfortunately this sort ends with an anti-climax, because we’re still in the process of implementing the raw ADO.NET, PostgreSQL specific UPSERT as an optimisation. Once I have more information I’ll follow up with another post filling in the end of the story.

In the meanwhile, we threw money at it by scaling up the RDS instance so it could keep more things in memory at once, which alleviated the issue, at least until we hit the next breakpoint.

Conclusion

Its true that no plan survives contact with the enemy, even when the enemy isn’t particularly hostile. All of our data synchronization was a silent precursor to shipping some new cloud features, so no-one notice when it exploded a few times, but it was still pretty depressing to see it be completely unable to handle the load we were throwing at it. I’m not surprised that Entity Framework doesn’t lead to optimal performance, even though its not pleasant having that particular fact slap you in the face. We can still use EF for a lot of what we need to do, and benefit from all of its goodies, like database migrations and easy in-memory testing, but we definitely need to put some things in place to bypass EF when the situation calls for it.

When it comes to this particular issue; even though throwing money at a problem doesn’t really work for us in the long term (for obvious reasons), it was nice to have that option available until we had something better in place. We’re still not sure if what we’re planning on doing is going to make a massive difference, but I remain hopeful.

Of course, software development and hope is never a good combination.

Look where it gets us when it comes to estimates.

0 Comments

A number of people much smarter than me have stated that there are only two hard problems in Computer Science, naming things, cache invalidation and off by one errors. Based on my own experience, I agree completely with this sentiment, and its one of the reasons why I always hesitate to incorporate caching into a system until its absolutely necessary.

Unfortunately, absolutely necessary always comes much sooner than I would like.

Over the last few weeks we’ve been slowly deploying our data freeing functionality to our customers. The deployment occurs automatically without user involvement (we have a neat deployment pipeline) so that has been pretty painless, but we’ve had some teething issues on the server side as our traffic ramped up. We’re obviously hosting everything in AWS, so its not like we can’t scale as necessary (and we have), but just throwing more power at something is a road we’ve been down before, and it never ends well. Plus, we’re software engineers, and when it looks like our code is slow or inefficient, it makes us sad.

The first bottleneck we ran into was our authentication service, and mitigating that particular problem is the focus of this post.

For background, we use a relatively simple token based approach to authentication. Consumers of our services are required to supply some credentials directly to our auth service to get one of these tokens, which can then be redeemed to authenticate against other services and to gain access to the resources the consumer is interested in. Each service knows that the auth service is the ultimate arbitrator for resolving those tokens, so they connect to the service as necessary for token validation and make authorization decisions based on the information returned.

A relatively naive approach to authentication, but it works for us.

Its pretty spammy though, and therein lies the root cause of the bottleneck.

I Already Know The Answer

One of the biggest factors at play here is that every single authenticated request (which is most of them) coming into any of our services needs to hit another service to validate the token before it can do anything else. It can’t even keep doing things in the background while the token is being validated, because if the auth fails, that would be a waste of effort (and a potential security risk if an error occurs and something leaks out).

On the upside, once a service has resolved a token, its unlikely that the answer will change in the near future. We don’t even do token invalidation at this stage (no need), so the answer is actually going to be valid for the entire lifetime of the token.

You can probably already see where I’m going with this, but an easy optimization is to simply remember the resolution for each token in order to bypass calls to the auth service when that token is seen again. Why ask a question when you already know the answer? This works particularly well for us in the data synchronization case because a single token will be used for a flurry of calls.

Of course, now we’re in the caching space, and historically, implementing caching can increase the overall complexity of a system by a large amount. To alleviate some of the load on the auth service, we just want a simple in-memory cache. If a particular instance of the service has seen the token, use the cache, else validate the token and then cache the result. To keep it simple, and to deal with our specific usage pattern (the aforementioned flurry), we’re not even going to cache the token resolution for the entire lifetime of the token, just for the next hour.

We could write the cache ourselves, but that would be stupid. Caching IS a hard problem, especially once you get into invalidation. Being that caching has been a known problem for a while, there are a lot of different components available for you to use in your language of choice, its just a matter of picking one.

But why pick just one?

There Has To Be A Better Way

As we are using C#, CacheManager seems like a pretty good bet for avoiding the whole “pick a cache and stick with it” problem.

Its a nice abstraction over the top of many different caching providers (like the in-memory System.Runtime.Caching and the distributed Redis), so we can easily implement our cache using CacheManager and then change our caching to something more complicated later, without having to worry about breaking everything in between. It also simplifies the interfaces to those caches, and does a bunch of really smart work behind the scenes.

The entire cache is implemented in the following class:

public class InMemoryNancyAuthenticationContextCache : INancyAuthenticationContextCache
{
    private readonly ICacheManager<AuthenticationContext> _manager;

    public InMemoryNancyAuthenticationContextCache()
        : this(TimeSpan.FromHours(1))
    {

    }

    public InMemoryNancyAuthenticationContextCache(TimeSpan cachedItemsValidFor)
    {
        _manager = CacheFactory.Build<AuthenticationContext>(a => a.WithSystemRuntimeCacheHandle().WithExpiration(ExpirationMode.Absolute, cachedItemsValidFor));
    }

    public AuthenticationContext Get(string token)
    {
        return _manager.Get(token);
    }

    public void Insert(string token, AuthenticationContext authContext)
    {
        _manager.Put(token, authContext);
    }
}

Our common authentication library now checks whatever cache was injected into it before going off to the real auth service to validate tokens. I made sure to write a few tests to validate the caching behaviour (like expiration and eviction, validating a decrease in the number of calls to the auth provider when flooded with the same token and so on), and everything seems to be good.

The one downside of using CacheManager (and the System.Runtime.Caching libraries) is that I’m putting a lot of trust in everything to “just work”. Performance testing will prove out whether or not there are any issues, but I can guarantee that if there are they will be a massive pain to diagnose if anything weird happens, just because our code is so many levels removed from the actual caching.

I sure do hope it does the right thing under stress.

Summary

Like I said at the start, I always hold off on implementing a cache for as long as I possibly can. Caching is a powerful tool to improve performance, but it can lead to some frustratingly hard to debug situations because of the transient nature of the information. No longer can you assume that you will be getting the most accurate information, which can make it harder to reason about execution paths and return values after something has actually happened. Of course, any decent system (databases, browsers, even disk access) usually has caching implemented to some degree, so I suppose you’re always dealing with the problem regardless of what your code actually does.

Good caching is completely invisible, feeding consumers appropriate information and mitigating performance problems without ever really showing its ugly side.

When it comes to implementing caching, it doesn’t make sense to write it yourself (which is true for a lot of things). Accept the fact that many people a lot smarter than you have already probably already solved the problem and just use their work instead. The existence of CacheManager was an unexpected bonus to loosely coupling our code to a specific cache implementation, which was nice.

Now if only someone could solve the whole naming things problem.

0 Comments

For anyone following the saga of my adventures with RavenDB, good news! It turns out its much easier to run a RavenDB server on reasonable hardware when you just put less data in it. I cleaned out the massive chunk of abandoned data over the last few weeks and everything is running much better now.

That’s not what this blog post is about though, which I’m sure is disappointing at a fundamental level.

This post is a quick one about some of the fun times that we had setting up access to customer data for the business to use for analysis purposes. What data? Well, lets go back a few steps to set the stage.

Our long term strategy has been to free customer data stuck in on-premises databases so that the customer can easily use it remotely, in mobile applications and webpages, without having to have physical access to their database server (i.e. be in the office, or connected to their office network). This sort of strategy benefits both parties, because the customer gains access to new, useful services for when they are on the move (very important for real estate agents) and we get to develop new and exciting tools and services that leverage their data. Win-win.

Of course, everything we do involving this particular product is a stopgap until we migrate those customers to a completely cloud based offering, but that’s a while away yet, and we need to stay competitive in the meantime.

As part of the creation of one of our most recent services, we consolidated the customer data location in the Cloud, and so now we have a multi tenant database in AWS that contains a subset of all of the data produced by our customers. We built the database to act as the backend for a system that allows the customer to easily view their data remotely (read only access), but the wealth of information available in that repository piqued the interest of the rest of the business, mostly around using it to calculate statistics and comparison points across our entire customer base.

Now, as a rule of thumb, I’m not going to give anyone access to a production database in order to perform arbitrary, ad-hoc queries, no matter how hard they yell at me. There are a number of concerns that lead towards this mindset, but the most important one is that the database has been optimized to work best for the applications that run on it. It is not written with generic, ad-hoc intelligence queries in mind, and any such queries could potentially have an impact on the operation of the database for its primary purpose. The last thing I want is for someone to decide they want to calculate some heavy statistics over all of the data present, tying up resources that are necessary to answer queries that customers are actually asking. Maintaining quality of service is critically important.

However, the business desire is reasonable and real value could be delivered to the customer with any intelligence gathered.

So what were we to do?

Stop Copying Me

The good thing about working with AWS is that someone, somewhere has probably already tried to do what you’re trying to do, and if you’re really lucky, Amazon has already built in features to make doing the thing easy.

Such was the case with us.

An RDS read-replica neatly resolves all of my concerns. The data will be asynchronously copied from the master to the replica, allowing business intelligence queries to be performed with wild abandon without having to be concerned with affecting the customer experience. You do have to be aware of the eventually consistent  nature of the replica, but that’s not as important when the queries being done aren’t likely to be time critical. Read-replicas can even be made publicly accessible (without affecting the master), allowing you to provision access to them without requiring a VPN connection or something similarly complicated.

Of course, if it was that easy, I wouldn’t have written a blog post about it.

Actually creating a read-replica is easy. We use CloudFormation to initialise our AWS resources, so its a fairly simple matter to extend our existing template with another resource describing the replica. You can easily specify different security groups for the replica, so we can lock it down to be publicly resolvable but only accessible from approved IP addresses without too much trouble (you’ll have to provision a security group with the appropriate rules to allow traffic from your authorised IP addresses, either as part of the template, or as a parameter injected into the template).

There are some tricks and traps though.

If you want to mark a replica as publicly accessible (i.e. it gets a public IP address) you need to make sure you have DNS Resolution and DNS Hostnames enabled on the host VPC. Not a big deal to be honest, but I think DNS Hostnames default to Off, so something to watch out for. CloudFormation gives a nice error message in this case, so its easy to tell what to do.

What’s not so easy is that if you have the standard public/private split of subnets (where a public subnet specifies the internet gateway for routing of all traffic and a private subnet either specifies nothing or a NAT) you must make sure to put your replica in the public subnets. I think this applies for any instance that is going to be given a public IP address. If you don’t do this, no traffic will be able to escape from the replica because the router table will try to push it through the NAT on the way out. This complicates things with the master RDS instance as well, because both replica and master must share the same subnet group, so the master must be placed in the public subnets as well.

With all the CloudFormation/AWS/RDS chicanery out of the way, you still need to manage access to the replica using the standard PostgreSQL mechanisms though.

The Devil Is In The Details

The good thing about PostgreSQL read replicas is that they don’t allow any changes at all, even if using the root account. They are fundamentally readonly, which is fantastic.

There was no way that I was going to publicise the master password for the production RDS instance though, so I wanted to create a special user just for the rest of the business to access the replica at will, with as few permissions as possible.

Because of the aforementioned readonly-ness of the replica, you have to create the user inside the master instance, which will then propagate it across to the replica in time. When it comes to actually managing permissions for users in the PostgreSQL database though, its a little bit different to the RDBMS that I’m most familiar with, SQL Server. I don’t think its better or worse, its just different.

PostgreSQL servers hosts many databases, and each database hosts many schemas. Users however, appear to exist at the server level, so in order to manage access, you need to grant the user access to the databases, schemas and then tables (and sequences) inside that schema that you want them to be able to use.

At the time when our RDS instance is initialised, there are no databases, so we had to do this after the fact. We could provision the user and give it login/list database rights, but it couldn’t select anything from tables until we gave it access to those tables using the master user.

GRANT USAGE ON {schema} TO {username}

GRANT SELECT ON ALL TABLES IN {schema} TO {username}

GRANT SELECT ON ALL SEQUENCES IN {schema} TO {username}

Granting access once is not enough though, because any additional tables created after the statement is executed will not be accessible. To fix that you have to alter the default privileges of the schema, granting the appropriate permissions for the user you are interested in.

ALTER DEFAULT PRIVILEGES IN SCHEMA {schema}
    GRANT SELECT ON TABLES TO {username}

With all of that out of the way, we had our replica.

Conclusion

Thanks to AWS, creating and managing a read-replica is a relatively painless procedure. There are some tricks and traps along the way, but they are very much surmountable. Its nice to be able to separate our concerns cleanly, and to have support for doing that at the infrastructure level.

I shudder to think how complicated something like this would have been to setup manually.

I really do hope AWS never goes full evil and decides to just triple or quadruple their prices though, because it would take months to years to replicate some of the things we’re doing in AWS now.

We’d probably just be screwed.

0 Comments

Welp, holidays finished, so back into software delivery I go.

We just finished a 3 day Hackathon, which was pretty fun. The team I was in worked on a proof of concept for providing business intelligence on our customers to internal users by connecting and correlating our many disparate data sources. It was a fairly simple React.js website backed  by a C# API, which connected to a number of different systems including Microsoft Dynamics CRM, Elasticsearch (specifically, our log stack), some crazy financial database called Helpy (which I don’t even seem to be able to Google) and finally a PostgreSQL database containing customer information.

It was cool to see it all come together over just a few days, but I doubt I’ll give it its own blog post like I did for the faster S3 clone that I did for the last Hackathon. Funnily enough, the hardest part was getting the damn thing deployed to an instance of IIS on a manually provisioned server inside our office. I’ve spent so much time using automated scripts that I wrote months ago in AWS to setup websites and API’s that I’ve completely forgotten how to just hack a webserver together.

Why didn’t I use the scripts you might ask? Well, I tried, and they really only work inside AWS on top of the AMI’s we’ve built up specifically for that purpose. A bit disappointing, but I got over it.

Hackathon fun aside, I had to deal with something else when I got back from my holidays, identifying and resolving a massive unexpected increase in Read IOPS usage for one of our production databases.

And for once, It had nothing to do with RavenDB.

Zero Hour

While I was on holidays, one of our production API’s started exhibiting some pretty degraded performance. It happened suddenly too, it was running fine one minute and then all of a sudden its latency spiked hard. Unfortunately I was not there for the initial investigation, so the first half of this post is going to be a second-hand account, with all the inherent inaccuracy that entails.

Looking into the issue, all queries to the underlying database (a postgreSQL database hosted in RDS) were taking much longer than expected, but it wasn’t immediately obvious as to why. There had been no new deployments (so nothing had changed), and it didn’t look like the amount of traffic being received had drastically increased.

The RDS monitoring statistics showed the root cause of the performance problems. There had been a massive increase in the amount of Read IOPS being consumed by the database instance, from 300ish to 700ish. This had been happening for at least a few days before the performance went downhill, so the operative question was why it had become an issue all of a sudden.

Apparently, IO credits are a thing, like CPU credits.

With the increase in Read IOPS being sustained over many hours, the IO credits available to the instance were slowly being consumed, until they were all gone and performance tanked.

The way IO credits work, to my current understanding, is that when you provision an EBS volume, the size of the volume determines the minimum guaranteed IO throughput, at a rate of 3 IOPS per GB provisioned (unless you’ve specified provisioned IOPS, which is a whole different ballgame). The volume is capable of bursting to 3000 IOPS, but operating above the minimum will consume IO credits. Running out of IO credits means no more bursting, so if the performance of your service was reliant on pulling more than the minimum IO available, it will tank. Like ours did.

Apart from the unfortunate fact that unlike CPU credits, there is no way to monitor IO credits, the problem was relatively easy to solve. Either increase the volume size of the RDS instance (to increase the minimum IO performance) or leave the size the same, but switch to using provisioned IOPS. We opted for the cheaper option (bigger disk = more performance) and the performance stabilised.

This left us in a precarious position though, as we didn’t know what had cause the dramatic increase in Read IOPS.

Detective Agency

PostgreSQL has some pretty good logs, which is a relief, considering no equivalent tool to SQL Server Profiler exists to my knowledge.

Getting the logs for a PostgreSQL instance hosted in RDS isn’t particularly complicated. Either edit the existing parameter group associated with the instance, or create a new one with logging enabled and then associate it with the instance. The logging options we were interested in were log_statement and log_min_duration_statement, which represent what activities to log and how long a query needs to be running before it should be logged, respectively. More information about working with PostgreSQL logs in RDS can be found at the appropriate AWS webpage. Getting to the logs manually is pretty easy. They are available for download when looking at the RDS instance in the AWS console/dashboard.

The logs showed that there was a class of query that was constantly being executed to determine the most recent piece of information in the database for a subset of the data (i.e.a select top x from y where a, b). The only piece of software that can access the database is a C# API, and it uses Entity Framework 6 for all database communication, so the queries were monstrous, but once our attention had been drawn to the appropriate place, it was pretty easy to see some issues:

  • The data type on the first segmentation column was a string instead of an integer (only integer values were ever present in the column, and conceptually only integer values would ever be present)
  • The data type on the second segmentation column was a string instead of a guid (similar to the integer comment above)
  • The primary key on the table was a simple auto ID, whereas the actual primary key should have been a compound key with column a and b above, along with a third column
  • There did not appear to be any appropriate indexes on the segmentation columns

What all of this meant was that the most frequent query being executed on the database was incredibly sub-optimal. It was likely that because the most frequent query needed to do a table scan in order to get an answer, and because the RDS instance was only specified as a t2.micro, postgreSQL was constantly churning memory, which would account for the high Read IOPS. The issue likely only occurred recently because we finally hit some threshold in traffic or database size where the churning manifested.

Fixing the problem was simple. Edit the EF models to change the types and specify appropriate indexes and add a new EF migration to be executed on the next deployment. The migration was a little more complex than usual because it had to upgrade data already in the database (for the wrong column types), but all in all, the fix didn’t take a huge amount of effort.

Like good software developers though, we had to make sure that it actually did fix the problem without actually deploying to production.

Validate Me!

Thanks to previous efforts in traffic replication, I made the assumption that it would be relatively easy to replicate our production traffic into an updated version of the service.

I was wrong.

Our previous traffic replication had been accomplished using Gor replicating traffic from a single EC2 instance (a machine hosting RavenDB) to another similar instance. Pure HTTP traffic, no URLS, just IP addresses and Gor had been manually installed and configured. I had started work on replicating traffic from our API instances to a URL of my choosing, but I hadn’t quite finished it before I went on holidays.

I needed to replicate traffic from a variable number of instances to a publicly accessible URL, and I really didn’t want to have to manually install it on each machine (and then have to do it again next time I wanted to replicate traffic).

Creating a deployable Nuget package containing Gor and some helper Powershell scripts wasn’t overly difficult, I just used a similar pattern to the one we have for our Logstash deployment package. After deploying it though, it wouldn’t work. When I say it wouldn’t work, I mean I didn’t get any feedback at all when running it, no matter how I ran it. The issue was that Gor running on windows is dependent on WinPCap being present in order to function at all. This is easy enough to accomplish when you’re doing a manual install (just go grab WinPCap and install it), but unfortunately much harder when you want to be able to deploy it to any old machine in a completely unattended fashion. WinPCap doesn’t supply a silent installer (why god why), so I was forced to install NMap silently on the machines as part of the Gor deployment. Why NMap? It also installs WinPCap silently, because they’ve customised it.

Not my greatest victory, but I’ll take it.

With Gor working, I still had an issue where even though Gor seemed to be aware of the traffic occurring on the machine (I could log it to stdout for example), no matter what I did I could not get the traffic to replicate to another service via a standard URL available to the greater internet. Long story short, the issue was that Gor (as of version 0.15.1) does not and cannot use a proxy. Being that these instances did not have public IP addresses (no need, they were hidden behind a load balancer), they had to go through a proxy to get anything meaningful done on the internet.

We’re switching to using NAT Gateways for all internet traffic from private instances, so in the long term, this won’t be an issue. For now, I have to live with the fact that proxies are a thing, so I was forced to create an internal load balancer pointing at the same EC2 instances as the public URL I was going to use.

With that in place, I finally managed to replicate the production traffic to a test service with the changes applied.

And it worked amazingly. With the changes in place, our Read IOPS dropped from 700ish to like 10. Much more efficient. The only cost was that the service had to be made unavailable to apply the database updates (because they involved the transformation of the column contents). That only took like 5 minutes though, so I just did it at night when nobody noticed or cared.

Conclusion

I suppose the conclusion of this entire journey is to never underestimate the impact that early design decisions can have on the long term health of a service, and that the value of detailed monitoring should never be undervalued.

Honestly, we should have picked up this sort of thing much earlier than we did (probably through load testing), but it was good that we had the systems in place to react quickly when it did happen unexpectedly.

Another benefit of the whole thing is that it proved the value of our ability to spin up an entirely self contained environment as a mechanism for testing changes under relatively real circumstances. With the ability to replicate traffic at will now (thanks to the working Gor deployment), we’re in a much better position to try things out before we go live with them.