0 Comments

On the back of the marathon series of posts around our AWS Lambda ELB Logs Processor, I’m just going to do another marathon. This time, I’m going to write about the data synchronization process that we use to facilitate Cloud integrations with our legacy application.

I’ve mentioned our long term strategy a few times in the past (the most recent being in a post about AWS RDS Database replicas in the latter half of 2016), but I’ll note it down here again. We want to free customer data currently imprisoned in their on-premises database, in order to allow for the production of helpful websites and mobile applications while also providing easier integrations for third parties. This sort of approach is win-win, we get to develop and sell interesting and useful applications and services, and the customer gets the ability to do things on the run (which is important for real-estate agents) without having to buy completely new software and migrate all of their data.

A pretty great idea, but the logistics of actually getting at the customer data are somewhat challenging.

The application responsible for generating all of the data is a VB6 application that is over 10 years old now. While it does have some modern .NET components, its not exactly the place you want to be building a synchronization process. Many copies of this application run in a single office and they all connect to a central SQL server, so orchestrating them all to act together towards a common goal is challenging. In addition to that, you run the risk of degrading the application performance for the average user if the application is busy doing things in the background that they might not necessarily care about right now.

What you really need is a centralised piece of software installed in a server environment in each office, running all the time. It can be responsible for the things the individual application should not.

Luckily, we have exactly that.

Central Intelligence

Each office that has access to our legacy application typically also has a companion application installed in parallel to their SQL Server. This component features a really nicely put together plugin framework that allows us to remotely deploy and control software that facilitates Cloud integrations and other typical server tasks, like automated backups.

Historically the motivation for the customer to install and configure this server component is to gain access to Cloud integrations they want to use, or to the other server supported features I briefly mentioned. Unfortunately, there have been at least 3 plugins so far that accomplish the syncing of data in one shape or another, usually written in such a way that they deal with only one specific Cloud integration. A good example of this is the plugin that is responsible for pushing property inspection information to a remote location to be used in a mobile application, and then pulling down updated information as it is changed.

These one-off plugins were useful, but each of them was specific enough that it was difficult to reuse the information they pushed up for other purposes.

The next time we had to create a Cloud integration, we decided to do it far more generically than we had before. We would implement a process that would sync the contents of the local database up to a centralised remote store, in a structure as close to the original as possible. This would leave us in a solid place to move forward with whatever integrations we might want to create moving forward. No need to write any more plugins for specific situations, just use the data that is already there.

As is always the case, while the goal might sound simple, the reality is vastly different.

Synchronize Your Watches Boys

When talking about synchronizing data, there are generally two parts to consider. The first is for all changes in the primary location to be represented in a secondary location and second is for all changes in the secondary location to be represented in the primary. For us, the primary is local/on-premises and the secondary is remote/cloud.

Being relatively sane people, and with business requirements that only required remote access to the data (i.e. no need to deal with remote changes), we could ignore the second part and just focus on one way sync, from the local to remote. Of course, we knew that eventually people would want to change the data remotely as well, so we never let that disquieting notion leave our minds, but for now we just had to get a process together that would make sure all local changes were represented remotely.

There were non-functional requirements too:

  • We had to make sure that the data was available in a timely fashion. Minutes is fine, hours is slow, days is ridiculous. We wanted a near real-time synchronization process, or as close to as we could get.
  • We needed to make sure that the synchronization process did not adversely affect the usability of the application in any way. It should be as invisible to the end-user as possible.

Nothing too complicated.

The Best Version Of Me

Looking at the data, and keeping in mind that we wanted to keep the remote structure as close to the local one as possible, we decided to use a table row as the fundamental unit of the synchronization process. Working with these small units, the main thing we would have to accomplish would be an efficient way to detect differences between the local and remote stores, so that we could decide how to react.

A basic differencing engine can use something as simple as a constantly increasing version in order to determine if location A is the same as location B, and luckily for us, this construct already existed in the database that we were working with. SQL Server tables can optionally have a column called RowVersion, which is a numeric column that contains unique values that constantly increase for each change made in the database. What this means is that if I make a change to a row in Table A, that row will have a new RowVersion of say 6. If I then make a change to Table B, that row will be versioned 7 and soon and so forth. I don’t believe the number is guaranteed to increase a single element at a time, but its always higher.

RowVersion is not a timestamp in the traditional sense, but it does represent the abstract progression of time for a database that is being changed (i.e. each change is a “tick” in time).

With a mechanism in place to measure change on the local side of the equation, all that was left was to find a way to use that measurement to act accordingly whenever changes occurred.

Simple.

To Be Continued

This post is long enough already though, so its time to take a break until next week, when I will outline the basic synchronization process and then poke a bunch of holes in it, showing why it wasn’t quite good enough.

0 Comments

As I mentioned briefly last week, our AWS Lambda ELB Logs Processor did not quite work when we pushed it to production. This was a shame, because we were pretty sure we got it right this time. Ah well, every failure is a learning experience.

In this post I’m going to briefly talk about how we deployed the new logs processor, how we identified the issue (and what it was) and finally how we fixed it.

Colour Blind

In order to maximise our ability to adapt quickly, we try to deploy as frequently as possible, especially when it comes to services that we fully control (like websites and API’s). The most important thing to get right when you’re deploying frequently is the ability to do so without disturbing the end-user. Historically, deployments dodged this problem by simply deploying during periods of low use, which for us means Sundays or late at night. I certainly don’t want to deal with that though, and I don’t want to delegate it to some poor sap, so instead we just make sure our change sets are small and easy to reason about, and our deployments happen in such a way that the service is never fully down while the deployment is occurring (easily accomplished in AWS with multiple deployment locations combined with rolling deployments).

For bigger changes though, we’ve started using blue/green deployments. Essentially this means having two completely separate environments active at the one time, with some top level mechanism for directing traffic to the appropriate one as necessary. For us this is a top level URL (like service.thing.com) which acts as a DNS alias for an environment specific URL (like prod-green-service.thing.com). We then use Route53 and Weighted DNS to direct traffic as we so desire.

For websites, blue/green deployments are trivial, assuming the backend services still work the same (or have already been deployed to be compatible with both versions). For those titular backend services though, blue/green deployments can be challenging, especially when you have to take into account data persistence and continuity of service.

When it comes to whatever persistence layer is necessary, our newer services usually feature data storage in a separately maintained environment (like RDS instances, S3 buckets and so on), specifically to help us do blue/green deployments on parts of the architecture that can freely change without causing any issues with data. Some of our earlier services did not do this, and as such are problematic when it comes to blue/green deployments. In those cases we usually test the new system by using traffic replication and then resort to traditional downtime when it comes time to do the actual deployment.

Blue/green deployments have proved to be particularly useful to for us because of the way we handle our environments as code.

To tie this all back in with the ELB logs processor, we used our environment package to create a blue environment for the service with the ELB logs processor in it (because our currently active one was green). Once the environment was up and running, we used weighted DNS inside Route53 to shunt a small amount of traffic to the blue environment to see if the ELB logs were being processed into our ELK stack correctly.

And they were, and the victory dance was done and everyone lived happily ever after.

Oh Yeah, Scale

Well, not quite.

The small amount of test traffic worked perfectly. All of the traffic was being represented inside our ELK stack as expected. Continuing on with the traditional blue/green deployment, we increased the amount of traffic hitting the new blue environment by increasing its weight in Route53.

Once we got to around 10% of the total traffic, things started to fall down. The new blue environment was handling the traffic perfectly well, but we were not seeing the expected number of log events from the ELB inside the ELK stack.

Looking at the CloudWatch logs for the Lambda function, it looked like the Lambda function was simply taking longer to run than the default timeout provided (which is 3 seconds), which was a silly oversight on our part. Considering it was trying to process a few megabytes of raw log data, its not surprising that it wasn’t getting through the entire thing.

AWS Lambda is charged by calculating the actual execution time with the resources that were allocated for the function, so there timeout basically represents the maximum amount of money you will be charged for each function execution. if you have an unbounded number of function executions (i.e. they occur on some unreliable trigger), than this can be very useful to limit your potential costs. For our usage, we know that the ELB logs are generated approximately once every 5 minutes, so we’re pretty safe to set the timeout to the maximum (300 seconds) to give the function as much time as possible to process the log file.

With that small change in place more of the log file was being processed, but it still wasn’t processing the entire thing. The good news was that it was no longer just timing out and terminating itself, but the bad news was that it was now just plain old crashing after processing a some of the file.

No Its Fine, I Don’t Need Those Connections Again

Javascript Lambda functions automatically write their output to CloudWatch, which is really useful from a debugging and tracing point of view. I mean, I don’t know how it would work otherwise (because you literally cannot access the underlying operating system they run on), but its still nice that it just works out of the box.

In our case, the error was as follows:

Error: connect EMFILE {ip}:{port} - Local (undefined:undefined)
at Object.exports._errnoException (util.js:870:11)
at exports._exceptionWithHostPort (util.js:893:20)
at connect (net.js:843:14)
at net.js:985:7
at GetAddrInfoReqWrap.asyncCallback [as callback] (dns.js:63:16)
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:82:10)

A small amount of digging showed that this error occurs when a TCP connection cannot be established to the specified IP over the specified port.

If you look back at the code that actually processes the ELB log file, the only place where a TCP connection is being made is after a line in the file has been processed and transformed into a JSON payload, ready to be  pushed to our ELK stack via Logstash. Being that the error only manifests after part of the file has already been processed successfully, it looked like the issue was one of resource exhaustion.

The obvious candidate was that the Lambda function was simply trying to open too many TCP connections at once. This made sense based on my understanding of Node.js at the time, so we implemented a connection pooling mechanism to prevent it from occurring (i.e. instead of simply trying to establish a connection, it would try to draw one from a global pool with a limit and if one was not available, wait for a few moments until it was).

Because each connection was only required for a few moments, the solution would essentially throttle the processing to whatever limit we impose, hopefully dodging the perceived problem with too many parallel connections.

function getConnection(callback) {
    if (connectionPool.length < connectionCountLimit) {
        console.log('creating new connection');
        const newSocket = net.createConnection(_logPort, _logHost);
        connectionPool.push(newSocket);
        return callback(newSocket);
    }

    const activeConnections = connectionPool.filter(function (socket) {
        return !socket.destroyed;
    });
    if (activeConnections.length != connectionCountLimit) {
        connectionPool = activeConnections;
    }

    setTimeout(function () {
        getConnection(callback);
    }, waitForConnectionDuration);
}

function postToLogstash(connection) {
    return function (entry) {
        console.log("INFO: Posting event to logstash... ");
        var message = JSON.stringify(entry) + "\n";
        message = message.replace("Timestamp", "@timestamp");
        connection.write(message, null, function () {
            console.log("INFO: Posting to logstash...done");
            connection.end();
        });
    }
}

Conclusion

I’d love to say that after we implemented the simple connection pooling, everything worked just fine and dandy and the victory dance was enjoyed by all.

And for once, I can!

With the simple connection pooling we implemented (which had a maximum connection count of like 50 in the first iteration), we managed to process an entire ELB log file from S3 without getting the error we were getting before. We still need to do some more investigation around whether or not we’re actually getting all of the messages we expect to get, but its looking good.

I’m still not entirely sure how the issue occurred though. Originally, because of my limited knowledge of Node.js, I thought that it was creating connections in parallel. It turns out, Node.js is not actually multi-threaded at all, unlike some other, real programming languages (cough, c#, cough), so it couldn’t possibly have been opening a bunch of connections all at the same time as I pictured.

What it might have been doing, however, is opening up a lot of connections very quickly. Node is really good at making use of the typical dead time in an application, like when executing HTTP requests and waiting for results, or when opening a file and waiting for IO to return. Its possible that each connection being opened was ceding control of the main execution pipeline for just long enough for another connection to be opened, so on and so forth until the underlying hardware ran out of resources.

Another possibility is that there was actually a connection leak, and the simple connection pool alleviated it by reusing the connection objects instead of always creating new ones.

Regardless, it processes through large ELB log files without breaking, which is good enough for this post.

0 Comments

Where I work now, we have a position known as “Delivery Manager”. From an outside perspective, the role is a middle-management position, responsible for ensuring the smooth delivery of software to agreed upon budgets and timelines, as well as reducing the complexity and amount of information that the CTO has to deal with on a daily basis. Fairly standard stuff.

For most of my time here, I had a delivery manager. I liked him and I feel like his overall effect on the team was positive. He knew when to involve himself and when to get out of our way, though this was helped somewhat by the fact that he was located in a completely different city, so contact was sometimes sporadic, and he had to make sure that those times where we could get together were as useful as possible.

Unfortunately, a few months ago his position was made redundant for a variety of reasons, but I suspect the primary one was the location (with the organisation choosing to focus more effort in their Brisbane location).

Somewhat ironically, change is actually the only constant in life, so being able to adapt quickly is an important skill to have. If you were to choose just one thing to improve, I would recommend choosing the ability to adapt and function whatever the situation. Hone that skill to a razor edge, because it will always be applicable.

For us, the initial assumption was that we would simply get another delivery manager, except this time actually situated in the same office as us. It was getting close to the end of the year though, so we were going to have to do without for a little while, at least until the new year.

Well, its the new year now, but the question we’re all asking, is do we really need one?

The Romans Tried It

Beyond developers, there are a few other roles present in my team:

  • The product owner, who is focused around what the deliver, when and how that fits in with the greater business strategy.
  • The iteration manager, who is focused around the process of delivery and how to ensure that is as smooth and efficient as possible.
  • The technical lead, who is focused around how to deliver and ensuring that what is delivered can be maintained and improved/extended as necessary.

It feels like between those 3 roles, a delivery manager should be redundant. Working together, this triumvirateshould be able to make decisions, provide whatever is necessary to any interested parties, and deal with the management overhead that automatically comes with people, like recruitment, evaluations, salary adjustments and so on (assuming they (as a group) are given the power to do so).

Each person in the triumvirate has their own area of speciality and should be be able to bring different things to the table when it comes to steering the team towards the best possible outcome.

Decisions made by the triumvirate would be owned by the group, not the individual. Allowing the responsibility to be shared in this way means that there are less likely to be issues resulting from a singular person trying to protect themselves and the likelihood of poor decisions is lessened by the nature of the group being able to take into account more factors than a single person.

Ego and personal ambition is reduced, because all parties should have approximately equal power, so an individual should never be able to become powerful enough to control or subsume the group as a whole.

Finally, with an odd number of people, any and all disagreements should be able to be resolved via voting, assuming no-one abstains and the options have already been reduced to 2.

It Didn’t End Well

Of course, its not all puppies and roses. All 3 participants in a triumvirate need to be mature enough to operate towards a common goal in a highly co-operative way. Like most roles in an organisation, powerful egos are detrimental to getting anything done. The upside of a triumvirate is that the impact of a powerful, out of control ego is lessened, as the other two members should be able to act and mitigate the situation.

The other downside of a triumvirate is the speed at which decisions are able to be made. A singular person in a position of power is capable of making decisions very quickly, for both good and bad. 3 people working together are unlikely to be able to decide as quickly as 1, due to a number of factors including communication overhead and differing opinions. Again this is another reason why all participants in a triumvirate must be highly mature people without ego, willing to listen to information provided and quickly construct informed opinions about a situation without colouring them with their own internal biases. A pretty tall order, if we’re being honest.

Finally, to an outside observer, having 3 people performing a single role might look inefficient, especially if they ignore that those 3 people have a wealth of responsibilities beyond acting within the triumvirate.

Google Did Much Better

A leadership structure consisting of 3 equals working together is not a new concept.

The ancient romans had at least 2 triumvirates, but to be honest, they did not end well. From the small amount of reading that I’ve done on the subject, it seems like one of the main reasons for their failure was the motivation behind their formation. In both cases, a triumvirate was established in order to avert a war due to succession disputes, so every party in the triumvirate was still acting in their own best interests rather than in the best interests of the people they should have been serving. A valuable lesson that can easily be applied with forming a triumvirate in an organisation.

A successful triumvirate can be seen in the management structure of Google back in 2010. Larry Page, Sergey Brin and Eric Schmidt worked together as a team at the very highest level of the organisation in order to help turn it into the juggernaut it is today. Each one brought different skills to the table and provided different experiences and opinions that no doubt helped them to make the best decisions that they could make. Of course, the Google triumvirate is no more as of 2011, with Larry Page stepping up to head the organisation on his own, but whatever factors went into that particular change are not easy to discern from an external viewpoint. Still, much has been written about the triumvirate structure that helped to make Google what they are today.

Conclusion

At best, all of the above is an interesting thought exercise that occurred to me as a result of my current situation. After discovering that the idea was not new, and doing a little reading, it looks like the construction of a triumvirate as a leadership structure seems to relatively uncommon. At least as far as I could see when searching on the internet. Of the examples that I did find, only 1 seemed to have an overall positive effect on parties in play (though, Roman politics is probably significantly different to organisational politics).

Being that there are a lot of people out there who have probably given this a lot more thought than I have, combined with the lack of successful case studies, I have to imagine that I have overlooked or understate some of the downsides inherent in the structure.

I imagine that its the human side of the equation that I’ve understated, which is a common mistake of mine. For all my bluster and jokes around the nature of people, I mostly assume they are good at heart, willing to subsume their own ego in favour of a better outcome for all.

The concept is also very similar to a committee, and anyone who has even interacted with a committee (especially in government) probably knows how ineffective they can be. Hopefully limiting the number of participants to 3 is enough to prevent that same level of inefficiency. The focus of any sort of structure like this needs to be around getting results in acceptable timeframes, something that I’ve never experienced from a traditional committee.

In summary, the idea seems solid on the surface, removing a layer from an organisational structure that honestly does not look like it needs to be there, but I might not be taking all the factors into account.

In contrast, writing software looks positively trivial.