A Function Of Its Time

February 13. 2018 0 Comments

Working on a legacy system definitely has its challenges.

For example, its very common for there to be large amounts of important business logic encapsulated in the structures and features of the database. Usually this takes the form of things like stored procedures and functions, default values and triggers, and when you put them all together, they can provide a surprising amount of functionality for older applications.

While not ideal by todays standards, this sort of approach is not necessarily terrible. If the pattern was followed consistently, at least all of the logic is in the same place, and legacy apps tend to have these magical global database connections anyway, so you can always get to the DB whenever you need to.

That is, until you start adding additional functionality in a more recent programming language, and you want to follow good development practices.

Like automated tests.

What The EF

If you’re using Entity Framework on top of an already existing database and you have stored procedures that you want (need?) to leverage, you have a few options.

The first is to simply include the stored procedure or function when you use the Entity Data Model Wizard in Visual Studio. This will create a function on the DbContext to call the stored procedure, and map the result set into some set of entities. If you need to change the entity return type, you can do that too, all you have to do is make sure the property names line up. This approach is useful when your stored procedures represent business logic, like calculations or projections.

If the stored procedures in the database represent custom insert/update/delete functionality, then you can simply map the entity in question to its stored procedures. The default mapping statement will attempt to line everything up using a few naming conventions, but you also have the ability to override that behaviour and specify procedures and functions as necessary.

If you don’t want to encapsulate the usage of the stored procedures, you can also just use the SqlQueryand ExecuteSqlCommandAsync functions available on the DbContext.Database property, but that requires you to repeat the usage of magic strings (the stored procedure and function names) whenever you want to execute the functionality, so I don’t recommend it.

So, in summary, its all very possible, and it will all work, up until you want to test your code using an in-memory database.

Which is something we do all the time.

In Loving Memory

To prevent us from having to take a direct dependency on the DbContext, we learn towards using factories.

There are a few reasons for this, but the main one is that it makes it far easier to reason about DbContext scope (you make a context, you destroy a context) and to limit potential concurrency issues within the DbContext itself. Our general approach is to have one factory for connecting to a real database (i.e. ExistingLegacyDatabaseDbContextFactory) and then another for testing (like an InMemoryDbContextFactory, using Effort). They both share an interface (usually just the IDbContextFactory<TContext> interface), which is taken as a dependency as necessary, and the correct factory is injected whenever the object graph is resolved using our IoC container of choice.

Long story short, we’re still using the same DbContext, we just have different ways of creating it, giving us full control over the underlying provoider at the dependency injection level.

When we want to use an in-memory database, Effort will create the appropriate structures for us using the entity mappings provided, but it can’t create the stored procedures because it doesn’t know anything about them (except maybe their names). Therefore, if we use any of the approaches I’ve outlined above, the in-memory database will be fundamentally broken depending on which bits you want to use.

This is one of the ways that Entity Framework and its database providers are something of a leaky abstraction, but that is a topic for another day.

This is pretty terrible for testing purposes, because sometimes the code will work, and sometimes it won’t.

But what else can we do?

Abstract Art

This is one of those nice cases where an abstraction actually comes to the rescue, instead of just making everything one level removed from what you care about and ten times harder to understand.

Each stored procedure and function can easily have an interface created for it, as they all take some set of parameters and return either nothing or some set of results.

We can then have two implementations, one which uses a database connection to execute the stored procedure/function directly, and another which replicates the same functionality through Linq or something similar (i.e. using the DbContext). We bind the interface to the first implementation when we’re running on top of a real database, and to the DbContext specific implementation when we’re not. If a function calls another function in the database, you can replicate the same approach by specifying the function as a dependency on the Linq implementation, which works rather nicely.

Of course, this whole song and dance still leaves us in a situation where the tests might do different things because there is no guarantee that the Linq based stored procedure implementation is the same as the one programmed into SQL Server.

So we write tests that compare the results returned from both for identical inputs, trusting the legacy implementation when differences are discovered.

Why bother at all though? I mean after everything is said and done, you now have two implementations to maintain instead of one, and more complexity to boot.

Other than the obvious case of “now we can write tests on an in-memory database that leverage stored procedures”, there are a few other factors in favour off this approach:

With a good abstraction in place, its more obvious what is taking a dependency on the stored procedures in the database
With a solid Linq based implementation of the stored procedure, we can think about retiring them altogether, putting the logic where it belongs (in the domain)
We gain large amounts of knowledge around the legacy stored procedures while building and testing the replacement, which makes them less mysterious and dangerous
We have established a strong pattern for how to get at some of the older functionality from our new and shiny code, leaving less room for sloppy implementations

So from my point of view, the benefits outweigh the costs.

Conclusion

When trying to leverage stored procedures and functions programmed into a database, I recommend creating interfaces to abstract their usages. You are then free to provide implementations of said interfaces based on the underlying database provider, which feels a lot more flexible than just lumping the function execution into whatever structures that EF provides for that purpose. The approach does end up adding some additional complexity and effort, but the ability to ensure that tests can run without requiring a real database (which is slow and painful) is valuable enough, even if you ignore the other benefits.

Caveat, the approach probably wouldn’t work as well if there aren’t good dependency injection systems in place, but the general concept is sound regardless.

To echo my opening statement, working with legacy code definitely has its own unique set of challenges. Its nice in that way though, because solving those challenges can really make you think about how to provide a good solution within the boundaries and limitations that have already been established.

Like playing a game with a challenge mode enabled, except you get paid at the end.

I’m In Your Debt

February 6. 2018 0 Comments

Spurred on by this video, we’ve been discussing the concept of technical debt, so this post is going to be something of an opinion piece.

I mean, what does everyone need more than anything? Another opinion on the internet, that’s what.

To be harsh for a moment, at a high level I think that a lot of the time people use the concept of technical debt as a way to justify crappy engineering.

Of course, that is a ridiculous simplification, so to extrapolate further, lets look at three potential ways to classify a piece of code:

Technical debt
Terrible code
Everything else

The kicker is that a single piece of code might be classified into all three of those categories at different points in its life.

Blood Debt

Before we get into that though, lets dig into the classifications a bit.

For me, classifying something as technical debtimplies a conscious, well founded software related decision was made, and that all of the parties involved understood the risks and ramifications of that decision, and agreed upon a plan to rectify it. Something like “we will implement this emailing mechanism in a synchronous way which will present a longer term scaling problem, if/when our total usage approaches a pre-defined number we will return to the area and change it to be asynchronous in order to better deal with the traffic”. Generally, these sorts of decisions trade delivery speed for a limitation of some sort (like pretty much all code), but the key difference is the communication and acceptance of that limitation.

Honestly, the definition of technical debt outlined above is so restrictive that it is almost academic.

Terrible codeon the other hand is just that. You usually know it when you see it, but its code that hurts you when you try to change or understand it.

For me, any single element out of the follow list is a solid indication of terrible code:

No automated tests
Poorly written tests
Constant and invasive duplication
Having to update all seven hundred places where a constructor is instantiated when you change it
Massive classes with many responsibilities
Repeated or duplicated functionality
Poor versioning

I could go on and on, but the common thread is definitely “things that make change hard”. Keeping these sorts of things at bay requires consistent effort and discipline, especially in the face of delivery pressure, but they are all important for the long term health of a piece of software.

The last category is literally everything else. Not necessarily debt, but not necessarily terrible code either. In reality, this is probably the majority of code produced in the average organization, assuming they have software engineers that care even a little bit. This code is not optimal or perfectly architected, but it delivers real value right now, and does only what it needs to do, in a well engineered way. It almost certainly has limitations, but they are likely not widely communicated and understood.

Now, about that whole fluidity thing.

Fluid Dynamics

Over its lifetime (from creation to deletion), a single piece of software might meet the criteria to be classified into all of the definitions above.

Maybe it started its life as an acceptable implementation, then coding standards changed (i.e. got better) and it was considered terrible code. As time wore on, perhaps someone did some analysis, communicated its limitations and failings and put a plan into place to fix it up, thus mutating it into debt. Maybe a few short months later the organization went through a destructive change, that knowledge was lost, and the new developers just saw it as crappy code again. Hell, sometime after that maybe we all agreed that automated tests aren’t good practice and it was acceptable again.

Its all very fluid and dynamic, and altogether quite subjective, so why bother with classifications at all?

I suppose its the same reason for why you would bother trying to put together a ubiquitous language when interacting with a domain, so you can communicate more effectively with stakeholders.

There is value in identifying and communicating the technical limitations of a piece of software to a wider audience, allowing that information to be used to inform business decisions. Of course, this can be challenging, because the ramifications have to be in terms the audience will understand (and empathize with), or the message might be lost.

But this comes with the problem of safely accumulating and storing the knowledge so that it doesn’t get lost, which is one of the reasons debt can mutate into terrible codeover time. This requires a consistent application of discipline over time that I have yet to bear witness to. It’s also very easy for this sort of debt registerto just turn into everything we don’t like about the codebase which is not its intent at all.

Moving away from debt, drawing a solid line between what is acceptable and what is not (i.e. the terrible code definition) obviously has beneficial effects on the quality of the software, but what happens when that line moves, which it is almost certain to do as people learn and standards change? Does everything that was already written just become terrible, implying that it should be fixed immediately (because why draw a line if you tolerate things that cross it) or does it just automatically become debt? It seems unlikely that the appropriate amount of due diligence would occur to ensure the issues are understood, and even less likely that they would be actioned.

In the end, a classification system has to provide real value as an input to decision making, and shouldn’t exists “just because”.

Conclusion

Being able to constantly and consistently deliver valuable software is hard, even without having to take into account what might happen in the future. I don’t know about you, but I certainly don’t have a solid history of anticipating future requirements.

Classifying code is a useful tool for discussion (and for planning), but in the end, if the code is causing pain for developers, or issues for clients, then it should be improved in whatever way necessary, as the need arises. There is no point in going back and fixing some technical debt in an arcane part of the system that is not actively being developed (even if it is actively being used), that’s just wasted effort. Assuming it does what its supposed to of course.

The key is really to do as little as possible, but to do it while following good engineering practices. If you do that at least, you should find that you never reach a place where your development grinds to a halt because the system can’t adapt to changing circumstances.

Evolve or die isn’t a suggestion after all.

Time For A Dating Opportunity

January 30. 2018 0 Comments

Date and Time data structures are always so fun and easy to use, aren’t they?

Nope.

They are, in fact, the devil. No matter how hard I try (and believe me, I’ve tried pretty hard) or what decisions I make, I always find myself facing subtle issues relating to time that go unnoticed until the problem is way harder to fix than it should be. Like say, after your customers have created millions and millions of entities with important timestamps attached to them that aren’t timezone aware.

More recently, we were bitten when we tried to sync fields representing a whole day (i.e. 1 January 2017, no time) through our data synchronization algorithm.

To our surprise, a different day came out of the other end, which was less than ideal.

Dates Are Delicious

During the initial analysis of the tables earmarked to be synced from client databases, we located quite a few fields containing dates with time information. That is, usage of the actual DateTime data structure in SQL Server. As most of us had been bitten in the past by subtle conversion bugs when playing with raw DateTimes in .NET, we made the decision to convert all DateTimes to DateTimeOffsets (i.e. 2017-01-01 00:00+10) at the earliest possible stage in the sync process, using the timezone information on the client server that the sync process was running on.

What we didn’t know was that some of the DateTime fields actually represented whole dates, and they were just represented as that day at midnight because there was no better structure available when they were initially created.

Unfortunately, converting whole days stored as DateTimes into DateTimeOffsets isn’t actually the best idea, because an atomic days representation should not change when you move into different timezones. For example, 1 January 2017 in Brisbane does not magically turn into 31 December 2016 22:00 just because you’re in Western Australia. Its still 1 January 2017.

This is one of the weird and frustrating things about the difference between whole Dates and DateTimes. Technically speaking, a Date as explained above probably should be location aware, especially as the timezone differences get more extreme. The difference between WA and QLD is pretty academic, but there’s a whole day between the US and Australia. If two users were to calculate something like rental arrears in two different places using whole dates, they would probably get two different numbers, which could lead to poor decisions. Of course, from a users point of view, the last thing they would expect is to have one day turn into another, or to add a time representation to something they entered as a day using a calendar selector, so all told, its confusing and terrible and I hate it.

If you want to get technical, the converted DateTime still represents the same instantin time, so as long as you know what the original offset was, you can use that to revert back to the original value (which is some day at midnight) without too much trouble, and then interpret it as necessary.

Of course, that’s when PostgreSQL decided to get involved.

Lost Time

A long time ago when we started the data synchronization project, we decided to use PostgreSQL as the remote store. We did this mostly because PostgreSQL was cheaper to run in AWS via RDS (the licensing costs for SQL Server in RDS were kind of nuts in comparison).

In hindsight, this was a terrible decision.

We might have saved raw money on a month to month basis, but we exposed ourselves to all sorts of problems inherent to the differences between the two database engines, not to mention the generally poor quality of the PostgreSQL tools, at least in comparison SQL Server Management Studio.

Returning to the date and time discussion; we chose to use Entity Framework (via NPGSQL) as our interface to PostgreSQL and to be honest we pretty much just trusted it to get the database schema right. All of our DateTimeOffsets got mapped to the PostgreSQL data structure timestamp_with_timezone, which looks like its pretty much the same thing.

Except its not. Its not the same at all. It actually loses date when storing a DateTimeOffset, and it does this by design.

In PostgreSQL terms, using a timestamp_with_timezone structure actually means “please automatically adjust the data I insert into this field using the given offset, so store it as UTC”. This makes sense, in a way, because strictly speaking, the data still represents the same instant in time, and can be freely adjusted to the users current offset as necessary (i.e. show me what the data looks like in +10).

Unfortunately, this approach means that the actual offset the data was inserted with is lost completely.

PostgreSQL has another data type called timestamp_without_timezone, but all it does it ignore the offset completely, while still stripping it out. Less than useful.

To summarise, here is the chain of events:

The user enters some data, representing a whole day: 1 Jan 2017
The system stores this data in SQL Server as a DateTime: 1 Jan 2017 00:00
Our sync process reads the data as a DateTimeOffset, using the local timezone: 1 Jan 2017 00:00 +10
The data is pushed to PostgreSQL and stored: 31 Dec 2016 14:00

Technically the data still represents the exact same point in time, but its actual meaning is now compromised. If anyone reads the remote data and assumes its still just a date, they are now a full day off, which is terrible.

Daylight savings doesn’t help either, because now the offset is inconsistent, so in NSW sometimes you will see the date as the previous day at 1400 and sometimes at 1300.

I mean honestly, daylight savings doesn’t really help anyone anyway, but that’s a different story.

That Approach Is Sooooo Dated

For whole dates, the only realistic solution is to treat them exactly as they should be treated, as dates with no time component.

Completely obvious in retrospect.

Both SQL Server and PostgreSQL have a Date data type which does exactly what it says on the box and leaves no room for misinterpretation or weird conversion errors.

Of course, .NET still only has the good old DateTime and DateTimeOffset data types, so there is room for shenanigans there, but at least the storage on either end would be correct.

For dates that actually do come with a time component (like an appointment), you have to really understand whether or not the time should be able to be interpreted in another timezone. Taking the appointment example, it might be reasonable to think that a company using your software product to track their appointments might exist in both NSW and QLD. The person executing the appointment would want to schedule it in their local time, but a remote administration person might want to know what time the appointment was in their own local time so that they know now to transfer calls.

SQL Server is fine in this respect, because a DateTimeOffset is perfect, freely able to be converted between one timezone and another with no room for misinterpretation.

In PostgreSQL, the timestamp_with_timezone data type might be good enough, assuming you don’t need to know with absolute certainty what the original offset was (and thus the time in the original creators context). If you do need to know that (maybe for business intelligence or analytics) you either need to know some information about the creator, or you should probably just use a timestamp data type, convert it to UTC yourself and store the original offset separately.

Conclusion

Representing time is hard in software. I’ve heard some people say this is because the fundamentally explicit structures that we have to have in software are simply unable to represent such a fluid and mutable construct, and I’m not sure if I disagree.

For the case I outlined above, we made some bad decisions with good goals (lets use DateTimeOffset, its impossible to misinterpret!) that built on top of other peoples bad decisions, but at least we understand the problem well enough now to potentially fix it.

Unfortunately, its unlikely that the concept of dates and times in software is just going to go away, as it seems pretty important.

The best we can hope for is probably just for existence to end.

Someone would probably want to know when existence is going to end taking daylight savings into account.

Even The Compiler Wants You To Write Less Code

January 23. 2018 0 Comments

An ultra quick post this week, because I don’t have time to write a longer one.

It’s a bit of a blast from the past, so hold on to your socks, because I’m about to talk about Visual Basic 6.

A Necessary Evil

VB6 is the king (queen?) of legacy apps, at least in the desktop space.

It really does seem like no matter which company you’re working for, they probably have a chunk of VB6 somewhere, and its likely to be doing something critical to the business. It was originally written years and years ago, its been passed through many different hands and teams over time and for whatever reason, it was never successfully replaced with something more modern and sustainable. Maybe the replacement projects failed miserably, maybe there was just no motivation to touch it, who knows.

For us, that frequently encountered chunk of VB6 has taken the form of our most successful and most profitable piece of software, so we kind of have to care. Sure, we’re literally in the middle of replacing said software with a SaaS offering, but until every single client has moved to the new hotness, the old application has to keep on trucking.

As a result, sometimes my team has to write VB6. We don’t like it, but we’re pragmatists, and we don’t do it all the time, so no-one has tried to burn the office down. Yet.

Its mostly bug fixes at this point (because all new code in this particular app is written in .NET, executed via structured events over COM), but sometimes we do augment existing features as well.

With the prelude out of the way, its time to get to the meat. We hit a nasty issue recently where every time we tried to compile the source it would fail with an out of memory error.

This wasn’t something as simple as “oh, just give the machine more memory” either, this was “the machine has plenty of memory, but the VB6 compiler has no more addressable space because its a 32-bit app”.

Please Sir, No More

Our most recent change (which was still on a branch, because while we might be writing VB6, we’re not savages) was to fold in some reusable component libraries to the main project. We weren’t using them anywhere else (and had no plans to ever use them anywhere else) and the nature of the libraries was making it difficult to debug some of the many crashes afflicted on our users each day, so it seemed like a no brainer.

Of course, we didn’t know that folding those components in would tip us over into the land of “no compilation for you”.

All told we have something like 300K lines of VB6 code, spread across many different forms, modules and classes. That really doesn’t seem like enough to cause memory issues, until we release that that number only described how many lines of code are present in source control.

Something tricksy was afoot.

Hot Code Injection

It turns out that because error handling and reporting in VB6 ranges from “runtime error 13” to “hard crash with no explanation”, the application made use of a third party component to dynamically augment the code before the realcompilation.

Specifically, unless you tell it not to, it goes through every single function and injects a variety of things intended to give better error output, like stack traces (well function pointers at least) and high level error handling for unexpected crashes (which we used to send error reports to our ELK stack). Incredibly useful stuff, but it results in a ridiculous increase to the total lines of code in play.

This is why the compiler was running out of memory. That solid 300Klines in source control was quickly ballooning into some number that the compiler just could not handle.

The solution? Go find some pointless, unused code and cut it out like a cancerous tumour until the compilation worked again. Its win-win, the codebase gets smaller, you get to compile again, everyone is happy.

Of course, you have to be really careful that the code is actually unused and not just misunderstood, but the static analysis in VB6 is passable at finding unused module level functions, so we located a few of them, nuked them from orbit and moved on with our lives.

Conclusion

I’ll be honest, the situation (and solution) above doesn’t exactly leave me with a warm fuzzy feeling in my heart. I’m sure we’ll run into the exact same problem at some point in the future, especially if we fold in any other components, but I have to contrast that unsettling feeling with the fact that our path is more likely to result in lessVB6 over time (and more .NET), until eventually the application dies a good death.

More generally, its a shame that the VB6 code you tend to find in the wild is a bit……special. Its not a terrible language (for its time), and its certainly not impossible to write good VB6 (well factored, following at least some modern software development practices), its just so easy to do it badly. With its low barrier to entry and how easy it was to create a desktop application, it was the perfect hotbed for all sorts of crazy, long lasting insanity.

Reminds me of Javascript.

Health Is Just A State Of Mind

January 16. 2018 0 Comments

Another week, another post.

Stepping away the data synchronization algorithm for a bit, its time to explore some AWS weirdness..

Well, strictly speaking, AWS is not really all that weird. Its just that sometimes things happen, and they don’t immediately make any sense, and it takes a few hours and some new information before you realise “oh, that makes perfect sense”.

That intervening period can be pretty confusing though.

Todays post is about a disagreement between a Load Balancer and a CloudWatch alarm about just how healthy some EC2 instances were.

Schrödinger's Symptoms

At the start of this whole adventure, we got an alert email from CloudWatch saying that one of our Load Balancers contained some unhealthy instances.

This happens from time to time, but its one of those things that you need to get to the bottom of quickly, just in case its the first sign of something more serious.

In this particular case, the alert was for one of the very small number of services whose infrastructure we manually manage. That is, the service is hosted on hand crafted EC2 instances, manually placed into a load balancer. That means no auto scaling group, so no capability to self heal or scale if necessary, at least not without additional effort. Not the greatest situation to be in, but sometimes compromises must be made.

Our first step for this sort of thing is typically to log into the AWS dashboard and have a look around, starting with the entities involved in the alert.

After logging in and checking the Load Balancer though, everything seemed fine. No unhealthy instances were being reported. Fair enough, maybe the alarm had already reset and we just hadn’t gotten the followup email (its easy enough to forgot to configure an alert to send emails when the alarm returns back to the OK state).

But no, checking on the CloudWatch alarm showed that it was still triggered.

Two views on the same information, one says “BAD THING HAPPENING” the only says “nah, its cool, I’m fine”.

But which one was right?

Diagnosis: Aggregations

When you’re working with instance health, one of the most painful things is that AWS does not offer detailed logs showing the results of Load Balancer health checks. Sure, you can see whether or not there were any healthy/unhealthy instances, but if you want to know how the machines are actually responding, you pretty much just have to execute the health check yourself and see what happens.

In our case, that meant hitting the EC2 instances directly over HTTP with a request for /index.html.

Unfortunately (fortunately?) everything appeared to be working just fine, and each request returned 200 OK (which is what the health check expects).

Our suspicion then fell to the CloudWatch alarm itself. Perhaps we had misconfigured it and it wasn’t doing what we expected? Maybe it was detecting missing data as an error and then alarming on that. It might still indicate a problem of some sort, but would at least confirm that the instances were functional, which is what everything else appeared to be saying.

The alarm was correctly configured though, saying the equivalent of “alert when there is more than 0 unhealthy instances in the last 5 minutes”.

We poked around a bit into the other metrics available on the Load Balancer (request count, latency, errors, etc) and discovered that latency had spiked a bit, request count had dropped a bit and that there was a tonne of backend errors, so something was definitely wrong.

Returning back to the alarm we noticed that the aggregation on the data point was “average”, which meant that it was actually saying “alert when there is more than 0 unhealthy instances on average over the last 5 minutes”. Its not obvious what the average value of a health check is over time, but changing the aggregation to minimum showed that there were zero unhealthy instances over the same time period, and changing it to maximum showed that all four of the instances were unhealthy over the same time period.

Of course, this meant that the instances were flapping between up and down, which meant the health checks were sometimes failing and sometimes succeeding.

It was pure chance that when we looked at the unhealthy instances directly in the Load Balancer that it had never shown any, and similarly when we had manually hit the health endpoint it had always responded appropriately. The CloudWatch alarm remembered though, and the average of [healthy, healthy, unhealthy] was unhealthy as far as it was concerned, so it was alerting correctly.

Long story short, both views of the data were strictly correct, and were just showing slightly different interpretations.

Cancerous Growth

The root cause of the flapping was exceedingly high CPU usage on the EC2 instances in question, which was leading to timeouts.

We’d done a deployment earlier that day, and it had increased the overall CPU usage of two of the services hosted on those instances by enough that the machines were starting to strain with the load.

More concerning though was the fact that the deployment had only really spiked the CPU from 85% to 95-100%. We’d actually been running those particularly instances hot for quite a while.

In fact, looking back at the CPU usage over time, there was a clear series of step changes leading up to the latest increase, and we just hadn’t been paying attention.

Conclusion

It can be quite surprising when two different views on the same data suddenly start disagreeing, especially when you’ve never seen that sort of thing happen before.

Luckily for us, both views were actually correct, and it just took a little digging to understand what was going on. There was a moment there where we started thinking that maybe we’d found a bug in CloudWatch (or the Load Balancer), but realistically, at this point in the lifecycle of AWS, that sort of thing is pretty unlikely, especially considering that neither of those systems are exactly new or experimental.

More importantly, now we’re aware of the weirdly high CPU usage on the underlying EC2 instances, so we’re going to have to deal with that.

Its not like we can auto scale them to deal with the traffic.