0 Comments

Building new features and functionality on top of legacy software is a special sort of challenge, one that I’ve written about from time to time.

To be honest though, the current legacy application that I’m working with is not actually that bad. The prior technical lead had the great idea to implement a relatively generic way to execute modern . NET functionality from the legacy VB6 code thanks to the magic of COM, so you can still work with a language that doesn’t make you sad on a day to day basis. Its a pretty basic eventing system (i.e. both sides can raise events that are handled by the other side), but its effective enough.

Everything gets a little bit tricksy when windowing and modal dialogs are involved though.

One Thing At A Time

The legacy application is basically a multiple document interface (MDI) experience, where the user is free to open a bunch of different entities and screens at the same time. Following this approach for new functionality adds a bunch of inherent complexity though, in that the user might edit an entity that is currently being displayed elsewhere (maybe in a list), requiring some sort of common, global channel for saying “hey, I’ve updated entity X, please react accordingly”.

This kind of works in the legacy code (VB6), because it just runs global refresh functions and changes form controls whenever it feels like it.

When the .NET code gets involved though, it gets very difficult to maintain both worlds in the same way, so we’ve to isolating all the new features from the legacy stuff, primarily through modal dialogs. That is, the user is unable to access the rest of the application when the .NET feature is running.

To be honest, I was pretty surprised that we could open up a modal form in WPF from an event handler started in VB6, but I think it worked because both VB6 and .NET shared a common UI thread, and the modality of a form is handled at a low level common to both technologies (i.e. win32 or something).

We paid a price from a user experience point of view of course, but we mostly worked around it by making sure that the user had all of the information they needed to make a decision on any screen in the .NET functionality, so they never needed to refer back to the legacy stuff.

Then we did a new thing and it all came crashing down.

Unforgettable Legacy

Up until fairly recently, the communication channel between VB6 and .NET was mostly one way. VB6 would raise X event, .NET would handle it by opening up a modal window or by executing some code that then returned a response. If there was any communication that needed to go back to VB6 from the .NET, it always happened after the modal window was already closed.

This approach worked fine until we needed to execute some legacy functionality as part of a workflow in .NET, while still having the .NET window be displayed in a modal fashion.

The idea was simple enough.

  • Use the .NET functionality to identify the series of actions that needed to be completed
  • In the background, iterate through that list of actions and raise an event to be handled by the VB6 to do the actual work
  • This event would be synchronous, in that the .NET would wait for the VB6 to finish its work and respond before moving on to the next item
  • Once all actions are completed, present a summary to the user in .NET

We’d actually used a similar approach for a different feature in the past, and while we had to refactor some of the VB6 to make the functionality available in a way that made sense, it worked okay.

This time the legacy functionality we were interested in was already available as a function on a static class, so easily callable. I mean, it was a poorly written function dependent on some static state, so it wasn’t a complete walk in the part, but we didn’t need to do any high-risk refactoring or anything.

Once we wrote the functionality though, two problems became immediately obvious:

  1. The legacy functionality could popup dialogs, asking the user questions relevant to the operation. This was actually kind of good, as one of the main reasons we didn’t want to reimplement was because we couldn’t be sure we would capture all the edge cases so using the existing functionality guaranteed that we would, because it was already doing it (and had been for years). These cases were rare, so while they were a little disconcerting, they were acceptable.
  2. Sometimes executing the legacy functionality would murder the modal-ness of the .NET window, which led to all sorts of crazy side effects. This seemed to happen mostly when the underlying VB6 context was changed in such a way by the operation that it would historically have required a refresh. When it happened, the .NET window would drop behind the main application window, and the main window would be fully interactable, including opening additional windows (which would explode if they too were modal). There did not seem to be a way to get the original .NET window back into focus either. I suspect that there were a number of Application.DoEvents calls secreted throughout the byzantine labyrinth of code that were forcing screen redraws, but we couldn’t easily prove it. It was definitely broken though.

The first problem wasn’t all that bad, even if it wasn’t great for a semi-automated process.

The second one was a deal-breaker.

Freedom! Horrible Terrifying Freedom!

We tried a few things to “fix” the whole modal window problem, including:

  • Trying to restore the modal-ness of the window once it had broken. This didn’t work at all, because the window was still modal somehow, and technically we’d lost the thread context from the initial .ShowDialog call (which may or may not have still been blocked, it was hard to tell). In fact, other windows in the application that required modality would explode if you tried to use them, with an error similar to “cannot display a modal dialog when there is already one in effect”.
  • Trying to locate and fix the root reason why the modal-ness was being removed. This was something of a fools errand as I mentioned above, as the code was ridiculously labyrinthian and it was impossible to tell what was actually causing the behaviour. Also, it would involve simultaneously debugging both VB6 and .NET, which is somewhat challenging.
  • Forcing the .NET window to be “always on top” while the operation was happening, to at least prevent it from disappearing. This somewhat worked, but required us to use raw Win32 windowing calls, and the window was still completely broken after the operation was finished. Also, it would be confusing to make the window always on top all the time, while leaving the ability to click on the parts of the parent window that were visible.

In the end, we went with just making the .NET window non-modal and dealing with the ramifications. With the structures we’d put into place, we were able to refresh the content of the .NET window whenever it gained focus (to prevent it displaying incorrect data due to changes in the underlying application), and our refreshes were quick due to performance optimization, so that wasn’t a major problem anymore.

It was still challenging though, as sitting a WPF Dispatcher on top of the main VB6 UI thread (well, the only VB6 thread) and expecting them both to work at the same time was just asking too much. We had to create a brand new thread just for the WPF functionality, and inject a TaskScheduler initialized on the VB6 thread for scheduling the events that get pushed back into VB6.

Conclusion

Its challenging edge cases like this whole adventure that make working with legacy code time consuming in weird and unexpected ways. If we had of just stuck to pure .NET functionality, we wouldn’t have run into any of these problems, but we would have paid a different price in reimplementing functionality that already exists, both in terms of development time, and in terms of risk (in that we don’t full understand all of the things the current functionality does).

I think we made the right decision, in that the actual program functionality is the same as its always been (doing whatever it does), and we instead paid a technical price in order to get it to work well, as opposed to forcing the user to accept a sub-par feature.

Its still not immediately clear to me how the VB6 and .NET functionality actually works together at all (with the application windowing, threading and various message pumps and loops), but it does work, so at least we have that.

I do look forward to the day when we can lay this application to rest though, giving it the peace it deserves after many years of hard service.

Yes, I’ve personified the application in order to empathise with it.

0 Comments

Its been almost a month since I posted about my terrible ASP.NET CORE MVC website for executing pre-canned business intelligence queries, and thanks to the hard work of other people, its time to throw it all away.

Just as planned.

Honestly, there is nothing more satisfying than throwing away a prototype. I mean, its always hard to let go of something you’ve built, but every time you knock one together there is always that nagging feeling in the corner of your mind about whether it will magically turn into production code without you noticing. It feels good to let it go before that happens.

Anyway, the database with all the juicy customer data in it is available inside Metabase now, and its pretty much what I built, but infinitely better.

That’s So Meta

Assuming you didn’t immediately go and read the link, Metabase is a tool for writing, visualizing and sharing business intelligence.

It literally solves exactly the sort of problems I described in my original post, as:

  1. People who are good at the analysis part can easily get into the system, create new questions, visualize the results (with an assortment of charts) and save the configuration for reuse
  2. People who really want the analysis but are bad at at, can use those questions for their own nefarious purposes, without having the bug the original writer

Interestingly enough, the tool doesn’t even require you to write actual SQL, though it does offer that capability. You can create simple questions by using a fairly straightforward GUI, so there’s room for just about anyone to get in and do some analysis, if they want.

To help with this sort of thing, the tool extracts meta information from the connected databases (like table names, field names, types, etc), and supplies it whenever you’re drafting a question, even in the pure SQL mode. Its not perfect (it pretty much just has a list of keywords that it tries to match when you type things, so it doesn’t look like its contextual), but its still pretty useful.

Obviously it offers all the normal things that you would expect from an intelligence platform as well, including user management (with a solid invite system), permissions, categorization and so on.

And its all open source, so the only cost is setting it up (which, as with all things, is not to be underestimated).

Snapchat

I had very little to do with the initial setup of Metabase at our organization. It was put in place to offer intelligence on the raw data of our cloud product, mostly because people kept querying the production database replica directly.

When I saw it though, I realized how useful it could be sitting on top of the massive wealth of information that we were syncing up from our customers from the legacy product.

Thanks to the excellent efforts of one of my colleagues, it all became a beautiful reality, and I can now start on the long long process of educating people about how to get their own data without bugging me.

To extrapolate on the actual Metabase setup:

  • Every day we take an RDS snapshot of the legacy cloud database
  • These snapshots are shared automatically with an operational AWS account (keeping a good production/other stuff separation)
  • In the mornings, we restore the latest snapshot to a static location, which Metabase has been configured to connect to
  • During this restore process, we mutate the database a little bit in order to obfuscate user data and do some other tricksy things

The whole thing is executed using Ansible Playbooks, via Jenkins in a fully automated fashion.

The only real limitation is that the data is guaranteed to be up to a day old, but that’s generally not a problem for business intelligence, which is more interested in long term patterns rather than real-time information.

Plus we have our ELK stack for real time stuff anyway.

Optimus Prime

I’ve been analysing our customers data for a while now, and sometimes it can be pretty brutal. Mostly because of the scale of information, but also because the database schema was never intended for intelligence, it was built to serve an application, and its arranged to optimise that path first and foremost.

The good thing is that incorporating the database into Metabase is a great opportunity to pre-calculate some derivations of the data, allowing them to be used to augment other data sets without having to wait for a two minute aggregation to complete.

A good example is entity counts by customer. One particular entity in the database has somewhere in the realm of three million rows. Not a huge number for a database, but if you want to aggregate those entities by the customer (perhaps a sum, maybe sum of sub-classifications as well) then you’re still looking at a decent amount of computation to get it done.

This sort of pre-calculation is a great opportunity to use a materialized view, which is basically just a fancy way to create a table from a query, along with the capability to “refresh” the content of the table at a later date by re-executing the same query. I suppose you could just use a table if you want, and insert rows into it yourself, but honestly, this is exactly the sort of thing materialized views are intended for.

Its easy enough to slot the creation of the view into the process that restores the database snapshot, so there’s really no reason not to do it, especially because it lets you do awesome things like include summary statistics by customer on your dashboard, with a query that loads in milliseconds rather than minutes.

Conclusion

I’m extremely happy to have access to Metabase, both for my own usage (its a pretty great tool, even if I am more comfortable in a raw query editor) and to give everyone else in the organization access to some really useful data.

Sure, we could always do the analysis ad-hoc, using whatever visualization tools we wanted (lets be frank, its probably Excel), but there is something very nice about having a centralized location for this stuff as I don’t have to worry about someone still making business decisions using a query I wrote months ago, which has all these bugs and bad assumptions in it. I can just update the root question and everyone is using the best information available.

At least until I find more bad assumptions of course.

More importantly, because its simplified the whole sharing and re-use process, its motivated me to actually work on delivering some analysis that I’ve been meaning to do for ages, which can only result in a better informed organization in general.

In a weird twist, this is the one place where I’m an eternal optimist, as I don’t want to see any downsides to everyone having access to some subset of our customers data.

0 Comments

There are many challenges inherent in creating an API. From the organization of the resources, to interaction models (CQRS and eventual consistency? Purely synchronous?) all the way through to the specific mechanisms for dealing with large collections of data (paging? filtering?) there is a lot of complexity in putting a service together.

One such area of complexity is binary data, and the handling thereof. Maybe its images associated with some entity, or arbitrary files, or maybe something else entirely, but the common thread is that the content is delivered differently to a normal payload (like JSON or XML).

I mean, you’re unlikely to to serve an image to a user as a base64 encoded chunk in the middle of a JSON document.

Or you might, I don’t know your business. Maybe it seemed like a really good idea at the time.

Regardless, if you need to deal with binary data (upload or download, doesn’t really matter), its worth thinking about the ramifications with regard to API throughput at the very least.

Use All The Resources

It really all comes down to resource utilization and limits.

Ideally, in an API, you want to minimise the time that every request takes, mostly so that you can maintain throughput. Every single request consumes resources of some sort, and if a particular type of request is consuming more than its fair share, the whole system can suffer.

Serving or receiving binary data is definitely one of those needytypes of requests, as the payloads tend to be larger (files, images and so on), so whatever resources are involved in dealing with the request are consumed for longer.

You can alleviate this sort of thing by making use of good asynchronous practices, ceding control over the appropriate resources so that they can be used to deal with other incoming requests. Ideally you never want to waste time waiting for slow things, like HTTP requests and IO, but there is always going to be an upper limit on the amount you can optimize by using this approach.

A different issue can arise when you consider the details of how you’re dealing with the binary content itself. Ideally, you want to deal with it as a stream, regardless of whether its coming in or going on. This sort of thing can be quite difficult to actually accomplish though, unless you really know what you’re doing and completely understand the entire pipeline of whatever platform you’re using.

Its far more common to accidentally read the entire body of the binary data into memory all at once, consuming way more resources that are strictly necessary than if you were pushing bytes through in a continuous stream. I’ve had this happen silently and unexpectedly when hosting web applications in IIS, and it can be quite a challenging situation to engineer around.

Of course if you’re only dealing with small amounts of binary data, or small numbers of requests, you’re probably going to be fine.

Assuming every relevant stakeholder understands the limitations that is, and no one sells the capability to easily upload and download thousands and thousands of photos taken from a modern camera phone.

But its not like that’s happened to me before at all.

Separation Of Concerns

If you can, a better model to invoke is one in which you don’t deal with the binary data directly at all, and instead make use of services that are specifically built for that purpose.

Why re-invent the wheel, only to do it worse and with more limitations?

A good example of this is to still have the API handle the entities representing the binary data, but when it comes to dealing with the hard bit (the payload), provide instructions on how to do that.

That is, instead of hitting GET /thing/{id}/images/{id}, and getting the binary content in the response payload, have that endpoint return a normal payload (JSON or similar) that contains a link for where to download the binary content from.

In more concrete terms, this sort of thing is relatively easily accomplished using Amazon Simple Storage Service (S3). You can have all of your binary content sitting securely in an S3 bucket somewhere, and create pre-signed links for both uploading and downloading with very little effort. No binary data will ever have to pass directly through your API, so assuming you can deal with the relatively simple requests to provide directions, you’re free and clear.

Its not like S3 is going to have difficulties dealing with your data.

Its somewhat beyond my area of expertise, but I assume you can accomplish the same sort of thing with equivalent services, like Azure Blobs.

Hell, you can probably even get it working with a simple Nginx box serving and storing content using its local disk, but honestly, that seems like a pretty terrible idea from a reliability point of view.

Conclusion

As with anything in software, it all comes down to the specific situation that you find yourself in.

Maybe you know for certain that the binary data you deal with is small and will remain that way. Fine, just serve it from the API. Why complicate a system if you don’t have to.

But if the system grows beyond your initial assumptions (and if its popular, it will), then you’ll almost certainly have problems scaling if you’re constantly handling binary data yourself, and you’ll have to pay a price to resolve the situation in the long run.

Easily the hardest part of being software engineer is trying to predict what direction the wind will blow, and adjust accordingly.

Really, its something of a fools errand, and the best idea is to be able to react quickly in the face of changing circumstances, but that’s even harder.

0 Comments

No post this week.

I had a flu shot last week, and I assume it disrupted the delicate balance of terrifying diseases already present in my body, Mr Burns style.

Regularly scheduled programming returns next week.

0 Comments

Continuous delivery of your software is something you should aspire to, even if you may never quite reach that lofty goal of “code gets committed, changes available in production”.

There is just so much value to business value to be had by making sure that your improvements , features and bug fixes are always in the hands of your users, and never spend dead time waiting for the stars to align and that mysterious release window to open up.

Of course, it can all take an incredible amount of engineering effort though, so as with everything in software, you probably need to think about what exactly you are trying to accomplish and how much you’re willing to pay for it.

Along the way, you’ll run across a wide variety of situations that make constant deployments challenging, and that’s where this post gets relevant. One my my teams has recently been made responsible for an API whose core function is to execute tasks that might take up to an hour to run (spoiler alert, its data migration), and that means being able to arbitrarily deploy our changes whenever we want is just not a capability we have right now.

In fact, the entire deployment process for this particular API is a bit of a special flower, differing in a number of ways from the rest of the organization.

And special flowers are not to be tolerated

Its A Bit Of A Marathon

I’ve written about this a few times already, but our organization (like most), has an old, highly profitable piece of legacy desktop software. Its future is…limited, to say the least, so the we’re engaged on a long term project to build a replacement that offers similar (but better) features using a SaaS (Software as a Service) model.

Ideally, we want to make the transition from old and busted to new hotness as easy as possible for every single one of our users, so there is a huge amount of value to be gained by investing in a reliable and painless migration process.

We’re definitely not there yet, but its getting closer with every deployment that we make.

Architecturally, we have a specialist migration tool, backed by an API, all of which is completely separate from the main user interface to the system. At this point in time, migrations are executed by an internal team, but the dream is that the user will be able to do it themselves.

The API is basically a fancy ETL, in that it gets data from some source (extract), transforms it into a format that works for the cloud product (transform) and then injects everything as appropriate via the cloud APIs (load). Its written in Java (specifically Kotlin) and leverages Spring for its API-ness, and Spring Batch for job scheduling and management. Deployment wise, the API is encapsulated in a Docker image (output from our build process) and when its time to ship a new version, the existing Docker containers are simply replaced with new ones in a controlled fashion.

More importantly to the blog post at hand, each migration is a relatively long running task that executes a series of steps in sequence in order to get customer data from legacy system A into new shiny cloud system B.

Combine “long running uninterruptable task” with “container replacement” and you get in-flight migrations being terminated every time a deployment occurs, which in turn leads to the fear, and we all know where fear leads.

A manual deployment process, gated by another manual “hey, are you guys running any migrations” process.

Waiting At The Finish Line

To allow for arbitrary deployments, one of the simplest solutions is to have the deployment process simply wait for any in-flight migrations to complete before it does anything destructive.

Automation wise, the approach is pretty straight forward:

  • Implement a new endpoint on the API that returns if the API instance can be “shutdown” using in-memory information about migrations in flight
  • Change the deployment process to use this endpoint to decide whether or not the container is ready to be replaced

With the way we’re using Spring Batch, each “migration” job runs from start to finish on a single API instance, so its simple enough to just trigger an in-memory count to increment whenever a job starts, and decrement when it finishes (or fails).

The deployment process then just waits for each container to state whether or not its allowed to shutdown before tearing anything down. Specifically each individual container needs to be asked though, not the API through the load balancer, because they all have their own state.

This approach has the unfortunate side effect of making it hard to reason about how long a deployment might actually take though, as a migration in flight could have anywhere from a few seconds to a few hours of runtime left and the deployment cannot continue until all migrations have finished. Additionally, if another migration is started while the process is waiting for any in-flight migrations to complete, it might never get to the chance to continue, which is troublesome.

That is, unless you put the API into “maintenance mode” or something, where its not allowed to start new migrations. That’s downtime though, which isn’t continuous delivery.

Running More Than One Race

A slight tweak to the first solution is to allow for some parallel execution between the old and new containers:

  • Spin up as many new version API containers as necessary
  • Take all of the old containers out of the load balancer (or equivalent) so no new migrations can be started on them
  • Leave the old ones around, but only until they finish up their migrations, and then terminate

This allows for continuous operation of the service (which is in line with the original goal, of the user not knowing that anything is going on behind the scenes), but can lead to complications.

The main one is what will happen if the new API version contains any database updates? That might make the database incompatible with the old version, which would cause everything to explode. Obviously, there is value in making sure that changes are at least one version backwards compatible, but that can be hard to enforce automatically, and it can be dangerous to just leave up to people to remember.

The other complication is that this approach assumes that the new containers can answer requests about the jobs running on the old containers (i.e. status), which is probably true if everything is behind a load balancer anyway, but its still something to be aware of.

So again, not an ideal solution, but at least it maintains availability while doing its thing.

Or We Could Just Do Sprints

If you really want to offer continuous delivery with something that does long running background tasks, the answer is to not have long running background tasks.

Well, to be fair, the entire operation can be long running, but it needs to be broken down into smaller atomic elements.

With smaller constituent elements, you can use a very similar process to the solutions above:

  • Spin up a bunch of new containers, have them automatically pick up new tasks as they become available
  • Stop traffic going directly to the old containers
  • Mark old containers as “to be shutdown” so they don’t grab new tasks
  • Wait for each old container to be “finished” and murder it

You get a much tighter deployment cycle, and you also get the nice side effect of potentially allowing for parallelisation of tasks across multiple containers (if there are tasks that will allow it obviously).

Conclusion

For our API, we went with the first option (wait for long running tasks to finish, then shutdown), mostly because it was the simplest, and with the assumption that we’ll probably revisit it in time. Its still a vast improvement over the manual “only do deployments when the system is not in use” approach, so I consider it a win.

More generally, and to echo my opening statement, the idea of “continuous delivery” is something that should be aimed for at the very least, even if you might not make it all the way. When you’re free to deploy any time that you want, you gain a lot of flexibility in the way that you are able to react to things, especially bug fixes and the like.

Also, each deployment is likely to be smaller, as you don’t have to wait for an “acceptable window”, and bundle up a bunch of stuff together when that window arrives. This means that if you do get a failure (and you probably will) its much easier to reason about what might have gone wrong.

Mostly I’m just really tired of only being able to deploy on Sundays though, so anything that stops that practice is amazing from my point of view.