Where Have All The Good Shards Gone

March 20. 2018 0 Comments

A wild technical post appears!

This weeks post returns to a topic very close to my heart, the Elasticsearch, Logstash and Kibana (ELK) Stack that we use for log aggregation. As you might be able to tell from my post history, logging, metrics and business intelligence rank pretty high on my list of priorities, regardless of any other focuses I might have. To me, if you don’t have good intelligence, you might as well be fighting in the dark, flailing about in the hopes that you hit something important.

This post specifically is about the process by which we deploy new versions of Elasticsearch, and an issue that can occur when you do rolling deployments and the Elasticsearch cluster is hosted in AWS.

Version Control

Way back in August 2017 I wrote about automating the deployment of new Elasticsearch versions to our ELK stack.

Long story short, the part of that post that it relevant to this one is the bit about unassigned shards in Elasticsearch when rebalancing after a version upgrade. Specifically, if you have nodes that are at a later version of Elasticsearch than others (which is normal when doing a rolling deployment), and the later version node is elected to hold the primary shard, replicas cannot be assigned to any of the nodes with the lower version.

This is troublesome if you’re waiting for a cluster to go green before progressing to the next node replacement, because unassigned shards equal a yellow cluster. You’ll be waiting forever (or you’ll hit your timeout because you were smart enough to put a timeout in, right?).

Without some additional change, the system will never reach a state of equilibrium.

La La La I Can’t Hear You

To extrapolate on the content of the initial post, the solution was to check that all remaining unassigned shards were version conflicts whenever an appropriate end state is reached. An end state would be something like a timeout waiting for the cluster to go green, or maybe something fancier like “number of unassigned shards has not changed over a period of time.

If the only unassigned shards left are version conflicts, its relatively safe to just continue on with the process and let Elasticsearch sort it out (which it will once all of the nodes are replaced). There is minimal risk of data loss (the primary shards are all guaranteed to exist in order for this problem to happen anyway), and each time a new node comes online, the cluster will rebalance into a better state anyway.

The script for checking for version conflicts is below:

function Get-UnassignedShards
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    $shards = Invoke-RestMethod -Method GET -Uri "$elasticsearchUrl/_cat/shards" -Headers @{"accept"="application/json"} -Verbose:$false;
    $unassigned = $shards | Where-Object { $_.state -eq "UNASSIGNED" };

    return $unassigned;
}

function Test-AllUnassignedShardsAreVersionConflicts
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    Write-Verbose "Getting all UNASSIGNED shards, to see if all of them are UNASSIGNED because of version conflicts";

    $unassigned = Get-UnassignedShards -elasticsearchUrl $elasticsearchUrl;

    foreach ($unassignedShard in $unassigned)
    {
        $primary = "true";
        if ($unassignedShard.prirep -eq "r")
        {
            $primary = "false";
        }
        $explainBody = "{ `"index`": `"$($unassignedShard.index)`", `"shard`": $($unassignedShard.shard), `"primary`": $primary }";
        Write-Verbose "Getting allocation explanation using query [$explainBody]";
        $explain = Invoke-RestMethod -Method POST -Uri "$elasticsearchUrl/_cluster/allocation/explain" -Headers @{"accept"="application/json"} -Body $explainBody -Verbose:$false;

        $versionConflictRegex = "target node version.*is older than the source node version.*";
        $sameNodeConflictRegex = "the shard cannot be allocated to the same node on which a copy of the shard already exists";
        $explanations = @();
        foreach ($node in $explain.node_allocation_decisions)
        {
            foreach ($decider in $node.deciders)
            {
                $explanations += @{Node=$node.node_name;Explanation=$decider.explanation};
            }
        }

        foreach ($explanation in $explanations)
        {
            if ($explanation.explanation -notmatch $versionConflictRegex -and $explanation.explanation -notmatch $sameNodeConflictRegex)
            {
                Write-Verbose "The node [$($explanation.Node)] in the explanation for shard [$($unassignedShard.index):$($unassignedShard.shard)] contained an allocation decider explanation that was unacceptable (i.e. not version conflict and not same node conflict). It was [$($explanation.Explanation)]";
                return $false;
            }
        }
    }

    return $true;
}

In The Zone…Or Out Of It?

This solution works really well for the specific issue that it was meant to detect, but to absolutely no-ones surprise, it doesn’t work so well for other problems.

Case in point, if your Elasticsearch cluster is AWS Availability Zone aware, then you can encounter a very similar problem to what I’ve just described, except with availability zone conflicts instead of version conflicts.

An availability zone aware Elasticsearch cluster will avoid putting shard replicas in the same availability zone as the primary (within reason), which is just another way to protect itself against losing data in the event of a catastrophic failure. I’m sure you can disable the functionality, but that seems like a relatively sane safety measure, so I’m not sure why you would.

Unfortunately, when combined with version conflicts also preventing shard allocation, you can be left in a situation where there is no appropriate place to dump a shard, so our deployment process can’t move on because the cluster never goes green.

Interestingly enough, there are two possible solutions for this:

The first is to be more careful about the order that you annihilate nodes in. Alternating availability zones is the way to go here, but this approach can get complicated if you’re also dealing with version conflicts at the same time. Also, it doesn’t really work all that well if you don’t have a full complement of nodes (with redundancy) spread about both availability zones.
The second is to just replicate the version conflict solution above, except for unassigned shards as a result of availability zone conflicts. This is by far the easier and less fiddly approach, assuming that the entire deployment finishes (so the cluster can rebalance as appropriate)

I haven’t actually updated our deployment since I discovered the issue, but my current plan is to go with the second option and see how far I get.

Conclusion

This is one of those cases where I knew that the initial solution that I put into place would probably not be enough over the long term, but there was just no value in trying to engineer for every single eventuality right at the start.

Also, truth be told, I didn’t know that Elasticsearch took AWS Availability Zones into account when allocating shards, so that was a bit of a surprise anyway.

Thinking about the actual deployment process some more, it might be easier to scale up, wait for a rebalance and then scale down again, terminating the oldest (and thus earlier version) nodes after all the new ones have already come online. The downside to this approach is mostly just time (because you have to wait for 2N rebalances, instead of just N rebalances (where N is the number of nodes), but it feels like it might be more robust in the face of unexpected weirdness.

Which, from my experience, I should probably just start expecting from now on, as it (ironically) seems like the one constant in software.

Feedback Loop

March 13. 2018 0 Comments

If you fulfil some sort of team leadership role, where you maintain and possibly direct one or more teams of people to accomplish things of value to the business, then you probably have some sort of responsibility to both give (and gather) feedback to (and from) all parties involved.

Obviously, collecting, collating and sharing feedback should be something that everyone does, but if you’re in that team leadership role, then you kind of have to do it. Its your job.

For me personally, I’ve spent enough time doing it recently that I’ve managed to form some opinions about it, as is often the case when I do a thing. The natural extension of that is to share them, because honestly, what good is having an opinion if you aren’t telling everyone aaaaaallllll about it.

As a side note, I’m sure that posts with a technical slant will return as soon as I actually do something technical, but until then, enjoy my musings on the various “managementy” things that I do.

Loyalty Card

In a team leadership position, you are responsible for the well being of your colleagues and their professional development

As such, their happiness and fulfilment should be your primary concern.

Such a strategy is not necessarily mutually exclusive with the best interests of the business though, and in fact should be complementary (happy fulfilled people generally being more productive than others), but if push comes to shove, and the best interests of the business do not line up with the best interests of your people, you should stand with your people.

Moving on from that sobering point, the first step is to understand the mechanism by which you give and gather feedback. From pain points and frustrations all the way through to career and personal development opportunities and direction, you need to have an effective and consistent way to learn all about your people and to understand what they need in order to be the best they can possibly be.

Going Undercover

The most effective way to really understand the people you’re responsible for is to engage with them on a regular basis.

Not in the form of “hey, lets have a daily catchup” though, because that’s going to easily turn into a status report, and that’s not what you want. You need to share the trials and tribulations of their day to day, and not as a manager or boss, but as a colleague. I’m fairly resolutely against what I see as the traditional management approach and instead think that if you are contributing in the same way (and to the same things) that your people are, then you’re going to understand them a hell of a lot better than if you’re looking down from your ivory tower.

Realistically this means that there is probably a hard cap on the number of people that you should be responsible for.

If they are all in a single team/squad, working towards the same goal or working within a shared space, you’re probably good for ten or so. If they are split across multiple areas/goals, then you’re limit is probably less than that.

The natural extension of this is that a pure people management role feels somewhat pointless. You should deliver the same sort of things as everyone else (maybe less of them), just with additional responsibilities to the people you’re working with.

How else could you possibly understand?

Face/Off

Even if you are completely embedded within the group of people you’re responsible for, there is still value in specifically making time to talk openly and honestly with every person individually about how they are going.

To be clear, the primary focus should be on the person, how they are feeling, where they would like to go (and how you can help) and any issues or concerns they might have, with a secondary focus on how you feel about the whole situation. You want to encourage them to have enough emotional maturity to objectively evaluate themselves, and to then be able to communicate that evaluation.

If you’ve been doing your job correctly, then you shouldn’t be surprised by anything that comes out of these sessions, but they are still useful as a more focused (and private) way to discuss anything that might be relevant.

I like to keep the discussion relatively freeform, but it does help to have some talking points, like:

Are you happy?
Do you feel productive in your day to day?
Do you feel like you are delivering value to the business as whole?
How do you think we could do better as a team?
How do you think we could do better as an organization?
How do you think you could do better?
Where do you want to go from here?
Do you have any concerns or unanswered questions?
Do you feel appropriately valued with regards to remuneration?

Don’t bombard the poor person with question after question though. Its not an interrogation.

The conversation should flow naturally with the questions above forming an underlying structure to help keep everything on track, and to provide some consistency from person to person.

That’s a Sick 360 Bro

Another mechanism is the classic “360 degree review”, where you encourage your people to send out surveys to their peers (and to complete surveys in turn) containing questions about how they are doing in various areas.

This particular mechanism has come up recently at my current workplace, so its topical for me.

I’m sure you could manage the entire process manually (paper!), but these sorts of things are typically digital (for ease of use) and are focused around getting people to anonymously comment on the people they work with in regards to the various responsibilities and expectations of the role they fill.

For example:

Bob is a Senior Software Engineer.

He is expected to:

Solve problems, probably through software solutions (but maybe not)
Mentor other software engineers, with a particular focus on those who are still somewhat green
Participate in high level technical discussions

Each one of those responsibilities would have a set of questions carefully crafted and made available for Bob’s peers to fill out, usually with some sort of numerical rating. That information would then be aggregated and returned to Bob, so that he could get a sense of how he is doing.

The anonymity of this approach is easily one of its greatest strengths. Even if you’re the most friendly, least intimidating person on earth, you’re still probably going to get more honest feedback if they don’t have to look at you directly.

Even more so if you have something of a dominant personality.

Its All About The Money, Money, Money

As a final point, everything that I’ve written about in this post should be clearly separated from discussions about salary, titles and all of the accoutrements that come with them.

At best, salary is a slightly positive factor in overall happiness and fulfilment. Once a creative person is being paid enough to meet their own personal goals, more money is unlikely to make them happier.

Of course, the flip side is devastating. Not enough money or a salary that is perceived as unfair (usually when compared against others or the market average) can be a massive demotivational factor, and in the worse case, can ruin a professional relationship.

Keeping the two things separated is difficult (and honestly, a complete separation is probably impossible), but you should still aim for it all the same. The last thing you want to happen is for people to withhold information about their weaker areas (prime targets for improvement and growth) because they know that you’re going to use it against them later when you start talking about money.

Conclusion

Being even slightly responsible for the well being of another person with respect to their professional life is a big responsibility and should be treated with an immense amount of care, empathy and respect.

That is not to say that you should be soft or impotent in your approach.

You need to be strong and decisive (when necessary) and give people pokes if they need them. Be aware though, not everyone responds to the same feedback mechanisms in the same way, and you will need to be mature enough to understand that and adapt accordingly.

To end on a fairly trite note, at the very least you should aim to be the sort of person you would look towards for professional guidance.

If you’re not at least doing that, then its worth reconsidering your approach.

Making People Happy By Not Doing Your Job

March 6. 2018 0 Comments

Posted in:
project management

This blog typically focuses pretty heavily on technical problems and their appropriately technical solutions. This is pretty normal for me, because I’m a software engineer and I like technical things.

Also, they are a lot easier to write about.

At some point you have to be mature enough to realise that software is expensive to both write and maintain, and you can provide significant value to your organisation by acting as the devils advocate whenever it looks like the solution to a problem is to write some software. Maybe the best solution doesn’t actually involve writing code at all.

If you picture yourself as just a developer (or code monkey), then doing this can feel a bit weird.

If you focus on solving problems though, its a natural extension of what you do.

Sometimes The Planets Align

Now for a more concrete example, because otherwise its all rather academic.

We’ve just embarked on a project to consume the users of another piece of software, giving them access to our new SaaS offering and moving them away from the legacy desktop software that they are currently using.

All told, there are a few hundred distinct clients to be eaten. Not exactly mind boggling numbers, but they represent a decent chunk of a particular market, so there is a lot of value in snapping them up.

Which is exactly what our competitors have been doing for longer than we would like to admit. That few hundred users was significantly higher late last year, and continues to drop at a relatively alarming rate.

Now, in our industry, its not just a matter of getting the client a copy of the (or login to) the software and then calling it a day. Users want all of their historical data to come along for the ride, so there is a significant push to support robust migrations from a variety of competitor software into our new platform.

We already have a migration tool that automates a migration from our own legacy software into the new one, so it could reasonably be extended to support another, similar migration.

But should we?

What we’re looking at here is a good case for not writing software at all:

The shelf life of this particular migration is limited. There are only a few hundred customers, and that number is not going to get any bigger as no new users are being added to the system
We have a limited amount of time available. The burn rate on the customer pool is ridiculously high, so the longer we wait, the less the total value of acquisition

It will take a certain amount of time to delivery an acceptable automated migration, and that amount of time could very well be long enough that the total pool of available customers diminishes to the point where its no longer financially feasible to chase them. Not only that, but once that pool of customers is empty, this particular migration has no value at all.

Now, there is a point in favour of developing migration software incrementally (i.e. migrating part of the system in an automated fashion and other parts manually), but there is also merit in simply hiring a bunch of dedicated data entry contractors and using the software development capability to build something that has more value over the long term.

Like a migration for an active competitor.

We still haven’t made a decision about this particular fork in the road yet, but it could easily go either way.

What Do You Do For Money Honey

Sometimes the situation is not quite as simple though, as you can find yourself in a situation where someone else has seemingly already made the decision that software should be written, and you need to convince them why its not a good idea.

We had a situation like this relatively recently, where the business had signed a contract with a third party provider of a communication service in order to complete a feature of some sort.

There was capacity remaining in the terms of the contract though (i.e. unutilised capability) and it was costing the company money for no benefit, so the idea was to augment or replace similar functionality already available in a piece of legacy software in order to make use of it.

This involved a significant amount of work, as the new provider’s functionality worked very differently to the old one, and had additional limitations.

In the end, we railed against the project and proved (using math!) that it was not financially feasible to spend a whole bunch of money on development in order to save less money elsewhere, especially when taking into account the different feature set available through the new provider (which would have required a parallel implementation as opposed to an out and out replacement).

Sometimes the best option is to not do anything at all.

Keep That Analysis Arm Strong

Both of the examples above relied heavily on our ability to perform solid analysis of a situation and then communicate the results of that analysis in a language that the business understood.

Usually this means money, but sometimes its also a matter of time. Interestingly enough, money often comes back to time anyway (we can spend $X on this, which equals Y months), and the time one is sometimes more of a question of money (how long can we sustain development at this burn rate, how much money will we gain/lose based on these timeframes, etc), so the two are intrinsically related.

The hardest art of the analysis for me is dealing with the time component, which usually means estimates. You can bypass this a bit if you’ve got an established team, because you can probably project from work that was done previously. If the nature of the new work is different, you can apply a fudge factor of some sort if you think it will help, but make sure those factors are clearly communicated in the analysis.

As a final point, don’t make the analysis complicated just to appear more comprehensive or smart. Say exactly what you need to say and no more. Present the facts and let them speak for themselves, even if you don’t like what those facts say.

Conclusion

Honestly, its really hard to not write software when its what you actually want to do.

In both of the examples I outlined, I could have happily directed a team of people to develop a software solution that would have met the requirements at hand, but it would have been irresponsible for me to do so if I honestly believed that there were better options available.

The hardest part in this whole thing can be convincing people why you should not be doing the thing that they ostensibly hired you to do.

This is where its important to be able to speak the language of the business, and to be able to deliver well analysed and supported information to the people who ultimately need to make the decision.

Of course, sometimes you don’t win that particular battle and you have to write software after all.

Which is a pretty good consolation prize, because writing software is awesome.

Just Browsing Thanks

February 27. 2018 0 Comments

With my altogether too short break out of the way, its time for another post about something software related.

This weeks topic? Embedding websites into windows desktop applications for fun and profit.

Not exactly the sexiest of topics, but over the years I’ve run into a few situations where the most effective solution to a problem involved embedding a website into another application. Usually an external party has some functionality that they want your users to access and they’ve put a bunch of effort into building a website to do just that. If you’re unlucky, they haven’t really though ahead and the website is the only option for integration (what’s an API?), but all is not lost.

Real value can still be delivered to the users who want access to this amazing thing you have no control over.

So Many Options!

I’ve been pretty clear in the posts on this blog that one of things my team is responsible for is a legacy VB6 desktop application. Now, its still being actively maintained and extended, so its not dead, but we try not to write VB6 when we can avoid it. Pretty much any new functionality we implement is written in C#, and if we need to present something visual to the user we default to WPF.

Hence, I’m going to narrow the scope of this post down to those technologies, with some extra information from a specific situation we ran into recently.

Right at the start of the whole “hey, lets jam a website up in here” though process, the first thing you need to do is decide whether or not you can “integrate” by just shelling out to the website using the current system default browser.

If you can, for the love of all that is good and holy, do it. You will save yourself a whole bunch of pain.

Of course, if you need to provide a deeper integration than that, then you’re going to have to delve into the wonderful world of WPF web browser controls.

Examples include:

CEFSharp.WPF
The built-in WebBrowser
Awesomium (which appears to have disappeared?, also, needs monies)

There are definitely other offerings, but I don’t know what they are. I can extrapolate on the first two (because I’ve used them both in anger), but I can’t really talk about the third one. I only included it because I’ve heard about it specifically.

Chrome Dome

CEFSharp is a .NET (both WPF and WinForms) wrapper around the Chromium Embedded Framework, and to be honest, its pretty great.

In fact, I’ve already written a post about using CEFSharp in a different project. The goal there was to host a website within a desktop application, but there were some tricksy bits around supplying data to the website directly (via shared Javascript context), and we actually owned the website being embedded, so we had a lot of flexibility around making it do exactly what we needed it to do.

The CEFSharp library is usually my first port of call when it comes to embedding a website in a desktop application.

Unfortunately, we we tried to leverage CEFSharp.WPF into our VB6/C# franken-application we ran into some seriously weird issues.

Our legacy application is at its core VB6. All of the .NET code is triggered from the VB6 via a COM interop, which essentially amounts to a message bus with handlers on the .NET side. VB6 raises event, .NET handles it. Due to the magic of COM, this means that you can pretty much do all the .NET things, including using the various UI frameworks like WinForms and WPF. There is some weirdness with windows and who owns them, but all in all it works pretty well.

To get to the point, we put a CEFSharp.WPF browser into a WPF screen, triggered it from VB6 and from that point forward the application would crash randomly with Access Violations any time after the screen was closed.

We tried the obvious things, like controlling the lifetime of the browser control ourselves (and disposing of it whenever the window closed), but in the end we didn’t get to the bottom of the problem and gave up on CEFSharp. Disappointing but not super surprising, given that that sort of stuff is damn near impossible to diagnose, especially when you’re working in a system built by stitching together a bunch of technological corpses.

Aiiiieeeeeee!

Then there is the built-in WPF WebBrowser control, which accomplishes basically the same thing.

Why not go with this one first? Surely built in components are going to be superior and better supported compared to third party components?

Well, for one, its somewhat dependent on Internet Explorer, which can lead to all sorts of weirdness.

A good example said weirdness if the following issue we encountered:

You try to load a HTTPS website using TL 1.2 through the WebBrowser control
It doesn’t work, giving a “page cannot be loaded error” but doesn’t tell you why
You load the page in Chrome and it works fine
You load the page in Internet Explorer and it tells you TLS 1.2 is not enabled
You go into the Internet Explorer settings and enable support for TLS 1.2
Internet Explorer works
Your application also magically works

The second pertinent piece of weirdness relates specifically to the controls participation in the WPF layout and rendering engine.

The WebBrowser control does not follow the same rendering logic as a normal WPF control, likely because its just a wrapper around something else. It works very similar to a WinForms control hosted in WPF, which is a nice way of saying it works pretty terribly.

For example, it renders on top of everything regardless of how you think you organised it, which can lead to all sorts of strange visual artefacts if you tend to use the Z-axis to create layered interfaces.

Conclusion

With CEFSharp causing mysterious Access Violations that we could not diagnose, the default WPF WebBrowser was our only choice. We just had to be careful with when and how we rendered it.

Luckily, the website we needed to use was relatively small and simple (it was a way for us to handle sensitive data in a secure way), so while it was weird and ugly, the default WebBrowser did the job. It didn’t exactly make it easy to craft a nice user experience, but a lot of the pain we experienced there was more the fault of the website itself than the integration.

That’s a whole other story though.

In the end, if you don’t have a horrifying abomination of technologies like we do, you can probably just use CEFSharp. Its solid, well supported and has heaps of useful features, assuming you handle it with a bit of care.

Nothing To See Here

February 20. 2018 0 Comments

No post this week, instead, holidays.