500: Internal Health Error

November 22. 2016 0 Comments

Posted in:
sick

Unfortunately Papa Nurgle, the Plaguefather, has gifted me with one of his many blessings this week, so I am in no state to put together a decent blog post.

Instead I have to muse on what other pithy titles I can use when I get sick next time, because I’m mostly out of HTTP error codes.

That Neat Little Symbol From Half-Life, Part 1

November 15. 2016 0 Comments

We’ve come a long way in our log aggregation journey. Don’t get me wrong, we still have a long way to go, but bit by bit, we’re getting better at it.

A good example of getting better, is the way in which we process our Elastic Load Balancer (ELB) logs. Over a year ago I put together a system for processing these logs into our ELK stack with way too many moving parts. It used Logstash (for processing) and Powershell (for downloading files from S3) hosted on an EC2 instance to aggregate ELB logs from S3 into to our ELK stack. Somewhat complicated in practice, but it worked, even if I was never particularly happy with it.

As is the way with these things though, because it did work, we’ve had no reason to revisit it, and we’re successfully applied the same approach to at least 3 other environments we’ve setup since.

It wasn’t without its share of problems though:

The EC2 instances hosting the solution had a tendency to cap themselves at 100% CPU for long periods. They were initially t2.mediums, but they kept expending all of their CPU credits, so we had to upgrade them to m3.mediums, which was a 50% increase in cost ($US 115/month). Never did figure out exactly what needed all that CPU, but the leading theory was Logstash.
For a while, the logs simply stopped processing after a period of time (days/weeks). This turned out to be an issue with accumulating memory dumps from Java as a result of Logstash crashing and NSSM automatically restarting it.
These were the machines most vulnerable to the memory leak in Logstash that causes its TCP driver to accumulate non-paged memory on Windows AWS instances due to some driver problem.

Good old Logstash.

To turn the discussion back to getting better, we had the opportunity to revisit this process when building some new environments, using all of the knowledge and experience that we’d gained in the intervening period. I think we came up with a much more efficient and understandable solution, but it wasn’t without its share of difficulties, which makes for a good post.

Anomalous Materials

One of the primary weaknesses in the previous approach for processing ELB logs was that it required an entire EC2 instance all to itself, for each environment that we spun up. We did this in order to keep each log processor isolated from the other and to allow us to be able to spin up an entirely self-contained environment without having to worry about some common machine that processed all of the logs in a bunch of different buckets.

Another weakness in the process that bothered me was that it had way too many moving parts. Sometimes you have to have a lot of moving parts working together in order to accomplish a goal, but you should always strive for simplicity, both from an operational point of view and from a maintenance point of view. Less is almost always better in software development.

AWS has come a long way since we jammed the initial solution together, so we decided to use this opportunity to simplify the process and experiment with some AWS tools that we don’t frequently use.

After some discussion, the we formed an idea of what we would like the new log processor to look like. We wanted to use Lambda to process the ELB logs as they were created, pushing them to the same Logstash ingress endpoint that we’ve been using consistently for the last year or so. The benefits we were expecting were a reduction in complexity (no need to have 4 different things working together), a reduction in running cost (mostly due to the removal of the EC2 instance) and a reduction in latency (the Lambda function would trigger whenever a file was written to the S3 bucket by the ELB, which meant no more polling for changes).

For those of you unfamiliar with Lamba, its a service offered by AWS that lets you configure code to run whenever a variety of events occurs. I’ve used it before to create a quicker S3 bucket clone, so if you want some more information, feel free to read up on that adventure.

In order to accomplish our goal, we would need to deal with 3 things:

The Lambda function code itself, which is written in Javascript, and would need to parse an ELB log file one line at a time, break it up, convert it into a JSON structure and then push it to a Logstash TCP endpoint to be written into Elasticsearch
The configuration/creation of the Lambda function, which would need to be done via CloudFormation as part of our normal environment setup (i.e. spin up an environment, it comes with the Lambda function that will process any ELB logs encountered)
The deployment of the code to the Lambda function via Octopus (because that’s how we roll)

Nothing particularly insane there, but definitely a few things that we’d never done before.

To Be Continued

In order to avoid creating a single monstrous post with more words than a small novel, I’m going to break it here.

Next week I’ll continue, explaining the Javascript code that we put together to process the log files (its not particularly complicated) and how we configured the Lambda function by incorporating its setup into our environment setup.

Until then, may all your Lambda functions execute quickly and your S3 buckets not turn into ghosts.

Like An Intricate Dance

November 8. 2016 0 Comments

I did a strange thing a few weeks ago. Something I honestly believed that I would never have to do.

I populated a Single Page Javascript Application with data extracted from a local database via a C# application installed using ClickOnce.

But why?

I maintain a small desktop application that acts as a companion to a service primarily offered through a website. Why do they need a companion application? Well, they participate in the Australian medical industry, and all the big Practice Management Software (PMS) Systems available are desktop applications, so if they want to integrate with anything, they need to have a presence on the local machine.

So that’s where I come in.

The pitch was that they need to offer a new feature (a questionnaire) that interacted with installed PMS, but wanted to implement the feature using web technologies, rather than putting it together using any of the desktop frameworks available (like WPF). The reasoning behind this desire was:

The majority of the existing team had mostly web development experience (I’m pretty much the only one that works on the companion application), so building the whole thing using WPF or something similar wouldn’t be making the best use of available resources,
They wanted to be able to reuse the implementation in other places, and not just have it available in the companion application.
They wanted to be able to reskin/improve the feature without having to deploy a new companion application

Completely fair and understandable points in favour of the web technology approach.

Turtles All The Way Down

When it comes to running a website inside a C# WPF desktop application, there are a few options available. You can try to use the native WebBrowser control that comes with .NET 3.5 SP1, but that seems to rely on the version of IE installed on the computer and is fraught with all sorts of peril.

You’re much better off going with a dedicated browser control library, of which the Chromium Embedded Framework is probably your best bet. For .NET that means CEFSharp (and for us, that means its WPF browser control, CEFSharp.WPF).

Installing CEFSharp and getting it to work in a development environment is pretty straightforward, just install the Nuget packages that you need and its all probably going to start working.

While functional, its not the greatest experience though.

If you’re using the CEFSharp WPF control, don’t try to use the designer. It will crash, slowly and painfully. Its best to just do the entire thing in code behind and leave the browser control out of the XAML entirely.
The underlying C++ library (libcef.dll) is monstrous, weighing in at around 48MB. For us, this was about 4 times the size of the entire application, so its a bit of a change.
If you’re using ClickOnce to deploy your application, you’re in for a bit of an adventure.

The last point is the most relevant to this post.

Because of the way ClickOnce works, a good chunk of the dependencies that are required by CEFSharp will not be recognised as required files, and thus will not be included in the deployment package when its built. If you weren’t using ClickOnce, everything would be copied correctly into the output folder by build events and MSBuild targets, and you could then package up your application in a sane way, but once you pick a deployment technology, you’re pretty much stuck with it.

You’ll find a lot of guidance on the internet about how to make ClickOnce work with CEFSharp, but I had the most luck with the following:

Add Nuget references to the packages that you require. For me this was just CEFSharp.WPF, specifically version 49.0.0 because my application is still tied to .NET 4.0. This will add references to CEFSharp.Common and cef.redist.x86 (and x64).
Edit the csproj file to remove any imports of msbuild targets added by CEFSharp. This is how CEFSharp would normally copy itself to the output folder, but it doesn’t work properly with ClickOnce. Also remove any props imports, because this is how CEFSharp does library references.
Settle on a processor architecture (i.e. x86 or x64). CEFSharp only works with one or the other, so you might as well remove the one you don’t need.
Manually add references to the .NET DLLs (CefSharp.WPF.dll, CefSharp.Core.dll and CefSharp.dll). You can find these DLLs inside the appropriate folder in your packages directory.

This last step deals entirely with the dependencies required to run CEFSharp and making sure they present in the ClickOnce deployment. Manually edit the csproj file to include the following snippet:

<ItemGroup>    
    <Content Include="..\packages\cef.redist.x86.3.2623.1396\CEF\*.*">
      <Link>%(Filename)%(Extension)</Link>
      <CopyToOutputDirectory>Always</CopyToOutputDirectory>
      <Visible>false</Visible>
    </Content>
    <Content Include="..\packages\cef.redist.x86.3.2623.1396\CEF\x86\*.*">
      <Link>%(Filename)%(Extension)</Link>
      <CopyToOutputDirectory>Always</CopyToOutputDirectory>
      <Visible>false</Visible>
    </Content>
    <Content Include="..\packages\CefSharp.Common.49.0.0\CefSharp\x86\CefSharp.BrowserSubprocess.*">
      <Link>%(Filename)%(Extension)</Link>
      <CopyToOutputDirectory>Always</CopyToOutputDirectory>
      <Visible>false</Visible>
    </Content>
    <Content Include="lib\*.*">
      <Link>%(Filename)%(Extension)</Link>
      <CopyToOutputDirectory>Always</CopyToOutputDirectory>
      <Visible>false</Visible>
    </Content>
    <Content Include="$(SolutionDir)packages\cef.redist.x86.3.2623.1396\CEF\locales\*">
      <Link>locales\%(Filename)%(Extension)</Link>
      <CopyToOutputDirectory>Always</CopyToOutputDirectory>
      <Visible>false</Visible>
    </Content>
</ItemGroup>

This particularly brutal chunk of XML adds off the dependencies required for CEFSharp to work properly as content files into the project, and then hides them, so you don’t have to see their ugly faces whenever you open your solution.

After taking those steps, I was able to successfully deploy the application with a working browser control, free from the delights of constant crashes.

Injected With A Poison

Of course, opening a browser to a particular webpage is nothing without the second part; giving that webpage access to some arbitrary data from the local context.

CEFSharp is pretty good in this respect to be honest. Before you navigate to the address you want to go to, you just have to call RegisterJsObject, supplying a name for the Javascript variable and a reference to your .NET object. The variable is then made available to the Javascript running on the page.

The proof of concept that I put together used a very simple object (a class with a few string properties), so I haven’t tested the limits of this approach, but I’m pretty sure you can do most of the basic things like arrays and properties that are objects themselves (i.e. Foo.Bar.Thing).

The interesting part is that the variable is made available to every page that is navigated to in the Browser from that point forward, so if there is a link on your page that goes somewhere else, it will get the same context as the previous page did.

In my case, my page was completely trivial, echoing back the injected variable if it existed.

<html>
<body>
<p id="content">
Script not run yet
</p>
<script>
if (typeof(injectedData) == 'undefined')
{
    document.getElementById("content").innerHTML = "no data specified";
}
else 
{
    document.getElementById("content").innerHTML = "name: " + injectedData.name + ", company: " + injectedData.company;
}
</script>
</body>
</html>

Summary

It is more than possible to have good integration between a website (well, one featuring Javascript at least) and an installed C# application, even if you happen to have the misfortune of using ClickOnce to deploy it. Really, all of the congratulations should go to the CEFSharp guys for creating an incredibly useful and flexible browser component using Chromium, even if it does feel monstrous and unholy.

Now, whether or not any of the above is a good idea is a different question. I had to make some annoying compromises to get the whole thing working, especially with ClickOnce, so its not exactly in the best place moving forward (for example, upgrading the version of CEFSharp is going to be pretty painful due to those hard references that were manually added into the csproj file). I made sure to document everything I did in the repository readme, so its not the end of the world, but its definitely going to cause pain in the future to some poor bastard.

This is probably how Dr Frankenstein felt, if he were a real person.

Disgusted, but a little bit proud as well.

More Smarter People

November 1. 2016 0 Comments

I’m a lucky developer.

Not because I tend to be able to escape from dangerous situations mostly intact (though that does seem to be a running theme) but because I have the good fortune to work for a company that actually supports the professional development of its employees.

I don’t mean the typical line of “yeah we support you, we totally have a training budget, but good luck getting even a token amount of money to buy Pluralsight to watch courses on your own time” that you usually get. I mean the good sort of support.

The only sort of support that matters.

Time.

Time Is Decidedly Not On My Side

When it comes to extending my technical capabilities, the last thing that I’m worried about is spending money. There are a endless ways to improve yourself in your chosen field without spending a cent thanks to the wonders of the internet, and even if I did have to part with some cash in order to get better at what I do, I would be crazy not to. Every skill I gain, every piece of experience I gather, improves my ability to solve problems and makes me a more valuable asset to whichever company I choose to associate with. More often than not, that means more money, so every cent expended in the pursuit of improvement is essentially an investment in my (and my families) future.

So money is not really an issue.

Time on the other hand is definitely an issue.

My day job takes up around a third of the average day, including some travelling. Combine that with some additional work that I do for a different company plus the requirement to sleep, eat and exercise (to stay healthy) and my weekdays are pretty much spoken for. Weekends are taken up by a mixture of gardening (which I tell myself is a form of relaxation, but is really just another way to indulge my desire to solve problems), spending time with my family and actual downtime to prevent me from going insane and murdering everyone in a whirlwind of viscera.

That doesn’t leave a lot of time left to dedicate to self-improvement, at least not within sacrificing something else (which, lets be honest, is probably going to be sleep).

I try to do what I can when I can, like reading blog posts and books, watching courses and writing posts like this (especially while travelling on the train to and from work), but it’s becoming less and less likely that I can grab a dedicated chink of time to really dig into something new with the aim of understanding how and why it works, to know how best to apply it to any problems that it might fit.

Having a dedicated piece of time gifted to me is therefore a massive boon.

Keeping Employees Through Mutually Beneficial Arrangements

We have a pretty simple agreement where I work.

Friday afternoon is for professional development. Not quite the legendary 20% time allegedly offered by Google, but its a massive improvement over most other places that I know of.

Our professional development is structured, in the sense that we talk amongst ourselves and set goals and deadlines, but it doesn’t generally involve anyone outside the team. No-one approves what we work on, though they do appreciate being informed, which we do in a number of different ways (one-on-one management catchups, internally visible wiki, etc). There is financial support as well (reimbursement for courses or subscriptions and whatnot), which is nice, but not super critical for reasons I’ve already outlined above.

So far we’ve aimed for a variety of things, from formal certifications (AWS Certifications are pretty popular right now) through to prototypes of potential new products and tools all the way to just using that time to fix something that’s been bugging you for ages. Not all “projects” end in something useful, but it’s really about the journey rather than the destination, because even an incomplete tool built using a new technology has educational value.

As a general rule, as long as you’re extending yourself, and you’ve set a measurable goal of some sort, you’re basically good to go.

Conclusion

Making time for professional development can be hard, even if you recognise that it is one of the most value rich activities that you could engage in.

It can be very useful to find a company (or cajole your current organisation) into supporting you in this respect, especially by providing some dedicated time to engage in educational activities. I recommend against using any time provided in this way to just watch videos or read blogs/books, and to instead use it to construct something outside of your usual purview. That’s probably just the way I learn though, so take that advice with a grain of salt.

Giving up time that could be used to accomplish business priorities can be a hard sell for some companies, but as far as I can see, its a win-win situation. Anecdotally, Friday afternoons are one of the least productive times anyway, so why not trade that time in for an activity that both makes your employees better at what they do and generates large amounts of goodwill.

Given some time, they might even come up with something amazing.

The Case of the Slow Package Upload

October 25. 2016 0 Comments

Unfortunately for us, we had to move our TeamCity build server into another AWS account recently, which is never a pleasant experience, though it is often eye opening.

We had to do this for a number of reasons, but the top two were:

We’re consolidating some of our many AWS accounts, to more easily manage them across the organisation.
We recently sold part of our company, and one of the assets included in that sale was a domain name that we were using to host our TeamCity server on.

Not the end of the world, but annoying all the same.

Originally our goal was to do a complete refresh of TeamCity, copying it into a new AWS account, with a new URL and upgrading it to the latest version. We were already going to be disrupted for a few days, so we thought we might as well make it count. We’re using TeamCity 8, and the latest is 10, which is a pretty big jump, but we were hopeful that it would upgrade without a major incident.

I should have remembered that in software development, hope it for fools. After the upgrade to TeamCity 10, the server hung on initializing for long enough that we got tired of waiting (and I’m pretty patient).

So we abandoned the upgrade, settling for moving TeamCity and rehosting it at a different URL that we still had legal ownership of.

That went relatively well. We needed to adapt some of our existing security groups in order to correctly grant access to various resources from the new TeamCity Server/Build Agents, but nothing we hadn’t dealt with a hundred times before.

Our builds seemed to be working fine, compiling, running tests and uploading nuget packages to either MyGet or Octopus Deploy as necessary.

As we executed more and more builds though, some of them started to fail.

Failures Are Opportunities To Get Angry

All failing builds were stopping in the same place, when uploading the nuget package at the end of the process. Builds uploading to Octopus Deploy were fine (its a server within the same AWS VPC, so that’s not surprising), but a random sampling of builds uploading packages to MyGet had issues.

Investigating, the common theme of the failing builds was largish packages. Not huge, but at least 10 MB. The nuget push call would timeout after 100s, trying a few times, but always experiencing the same issue.

With 26 MB of data required to be uploaded for one of our packages (13 MB package, 13 MB symbols, probably should optimize that), this meant that the total upload speed we were getting were < 300 KBps, which is pretty ridiculously low for something literally inside a data centre.

The strange thing was, we’d never had an issue with uploading large packages before. It wasn’t until we moved TeamCity and the Build Agents into a new AWS account that we started having problems.

Looking into the network configuration, the main differences I could determine were:

The old configuration used a proxy to get to the greater internet. Proxies are the devil, and I hate them, so when we moved into the new AWS account, we put NAT gateways in place instead. Invisible to applications, a NAT gateway is a far easier way to give internet access to machines that do not need to be exposed on the internet directly.
Being a completely different AWS account means that there is a good chance those resources would be spun up on entirely different hardware. Our previous components were pretty long lived, so they had consistently been running on the same stuff for months.

At first I thought maybe the NAT gateway had some sort of upload limit, but uploading large payloads to other websites was incredibly fast. With no special rules in place for accessing the greater internet, the slow uploads to MyGet were an intensely annoying mystery.

There was another thing as well. We wrap our usages of Nuget.exe in Powershell functions, specifically to ensure we’re using the various settings consistently. One of the settings we were setting by default with each usage of push, was the timeout. It wasn’t set to 100 seconds though, it was set to 600.

Bugs, Bugs, Bugs

A while back I had to upgrade to the latest Nuget 3.5 release candidate in order to get a fix for a bug that was stopping us from deploying empty files from a package. Its a long story, but it wasn’t something we could easily change. Unfortunately, the latest release candidate also has a regression in it where the timeout for the push is locked at 100 seconds, no matter what you do.

Its been fixed since, but there isn’t another release candidate yet.

Rolling back to a version that allows the timeout to work correctly, stops the other thing from working.

That whole song and dance is how software feels sometimes.

With no obvious way to simply increase the timeout, and because all other traffic seemed to be perfectly fine, it was time to contact MyGet support.

They responded that its something they’ve seen before, but they do not know the root cause. It appears to be an issue with the way that AWS is routing traffic to their Azure hosts. It doesn’t happen all the time, but when it does, it tanks performance. They suggested recycling the NAT gateway to potentially get it on new hardware (and thus give it a chance at getting access to better routes), but we tried that and it didn’t make a difference. We’ve since sent them some detailed fiddler and network logs to help them diagnose the issue, but I wouldn’t be surprised if it was something completely out of their control.

On the upside, we did actually have a solution that was already working.

Our old proxy.

It hadn’t been shut down yet, so we configured the brand new shiny build agents to use the old proxy and lo and behold, packages uploaded in a reasonable time.

This at least unblocked our build pipeline so that other things could happen while we continue to investigate.

Conclusion

Disappointingly, that’s where this blog post ends. The solution that we put into place temporarily with the old proxy (and I really hate proxies) is a terrible hack, and we’re going to have to spend some significant effort fixing it properly because if that proxy instance dies, we could be returned to exactly the same place without warning (if the underlying issue is something to do with routing that is out of control).

Networking issues like the one I’ve described above are some of the most frustrating, especially because they can happen when you least expect it.

Not only are they basically unfathomable, there is very little you can do to actively fix the issue other than moving your traffic around trying to find a better spot.

Of course, we can also optimise the content of our packages to be as small as possible, hopefully making all of our artifacts small enough to be uploaded with the paltry amount of bandwidth available.

Having a fixed version of Nuget would be super nice as well.