Logstash Never Forgets…Apparently

September 29. 2015 0 Comments

Generally, I’ve been pretty happy with the way that we do log aggregation, using Elasticsearch, Logstash and Kibana (ELK). That’s not to say it couldn’t use improvements in almost every area, but for now, it does what it needs to do. Its definitely better than we had before, and people are actually using it to answer questions, which I would consider the mark of a successful implementation.

Of all the components in the triumvirate though, Logstash has easily been the worst. Kibana has been basically problem free (somewhat to be expected for a visualisation framework that delegates the majority of its work to the data store) and Elasticsearch has been complicated and somewhat temperamental (mostly because we haven’t had the time nor experience to tune it properly), but Logstash has been a pain.

Version 1.4 flat out did not work properly on Windows with file based inputs, as it didn’t properly remember its position in files it was reading. Version 1.5.0 fixed that problem, but we had to suffer through a number of beta issues because we were forced to use the beta version. Even after the 1.5 release we’ve had some other issues relating to TCP outputs consuming mass amounts of CPU (kind of annoying on CPU Credit limited instances), using the S3 input from behind a proxy and recovering reliably after its TCP output endpoint was down temporarily.

It’s become a bit of meme between myself and one of our operations people that whenever something weird happens with logs (even if its unrelated to the actual process of log aggregation), that it’s probably Logstash’s fault.

When one of our AWS instances responsible for filling API requests became unresponsive a few weeks ago, we joked that it was probably Logstash’s fault.

It turned out that it actually was.

What’s That Thing You Use to Dig Food?

Our API’s all have redundancy built into them via the magic of auto scaling groups and multiple instances. Its the kind of thing you have to have when you’re running anything in AWS, as there is no guarantee that your instances will be long lived. Its likely that they will be fine and everything will go swimmingly, but there is no guarantee. Honestly, that’s no different from normal hosting, the information is just available upfront, instead of after a few months when something explodes. Plan for failure, because it will happen.

Anyway, a few weeks ago one of our API instances stopped doing the thing it was supposed to be doing, i.e. filling API requests, so it was quietly removed from the load balancer. The alarms we have setup for number of healthy instances kicked in and we put the failing instance into standby mode (which spun up a new one to take its place) and started investigating.

Looking at the logs in Kibana, the instance started returning 500’s sporadically early in the morning and then got progressively worse until it was removed from the load balancer a little after 0800. The root cause of the 500’s? OutOfMemoryExceptions, the API process literally could not allocate a piece of memory large enough to do the thing it wanted to do.

My first response was to make that deep guttural noise that only comes from software developers when they know what’s about to happen. A mixture between a groan and a resigned sigh, tinged with frustration.

I’ve dealt with memory leaks before.

They are terrible, and I hate them.

I assumed that we had a memory leak in our deployed .NET application. Although .NET is a managed language, it is surprisingly easy to leak memory, with various ways to root vast amounts of information, stopping the garbage collector from recovering it.

The best tool that I’ve found for visualizing memory leaks in .NET is the SciTech .NET Memory Profiler. Honestly though, other than Visual Studio itself, its basically the only one I’ve really used, so I can’t strictly say that its better than anything else. I can say that it is good at what it does, because its helped me locate and fix quite a few memory leaks.

Being that the API had been running uninterrupted for a few weeks, the memory leak would be slow. it might even be related to specific requests that only happen rarely, or even worse, it might be time based, rather than event based. I knew I would need to either reproduce the problem locally (which might be quite hard) and find some way to get the memory profiler tool into the environment where the problem occurred and then wait until it happened again.

Like I said, I hate memory leaks.

Surprise! It Wasn’t Us

Remoting on to the AWS instance with the memory issue showed an entirely unexpected situation.

Our .NET process wasn’t using much memory at all.

The Java process, however, was using 1.5GB of physical memory and a huge chunk of virtual memory. The only piece of Java code running? Logstash.

Now I had a different problem. At least with a memory leak within our software I could trace it, find it and fix it. I know .NET, I’ve done memory leak analysis before, and while its terrible, its a known problem. It might take me a while to get to the bottom of it, but I would get there.

Logstash runs Ruby via JRuby, which runs inside Java. Its open source, but I wouldn’t even know where to begin with that mess. How could I find out why it was using so much memory for the apparently simple task of processing some text files? None of our Logstash configs are fancy, they just process text files and shunt the results to another instance of Logstash which puts them into Elasticsearch.

Clearly, The Problem Is Solved

At this particular point in time, I do not have the luxury of getting to the root cause of problems with supplementary components. This makes me sad inside, but that’s how reality rolls sometimes.

The easiest solution right now? Restart the Logstash service every day.

Easy enough to implement with a Scheduled Task that executes Powershell.

I did have one incredibly weird issue with the task. The task name that I generated for the service restart had square braces in it. As a result, so did the folder where the Powershell task puts its transcribed output. The Start-Transcript command does not like square braces in its output file for some reason, and it fails with a Start-Transcript: The host is not currently transcribing, which is incredibly helpful. This particular quirk took far too long to get to the bottom of.

Long term we need to look into the issue in depth. Maybe its a problem in the particular version of Logstash we are using (1.5.2), maybe its our configuration or maybe its the environment. I don’t know, and that particular investigation could be a gigantic time sink, so we’ll have to dedicate some time to it later on.

Another solution would be to not use Logstash directly on the machines generating the logs, and to instead use the Logstash Forwarder, which is a lightweight replacement for Logstash that only forwards events to a remote location. The remote location is typically an actual Logstash instance which has all the logic to parse the event into a meaningful format. To me this just moves the issue around, but it still might be an option.

Conclusion

This is actually the first case that I can remember where a memory leak wasn’t the fault of the software I was actively working on. I call it a memory leak because I can’t conceive how an apparently simple process like Logstash could use so much memory over time, but it might just be what it does. I mean, i wouldn’t consider SQL Server using massive amounts of memory a leak per se (because I understand what it does is hard, and it consumes more memory based on the size of the data), it’s just really weird that a process which should be entirely transient (read, transform, write, discard) could keep using more and more memory over time without there being some sort of bug.

To be honest, I’m extremely unhappy with the installing a scheduled task to restart the logstash service. It’s one of those bandaid fixes that you just know will end up biting you hard at some point in the future. It will probably cause an entirely worse issue, like massive, unexpected data loss or something.

For now, at least I know that the Logstash process will not stop our APi’s from working, which I guess is a victory.

A pyrrhic victory.

Cleaning House

September 22. 2015 0 Comments

As I’ve said repeatedly before, logging is easily one of the most important things that comes out of a piece of software. Good logging can be used for all sorts of purposes, but most importantly, it makes maintaining and supporting a piece of software much much easier. I won’t go into any more detail about how important I think logging is right now, as I’m going to use this post to talk about dealing with the bits and pieces that logging often leaves behind.

Mostly that means log files.

In our environment setup, we have drifted towards a pattern where all log files are written into a common logs directory, with subdirectories for each component.

If we want to find anything out about our components, we can just go to the logs directory. Nice and easy, no need to waste time hunting over the file system for logs that matter. Of course, its only used as a backup, as all of our logs are aggregated into ELK, but its still useful.

At this point in time, most of our components simply create log files on disk, and write everything there. Its then up to an external component (logstash in most cases, nxlog in very rare cases) to take those files, do something to them and write the events to some other location for aggregation. We have log files being created by Squid proxies, performance counters, web services, iis, windows services and Powershell scripts and probably many other things that aren't occurring to me right now.

Of course, the downside of this extensive logging is that you are constantly consuming more and more disk space. Some logging frameworks and mechanisms manage that for you (either with a maximum directory size or some maximum number of days of log files), but not all.

A great side effect of having all logs go to the same directory, is that it’s easy to write something to clean up that directory based on some retention policy (not written in the last 14 days for example).

So we did just that.

Scheduled Annihilation

Every environment that we create has two Powershell scripts executed on all machines by default.

The first sets up the logs directory at the common location, ensuring that it is writable by all appropriate users, and that all subdirectories and files are also appropriately writable.

We implemented this because when we setup a Performance Counter to write a log file containing system stats, it created the directory with an owner of System, and everything else just failed miserably from that point foward, completely breaking the common log location right off the bat. It was, for some reason, stupidly difficult to claim the folder back and fix its permissions as well, so we just implemented a script to configure the folder ahead of time.

function InitialiseCommonLogsDirectory
{
    $directoryPath = "C:\logs"
    if (-not (Test-Path $directoryPath))
    {
        Write-Verbose "Creating common logs directory at [$directoryPath]"
        New-Item $directoryPath -Type Directory 
    }

    Write-Verbose "Making common logs directory [$directoryPath] writeable by everyone."
    $Acl = Get-Acl $directoryPath

    $Ar = New-Object System.Security.AccessControl.FileSystemAccessRule("Everyone", "FullControl", "ContainerInherit,ObjectInherit", "None", "Allow")

    $Acl.SetAccessRule($Ar)
    Set-Acl $directoryPath $Acl
}

The second script installs a Scheduled Task to run a fairly simple piece of Powershell that just deletes all files that hadn’t been written to in the last 14 days. The assumption was that the physical files on disk were relatively unimportant assuming the log aggregation was working correctly, but we keep a small window just in case.

Annoyingly, the commands for setting up a scheduled task via Powershell are a bit arcane, mostly because you have to resort to using COM services to get the job done. The script to setup the scheduled task looked like this:

function CreateLogsClearingTask
{
    $directoryPath = "C:\logs"

    $taskName = "Clear Old Logs"

    $service = new-object -ComObject("Schedule.Service")
    $service.Connect()
    $rootFolder = $service.GetFolder("\")
    try
    {
        $rootFolder.DeleteTask($taskName, 0)
    }
    catch
    {
        if (-not ($_.Exception.Message -like "*The system cannot find the file specified*"))
        {
            throw $_.Exception
        }
    }

    $daysOld = 14
    $taskDescription = "Clears Logs that are older than [$daysOld] days"
    $taskCommand = "powershell"
    $taskArguments = "-Command `"&{ Get-ChildItem $directoryPath\* -Include *.log -Recurse | Where LastWriteTime -lt (Get-Date).AddDays(-$daysOld) | Remove-Item }`" -ExecutionPolicy Bypass"
 
    write-verbose "Creating Scheduled Task to automatically execute [$taskCommand] with arguments [$taskArguments] on a regular basis."

    $TaskStartTime = [DateTime]::Now.Date.AddHours(1)
 
    $TaskDefinition = $service.NewTask(0) 
    $TaskDefinition.RegistrationInfo.Description = "$taskDescription"
    $TaskDefinition.Settings.Enabled = $true
    $TaskDefinition.Settings.AllowDemandStart = $true
 
    $triggers = $TaskDefinition.Triggers
    #http://msdn.microsoft.com/en-us/library/windows/desktop/aa383915(v=vs.85).aspx
    $trigger = $triggers.Create(2)
    $trigger.StartBoundary = $TaskStartTime.ToString("yyyy-MM-dd'T'HH:mm:ss")
    $trigger.Enabled = $true
    $trigger.DaysInterval = 1
 
    # http://msdn.microsoft.com/en-us/library/windows/desktop/aa381841(v=vs.85).aspx
    $Action = $TaskDefinition.Actions.Create(0)
    $action.Path = "$taskCommand"
    $action.Arguments = "$taskArguments"
 
    #http://msdn.microsoft.com/en-us/library/windows/desktop/aa381365(v=vs.85).aspx
    $rootFolder.RegisterTaskDefinition("$TaskName",$TaskDefinition,6,"System",$null,5)

    $task = $rootFolder.GetTask($taskName)

    Write-Verbose "Executing Scheduled Task."
    $task.Run($null)
}

Not that friendly.

You may be thinking, why not just use the awesome scheduled task cmdlets available in later versions of Windows, New-ScheduledTaskTrigger, Get-ScheduledTask and Register-ScheduledTask. These commands are extremely useful and easy to understand, but unfortunately, in our particular case, we needed to target operating systems where those features are not available. Specifically, our build server (which runs the automated tests for the Powershell scripts) is running an older version of Windows Server, where the cmdlets are unavailable. Sometimes you just have to target the lowest common denominator.

Straightforward, right? I mean, it looks like it does the job.

It did not.

Logs! Logs Everywhere!

We didn’t notice for a pretty long time, because the growth was tiny on our non-production environments, but once we started to have some real traffic, the log files for IIS and the Web Service started to add up quickly. Luckily, we have dashboards that track the disk usage of our various server components, and someone happened to see the pattern of the % free space on machines dropping steadily over time.

A closer look at the script above will show you that it was failing because it was trying to clean up C:\logs\logs, which of course doesn’t exist. The directory variable had C:\logs already in it, and the Powershell script was appending another \logs to it before it was being run.

It turns out that the task had been failing continuously, pretty much since we implemented it, and because the leading indicator of failure was missing (a failing task), no-one noticed. The task thought it had succeeded because Powershell returns an exit code of 0 when an exception is thrown.

I cleaned up the script, wrote a test to verify that it was actually deleting old files (without error), incorporated the fixed scripts into our environment setup and deployed the fix to the live machines.

Once deployed, the task said it succeeded, but looking at the log files, they hadn’t actually been cleaned up at all. It had failed silently again.

Powershell and Failure: Close Friends

The root cause for the task not indicating failure correctly was that the Powershell executable, for some ungodly reason, returns an exit code of 0 when an exception is thrown. If you want to make it act like a normal application would (non-zero exit code = bad, zero = good) you have to wrap your script in a try…catch at the root and exit with a non-zero code yourself.

A little bit more refactoring around the Powershell scripts that create the scheduled task, and I had it automatically wrapping the supplied script with a try..catch and exiting with a failure if an exception occurred. I’ve included the script that creates the task below, but its only a part of a bigger system (which you can see in the Solavirum.Scripts.Common repository, when I finish updating it).

function New-DailyPowershellScheduledTask
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$taskName,
        [string]$taskDescription,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$script,
        [TimeSpan]$TimeOfDay="0:00",
        [switch]$ExecuteImmediately=$true
    )

    $service = new-object -ComObject("Schedule.Service")
    $service.Connect()
    $rootFolder = $service.GetFolder("\")

    try
    {
        $command = "powershell.exe"
        $arguments = _ConvertPowershellScriptIntoArgumentsForScheduledTask -script $script -taskName $taskName

        $TaskStartTime = [DateTime]::Now.Date.Add($TimeOfDay)
 
        $TaskDefinition = $service.NewTask(0) 
        $TaskDefinition.RegistrationInfo.Description = "$taskDescription"
        $TaskDefinition.Settings.Enabled = $true
        $TaskDefinition.Settings.AllowDemandStart = $true
 
        $triggers = $TaskDefinition.Triggers
        #http://msdn.microsoft.com/en-us/library/windows/desktop/aa383915(v=vs.85).aspx
        $trigger = $triggers.Create(2)
        $trigger.StartBoundary = $TaskStartTime.ToString("yyyy-MM-dd'T'HH:mm:ss")
        $trigger.Enabled = $true
        $trigger.DaysInterval = 1
 
        # http://msdn.microsoft.com/en-us/library/windows/desktop/aa381841(v=vs.85).aspx
        $Action = $TaskDefinition.Actions.Create(0)
        $action.Path = "$command"
        $action.Arguments = "$arguments"
 
        #http://msdn.microsoft.com/en-us/library/windows/desktop/aa381365(v=vs.85).aspx
        $rootFolder.RegisterTaskDefinition("$TaskName",$TaskDefinition,6,"System",$null,5)

        $task = $rootFolder.GetTask($taskName)

        if ($ExecuteImmediately)
        {
            Write-Verbose "An immediate execution of the just created scheduled task [$taskName] was requested. Executing now."
            $task.Run($null)
        }
    }
    finally
    {
        [System.Runtime.Interopservices.Marshal]::ReleaseComObject($service)
    }
}

function _ConvertPowershellScriptIntoArgumentsForScheduledTask
{
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$script,
        [string]$taskName
    )

    $taskLogsDirectoryName = "$($taskName.Replace(' ', '_'))"
    $rootLogsDirectoryPath = "C:\logs\scheduled-tasks"
    $logsDirectoryPath = "$rootLogsDirectoryPath\$taskLogsDirectoryName"
    $logDirectory = New-Item -ItemType Directory -Path $logsDirectoryPath -Force
    return "-command `"& { try { `$timestamp = [DateTime]::Now.ToString('yyyyMMdd'); Start-Transcript -Path $logsDirectoryPath\`$timestamp.log -Append; $script } catch { Stop-Transcript; exit 1; } }`" -ExecutionPolicy Bypass"
}

Now that I had fixed the issue with it failing to fail in an obvious way, it was time to add some logging. So I could log why it wasn’t deleting the logs.

Using the code that wraps the incoming script in a try…catch, I added a call to Start-Transcript at the start of the script, and a call to Stop-Transcript at the end. Start-Transcript is a useful feature built into Powershell that lets you write to a file everything interesting that happened during the transcript block. Its not perfect (it doesn’t seem to record Verbose statements), but it does record the output stream and any errors and warnings, which is enough.

Be careful. If you supply Start-Transcript with a file path inside a directory that does not exist, it appears to just fail silently. I’m pretty sure I tried the –Force flag (which usually deals with these things), but it didn’t seem to help. Make sure your destination directory exists before you try to start the transcript.

I made sure that the log files coming out of the scheduled task were appropriately date stamped (so they too, could in turn be cleaned up) and deployed the updated scheduled task again.

The Root of My Problems

The transcript file showed that I was getting an Access Denied exception when the script was running as a task. Obviously this did NOT happen in my tests, so it was something specific to the deployed environments.

Fair enough, that can happen with scheduled tasks, because they default to running as the System account, which doesn’t necessarily have normal permissions.

The interesting thing was that the root logs directory was specifically marked as full control to all appropriate users, and that included the System account. Checking the permissions on the directory with Get-Acl proved that to be true. System clearly had the ability to do whatever it liked to the logs directory and its content.

So where was the Access Denied error coming from?

Time to go back to debugging basics. It was time for Process Monitor (procmon). Process Monitor is one of those awesome sysinternals tools supplied by Microsoft/someone at Microsoft. It lets you monitor every single event happening within windows, including interactions with the registry, file system and all sorts of other things. Also, more importantly for the maintenance of sanity, it lets you filter those events.

Firing up procmon and running the scheduled task showed that it was clearly deleting the very first file that it found that was old enough to be deleted, and then stopping. There was no obvious errors coming from the file system, the process just seemed to stop trying. If I ran the task again, it actually had deleted the first file, and it would delete the next one (which was now the first) and stop again.

I still have no idea why it was doing this. I suspect it might have had something to do with deleting files during a streaming operation in Powershell (i.e. Get|Where|Delete), but if that were the case, I would think it would happen during my test as well.

I fixed the issue by wrapping the Get|Where in brackets, to make sure it was evaluated before it was streamed to the Delete functions, kind of like using a ToArray or ToList in C# when enumerating through a collection you intend to remove values from.

Once I made that change everything worked. The files were deleted, the task indicated success and all was well in the world.

Conclusion

There are a number of lessons to come out of this, but the most important is to make sure that your failure indicators actually indicate failure when a failure occurs, which is really just a specific application of “test all the things”.

Its not enough to just test it yourself either (in a unit test or similar), you must test in the environment that it will actually be deployed to, because the differences catch you out. In additional to testing, you need to have indicators or events that you monitor in order to make sure everything is acting as you expected. In our case, the monitoring that we had on the machines allowed us to notice the issue, but if no-one was paying attention, we would have probably had issues when the disk finally filled up.

It would have been better to have alerts that warned us when something was trending the wrong way. We probably glossed over that sort of thing because we thought it was a solved problem (after all, we had a script), so I suppose the moral is to question and check everything, even if you think its fine.

Because its probably not.

Certified Insanity Part 2: Electric Boogaloo

September 15. 2015 0 Comments

Almost a year ago I wrote a blog post detailing an adventure I had with ClickOnce, certificates and trusted software. It was exhausting and like most things certificate related, frustrating and confusing. I got it working in the end, and everything was good.

For about a year.

I wrote the blog post a few months after I dealt with the issue, which was a few months after the certificate had been issued.

Thus, a few months ago, the certificate expired.

The renewal was remarkably painless, all things considered. I still had access to the same Windows Server virtual machine that I used for the last certificate, which was nice, and more importantly, I wrote down everything I did last time so I wouldn’t have to discover it all again.

I got the new certificate, incorporated it into the build/publish process and everything was good again.

Or so I thought.

Testing Is Important

This is one of those cases where you realise that testing really is important, and that you shouldn’t slack on it.

It’s a little bit hard in this project, because I don’t technically actively work on it anymore, and it was pretty barebones anyway. No build server or anything nice like that, no automated execution of functional tests, just a git repository, Visual Studio solution and some build scripts.

When I incorporated the new certificate, I did a build/publish to development to make sure that everything was okay. I validated the development build on a Windows 7 virtual machine that I had setup for just that task. It seemed fine. Installed correctly over the top of the old version, kept settings, no errors or scary warning screens about untrusted software, so I didn’t think any more of it.

I didn’t need to re-publish to production, because the old build was timestamped, so it was still valid with the old certificate. Other than the certificate changes, there was nothing else, so no need to push a new version.

What I didn’t test was Windows 7 with .NET Framework 4.0 (or Windows XP, which can only have .NET Framework 4.0.

Last week I did some reskinning work for the same project, giving the WPF screens a minor facelift to match the new style guide that had been implemented on the website. Pretty straightforward stuff in the scheme of things, thanks to WPF. I switched up some button styles, tweaked a few layouts and that was that.

Build publish to all 3 environments, validated that it worked (again on my normal Windows 7 test instance) and I moved on to other things.

Then the bug reports started coming in.

Software That Doesn’t Work Is Pretty Bad

Quite a few people were unable to install the new version. As a result, they were unable to run the software at all, because that’s how ClickOnce rolls. The only way they could start up, was to disconnect from the internet (so it would skip the step to check for new versions) and then reconnect after the software had started.

The reports all featured the same error screen:

That’s basically the default ClickOnce failure screen, so nothing particularly interesting there. It complains about the format of the application being wrong. Generic error, nothing special. The log though, that was interesting:

Below is a summary of the errors, details of these errors are listed later in the log.
* Activation of [REDACTED – Installation Location] resulted in exception. Following failure messages were detected:
+ Exception reading manifest from [REDACTED – Manifest Location]: the manifest may not be valid or the file could not be opened.
+ Manifest XML signature is not valid.
+ SignatureDescription could not be created for the signature algorithm supplied.

So apparently I have an invalid signature, which, of course, is related to certificates. Considering it all worked fine on newer/updated machines, my mind immediately went to the predictable thorn in my side, the differences between .NET 4 and .NET 4.5. Windows XP can never have .NET 4.5, and Windows 7 boxes that haven’t been updated to the latest version won’t have .NET 4.5 either.

A quick search showed that .NET only supports SHA-1 and my manifests were signed with SHA-256.

But what did I do that caused it to stop working? I worked perfectly well before I published my reskinning changes, and they were pure XAML, nothing special.

If you’ve been paying attention, what I did was obvious. I changed the certificate.

To me, that wasn’t obvious, because I changed the certificate a few months ago, and I’d already forgotten about it.

After a few hours of investigation, I came across some reports of Visual Studio 2013 not signing ClickOnce applications targeting .NET 4 with the proper algorithm. My new certificate was SHA-256, so this sounded plausible. The problem was, this particular bug was fixed in Visual Studio 2013 Update 3, and I was running Update 4.

This StackOverflow post in particular was helpful, because it reminded me that I was using Mage to re-sign the manifests after I used SignTool to sign the assembly.

Apparently the fix they made in Visual Studio Update 3 was never incorporated into Mage, so you quite literally cannot choose to sign the manifests with SHA-1 using an SHA-256 certificate, even if you have to.

Fixing the Match

The fix was actually pretty elegant, and I’m not sure why I didn’t do the whole signing thing this way in the first place.

Instead of writing custom code to sign the manifests again when signing the executable after building/publishing via MSBuild, I can just call MSBuild twice, once for the build and once for the publish. In between those two calls, I use SignTool to sign the executable, and everything just works.

$msbuild = "C:\Program Files (x86)\MSBuild\12.0\bin\msbuild.exe"
$solutionFile = $PSScriptRoot + "\[SOLUTION_FILE_NAME]"

.\tools\nuget.exe restore $solutionFile

$msbuildArgs = '"' + $solutionFile + '" ' + '/t:clean,rebuild /v:minimal /p:Configuration="' + $configuration + '"'
& $msbuild $msbuildArgs

if($LASTEXITCODE -ne 0)
{
    write-host "Build FAILURE" -ForegroundColor Red
    exit 1
}

$appPath = "$msbuildOutput\$applicationName.application"
$timestampServerUrl = "http://timestamp.comodoca.com/authenticode"

$certificateFileName = "[CERTIFICATE_NAME].pfx"
$intermediateCertificateFileName = "[INTERMEDIATE_CERTIFICATE_NAME].cer"
$executableToSign = "[PATH_TO_EXECUTABLE]"

$certificatesDirectory = $PSScriptRoot + "[CERTIFICATES_DIRECTORY]"
$certificateFilePath = "$certificatesDirectory\$certificateFileName"
$intermediateCertificateFilePath = "$certificatesDirectory\$intermediateCertificateFileName"

& tools\signtool.exe sign /f "$certificateFilePath" /p "$certificatePassword" /ac "$intermediateCertificateFilePath" -t $timestampServerUrl "$executableToSign"
if($LASTEXITCODE -ne 0)
{
    write-error "Signing Failure"
    exit 1
}

$msbuildArgs = '"' + $solutionFile + '" ' + '/t:publish /v:minimal /p:Configuration="' + $configuration + '";PublishDir=' + $msbuildOutput + ';InstallUrl="' + $installUrl + '";IsWebBootstrapper=true;InstallFrom=Web'
& $msbuild $msbuildArgs

The snippet above doesn’t contain definitions for all of the variables, as it is an excerpt from a larger script. It demonstrates the concept though, even if it wouldn’t execute by itself.

Using this approach, Visual Studio takes care of using the correct algorithm for the target framework, the executable is signed appropriately and everyone is happy.

This meant that I could delete a bunch of scripts that dealt with signing and re-signing, which is my favourite part of software development. Nothing feels better than deleting code that you don’t need anymore because you’ve found a better way.

Conclusion

Certificates are definitely one of those things in software that I never quite get a good handle on, just because I don’t deal with them every day. I have a feeling that if I was dealing with certificates more often, I probably would see these sorts of things coming, and be able to prepare accordingly.

More importantly though, this issue would have been discovered immediately if we had a set of functional tests that validate that the software can be installed correctly on a variety of operation systems, specifically those that we officially support. Obviously it would be better if the tests were automated, but even having a manual set of tests would have made the issue obvious as soon as I made the change, instead of months later when I make an unrelated change.

Of course, I should have run such tests at the time, but that’s why automation is important, to prevent that exact situation.

That or hiring a dedicated tester who likes to do that sort of thing manually.

Because I don’t

Designer Code

September 8. 2015 0 Comments

Posted in:
c#
WPF
designer

Its time to step away from web services, log aggregation and AWS for a little while.

Its time to do some UI work! Not HTML though unfortunately, Windows desktop software.

The flagship application that my team maintains is an old (15+ years) VB6 application. It brings in a LOT of money, but to be honest, its not the best maintained piece of software I’ve ever seen. Its not the worst either, which is kind of sad. Like most things it falls somewhere in the middle. More on the side of bad than good though, for sure.

In an attempt to keep the application somewhat current, it has a .NET component to it. I’m still not entirely sure how it works, but the VB6 code uses COM to call into some .NET functionality, primarily through a message bus (and some strongly typed JSON messages). Its pretty clever actually, even if it does lead to some really strange threading issues from time to time.

Over the years, good sized chunks of new functionality have been developed in .NET, with the UI in Windows Forms.

Windows Forms is a perfectly fine UI framework, and don’t let anyone tell you different. Sure it has its flaws and trouble spots, but on the whole, it works mostly as you would expect it to, and it gets the job done. The downside is that most Windows Forms UI’s look the same, and you often end up with business logic being tightly coupled to the UI. When your choice is Windows Forms or VB6 though, well, its not really a choice.

For our most recent project, we wanted to try something new. Well, new for the software component anyway, certainly not new to me, some of the other members of the team, or reality in general.

WPF.

Present that Framework

I’m not going to go into too much detail about WPF, but its the replacement for Windows Forms. It focuses on separating the actual presentation of the UI from the logic that drives it, and is extremely extensible. Its definitely not perfect, and the first couple of versions of it were a bit rough (Avalon!), but its in a good spot now.

Personally, I particularly like how it makes using the MVVM (Model-View-ViewModel) pattern easy, with its support for bindings and commands (among other things). MVVM is a good pattern to strive for when it comes to developing applications with complicated UI’s, because you can test your logic independent of the presentation. You do still need to test it all together obviously, because bindings are code, and you will make mistakes.

I was extremely surprised when we tried to incorporate a WPF form into the VB6 application via the .NET channels it already had available.

Mostly because it worked.

It worked without some insane workaround or other crazy thing. It just worked.

Even showing a dialog worked, which surprised me, because I remember having a huge amount of trouble with that when I was writing a mixed Windows Form/WPF application.

I quickly put together a fairly primitive framework to support MVVM (no need to invest into one of the big frameworks just yet, we’re only just starting out) and we built the new form up to support the feature we needed to expose.

That was a particularly long introduction, but what I wanted to talk about in this post was using design time variables in WPF to make the designer actually useful.

Intelligent Design

The WPF designer is…okay.

Actually, its pretty good, if a little bit flakey from time to time.

The problem I’ve encountered every time I’ve used MVVM with WPF is that a lot of the behaviour of my UI component is dependent on its current data context. Sure, it has default behaviour when its not bound to anything, but if you have error overlays, or a message channel used to communicate to the user, or expandable bits within an Items Control whose items have their own data templates, it can be difficult to visualise how everything looks just from the XAML.

It takes time to compile and run the application as well (to see it running in reality), even if you have a test harness that opens the component you are working on directly.

Tight feedback loops are easily one of the most important parts of developing software quickly, and the designer is definitely the quickest feedback loop for WPF by far.

Luckily, there are a number of properties that you can set on your UI component which will only apply at design time, and the most useful one is the design time DataContext.

This property, when applied, will set the DataContext of your component when it is displayed in the designer, and assuming you can write appropriate classes to interface with it, gives you a lot of power when it comes to viewing your component in its various states.

Contextual Data

I tend towards a common pattern when I create view models for the designer. I will create an interface for the view model (based on INotifyPropertyChanged), a default implementation (the real view model) and a view model creator or factory, specifically for the purposes of the designer. They tend to look like this (I’ve included the view model interface for completeness):

namespace Solavirum.Unspecified.ViewModel
{
    public interface IFooViewModel : INotifyPropertyChanged
    {
        string Message { get; }
    }
}

namespace Solavirum.Unspecified.View.Designer
{
    public class FooViewModelCreator
    {
        private static readonly _Instance = new FooViewModelCreator();
        
        public static IFooViewModel ViewModel
        {
            get
            {
                return _Instance.Create();
            }
        }
        
        public IFooViewModel Create()
        {
            return new DummyFooViewModel
            {
                Message = "This is a message to show in the designer."
            };
        }
        
        private class DummyFooViewModel : BaseViewModel, IFooViewModel
        {
            public string Message { get; set; }
        }
    }
}

As you can see, it makes use of a private static instance of the creator, but it doesn’t have to (It’s just for caching purposes, so it doesn’t have to be recreated all the time, its probably not necessary). It exposes a readonly view model property which just executes the Create function, and can be bound to in XAML like so:

<UserControl 
    x:Class="Solavirum.Unspecified.View.FooView"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
    xmlns:d="http://schemas.microsoft.com/expression/blend/2008" 
    xmlns:Designer="clr-namespace:Solavirum.Unspecified.View.Designer"
    mc:Ignorable="d"
    d:DataContext="{x:Static Designer:FooViewModelCreator.ViewModel}">
    <Grid>
        <TextBlock Text="{Binding Message}" />
    </Grid>
</UserControl>

With this hook into the designer, you can do all sorts of crazy things. Take this creator for example:

namespace Solavirum.Unspecified.View.Designer
{
    public class FooViewModelCreator
    {
        private static readonly FooViewModelCreator _Instance = new FooViewModelCreator();

        public static IFooViewModel ViewModel
        {
            get { return _Instance.Create(); }
        }

        public IFooViewModel Create()
        {
            var foreground = TaskScheduler.Current;
            var messages = new List<string>()
            {
                "This is the first message. Its short.",
                "This is the second message. Its quite a bit longer than the first message, and is useful for determining whether or not wrapping is working correctly."
            };

            var vm = new DummyFooViewModel();

            var messageIndex = 0;
            Task.Factory.StartNew
                (
                    () =>
                    {
                        while (true)
                        {
                            Thread.Sleep(TimeSpan.FromSeconds(5));
                            var newMessage = messages[messageIndex];
                            messageIndex = (messageIndex + 1)%messages.Count;
                            Task.Factory.StartNew
                                (
                                    () => vm.Message = newMessage,
                                    CancellationToken.None,
                                    TaskCreationOptions.None,
                                    foreground
                                );

                        }
                    },
                    CancellationToken.None,
                    TaskCreationOptions.LongRunning,
                    TaskScheduler.Default
                );

            return vm;
        }

        private class DummyFooViewModel : IFooViewModel
        {
            private string _Message;

            public string Message
            {
                get { return _Message; }
                set
                {
                    _Message = value;
                    RaisePropertyChanged("Message");
                }
            }

            public event PropertyChangedEventHandler PropertyChanged;

            private void RaisePropertyChanged(string propertyName)
            {
                var handlers = PropertyChanged;
                if (handlers != null)
                {
                    handlers(this, new PropertyChangedEventArgs(propertyName));
                }
            }
        }
    }
}

It uses a long running task to change the state of the view model over time, so you can see it in its various states in the designer. This is handy if your view model exposes an error property (i.e. if it does a refresh of some sort, and you want to notify the user in an unobtrusive way when something bad happens) or if you just want to see what it looks like with varying amounts of text or something similar. Notice that it has to marshal the change onto the foreground thread (which in the designer is the UI thread), or the property changed event won’t be picked up by the binding engine.

Once you’ve setup the creator and bound it appropriately, you can do almost all of your UI work in the designer, which saves a hell of a lot of time. Of course you can’t verify button actions, tooltips or anything else that requires interaction (at least to my knowledge), but its still a hell of a lot better than starting the application every time you want to see if your screen looks okay.

Getting Inside

Now, because you are running code, and you wrote that code, it will have bugs. There’s no shame in that, all code has bugs.

Its hard to justify writing tests for designer specific support code (even though I have done it), because they don’t necessarily add a lot of value, and they definitely increase the time required to make a change.

Instead I mostly just focus on debugging when the designer view models aren’t working the way that I think they should.

In order to debug, you will need to open a second instance of Visual Studio, with the same project, and attach it to the XAML designer process (XDesProc). Now, since you have two instances of Visual Studio, you might also have two instances of the XAML designer process, but its not hard to figure out the right one (trial and error!). Once you’ve attached the process you can put breakpoints in your designer specific code and figure out where its going wrong.

I’ll mention it again below, but the app domain for the designer is a little bit weird, so sometimes it might not work at all (no symbols, breakpoints not being hit, etc). Honestly, I’m not entirely sure why, but a restart of both instances of Visual Studio, combined with a fresh re-compilation usually fixes that.

Gotchas

There are a few gotchas with the whole designer thing I’ve outlined above that are worth mentioning.

The first is that If you do not version your DLL appropriately (within Visual Studio, not just within your build server), you will run into issues where old versions of your view model are being bound into the designer. This is especially annoying when you have bugs, as it will quite happily continue to use the old, broken version. I think the designer only reloads your libraries when it detects a version change, but I can’t back that up with proof.

The solution is to make sure that your version changes every time when you compile, which honestly, you should be doing anyway. I’ve had success with just using the reliable 0.0.* version attribute in the assembly when the project is compiled as debug (so using an #IF DEBUG). You just have to make sure that whatever approach you use for versioning in your build server doesn’t clash with that.

The second gotcha is that the app domain for the designer is a bit…weird. For example, Ninject won’t automatically load its extension modules in the designer, you have to load them manually. For Ninject specifically, this is a fairly straightforward process (just create a DesignerKernel), but there are other issues as well.

Sometimes the designer just won’t run the code you want it to. Typically this happens after you’ve been working on it for a while, constantly making new builds of the view model creator. The only solution I’ve found to this is just to restart Visual Studio. I’m using Visual Studio 2013 Update 5, so it might be fixed/better in 2015, but I don’t know. Its not a deal breaker anyway, basically just be on the lookout for failures that look like they are definitely not the fault of your code, and restart Visual Studio before you start pulling your hair out.

Conclusion

I highly recommend going to the extra effort of creating view models that can be bound in your component at design time. Its a great help when you’re building the component, but it also helps you to validate (manually) whether or not your component acts as you would expect it to when the view model is in various states.

It can be a little bit difficult to maintain if your code is changing rapidly (breaking view models apart can have knock-on effects on the creator for example, increasing the amount of work required in order to accomplish a change), but the increase in development speed for UI components (which are notoriously fiddly anyway) is well worth it.

Its also really nice to see realistic looking data in your designer. It makes the component feel more substantial, like it actually might accomplish something, instead of being an empty shell that only fills out when you run the full application.

The Seed Is Strong

September 1. 2015 0 Comments

Update:I wrote the code for the seeder below outside of a development environment. It doesn’t work. I’ll try to revisit this post at a later date if I get a chance to implement a seeder class, but I’m currently using the Seed extension methods to great effect. My apologies to anyone who finds this post and is surprised when it doesn’t work.

Entity Framework and ORM’s in general have come a long way. Entity Framework in particular is pretty amazing now, compared to where it was 5 years ago. It was around then that my team made the decision to use NHibernate as our ORM, instead of EF. EF has obviously matured a lot since then, and seems to be the default choice now when working in the .NET world.

I’ve made a couple of posts on this blog involving Entity Framework and some of my adventures with it, one on creating test databases leveraging a scratch MSSQL instance in AWS and another on using a different EF provider to allow for in-memory databases.

One of the great things about working with ORMs, is that your persisted data is just objects, which means you have far more control over it than you ever did before. No need to use SQL statements (or similar) to build up some test data, just create some objects, insert them and off you go.

This post is going to talk about a mechanism for creating those objects, specifically about the concept of seeding.

Also, obviously, all of the sub-titles will be puns based on seeds and seeding.

Just Tossing My Seed Around

Most of the time I see seeding functions built into the DbContext class. They are typically executed whenever the context is created, making sure that certain data is available.

To me this is a violation of the Single Responsibility Principle, because now you have a class that is responsible for both managing data access and for putting some subset of the data there in the first place. While this alone is definitely a good reason to have a dedicated seeder class, there are others as well:

If you have a hardcoded seed method inside your DbContext, its much harder to customise it to seed different data based on your current needs.
Commonly, seed implementations inside the DbContext are wasteful, always trying to seed into the database whenever you create a new DbContext. I’m in favour of using a DbContextFactory and creating a DbContext per operation (or at least per request), which can make the time spent dealing with seeding significant.

I find that the best way to think about seeding is to use the specification pattern (or at least the concept). You want to be able to create an object that describes how you want your data to look (or what capabilities you want your data to have), and then execute it. Let the object sort out the seeding as its dedicated function.

This works fairly well. Yu define a Seeder or DataSpecification class, and expose appropriate properties and methods on it to describe the data (like how many DbFoo entries do I want, how many DbBar entries, what they look like, etc). You implement a method that takes a DbContext of the appropriate type, and in that method you use the information supplied to create and save the appropriate entities.

If you follow this approach, you find that your Seeder can become very complicated very quickly, especially because its entire purpose is to be highly configurable. Its also responsible for knowing how to construct many different varieties of objects, which is another violation of SRP.

I find SRP to be a pretty good guideline for handling class complexity. If you think about the responsibilities that your class has, and it has more than a few, then those responsibilities either need to be very tightly coupled, such that you couldn’t reasonably pull them apart, or you should really consider having more than one class. The downside of SRP is that you tend to have quite a lot of small classes, which is another form of complexity. The upside is that you have a lot of small, composable, modular classes, which are extremely useful once you get over that initial complexity bump for having many many classes.

Ready For Round Two

I didn’t like that my Seeder class had detailed knowledge about how to construct the various entities available from the DbContext. Plus it was huge and hard to understand at a glance.

The next step was to split the logic for how to create an entity into classes dedicated to that. I tend to use the naming convention of XBuilder for this purpose, and they all look very similar:

using System;

public interface IEntityBuilder<TEntity>
{
    TEntity Build();
}

public class DbFoo
{
    public int Id { get; set; }
    public string Do { get; set; }
    public string Re { get; set; }
}

public class DbFooBuilder : IEntityBuilder<DbFoo>
{
    private string _Do = "bananas";
    private string _Re = "purple";

    public DbFooBuilder WithDo(string v)
    {
        _Do = v;
        return this;
    }

    public DbFoo Build()
    {
        return new DbFoo()
        {
            Do = _Do,
            Re = _Re
        };
    }
}

As you can see, the builder features somewhat fluent syntax (the WithX methods) allowing you to chain calls to customise the constructed entity, but has sane defaults for all of the various properties that matter.

The Faker.Net package is handy here, for generating company names, streets, etc. You can also simply generate random strings for whatever properties require it, but its generally much better to generate real looking data than completely nonsensical data.

With the additional of dedicated builders for entities, the Seeder looks a lot better, being mostly dedicated to the concept of “how many” of the various entities. It could be improved though, because its difficult to use the Seeder to specify a subset of entities that meet certain criteria (like generate 10 DbFoo’s with their Do property set to “bananas”, and 200 where its set to “apples”).

We can fix that by providing some additional methods on the Seeder that allow you to customise the builders being used, instead of just letting to Seeder create X number of them to fulfil its “number of entities” requirement.

public class Seeder
{
    private List<DbFooBuilder> FooBuilders = new List<DbFooBuilder>();

    public Seeder WithDbFoos<TEntityBuilder>(IEnumerable<TEntityBuilder> builders)
        where TEntityBuilder : IEntityBuilder<DbFoo>
    {
        FooBuilders.AddRange(builders);

        return this;
    }

    public Seeder WithDbFoos<TEntityBuilder>(int number, Action<TEntityBuilder> customise = null)
        where TEntityBuilder : IEntityBuilder<DbFoo>, new()
    {
        var builders = Enumerable.Range(0, number).Select(a => customise(new TEntityBuilder()));
        return WithDbFoos(builders);
    }

    public void Seed(DbContext db)
    {
        foreach (var builder in FooBuilders)
        {
            db.DbFoos.Add(builder.Build());
        }
        db.SaveContext();
    }
}

Much better and extremely flexible.

Bad Seed

I actually didn’t quite implement the Seeder as specified above, though I think its definitely a better model, and I will be implementing it in the near future.

Instead I implemented a series of builders for each of the entities I was interested in (just like above), and then wrote a generic Seed extension method for IDbSet:

using System;
using System.Collections.Generic;
using System.Data.Entity;
using System.Linq;

namespace Solavirum.Database.EF
{
    public IEntityBuilder<TEntity>
    {
        TEntity Build();
    }

    public static class SeedExtensions
    {
        private static Random _random = new Random();

        public static void Seed<TEntity, TBuilder>(this IDbSet<TEntity> set, Func<TBuilder, TBuilder> modifications = null, int number = 10)
            where TEntity : class
            where TBuilder : IEntityBuilder<TEntity>, new()
        {
            modifications = modifications ?? (a => a);

            for (int i = 0; i < number; i++)
            {
                var builder = new TBuilder();
                builder = modifications(builder);
                set.Add(builder.Build());
            }
        }

        public static T Random<T>(this IEnumerable<T> enumerable)
        {
            int index = _random.Next(0, enumerable.Count());
            return enumerable.ElementAt(index);
        }
    }
}

This is nice from a usability point of view, because I can seed any entity that has an appropriate builder, just by using the one method. The Random<T> method exists so I can get a random element out of a DbSet for linking purposes, if I need to (it was used in a method that I removed, dealing specifically with an entity with links to other entities).

What I don’t like about it:

Its difficult to supply dependencies to the seed method (unless you expose them in the method signature itself) because its inside a static class. This means supplying a logger of some description is hard.
The builders have to have parameterless constructors, again because its hard to supply dependencies. This isn’t so bad, because the builders are meant to be simple and easy to use, with sane default values.
Builders with dependencies on other entities (like a hypothetical DbFooBar class that has references to both a DbFoo and a DbBar) have to have their own Seed method in order to use entities that exist in the current DbContext. This isn’t a dealbreaker, but it does complicate things.

I think a well constructed Seeder class better encapsulates the concept, even though its nice to be able to just hit up a Seed method right off the IDbSet and have it all just work.

Conclusion

Being able to easily create data that meets certain criteria is an amazing tool when it comes to development and testing. Doing it in a provider agnostic way is even better, because you can push data into an in-memory database, an SQL database or a Postgres database, using the same code. In my opinion, the ability to sub out providers is one of the best parts of using an ORM (that and not having to deal with query/data manipulation languages directly).

Nothing that I’ve written above is particularly ground breaking, but its still very useful, and I highly recommend following the general strategy when working with a persistence layer via Entity Framework.

I hope you enjoyed all the terrible seed puns.

I regret nothing.