The Case Of The Phantom Bucket

November 3. 2015 0 Comments

Posted in:
amazon
s3
ghosts

A very short post this week, as I’m still struggling with my connection leak and a number of other things (RavenDB production server performance issues is the biggest one, but also automating a Node/NPM built website into our current CI architecture, which is mostly based around Powershell/MSBuild). Its been a pretty discombobulated week.

So this incredibly short post?

Phantom buckets in S3.

There Is A Hole In The Bucket

Our environments often include S3 buckets, and those buckets are typically created via the same CloudFormation template as the other components (like EC2 instances, ELB, Auto Scaling Groups, etc).

Until now, the names of these buckets have been relatively straightforward. A combination of a company name + environment (i.e. ci, staging, etc) + the component (like auth service) + the purpose of the bucket (logs, images, documents, whatever).

This works great. Your buckets have sane names, so you know where to look for things and its easy to apply different lifecycle management depending on the bucket purpose.

Unfortunately its not all wonderful happy time.

The first issue is that CloudFormation will not delete a bucket with contents. I can understand this from a safety point of view, but when the actual AWS API allows you to just delete buckets with contents, the disconnect is frustrating.

What this means is that now you need to delete the bucket contents outside of the actual stack deletion. Its especially annoying for buckets being used to contain ELB logs, as there is an extremely good chance of files being written after you’ve cleared the bucket ready for CloudFormation to delete it. I’ve solved this issue by just deleting the bucket outside of the stack teardown (we already do some other things here, like Octopus management, so its not entirely unprecedented).

The second issue is phantom buckets.

OooOOooOoo

I’ve encountered this issue twice now. Once for our proxy environment and now once for one of our API’s.

What happens is that when the environment attempts to spin up (our CI environments are recreated every morning to verify that our environment creation scripts work as expected), it will fail because it cannot create the bucket. The actual error is incredibly unhelpful:

{
    "EventId" : "LogsBucket-CREATE_FAILED-2015-11-02T21:49:55.907Z",
    "LogicalResourceId" : "LogsBucket",
    "PhysicalResourceId" : "OBFUSCATED_BUCKET_NAME",
    "ResourceProperties" : "{\"BucketName\":\"OBFUSCATED_BUCKET_NAME\",\"LifecycleConfiguration\":{\"Rules\":[{\"Status\":\"Enabled\",\"Id\":\"1\",\"ExpirationInDays\":\"7\"}]}}\n",
    "ResourceStatus" : "CREATE_FAILED",
    "ResourceStatusReason" : "The specified bucket does not exist",
    "ResourceType" : "AWS::S3::Bucket",
    "StackId" : "OBFUSCATED_STACK_ID",
    "StackName" : "OBFUSCATED_STACK_NAME",
    "Timestamp" : "\/Date(1446500995907)\/"
}

If I go into the AWS dashboard and look at my buckets, its clearly not there.

If I try to create a bucket with the expected name, it fails, saying the bucket already exists.

Its a unique enough name that it seems incredibly unlikely that someone else has stolen the name (bucket names being globally unique), so I can only assume that something has gone wrong in AWS and the bucket still technically exists somehow, but we’ve lost control over it.

Somehow.

Of course, because the bucket is an intrinsic part of the environment, now I can’t create my CI environment for that particular service. Which means we can’t successfully build/deploy any thing involving that service, because CI is typically used for functional test validation.

Who Ya Gunna Call? Ghostbusters!

The only solution I could come up with, was to make sure that every time an environment is created, the buckets have completely unique names. With only 63 characters to work with, this is somewhat challenging, especially if we want to maintain nice sane bucket names that a human could read.

What I ended up doing was shortening the human readable part (just environment + component + purpose) and appending a GUID onto the end.

Now that I couldn’t predict the name of the bucket though, I had to fix up a couple of other loose ends.

The first was that the bucket deletion (during environment tear down) now had to query the stack itself to find out the bucket resources. Not overly difficult.

try
{
    if ($environment -ne $null)
    {
        $resources = Get-CFNStackResources -StackName $environment.StackId -AccessKey $awsKey -SecretKey $awsSecret -Region $awsRegion
        $s3buckets = $resources | Where { $_.ResourceType -eq "AWS::S3::Bucket" }
        foreach ($s3Bucket in $s3Buckets)
        {
            try
            {
                $bucketName = $s3Bucket.PhysicalResourceId
                _RemoveBucket -bucketName $bucketName -awsKey $awsKey -awsSecret $awsSecret -awsRegion $awsRegion
            }
            catch
            {
                Write-Warning "Error occurred while trying to delete bucket [$bucketName] prior to stack destruction."
                Write-Warning $_
            }
        }
    }
}
catch
{
    Write-Warning "Error occurred while attempting to get S3 buckets to delete from the CloudFormation stack."
    Write-Warning $_
}

The second was that our Octopus projects used the predictable bucket name during deployments, so I had to change the environment setup code to update the project variables to have the correct value. This was a little more difficult, but due to Octopus being awesome from an automation point of view, it eventually worked.

Summary

I can see how this sort of situation can arise in a disconnected, eventually consistent architecture, but that doesn’t make it any less frustrating.

It could be my fault for constantly creating/deleting buckets as part of the environment management scripts, but being that it doesn’t happen all the time, it really does feel like a bug of some sort.

Plus, ghost buckets are scary. Does that mean there is some of my data up there in AWS that I no longer have control over? I mean, I can’t even see it, let alone manage it.

A sobering thought.

Leaking Connections

October 27. 2015 0 Comments

The core of the newest suite of components that we are building is an authentication and authorization service. Its goal is simple, provide a central location for identifying clients and the data sets and users that they own, for the purposes of using remote, cloud based resources. It forms part of a greater strategy of helping to get previously client side only data into a location where it can be used by mobile applications, websites and third party integrations.

This service also includes the management of application keys, allowing us to control who is able to access the cloud components, and uniquely identifying where requests are coming from. Typically we generate a unique application for each piece of software that wants to interact with the system, including third party software and our own.

Most of our services up to now have been built using the Nancy framework. The main reason for this is because its simple and easy to understand, and gives a lot of control over the entire pipeline, which is important for a number of reasons (one of which is logging).

Unfortunately, when it came time to build the authentication and authorization service, we were unable to use Nancy. Instead we had to use Web API. Now, don’t get me wrong, Web API is a fine framework, but the majority of our experience lay with Nancy, so we might not have put together the best Web API implementation. We were forced the use Web API because the component we were going to use for the core of the authentication (the ASP.NET Identity Framework) integrated easily with Web API, but didn’t seem to have any obvious way to integrate with Nancy (when hosted in ASP.NET).

Really, all of the above is just background to the problem that I’m currently struggling with.

Leaking connections to the persistence store.

A Nice Deep Pool

The backend of our authentication and authorization service is a PostgreSQL database hosted in AWS using RDS. We chose PostgreSQL because SQL Server is generally too expensive (even hosted through AWS) and because there was a PostgreSQL provider for Entity Framework that seemed solid (Npgsql). PostgreSQL is also an extremely mature database engine, and is just far enough out of our comfort zone to provide some new learning opportunities.

Because of the way the Identity Framework uses Entity Framework, all we had to do was create the DbContext using the appropriate connection/provider and EF would take care of the rest, seamlessly persisting our data into the appropriate place.

To be honest, it was actually pretty seamless. We didn’t have any issues (that I can remember anyway) and everything seemed to go pretty smoothly. The service wasn't being actively used in a production environment (because we were still developing the solution it was built for), but it was getting regular traffic and handling it admirably.

Then we did some load testing and it crashed and burnt.

Investigation showed that the service stopped working because it could no longer open new connections to the database, as the database already had too many connections currently in use. We were only using a d2.t2.micros RDS instance, and it has a cap on the number of connections (approximately 26), so that made sense. As soon as there was a greater number of requests trying to access the database than the available number of connections, requests would fail.

A quick investigation showed that the database connections being used for EF did not have pooling enabled. The pooling supported in Npgsql is an entirely client side structured, allowing for efficient reuse of connections by avoiding the work that goes into establishing the pipeline. It also offers the ability to queue the creation of additional connections once the configured pool limit has been set. Obviously there is a timeout (to prevent deadlock), but this allowed us to set the connection pool limit to close to the RDS limit, safe in the knowledge that subsequent requests would simply be slower, rather than failing altogether.

With pooling enabled, the load tests passed with flying colours. I knew that we would have to scale up the RDS instance based on real traffic, but that could happen later. The pooling allowed the service to make better use of limited resources.

For now, problem solved.

Right?

These Are My Connections! Mine!

After running happily for many weeks, the staging service went down hard. No load tests involved this time, just normal traffic. Shortly afterwards (within days), a newly provisioned production server did the same thing.

Looking at the logs the symptoms seemed to be similar, errors related to being unable to obtain a connection because the server did not have any available. There was no real sign that the service was doing anything particularly difficult that would require it to hold the connections for an extended amount of time, so we were a bit confused.

Being that the service had been running happily for a number of weeks, it must have been something we changed recently. Unfortuntately, the only recent change made was a fix to a threading issue with Entity Framework, Owin and the Identity Framework (2 threads attempting to lazily initialize the same DbContext at the same time). It was a fairly trivial fix. Instead of returning the DbContext from the factory straightaway, make sure that the model has been initialized first.

It was time to dig into it in more depth.

My hypothesis was that if the application was leaking connections to the database (such that they were hanging around but no longer in a usable state), then I should be able to see them when analysing a memory dump. I started up the Auth API locally using the SciTech .NET Memory Profiler, took an initial snapshot before doing anything meaningful, and then started the stress tests.

Every 15 minutes I would take another snapshot of the memory space, specifically looking for any DbContext or NpgsqlConnection objects that were hanging around for longer than they should have been.

What I found was interesting.

Slowly but surely a collection of NpgsqlConnection objects were accumulating. They were disposed, but could not be garbage collected because they were being rooted by an event handler inside the Npgsql connection pooling implementation. I’m not actually sure if they were holding on to server side resources (being disposed, you would think that they wouldn’t be), but it didn’t look promising.

I turned off connection pooling and those objects stopped being accumulated. Obviously the service also stopped working properly once it reached a certain amount of traffic (when reaching the connection cap of the RDS instance), but it did at least confirm that the issue seemed to be within Npgsql itself and not in our code.

Luckily, a new version of Npgsql had been released (we were using 2.2.5, but version 3 was now available). I removed all of the references to the old package and installed the new one. After dealing with some differences (some very strange errors relating to attempting to use disposed DbContexts during Owin cleanup due to the unnecessary usage of SaveChangesAsync), everything seemed to be working just fine. All of the functional tests passed, and the stress tests didn’t show anything interesting unless I tuned them up to “crash the computer” level.

Pushed, built, deployed.

Job done.

Where Have All The Connections Gone

Nope.

Still leaking connections. Still fails when some critical mass of leaked connections is reached.

Come back next week for part 2, when I hopefully have an actual answer as to where the connections are going.

Smoking Out Fires

October 20. 2015 0 Comments

I’m pretty happy with the way our environment setup scripts work.

Within TeamCity, you generally only have to push a single button to get an environment provisioned (with perhaps a few parameters filled in, like environment name and whatnot) and even outside TeamCity, its a single script that only requires some credentials and a few other things to start.

Failures are detected (primarily by CloudFormation) and the scripts have the ability to remote onto AWS instances for you and extract errors from logs to give you an idea as to the root cause of the failure, so you have to do as little manual work as possible. If a failure is detected, everything is cleaned up automatically (CloudFormation stack deleted, Octopus environment and machines deleted, etc), unless you turn off automatic cleanup for investigation purposes.

Like I said, overall I’m pretty happy with how everything works, but one of the areas that I’m not entirely happy with is the last part of environment provisioning. When an environment creation is completed, you know that all components installed correctly (including Octopus deploys) and that no errors were encountered with any of the provisioning itself (EC2 instances, Auto Scaling Groups, RDS, S3, etc). What you don’t know is whether or not the environment is actually doing what it should be doing.

You don’t know whether or not its working.

That seems like a fixable problem.

Smoke On The Water

As part of developing environments, we’ve implemented automated tests using the Powershell testing framework called Pester.

Each environment has at least one test, that verifies the environment is created as expected and works from the point of view of the service it offers. For example, in our proxy environment (which uses SQUID) one of the outputs is the proxy URL. The test takes that url and does a simple Invoke-WebRequest through it to a known address, validating that the proxy works as a proxy actually should.

The issue with these tests is that they are not executed at creation time. They are usually only used during development, to validate that whatever changes you are making haven’t broken the environment and that everything is still working.

Unfortunately, beyond git tagging, our environment creation scripts/templates are not versioned. I would vastly prefer for our build scripts to take some set of source code that represents an environment setup, test it, replace some parameters (like version) and then package it up, perhaps into a nuget package. It’s something that’s been on my mind for a while, but I haven’t had time to put it together yet. If I do, I’ll be sure to post about it here.

The simplest solution is to extract the parts of the tests that perform validation into dedicated functions and then to execute them as part of the environment creation. If the validation fails, the environment should be considered a failure and should notify the appropriate parties and clean itself up.

Where There Is Smoke There Is Fire

The easiest way to implement the validation (hereafter referred to as smoke tests) in a reusable fashion is to incorporate the concept into the common environment provisioning scripts.

We’ve created a library that contains scripts that we commonly use for deployment, environment provisioning and other things. I made a copy of the source for that library and posted it to Solavirum.Scripts.Common a while ago, but its a bit out of date now (I really should update it).

Within the library is a Functions-Environment file.

This file contains a set of Powershell cmdlets for provisioning and deleting environments. The assumption is that it will be used within libraries for specific environments (like the Proxy environment mentioned above) and will allow us to take care of all of the common concerns (like uploading dependencies, setting parameters in CloudFormation, waiting on the CloudFormation initialization, etc).

Inside this file is a function called New-Environment, whose signature looks like this:

function New-Environment
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$environmentName,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsSecret,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$awsRegion,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusServerUrl,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusApiKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$uniqueComponentIdentifier,
        [System.IO.FileInfo]$templateFile,
        [hashtable]$additionalTemplateParameters,
        [scriptblock]$customiseEnvironmentDetailsHashtable={param([hashtable]$environmentDetailsHashtableToMutate,$stack) },
        [switch]$wait,
        [switch]$disableCleanupOnFailure,
        [string[]]$s3Buckets
    )

    function body would be here, but its super long
}

As you can see, it has a lot of parameters. It’s responsible for all of the bits of pieces that go into setting up an environment, like Octopus initialization, CloudFormation execution, gathering information in the case of a failure, etc. Its also responsible for triggering a cleanup when an environment is deemed a failure, so is the ideal place to put some smoke testing functionality.

Each specific environment repository typically contains a file called Invoke-NewEnvironment. This file is what is executed to actually create an environment of the specific type. It puts together all of the environment specific stuff (output customisation, template location, customised parameters) and uses that to execute the New-Environment function, which takes care of all of the common things.

In order to add a configurable smoke test, all we need to do is add an optional script block to the New-Environment function. Specific environment implementations can supply a value to it they like, but they don’t have to. If we assume that the interface for the script block is that it will throw an exception if it fails, then all we need to do is wrap it in a try..catch and fail the environment provisioning if an error occurs. Pretty straightforward.

To support the smoke test functionality, I wrote two new Pester tests. One verifies that a failing smoke test correctly fails the environment creation and the other verifies that the result of a successful smoke test is included in the environment creation result. You can see them below:

Describe -Tags @("Ignore") "Functions-Environment.New-Environment.SmokeTest" {
    Context "When supplied with a smoke test script that throws an exception (indicating smoke test failure)" {
        It "The stack creation is aborted and deleted" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                try
                {
                    $createArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-TemplateFile"=$templatePath;
                        "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-S3Buckets"=@($testBucket);
                        "-SmokeTest"={ throw "FORCED FAILURE" };
                        "-Wait"=$true;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;
                        "-OctopusApiKey"=$octoCreds.ApiKey;
                        "-OctopusServerUrl"=$octoCreds.Url;
                    }
                    $environmentCreationResult = New-Environment @createArguments
                }
                catch
                {
                    $error = $_
                }

                $error | Should Not Be $null
                $error | Should Match "smoke"

                try
                {
                    $getArguments = @{
                        "-EnvironmentName"=$environmentName;
                        "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                        "-AwsKey"=$creds.AwsKey;
                        "-AwsSecret"=$creds.AwsSecret;
                        "-AwsRegion"=$creds.AwsRegion;                
                    }
                    $environment = Get-Environment @getArguments
                }
                catch
                {
                    Write-Warning $_
                }

                $environment | Should Be $null
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }

    Context "When supplied with a valid smoke test script" {
        It "The stack creation is successful" {
            $creds = Get-AwsCredentials
            $octoCreds = Get-OctopusCredentials
            $environmentName = Create-UniqueEnvironmentName
            $uniqueComponentIdentifier = "Test"
            $templatePath = "$rootDirectoryPath\src\TestEnvironment\Test.CloudFormation.template"
            $testBucket = [Guid]::NewGuid().ToString("N")
            $customTemplateParameters = @{
                "LogsS3BucketName"=$testBucket;
            }

            try
            {
                $createArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-TemplateFile"=$templatePath;
                    "-AdditionalTemplateParameters"=$CustomTemplateParameters;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-SmokeTest"={ return $_.StackId + " SMOKE TESTED"}; 
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                $environmentCreationResult = New-Environment @createArguments

                Write-Verbose (ConvertTo-Json $environmentCreationResult)

                $environmentCreationResult.SmokeTestResult | Should Match "SMOKE TESTED"
            }
            finally
            {
                $deleteArguments = @{
                    "-EnvironmentName"=$environmentName;
                    "-UniqueComponentIdentifier"=$uniqueComponentIdentifier;
                    "-S3Buckets"=@($testBucket);
                    "-Wait"=$true;
                    "-AwsKey"=$creds.AwsKey;
                    "-AwsSecret"=$creds.AwsSecret;
                    "-AwsRegion"=$creds.AwsRegion;
                    "-OctopusApiKey"=$octoCreds.ApiKey;
                    "-OctopusServerUrl"=$octoCreds.Url;
                }
                Delete-Environment @deleteArguments
            }
        }
    }
}

Smoke And Mirrors

On the specific environment side (the Proxy in this example), all we need to do is supply a script block that will execute the smoke test.

The smoke test itself needs to be somewhat robust, so we use a generic wait function to repeatedly execute a HTTP request through the proxy until it succeeds or it runs out of time.

function Wait
{
    [CmdletBinding()]
    param
    (
        [scriptblock]$ScriptToFillActualValue,
        [scriptblock]$Condition,
        [int]$TimeoutSeconds=30,
        [int]$IncrementSeconds=2
    )

    write-verbose "Waiting for the output of the script block [$ScriptToFillActualValue] to meet the condition [$Condition]"

    $totalWaitTimeSeconds = 0
    while ($true)
    {
        try
        {
            $actual = & $ScriptToFillActualValue
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the script to get the actual value (which is evaluated by the condition for waiting purposes). As a result, the actual value is undefined (NULL)"
            Write-Warning $_
        }

        try
        {
            $result = & $condition
        }
        catch
        {
            Write-Warning "An error occurred while evaluating the condition to determine if the wait is over"
            Write-Warning $_

            $result = $false
        }

        
        if ($result)
        {
            write-verbose "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) met the condition [$condition]"
            return $actual
        }

        write-verbose "The current output of the condition [$condition] (Variable:actual = [$actual]) is [$result]. Waiting [$IncrementSeconds] and trying again."

        Sleep -Seconds $IncrementSeconds
        $totalWaitTimeSeconds = $totalWaitTimeSeconds + $IncrementSeconds

        if ($totalWaitTimeSeconds -ge $TimeoutSeconds)
        {
            throw "The output of the script block [$ScriptToFillActualValue] (Variable:actual = [$actual]) did not meet the condition [$Condition] after [$totalWaitTimeSeconds] seconds."
        }
    }
}

function Test-Proxy
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [string]$proxyUrl
    )

    if ($rootDirectory -eq $null) { throw "RootDirectory script scoped variable not set. That's bad, its used to find dependencies." }
    $rootDirectoryPath = $rootDirectory.FullName

    . "$rootDirectoryPath\scripts\common\Functions-Waiting.ps1"
    
    $result = Wait -ScriptToFillActualValue { return (Invoke-WebRequest -Uri "www.google.com" -Proxy $proxyUrl -Method GET).StatusCode }  -Condition { $actual -eq 200 } -TimeoutSeconds 600 -IncrementSeconds 60
}

The main reason for this repeated try..wait loop is because sometimes a CloudFormation stack will complete successfully, but the service may be unavailable from an external point of view until the Load Balancer or similar component manages to settle properly.

Conclusion

I feel much more comfortable with our environment provisioning after moving the smoke tests into their own functions and executing them during the actual environment creation, rather than just in the tests.

Now whenever an environment completes its creation, I know that it actually works from an external observation point. The smoke tests aren’t particularly complex, but they definitely add a lot to our ability to reliably provision environments containing services.

Alas, I don’t have any more smoke puns or references to finish off this blog post…

Oh wait, yes I do!

*disappears in a puff of smoke*

It Really Is Hard To Find Good People

October 13. 2015 0 Comments

Posted in:
recruitment

Its time to start growing my team. Yes, I have a team. We maintain a large, old, highly profitable piece of software aimed primarily at Real Estate Agents. Its got some rough edges, but it provides real value to its users, and helps them to do their jobs more effectively (which in turn makes them more money).

As I stated above, the core of the application is quite old (Visual Basic 6), and was not necessarily constructed using the best of engineering principles, even for the period when it was built. It has been edited, maintained, mutated and otherwise altered over the course of over a decade, and while its not the best piece of code I’ve ever seen, its certainly not the worst.

A lot of what we do is provide new ways in which modern systems (either ours, or third parties) can interact with that piece of software, so that the users aren’t disadvantaged purely as a result of its age. While doing this, we’re learning about the domain and the users and planning for the eventual replacement (because it can’t live forever).

To do all of the above, I need good people.

The Search Begins

I’ve done some recruitment before. Not a huge amount, but enough to know that it is far harder than it looks. Its time consuming, and can be somewhat disheartening when you have to filter through a mountain of average applicants to find the good ones. Of course, that’s mostly an issue when you use the normal channels (Seek, Recruitment Agents, etc), which generally should only happen as a last resort.

Like most people, I follow a 3 step filtering process in order to try and help me identify whether or not I think someone will be a good addition to the team.

The first step is the resume. Honestly, I prefer a good, well structured LinkedIn profile to a static document, as I find that resumes are often tailored for recruitment agents (who love buzzwords and lists of technologies), rather than to the person who actually needs to read it. As an introduction though, I much prefer a custom letter plus a cut down resume highlighting only the skills and experience that would likely be required for the job at hand (with a link back to the full profile for completeness).

The letter should give me an idea that you know something about the company I work for and shows that you’re actually willing to put some effort into this application, while the targeted resume allows for a focus on only those areas of your history that are immediately useful. Believe me, a shorter, more focused resume is far better to read through than one in which you detail your entire storied history, down to that time you did night fill at Big W in your home town (where you learned complex managerial skills and how to use your time effectively, of course).

Anyway, that first, filtering step can be pretty gruelling, especially if you get flooded with a bunch of really standard resumes. When evaluating, I usually err on the side of “sure, why not” during the first step, as you can’t really get a good handle on someone until you speak with them.

Thus, step 2, talk to the person.

Talking Skills Not Good

A resume is all well and good, but you can’t really judge someone until you actually speak with them. A simple phone interview is the quickest and least disruptive way of accomplishing that. I usually have some really standard (and pretty tame) technical questions that I ask during this interview, to gauge the average level of knowledge. Stuff like:

What is an interface?
Have you heard of SOLID?
What do you know about inversion of control?
What are some of the difficulties encountered when writing web services?
Tell me in depth how the .NET garbage collector works with regards to finalizers and how they interact with the 3 tier garbage collection algorithm.

Okay, that last one was a joke, but the rest are good questions to see what sort of level of knowledge someone has.

Its not an immediate fail if the applicant can’t answer any of the above (well, except for maybe the interfaces one, I mean, come on), its more about how the applicant reacts and communicates, even when they don’t know the answer. Do they try to dodge the question to hide their lack of knowledge? Do they admit they don’t know and then ask what it is? You can tell a lot about a person by listening to both their actual answers and how they answer.

Assuming a candidate shows promise, after the phone interview is typically a good time see if they have any Github repositories or other examples of code they’ve written. Have they contributed to the community in any way? Things like StackOverflow or other forums can show a lot about how a person communicates and solves problems.

Just like resumes, I err on the side of “why not” here. If I’m not sure about someone, I’ll get them to come in and meet me in person.

Thus step 3, the in-person interview.

Ideally No Actual Physical Contact

You all know how this one goes. I get the applicant to meet me somewhere, ideally the office that they may work in, and get a sense of what they are like in person. Standard interview tactics apply, give company history, ask them various questions (tell me about your last job, what was your most recent success/failure, why do you think you would like to work here), give them time to ask questions of their own, etc.

In addition to all the standard stuff above, I like to give real problems to the applicant (maybe something that I’ve encountered recently) and see what their approach is. Do they ask questions? Do they make assumptions and go off half-cocked? Do they ignore the problem altogether and focus on something irrelevant? Do they consider the user at all, or are they focused purely on the technical? All useful pieces of information.

The in person interview is the stage where you get other people involved. If you have a HR department and they need to sign off, get them to come in and do their thing. Managers, directors, whoever wants to offer their feedback.

Ideally you get this whole process out of the way in one session, hopefully not longer than an hour or two. Interview processes that go on for weeks at a time (mostly as a result of scheduling issues) are annoying for all parties involved. Personally, I prefer something pretty casual (I’m not a formal man) because the normal interview atmosphere can really give you the wrong impression of someone. Show the applicant around, let them see a standard development environment, talk to the rest of the team, etc. Lots to do in a mere hour or two, but its more than possible.

Making a decision at the end of this process is hard. Its extremely difficult to get a good handle on someone from a few hours of contact (if that). Good people will be missed and bad people will be mistakenly regarded as good. It happens.

You make a decision using the best of your knowledge, and move on.

The Follow Up

Recruitment doesn’t end with hiring.

You really don’t know what someone is going to be like until you actually work with them. Honestly, some people can construct very convincing facades, and then be completely terrible. Give honest feedback and regularly make time to evaluate whether or not the applicant was the right choice. If they weren’t, then deal with it appropriately (whatever that might mean). The knife cuts both ways however, as they are also free to leave (at minimal notice) if they realise the job is not for them.

This is the standard evaluation/trial period. Where I am now this is 6 months, with 1 week notice for either party.

What I’m really trying to get at here though, is that there should be enough respect on both sides of the table to understand when something is not working out, and to react accordingly.

A Shorter Path

Better than all of the above? Get to know lots of good people and just look them up every time you need someone.

There is no better option than to call on someone you’ve actively worked with before. You already know them, you know what they are like and what they are capable of. Ideally you don’t just offer a job to anyone you’ve ever worked with (we’ve all worked with some duds before), but you remember the good ones, and you reach out to them when its appropriate. Ideally it should be a mutually beneficial thing, as they should do the same sort of thing for you.

Another Way?

I always thought it would be interesting to simply get promising applicants to just come in and do some work for a day. Get them to sign an NDA (just in case) and literally get them to come work for the company for a day. Obviously you would pay them an appropriate amount of money (nothing is free after all) to make it worth their while.

What better way to evaluate whether or not someone is going to be a good fit than to just work with them? Sure, this might not be feasible for someone who is already employed full time, but if they really are interested, I’m sure they could get a day off.

At the end of the day, either party can say “no thanks” with no hard feelings.

I mean, it makes sense to me, but I have a pretty simple view of things. I’m sure there is all sorts of bureaucratic reasons why it wouldn’t work. There usually is.

Conclusion

Honestly, this entire blog post has really been a lead up to the following statement.

My name is Todd Bowles and I am the Technical/Team Lead for one of the development teams at Onthehouse, located in the Brisbane CBD, Australia.

I’m looking for some good developers to join my team and help build integrations into our current Real Estate management software. Over the next 5 years or so, we’ll be creating a brand new offering using a SAAS model that will eventually replace our installed application.

I need people who are comfortable both with older installed technologies (VB6, .NET, WinForms, WPF) and new web based technologies (ASP.NET, Nancy, ReactJS + others). You’ll need to be adaptable (its not all pure .NET) and to not be afraid of either new things or particularly old, scary things.

You’ll have to have a focus on creating well engineered, well tested code while constantly communicating clearly with the rest of the team and with stakeholders. You must understand and be willing to maintain and extend the automated deployment of both software and infrastructure using Octopus and AWS respectively.

Ideally you will be able to challenge me and teach me new things and I promise to do the same in turn. You will have to be the sort of person who wants to learn and improve, and is never content with simply resting on their laurels.

In return for the qualities above, I can offer a competitive salary (dependent on your awesomeness), a fantastic place to work, some interesting problems to solve and most importantly, Friday afternoon drinks.

If you’re interested, send some sort of introductory email to hr@onthehouse.com.au and put my name on it and I’ll get back to you.

I look forward to hearing from you.

Keeping a Record

October 6. 2015 0 Comments

Posted in:
powershell
logging

In my recent adventures with trying to create and maintain Windows Scheduled Tasks that execute Powershell, I’ve discovered a need for logging exactly what happens in the scripts. Mostly this is due to the fact that getting the output from a scheduled task (beyond the exit code) is a massive pain, but I’ve wanted a good Powershell logging system/framework for a while now. I mean, its basically code, and I would never write code without good logging, so this shouldn’t be an exception.

Luckily, Powershell offers a mechanism for logging, usable via the X-Transcript commands.

Unluckily, its a bit crap.

Transcribing

Powershell offers two cmdlets for dealing with transcription. Start-Transcript and Stop-Transcript.

When you start a transcript, you can optionally supply a filepath, and the transcription will take place there. If not, it will create a file for you and tell you where it is.

As for what is actually transcribed, you get all of the Output of the script plus errors and other exceptions. You also get some meta information, including times and the script being executed.

All of this sounds pretty perfect for logging on the surface. Just start a transcript at the beginning of your script and stop it at the end. So I tried it.

It was…okay. Once I figured out that it doesn’t automatically create missing directories in the path supplied (it doesn’t fail, it just does everything really really slowly), that permissions can really ruin its day (same sort of thing, no failures, just super slow while it fails all of its writes) and that if you have square braces in your path it fails miserably with the least helpful error message ever, it at least created a record of what was happening. Sort of.

Okay, It was terrible and it made me sad.

Apart from the weird behaviour with directories/permissions/braces, transcribing does not record the Verbose or Debug streams. This is kind of annoying, because that’s where I put the majority of my log statements (Output is bad because it messes up stream processing and return values and Host is just generally not advised). I believe later versions of transcribing can record additional streams, but I didn’t look into it any further, because I was already annoyed with the feature.

I also hated the output it created. If I’m looking at a log file, at minimum I want timestamps, severity and messages. For a language/framework with a mature logging tool (like .NET with log4net or Serilog) I expect a lot more, but for a simple log file, those three things I listed are enough.

Time to try an alternate approach. I hesitate to say better, but honestly, I was pretty unimpressed with transcription, so it seemed very possible.

Redirection

Powershell offers the ability to redirect the various available streams, using the standard Windows command line redirection, which seems like it would be the easiest way to create a log file. Simply redirect the streams you’re interested in to a file (pretty much all of them) and Bob’s your uncle.

In my case, I had a lot of issues getting this to work. Because I was running Powershell from a Windows Scheduled Task, using redirection didn’t seem to work. It turns out that one of my mistakes was assuming you could redirect the output from Powershell.exe, which you don’t seem to be able to do. Instead the redirection appears to be primarily for commands within Powershell.

I tried wrapping my Powershell in more Powershell (which actually did the redirection, as well as setup the file name via the current timestamp and ensure the appropriate directories existed), and while it seemed to work, it still wasn’t good enough. No timestamps on the lines in the file left me with no real information as to the time that elapsed during the script. Additionally, the syntax for getting the redirection to work properly was a bit strange, with the all redirection seemingly not working as expected (9>&1 didn’t seem to work, and neither did *>&1).

It was still better than transcribing though.

It was time to go deeper.

Interception

Powershell is an interesting language. Coming from a heavy .NET/C# background, I was initially confused by Powershell and its untyped variables and extremely loose, interpreted syntax. I still definitely prefer C#, but Powershell has its place in my environment, but only when its supported by tests using Pester.

One of the wonderful things Powershell lets you do, is function overriding.

You can create an implementation of a function, and have that implementation be used within a limited scope, instead of the default implementation.

You might be able to see where I’m going with this.

All of my logging was already in place via a combination of Write-Verbose, Write-Debug and Write-Warning (and the other write functions as well, just more rarely). All I needed to do was intercept those calls and write the messages to some file instead of whatever the default did.

It was easy enough to implement a file containing implementations of all of the Write-X cmdlets, and redirect the output to a file. I could easily add timestamps and any other information I wanted, and could control the output format, allowing it to be easily parsed by something else (like Logstash). To use the feature you simply setup a global variable containing the directory to put the log files into and then dot include the file.

The only downside to this was that any function I overwrote in this way no longer worked normally (which is kind of obvious), so during my testing I broke my IDE a number of times, which made development a bit harder than expected. Luckily, I ran across this StackOverflow answer from someone much smarter than me, which showed how to wrap the existing implementations instead of simply replacing them.

function Get-TimestampForLogFileName
{
    return [DateTime]::Now.ToString('yyyyMMdd');
}

function Get-CurrentLogFile
{
    if ([string]::IsNullOrEmpty($logsDirectoryPath)) { throw "LogsDirectoryPath script scoped variable not set. Thats bad, its used to create log files for Powershell output." }

    $timestamp = Get-TimestampForLogFileName

    $logFilePath = "$logsDirectoryPath\$timestamp.log";
    if (-not(Test-Path $logFilePath))
    {
        $logFile = New-Item -Type File -Path $logFilePath -Force;
        $max = 5
        $current = 0
        while (-not ($logFile.Exists))
        {
            Sleep -Seconds 1
            $current++
            if ($current -gt $max) { break }
        }
    }

    return $logFilePath
}

function Get-TimestampForLogContent
{
    return [DateTime]::Now.ToString('yyyyMMddHHmmss');
}

function Write-Debug 
{
    [CmdletBinding()]
    param
    (
       [Parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true)]
       [AllowEmptyString()]
       [System.String]${Message}
    )

    begin 
    {
       try 
       {
           $outBuffer = $null
           if ($PSBoundParameters.TryGetValue('OutBuffer', [ref]$outBuffer))
           {
               $PSBoundParameters['OutBuffer'] = 1
           }
           $wrappedCmd = $ExecutionContext.InvokeCommand.GetCommand('Write-Debug', [System.Management.Automation.CommandTypes]::Cmdlet)
           $scriptCmd = {& $wrappedCmd @PSBoundParameters }
           $steppablePipeline = $scriptCmd.GetSteppablePipeline($myInvocation.CommandOrigin)
           $steppablePipeline.Begin($PSCmdlet)
       } 
       catch 
       {
           throw
       }
    }

    process 
    {
       try 
       {
            $logFilePath = Get-CurrentLogFile;
            Add-Content -Path $logFilePath "$(Get-TimestampForLogContent):DEBUG:$message" -Force;
            $steppablePipeline.Process($_)
       } 
       catch 
       {
           throw
       }
    }

    end 
    {
       try 
       {
           $steppablePipeline.End()
       } catch 
       {
           throw
       }
    }
}

I’ve only included the Write-Debug implementation, but all of the other ones look the same.

Finally I had a logging implementation that didn’t make me sad. It wrote reliably to a file, and I could easily use it from within a Windows Scheduled Task (after I started creating an actual .ps1 file for the scheduled task to execute that is).

Conclusion

The logging solution that I ended up putting into place isn’t perfect, but it seems to be a hell of a lot better than the other two approaches that I tried.

It outputs in a format that people are used to seeing, and is easily parsable by both machines and people. If I wanted to, I could import the log information into ELK using Logstash, which would let me keep track of the execution of the scheduled tasks remotely. Honestly that’s probably something I’ll put into place eventually, just not right now.

It was nice to be able to wrap functionality provided by something else without having to have written the code explicitly for that in the first place. Coming from C#, I typically code to interfaces, so I’m used to being able to replace things as necessary, but that requires effort up front. Here I simply replaced some system provided cmdlets and made them do something slightly different based on my requirements. I can’t count the number of times I’ve wanted the ability to do that sort of thing with System.Diagnostics.Trace or Console or DateTime (usually for testing purposes).

Having that sort of power is both amazing and dangerous (amazingly dangerous?). It’s very easy to completely break other, unrelated things when you are essentially replacing the implementation of a global static function.

Ah well, with great power, comes great responsibility I suppose.

Wouldn’t want Uncle Bob to be killed some rogue code in a tragically ironic accident.