0 Comments

Conferences never fail to leave me somewhat humbled. Normally I suppress the fact that I don’t know enough (because its depressing to acknowledge that all the time), but listening to people much smarter and more successful than me speak at conferences really does hammer the point home. I wouldn’t necessarily consider this a bad thing though, as aware of the limitations of your knowledge is almost always better than believing you are amazing and know all the things.

If you caught my blog post last week, you’ll know that I was planning on heading to DDD Brisbane and Yow! over the last couple of days. Well, I went and did that, and now I’m tired. Richer in knowledge yes, but also tired.

I’m going to use this blog post to give a quick summary of the sessions across both of those events that I thought were the most interesting, both in order to summarise and solidify my own thoughts on the subject and to share my experiences with others.

DDD Brisbane

Panel: What is Quality Code (Various)

Its interesting to see what a small group of people consider to be quality code. Its one of those really common questions that is hard to answer by virtue of the fact that everyone really does think about quality in a different way. The panel here was no exception, but there was a common thread of mutability being one of the major measures of quality. Code that is changeable and can adapt in the face of the changing world is of a much higher quality than code that can’t change. A lot goes into making code changeable (like making it understandable first), but in my opinion that’s what it comes down to. This of course means that you should be rightly wary of decisions that make future change difficult (architecture is definitely one of those things).

Parsing Text: The Programming Superpower You Need at Your Fingertips (Nicholas Blumhardt)

Nick gave an interesting talk on parsing data structures from text representations. To be honest, text parsing is one of those things that I tend to shy away from unless absolutely necessary, because it can be very difficult to get right. The actual parsing itself (assuming you have a well defined grammar of some sort) isn’t necessarily the hard part, but the functionality around it (error detection and notification mostly) can be exceedingly difficult. Nick demonstrated solving the problem from two different directions, the first using a composable object-oriented approach (an Integer class, a character class, a Double class build from 2 Integers and a Period) and then improved on that design by using a more fluent syntax (using his Sprache library).

Back to Basics: Simple, Elegant and Beautiful Code (Andrew Harcourt)

Andrew was as entertaining as always. One of the most important things he mentioned was make sure you include personal and team happiness in the mix when you’re optimising your outcomes. Its often left behind in the rush to deliver features or to do things perfectly, but it can definitely have one of the biggest impacts when it comes to your own (and your teams) ability to deliver anything of import. He said a bunch of other stuff as well, but the happiness point really resonated with me.

Sweet You’re Agile! Now What? (Chris Gilbert)

Chris talked about Agile. Other than the normal agile themes, processes and practices he mentioned one particular approach for standups that I thought was interesting (and that I will be trying out in the next couple of days). Instead of going around the group getting everyone to talk about what they did/are doing/are blocked on, he suggested going through the board and asking what the team can do to deliver each the stories/tickets that are in process. He called this process “walking the board” and I think it would definitely be more useful than the standard standup (which can very easily become a progress report). The focus is shifted to actually delivering (or asking the questions that lead to delivery) instead of just activity.

What is a Compiler: We Thought We Knew… (Mads Torgersen)

This talk was amazing. Mads exudes a real enthusiasm for the subject (the new C# compiler and its domain model) and its hard not to get excited when he does. He went through the challenges in merging the two codebases that understand C# (the compiler and the IDE) and the structure that now lives underneath the hood (immutable but shared syntax and parser trees). He also demonstrated some particularly cool things like being able to write customer analysers (so you can highlight when someone is doing something that you discourage, like creating public static methods) and fixers (that allow you to regenerate the code such that it no longer has the problem) and the ability to run C# code from a command line in a REPL.

Yow!

Delivery Mapping: Turning the Lights On (Dan North)

Dan is another one of those people that I’ve heard of, but never heard speak before. He’s a really good speaker, and he was presenting some very interesting things so this session stuck in my mind quite persistently (unlike some of the other talks that are already getting a bit fuzzy). Dan spoke about a different method of organising agile software development at scale. His idea was to facilitate your people organising themselves into short lived teams (opposing accepted practices) so that they rally around problems rather than around the team itself. He then talked at length about helping to identify those teams by determining what skills you have available (and what skills you lack) and then cross referencing that with the skills that you need to solve a particular problem.

The Future of Software Engineering (Glenn Vanderburg)

Glenn talked (almost lectured really) about the history of software engineering (as compared to other, more mature forms of engineering) and how relatively recently there has been a lot of sentiment that software engineering is not really engineering because its different at some fundamental level that makes it incompatible. Glenn disagrees, saying that its more likely we really only tried a set of processes very closely modelled on another engineering discipline, and when it didn't work for us we tried to throw out the entire concept rather than trying something different. Glenn reiterated a sentiment that I’ve heard before (but can’t really remember from where) about the profession misunderstanding exactly what constitutes design when it comes to software development. The code (and everything leading up to it) is the design, rather than the code being the artefact. This means of course that an architectural blueprint in other engineering disciplines (being a set of instructions to follow for construct something) is equivalent to the source code in software development (being a set of instructions to accomplish something). This flips the typical cost model, because materials are usually the most expensive part of building something, but for software, building the thing using the blueprint is essentially free. A concept I agree with.

Engineering and Exploring the Red Planet (Anita Sengupta & Kamal Oudrhiri)

I mention this session (which was one of the day 2 keynotes) mostly because it was incredibly interesting to see the engineering challenges that went into landing the most recent Rover on Mars. Specifically it was pretty hilarious that at day 200 into the expedition, some sort of bug prevented the primary computer on the Rover (there are two) from receiving commands, so they had to force it to shut down and hope the backup computer turned itself on. They expected the backup to be available within 60 seconds, but it didn’t come back for 3+ minutes. The hilarious part, the delay between primary down –> backup available was actually feature that was added late in the development that not everyone was aware of, so they were panicking for every second in that 3 minute delay.

Agile is Dead – Long Live Agility (Dave Thomas)

While I’ve heard of Dave Thomas (one of the original writers of The Manifesto for Agile Software Development), I’ve never actually heard him speak. He espoused a lot of things that I’ve been struggling with regarding agile for a while now, including its commercialisation, the fact that one does not do Agile and that most people seem to have lost sight of what it was originally meant to accomplish. Its really about solving problems incrementally, which for software usually means delivering small amounts of functionality on a regular basis and adapting to the feedback loop that allows. Dave gave his incredibly polished and detailed plan for increasing agility, which I will paraphrase here because really everyone should see it:

  1. Make a small step forward.
  2. Get feedback.
  3. Adjust

As you can see, and incredibly complex and fully realised process model. He should sell it through a consultancy.

Designing for Failure: Scaling Uber’s Backend by Breaking Everything (Matt Ranney)

It was refreshing to hear Matt talk about how successful organisations make mistakes just as much as everyone else. He spoke at length about the various failures that had occurred in Uber’s software systems over the last year or so and how they were eventually dealt with. Usually speakers talk at conferences about how their organisation does X and Y amazingly and so should you, so it was great to hear about how Uber has made a bunch of mistakes during their incredible growth. As a bonus the background images for some of his slides showed a visualisation of the Uber traffic through various major US cities over a 24 hour period, which it itself was incredibly interesting.

Making Hacking Childs Play (Troy Hunt)

Last year at DDD I had the pleasure of watching OJ Reeves do some hacking. Troy’s talk this year was in a similar vein, but more presentational rather than demonstrative (although there were definitely demonstrations, like making any old website do the Harlem Shake. Well, almost any, except Troys personal project Have I Been Pwned that is. Troy went through a lot of different security stuff, mostly themed around how hacking can be easy enough for even children to do (and bored kids do a lot of things). The best part of the session? Troy had set up a WIFI Pineapple as a public WiFi in the room. Not super interesting by itself, but he also set it up so that it looked like any unsecured public WiFi that a device had previously connected to. This meant that any mobile device in the room that had previously connected to an unsecured public WiFi at any point in the past was quite happy to connect to this one and just jam all its normal traffic through. Result: Troy could see and record and HTTP traffic from a number of different phones in the room. Hilarious.

DevOps @ Wotif – Making Easy = Right (Alexandra Spillane and Matt Callanan)

Nothing revolutionary here, but a very solid talk about the process that WotIf went through in order to solve the problems they were having with an incredibly over-constrained operations team and generally low morale (because no-one was delivering). It was enlightening to hear someone talk about the application of those DevOps principles within an organisation that still needed to have a fair amount of control. They did it involving standards and checkpoints, which honestly is probably the only way you can migrate an organisation from the old school way of thinking to the newer. I imagine that eventually, with mature enough teams and ingrained practices they will probably start to drop some of the formality, but during the transformative early stages, that sort of thing is critical.

The Mother of all Programming Languages Demos (Sean McDirmid)

Sean’s talk was interesting, if not immediately applicable to my day to day problems. He spoke of different/new ways of programming involving letting the computer help you find a solution, rather than you just writing one. I think it was mostly about decreasing the time in feedback loops, because he demoed a pseudo programming language (it was Python-esque), that allowed you to run and change code in real-time. It could do things like go backwards and forwards in time for animations (including visualising the next 200 positions of your object) and scrubbing through possible values for your variables and functions. Like I said, interesting, but probably not useful for another few years at least (which is what I would expect our of a talk from a member of the Microsoft research department).

Conclusion

For me and for the cost involved (a paltry $30), DDD is a no brainer. There is really no reason to not go to that beautiful little conference and listen to your peers (and betters) talk about interesting things.

Yow on the other hand…

I’m not 100% convinced of the value for money that actually attending the Yow conference provides. Don’t get me wrong, it was certainly interesting, and going to these sorts of things is more about promoting a culture of learning and improvement, rather than extracting physical value from the conference itself. Its just that if you wait a little bit you can see absolutely everything that was presented, in the comfort of your office or home, for nothing at all.

I suppose for a more outgoing person the main benefit is in being in the same place as so many smart people, so you can pick their brains and have interesting discussions about all sorts of things. What I mostly see though is people sticking together in the groups defined by the organisations that they came from (myself included) because its just easier that way. Granted, not everyone does this, but it definitely seemed to be more common than not.

If promoting discussions between smart people was really was the case though, there would be more dedicated time for sitting down in smaller groups and discussing things. Maybe instead of focusing the conference around speakers and sessions, you could organise to have lots of small rooms with whiteboards and materials available. Give each room a topic of some sort (where you still have one or two people facilitating) and then let it all sort itself out.

Much much harder to organise and sell I imagine, but I’d be more comfortable paying for that sort of thing.

That sounds like real value.

0 Comments

Well, for me anyway, because the conferences I tend to go to happen around this time of year.

I’m off to two opportunities for self-improvement this year, DDD Brisbane and Yow!. Unfortunately (fortunately?) they are scheduled extremely close together (Saturday, Monday, Tuesday), so I’ve got a busy few days full of learning coming up. Still, a little busyness is a small price to pay for knowledge. “Scientia potentia est” after all.

I don’t know any Latin beyond what the internet tells me.

DDD Brisbane I had to pay for out of my own pocket (a mere AUD $30, incredibly affordable for anyone looking to take advantage of the experience of others), but my employer (Onthehouse Group) was nice enough to cover the costs of Yow!, so I didn’t have to take that one on the chin.

Last year I only managed to attend DDD, so it will be nice to head off to Yow! this year to absorb all that delicious knowledge. I’ll probably do a follow up post after the conferences are over (assuming I yet live),  but for now this post will be a quick one highlighting some of the things that I’m looking forward to.

Developers, Developers, Developers

There is some good stuff happening at DDD this year. Personally, I’m looking forward to the following sessions.

Panel: What is Quality Code

This is one of those all-star panels that will probably dump massive amounts of experience in a very short amount of time. Featuring some of the more well known players in the Software Development space (in Brisbane at least), I’m sure it will be both engaging and educational.

Designing an API

Indu Alagarsamy & John Simons talk is topical for me, because we are currently investing in the design and implementation of an API that will expose parts of our flagship software that were previously unavailable to external consumers. I’m interested to see what these two speakers will offer in terms of new information, or lessons learnt.

Back to Basics: Simple, Elegant, Beautiful Code

Listening to Andrew Harcourt give a talk is always both entertaining and enlightening, and I’m sure this talk will be no different. Its always a goal of mine to write simple code, and I considering myself a craftsman when it comes to developing software, so I’m interested in what sort of things Andrew has to say on the matter.

Sweet You’re Agile! No What?

Its always interesting to see what people have to say about Agile, and I wonder what Chris has to add to all the other stuff floating about out there.

Yow

I’ve never actually been to a Yow! conference before, though I have watched quite a lot of the videos that they put up after the fact (you can view all of the previous Yow! videos for free at the Eventer website). It should be a good opportunity for me, even though its a much larger, less personal conference. This year I’m looking forward to the following sessions.

Delivery Mapping: Turning the Lights On

Writing software is all well and good, but if you can’t deliver it in a timely, reliably fashion, what does it really matter? I’m hoping this session will provide some new information on that front.

Pragmatic Microservices: Whether, When and How to Migrate

Whether or not the use a microservice based architecture is something we’re currently struggling with right now, so I’m very interested in seeing real examples from big companies (on both sides of the argument) accompanied by some sort of analysis into the desirability of the approach.

Property Based Testing: Shrinking Risk in Your Code

I’m a big proponent of testing, and while I have heard some things about property based testing, my level of knowledge about it would barely fill a thimble. I’m hoping the the content of this session will enlighten me somewhat.

Agile is Dead (Long Live Agility)

In the same sort of vein as the Agile talk Chris is giving at DDD, I’m curious to hear what one of the original founders of the Agile Manifesto has to say on the subject. Personally, I think not enough people pay attention to Agile values and instead focus on the process, thinking that makes them “agile”.

Designing for Failure: Scaling Uber’s Backend By Breaking Everything

Now this is the kind of talk I can get behind. Using AWS as our primary hosting mechanism has only solidified in my mind the fact that anything can break at any moment, so you might as well be prepared for it. I wonder if this session will feature similar concepts to Netflix and their legendary Chaos Monkey, Gorilla and Kong.

Making Hacking Childs Play

Troy Hunt is pretty legendary in the security field, so I’m sure he has to say will have excellent ramifications for the way I think about security.

DevOps @ Wotif: Making Easy = Right

Looking back I’ve always been interested in making sure that software developers are involved at every step of the software pipeline, and deployment/support is no exception. The relatively new culture around DevOps only reaffirms that sort of attitude, taking operational concerns into the programming space. Hopefully this talk will add more fuel to the fire.

Conclusion

The downside of these sorts of conferences is that you simply can’t see everything. At least with Yow! I will be able to watch the videos later, which I certainly will be doing. There were some particular hard decisions made in my list above, and I don’t want to miss out on anything amazing.

As an aside, it really does feel like there is never an end to learning new things in this field and that can be a combination of exhilarating and exhausting. The problem is, if you stop to take a breather, everything moves so fast that you’re already behind the curve. One day it might settle down, but I can’t imagine that’s going to happen in my lifetime.

I’m still not entirely sure if anyone actually reads this blog, but if you do, and you somehow recognise me at either of the conferences above, come and say hi.

0 Comments

So, we have all these logs now, which is awesome. Centralised and easily searchable, containing lots of information relating to application health, feature usage, infrastructure utilization along with many other things I’m not going to bother to list here.

But who is actually looking at them?

The answer to that question is something we’ve been struggling with for a little while. Sure, we go into Kibana and do arbitrary searches, we use dashboards, we do our best to keep an eye on things like CPU usage, free memory, number of errors and so on, but we often have other things to do. Nobody has a full time job of just watching this mountain of data for interesting patterns and problems.

We’ve missed things:

  • We had an outage recently that was the result of a disk filling up with log files because an old version of the log management script had a bug in it. The disk usage was clearly going down when you looked at it in Kibana dashboard, but it was happening so gradually that it was never really brought up as a top priority.
  • We had a different outage recently where we had a gradual memory leak in the non-paged memory pool on some of our API instances. Similar to above, we were recording free memory and it was clearly dropping over time, but no-one noticed.

There has been other instances (like an increase in the total number of 500’s being returned from an API, indicating a bug), but I won’t go into too much more detail about the fact that we miss things. We’re human, we have other things to do, it happens.

Instead, lets attack the root of the issue. The human element.

We can’t reasonably expect anyone to keep an eye on all of the data hurtling towards us. Its too much. On the other hand, all of the above problems could have easily been detected by a computer, all we need is something that can do the analysis for us, and then let us know when there is something to action. It doesn’t have to be incredibly fancy (no learning algorithms….yet), all it has to do is be able to compare a few points in time and alert off a trend in the wrong direction.

One of my colleagues was investigating solutions to this problem, and they settled on Sensu.

Latin: In The Sense Of

I won’t go into too much detail about Sensu here, because I think the documentation will do a much better job than I will.

My understanding of it, however, is that it is a very generic, messaging based check/handle system, where a check can be almost anything (run an Elasticsearch query, go get some current system stats, respond to some incoming event) and a handler is an arbitrary reaction (send an email, restart a server, launch the missiles).

Sensu has a number of components, including servers (wiring logic, check –> handler), clients (things that get checks executed on them) and an API (separate from the server). All communication happens through RabbitMQ and there is some usage of Redis for storage (which I’m not fully across yet).

I am by no means any sort of expert in Sensu, as I did not implement our current proof of concept. I am, however, hopefully going to use it to deal with some of the alerting problems that I outlined above.

The first check/handler to implement?

Alert us via email/SMS when the available memory on an API instance is below a certain threshold.

Alas I have not actually done this yet. This post is more going to outline the conceptual approach, and I will return later with more information about how it actually worked (or didn’t work).

Throw Out Broken Things

One of the things that you need to come to terms with early when using AWS is that everything will break. It might be your fault, it might not be, but you should accept right from the beginning that at some point, your things will break. This is good in a way, because it forces you to not have any single points of failure (unless you are willing to accept the risk that they might go down and you will have an outage, which is a business decision).

I mention this because the problem with the memory in our API instances that I mentioned above is pretty mysterious. Its not being taken by any active process (regardless of user), so it looks like a driver problem. It could be one of those weird AWS things (there are a lot), and it goes away if you reboot, so the easiest solution is to just recycle the bad API instance and move on. Its already in an auto-scaling group for redundancy, and there is always more than 1, so its better to just murder it, relax, and let the ASG do its work.

Until we’re comfortable automating that sort of recycling, we’ll settle for an alert that someone can use to make a decision and execute the recycle themselves.

By installing the Sensu client on the machines in question (incorporating it into the environment setup itself), we can create a check that allow us to remotely query the available free memory and compare it against some configured value that we deem too low (lets say 100MB). We can then configure 2 handlers for the check result, one that emails a set of known addresses and another that does the same for SMS.

Seems simple enough. I wonder if it will actually be that simple in practice.

Summary

Alerting on your aggregate of information (logs, stats, etc) is a pretty fundamental ability that you need to have.

AWS does provide some alerting in the form of CloudWatch alarms, but we decided to investigate a different (more generic) route instead, mostly because of the wealth of information that we already had available inside our ELK stack (and our burning desire to use it for something other than pretty graphs).

As I said earlier, this post is more of an outline of how we plan to attack the situation using Sensu, so its a bit light on details I’m afraid.

I’m sure the followup will be amazing though.

Right?

0 Comments

With our relatively new desire to create quality API’s facilitating external integration with our products, we’ve decided to start using Apiary.

In actuality, we’ve been using Apiary for a while now, but not very well. Sure we used it to document what the API looked like, but we didn’t take advantage of any of its more interesting features. That changed recently when I did a deep dive into the tool in order to clean up our existing documentation before showing it to a bunch of prospective third party integrators.

Apiary is an excellent tool for assisting in API development. It not only allows for the clear documentation of your API using markdown and multi-markdown, but it also acts as a mock API instance. This of course allows you to design the API the way you want it to be, and then evaluate that design in a very lightweight fashion, before committing to fully implement it. Cheap prototypes make for lots of validation during the design phase, which in turn makes for a better design, which generally leads to a better end-result.

Even once you’ve settled on a design, and fully implemented it, you can use the blueprint to call into your real API, using the documented parameters and attributes. This allows for all sorts of useful testing and validation from an entirely different point of view (and is especially useful for those users unfamiliar with other mechanisms for calling an API).

Its not all amazing though (even though it is pretty good). There are a few features that are lacking which make me sad as a software development, and I’ll go into those later on. In additional to a few areas that are lacking, it has a bit of a learning curve, and it can be difficult to know exactly what will work and what won’t. There are quite a few good examples and tutorials available, but the Apiary feature set also seems to change quite rapidly, so it can be hard to determine exactly what you should be using.

This post is mostly going to contain an explanation of how we currently structure our API documentation, along with a summary of the useful things I’ve learned over the past week or so. Really though, its for me to fortify this information in my own head, and to document it in an easily sharable location.

The Breakdown

I like to make sure that API documentation speaks for itself. Other than access credentials, you should be able to point someone at your documentation and they should be able to use it without needing anything else from you. Its not that I don’t like talking to people, and there are definitely always going to be questions, but I find that “doesn't need to come back to you” is a pretty good measure of documentation quality.

Of course, this implies that your documentation should do more than just outline your endpoints and how to talk to them. Don’t get me wrong, it needs to do that as well (in as much detail as you can muster), but it also needs to give high level information about the API (where are the instances, what is a summary of the endpoints available) as well as information about the concepts and context in play.

As a result, I lean towards the following headers in Apiary: Overview, Concepts and Reference.

Overview

The overview section gives a brief outline of the purpose of the API, and lists known instances (actual URLs) along with their purpose. This is a good place to differentiate your environments (CI, Staging, Production) and anything useful to know about them (CI is destroyed every night for example, so data will not be persisted).

It also lists a summary of the endpoints available, along with verbs that can be used on them. This is really just to help users understand the full breadth of the API. Ideally each one of these endpoints links to its actual endpoint specification in the reference section.

Concepts

The concepts section is where you go into more detail about things that are useful for the user to know. This is where you can outline the domain model in play, or describe anything that you think the user would be unfamiliar with because it requires specialised knowledge. You can easily put Terms of Reference here, but you should focus on not only explaining unfamiliar words, but also on communicating the purpose and reasoning behind the API.

This is also a good place to talk about authentication, assuming you have some.

Reference

This is your normal API reference. Detail the endpoints/resources and what verbs are available on them. Explain required parameters (and the ways in which they must be supplied) along with expected outputs (both successful and failure). I include in this section a common sub-section, detailing any commonality between the API endpoints (like all successful results look like this, or any request can potentially return a 400 if you mess up the inputs, etc).

For Apiary in particular, make sure that your examples actually execute against a real server, as well as act appropriately as a mock. This is one of the main benefits of doing this sort of API documentation in Apiary (rather than just a Github or equivalent repository using markdown), so make the most of it.

Helpful Tips

I found that using Data Structures (using the +Attribute syntax) was by far the best way of describing request/response structures, both in terms of presentation in the resulting document and in terms of reusability between multiple elements.

Specifying a section for data structures is easy enough. Before you do your first API endpoint reference in the document (with the ## Customer [/customers] syntax), you can simply specify a data structures heading as follows:

# Data Structures
## Fancy Object (object)
+ Something: Sample Value (string, required) – Comment describing the purpose of the field
+ AnotherThing: `different sample escaped` (string, required)
+ IsAwesome: true (bool, optional)

# Simple Thing (object)
+ Special: yes (string, required)

## Derivation (Fancy Object)
+ Magical: (Simple Thing, required)

You can then reference those data structures using the +Attributes (Data Structure Name) syntax in the API reference later on.

Notice that you can model inheritance and composition in a very nice way, which can make for some very reusable components, and that there is room for comments and descriptions to do away with any ambiguity that isn’t already dealt with via the name and example value.

It Could Be Better

Its not all great (even though it is pretty good).

I had a lot of issues with common headers in my API Blueprint, like a common header for identifying the source of a request, or even the requirement of an Authorization header. This was mostly just me not wanting to repeat myself, as I didn’t want to have to specify the exact same header for every single endpoint in the documentation, and have the examples be executable. Even the ability to specify variables to be substituted into the blueprint in a later place would have been better. I really don’t like having the same information stated in multiple locations. It makes maintenance a gigantic pain.

Another rough patch is that there seem to be a number of different ways to accomplish the same sort of thing (reduced duplication when dealing with request/response templates). You can use models (which can define headers, bodies and schemas) or you can use data structures (which are less JSON specific and more general, but can’t do headers) or you can manually specify the information yourself in every section. It can be difficult to determine the best tool to use.

I assume that this is a result of the service changing quite rapidly and that it will likely settle down as it gets more mature.

Finally, because of the nature of the markdown, it can be quite hard to determine if you have actually done what you wanted to do, in terms of the Apiary specific data structures and specifications. If you mistype or leave out some required syntax, its probably still valid markdown, so your documentation still works, but it also doesn’t quite do the right thing. There is some syntax and semantic checking available in the online editor, which is awesome, but it can be a bit flakey at times itself. Still, I imagine this will improve greatly as time goes on.

Conclusion

As a step up from normal API documentation (either in a Wiki or using some sort of Markdown), Apiary offers a number of very useful features. By far the most useful is the ability for the specification to actually create an executable service, as well as pass requests through to your real service. This can really improve the turnaround time for validating your designs and is a major boon for API development.

At a conceptual level, I’m still disappointed that we have to document our API’s, rather than somehow exposing that information from the API itself. My ideal situation would be if that sort of ability was included as part of the source of the API. I still hate the fact that the two can very easily get out of sync, and no matter how hard you try, the only source of truth is the actual executable code. Its the same problem with comments in your code. If you can’t make the code express itself well enough, you have to resort to writing around it.

If I have to document though, I want the documentation to be verifiable, and Apiary definitely provides that.

0 Comments

And now, for the thrilling conclusion! The thrilling conclusion to what you might ask? Check back here for the first part in this saga.

I mentioned very briefly in the first part of this epic that we attempted to fix a strange threading bug with token generation via the ASP.NET Identity Framework by making sure that the Entity Framework DbContext was fully initialized (i.e. model created, connection established, etc) before it left the factory method. Initial tests were promising, but it turns out this did not fix the issue.

I mention this because I had absolutely no luck reproducing the connection leak when I was running locally (with or without a profiler attached). I could easily force timeouts when getting a connection from the pool (because it was too busy), but I couldn’t reproduce the apparent situation where there were connections established that could not be actively used.

When going through the combination of CloudWatch logs for RDS (to track connection usage) and our own ELK stack I found a correlation between the errors that sometimes occurred when generating tokens and the increase in the usage of connections. This pattern was pretty consistent. Whenever there was a cluster of errors related to token generation, there was an increase in the total number of connections used by the service, which never went down again until the application pool was recycled at the default time of 29 hours from the last recycle.

Token Failure

We’ve been struggling with the root cause of the token generation failures for a while now. The most annoying part is that it doesn’t fail all the time. In fact, my initial load tests showed only around a 1% failure rate, which is pretty low in the scheme of things. The problem manifests itself in exceptions occurring when a part of the Identity Framework attempts to use the Entity Framework DbContext that it was given. It looks as though there is some sort of threading issue with Entity Framework, which makes sense conceptually. Generally EF DbContext objects are not thread safe, so you shouldn’t attempt to use them on two different threads at the same time.

The errors were many and varied, but all consistently come from our implementation of the OAuthAuthorizationServerProvider. A few examples are below:

System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details. ---> System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Entity.Infrastructure.Interception.InternalDispatcher`1.Dispatch[TTarget,TInterceptionContext,TResult](TTarget target, Func`3 operation, TInterceptionContext interceptionContext, Action`3 executing, Action`3 executed)
   at System.Data.Entity.Infrastructure.Interception.DbCommandDispatcher.Reader(DbCommand command, DbCommandInterceptionContext interceptionContext)
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   --- End of inner exception stack trace ---
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   at System.Data.Entity.Core.Objects.Internal.ObjectQueryExecutionPlan.Execute[TResultType](ObjectContext context, ObjectParameterCollection parameterValues)
   at System.Data.Entity.Core.Objects.ObjectContext.ExecuteInTransaction[T](Func`1 func, IDbExecutionStrategy executionStrategy, Boolean startLocalTransaction, Boolean releaseConnectionOnSuccess)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<>c__DisplayClass7.<GetResults>b__5()
   at System.Data.Entity.Core.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<System.Collections.Generic.IEnumerable<T>.GetEnumerator>b__0()
   at System.Data.Entity.Internal.LazyEnumerator`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source, Expression`1 predicate)
   at [OBFUSCATION!].Infrastructure.Repositories.AuthorizationServiceRepository.GetApplicationByKey(String appKey, String appSecret) in c:\[OBFUSCATION!]\Infrastructure\Repositories\AuthorizationServiceRepository.cs:line 412
   at [OBFUSCATION!].Infrastructure.Providers.AuthorizationServiceProvider.ValidateClientAuthentication(OAuthValidateClientAuthenticationContext context) in c:\[OBFUSCATION!]\Infrastructure\Providers\AuthorizationServiceProvider.cs:line 42
   
System.NullReferenceException: Object reference not set to an instance of an object.
   at Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Entity.Infrastructure.Interception.InternalDispatcher`1.Dispatch[TTarget,TInterceptionContext,TResult](TTarget target, Func`3 operation, TInterceptionContext interceptionContext, Action`3 executing, Action`3 executed)
   at System.Data.Entity.Infrastructure.Interception.DbCommandDispatcher.Reader(DbCommand command, DbCommandInterceptionContext interceptionContext)
   at System.Data.Entity.Core.EntityClient.Internal.EntityCommandDefinition.ExecuteStoreCommands(EntityCommand entityCommand, CommandBehavior behavior)
   at System.Data.Entity.Core.Objects.Internal.ObjectQueryExecutionPlan.Execute[TResultType](ObjectContext context, ObjectParameterCollection parameterValues)
   at System.Data.Entity.Core.Objects.ObjectContext.ExecuteInTransaction[T](Func`1 func, IDbExecutionStrategy executionStrategy, Boolean startLocalTransaction, Boolean releaseConnectionOnSuccess)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<>c__DisplayClass7.<GetResults>b__5()
   at System.Data.Entity.Core.Objects.ObjectQuery`1.GetResults(Nullable`1 forMergeOption)
   at System.Data.Entity.Core.Objects.ObjectQuery`1.<System.Collections.Generic.IEnumerable<T>.GetEnumerator>b__0()
   at System.Data.Entity.Internal.LazyEnumerator`1.MoveNext()
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
   at System.Linq.Queryable.FirstOrDefault[TSource](IQueryable`1 source, Expression`1 predicate)
   at [OBFUSCATION!].Infrastructure.Repositories.AuthorizationServiceRepository.GetApplicationByKey(String appKey, String appSecret) in c:\[OBFUSCATION!]\Infrastructure\Repositories\AuthorizationServiceRepository.cs:line 412
   at [OBFUSCATION!].Infrastructure.Providers.AuthorizationServiceProvider.ValidateClientAuthentication(OAuthValidateClientAuthenticationContext context) in c:\[OBFUSCATION!]\Infrastructure\Providers\AuthorizationServiceProvider.cs:line 42

In the service, this doesn’t make a huge amount of sense. There is one DbContext created per request (via Owin), and while the Owin middleware is asynchronous by nature (meaning that execution can jump around between threads) there is no parallelism. The DbContext should not be being used on multiple threads at one time, but apparently it was.

It was either that, or something was going seriously wrong in the connection pooling code for Npgsql.

Scope Increase

As I didn’t quite understand how the dependency injection/object lifetime management worked via the OwinContext, I had my suspicions that something was going awry there. Either the DbContext was not in fact generated once per request, or there was some strange race condition that allowed a DbContext to be reused on more than one thread.

As decided to rewrite the way in which dependencies are obtained in the service. Instead of generating a DbContext per request, I would supply a DbContextFactory to everything, and let it all generate its own, temporarily scoped DbContext that it is responsible for disposing.

In order to accomplish this I switched to an IoC container that I was more familiar with, Ninject. Not a small amount of work, and not without added complexity, but I felt that it made the code more consistent with the rest of our code bases and generally better.

In retrospect, I should have verified that I could reproduce the token generation errors at will first, but I didn’t. I wrote the test after I’d spent the better part of a day switching out the dependency injection mechanisms. This was a mistake.

Since the errors always occurred during the execution of a single endpoint, I wrote a test that uses 10 tasks to spam that particular endpoint. If none of the tasks fault within a time limit (i.e. no exceptions are thrown), then the test is considered a success. Basically a very small, focused, stress test to be run automatically as part of our functional test suite.

[Test]
[Category("functional")]
public void WhenAttemptingToGenerateMultipleTokensAtTheSameTime_NoRequestsFail()
{
    var authClientFactory = _resolver.Get<IAuthClientFactory>();
    var app = new ApplicationBuilder(authClientFactory.CreateSeeder())
        .WithRole("auth_token_generate")
        .WithRole("auth_customer_register")
        .WithRole("auth_database_register")
        .WithRole("auth_user_register")
        .WithRole("auth_query")
        .Build();

    var userBuilder = new UserBuilder(authClientFactory.CreateFromApplication(app.ApplicationDetails.Key, app.ApplicationSecret));
    userBuilder.Build();

    List<Task> tokenGenerationTasks = new List<Task>();
    var cancellation = new CancellationTokenSource();
    for (int i = 0; i < 10; i++)
    {
        var task = Task.Factory.StartNew
        (
            () =>
            {
                var client = authClientFactory.CreateFromApplication(app.ApplicationDetails.Key, app.ApplicationSecret);
                while (true)
                {
                    if (cancellation.Token.IsCancellationRequested) break;
                    var token = client.TokenGenerate(userBuilder.CustomerId + "/" + userBuilder.DatabaseId + "/" + userBuilder.UserCode, userBuilder.Password);
                }
            },
            cancellation.Token,
            TaskCreationOptions.LongRunning,
            TaskScheduler.Default
        );

        tokenGenerationTasks.Add(task);
    }

    // The idea here is that if any of the parallel token generation tasks throw an exception, it will come out here
    // during the wait.
    Task.WaitAny(tokenGenerationTasks.ToArray(), TimeSpan.FromSeconds(15));
    cancellation.Cancel();

    var firstFaulted = tokenGenerationTasks.FirstOrDefault(a => a.IsFaulted);
    if (firstFaulted != null) throw firstFaulted.Exception;
}

The first time I ran the test against a local service it passed successfully…

Now, I don’t know about anyone else, but when a test works the first time I am immediately suspicious.

I rolled my changes back and ran the test again, and it failed.

So my restructuring successfully fixed the issue, but why?

The Root Of The Problem

I hadn’t actually understood the issue, all I did was make enough changes such that it seemed to no longer occur. Without that undestanding, if it recurred, I would have to start all over again, possibly misdirecting myself with the changes that I made last time.

Using the test that guaranteed a reproduction, I investigated in more depth. Keeping all of my changes reverted, I was still getting a weird sampling of lots of different errors, but they were all consistently coming from one of our repositories (classes which wrap a DbContext and add extra functionality) whenever it was used within our OAuthAuthorizationServerProvider implementation.

Staring at the code for a while, the obviousness of the issue hit me.

At startup, a single OAuthAuthorizationServerProvider implementation is created and assigned to generate tokens for requests to the /auth/token endpoint.

This of course meant that all of the functions in that provider needed to be thread safe.

They were not.

Of the two functions in the class, both set and then used a class level variable, which in turn had a dependency on a DbContext.

This was the smoking gun. If two requests came in quickly enough, one would set the variable (using the DbContext for the request) the other would do the same (using a different DbContext) and then the first would attempt to use a different threads DbContext (indirectly through the variable). This would rightly cause an error (as multiple threads tried to use the same DbContext) and throw an exception, failing the token generation.

I abandoned my changes (though I will probably make them again over time), removed the class variable and re-ran the test.

It was all good. No errors at all, even after running for a few hours.

But why did the error cause a resource leak at the database connection level?

Leaky Logic

In the end I didn’t find out exactly why threading errors with Entity Framework (using Npgsq) were causing connection leaks. I plan to investigate in more depth in the future, and I’ll probably blog about my findings, but for now I was just happy to have the problem solved.

With the bug fixed, profiling over a period of at least 24 hours showed no obvious connection leaks as a result of normal traffic. Previously this would have guaranteed at least 10 connections leaking, possibly more. So for now the problem is solved and I need to move on

Summary

Chasing down resource leaks can be a real pain, especially when you don’t have a reliable reproduction.

If I had realised earlier that the token generation failures and connection leaks were related, I would have put more effort into reproducing the first in order to reproduce the second. It wasn’t immediately obviously that they were linked though, so I spent a lot of time analysing the code trying to figure out what could possibly be leaking valuable resources. This was a time consuming and frustrating process, ultimately leading nowhere.

Once I finally connected the dots between the token failures and the connection leak, everything came together, even if I didn’t completely understand why the connections were leaking in error situations.

Ah well, can’t win em all.