Stop! In The Name Of Instance Retirement
I learned a new thing about AWS, instance retirement and auto scaling groups a few weeks ago.
I mean, lets be honest, the amount of things I don’t know dwarfs the amount of things I do know, but this one in particular was surprising. At the time the entire event was incredibly confusing, and no-one at my work really knew what was going on, but later on, with some new knowledge, it all made perfect sense.
I’m Getting Too Old For This
Lets go back a step though.
Sometimes AWS needs to murder one of your EC2 instances. As far as I know, this tends to happen when AWS detects failures in the underlying hardware, and they schedule the instance to be “retired”, notifying the account holder as appropriate. Of course, this assumes that AWS notices the failure before it becomes an issue, so if you’re really unlucky, sometimes you don’t get a warning and an EC2 instance just disappears into the ether.
The takeaway here, is that you should never rely on just one EC2 instance for any critical functions. This is one of the reasons why auto scaling groups are so useful, because you specify a template for the instances instead of just making one by itself. Of course, if your instances are accumulating important state, then you’ve still got a problem if one goes poof.
Interestingly enough, when you stop (not terminate, just stop) an EC2 instance that is owned by an auto scaling group, the auto scaling group tends to murder it and spin up another one, because it thinks the instance has gone bad and needs to be replaced.
Anyway, I was pretty surprised when AWS scheduled two of the Elasticsearch data nodes in our ELK stack for retirement and:
- The nodes hung around in a stopped state, i.e. they didn’t get replaced, even though they were owned by an auto scaling group
- AWS didn’t trigger the CloudWatch alarm on the Elasticsearch load balancer that is supposed to detect unhealthy instances, even though the stopped instances were clearly marked unhealthy
After doing some soul searching, I can explain the first point somewhat.
The second point still confuses me though.
You’re Suspended! Hand In Your Launch Configuration And Get Out Of Here
It turns out that when AWS schedules an instance for retirement, it doesn’t necessarily mean the instance is actually going to disappear forever. Well, it won’t if you’re using an EBS volume at least. If you’re just using an instance store volume you’re pretty boned, but those are ephemeral anyway, so you should really know better.
Once the instance is “retired” (i.e. stopped), you can just start it up again. It will migrate to some new (healthy) hardware, and off it goes, doing whatever the hell it was doing in the first place.
However, as I mentioned earlier, if you stop an EC2 instance owned by an auto scaling group, the auto scaling group will detect it as a failure, terminate the instance and spin up a brand new replacement.
Now, this sort of reaction can be pretty dangerous, especially when AWS is the one doing the shutdown(as opposed to the account holder), so AWS does the nice thing and suspends the terminate and launch processes of the auto scaling group, just to be safe.
Of course, the assumption here is that the account holder knows that the processes have been suspended and that some instances are being retired, and they go and restart the stopped instances, resume the auto scaling processes and continues on with their life, singing merrily to themselves.
Until this happened to us, I did not even know that suspending auto scaling group processes was an option, let alone that AWS would do it for me. When we happened to notice that two of our Elasticsearch data nodes had become unavailable through Octopus Deploy, I definitely was not the “informed account holder” in the equation, and instead went on an adventure trying to figure out what the hell was going on.
I tried terminating the stopped nodes, in the hopes that they would be replaced, but because the processes were suspended, I got nothing. I tried raising the number of desired instances, but again, the processes were suspended, so nothing happened.
In the end, I created a secondary auto scaling group using the same Launch Configuration and got it to spin up a few instances, which then joined the cluster and helped to settle everything down.
It wasn’t until the next morning that cooler heads prevailed and I got a handle on what was actually happening that we cleaned everything up properly.
Conclusion
This was one of those cases where AWS was definitely doing the right thing (helping people to avoid data loss because of an underlying failure out of their control), but a simple lack of knowledge on our part caused a bit of a kerfuffle.
The ironic thing is that if AWS had simply terminated the EC2 instances (which I’ve seen happen before) the cluster would have self-healed and rebalanced perfectly well (as long as only a few nodes were terminated of course).
Like I said earlier, I still don’t know why we didn’t get a CloudWatch alarm when the instances were stopped, as they were definitely marked as “unhealthy” in the Load Balancer. We didn’t even realise that something had gone wrong until someone noticed that the data nodes were reporting as unavailable in Octopus Deploy, and that happened purely by chance.
Granted, we still had four out of six data nodes in service, and we run our shards with one primary and two replicas, so we weren’t exactly in the danger zone, but we were definitely approaching it.
Maybe its time to try and configure an alarm on the cluster health.
That’s always nice and colourful.