Sometimes Its Just Your Fault

August 23. 2016 0 Comments

I’ve mentioned a number of times about the issues we’ve had with RavenDB and performance.

My most recent post on the subject talked about optimizing some of our most common queries by storing data in the index, to avoid having to retrieve full documents only to extract a few fields. We implement this solution as a result of guidance from Hibernating Rhinos, who believed that it would decrease the amount of memory churn on the server, because it would no longer need to constantly switch the set of documents currently in memory in order to answer incoming queries. Made sense to me, and they did some internal tests that showed that indexes that stored the indexed fields dramatically reduced the memory footprint of their test.

Alas, it didn’t really help.

Well, that might not be fair. It didn’t seem to help, but that may have been because it only removed one bottleneck in favour of another.

I deployed the changes to our production environment a few days ago, but the server is still using a huge amount of memory, still regularly dumping large chunks of memory and still causing periods of high latency while recovering from the memory dump.

Unfortunately, we’re on our own as far as additional optimizations go, as Hibernating Rhinos have withdrawn from the investigation. Considering the amount of time and money they’ve put into this particular issue, I completely understand their decision to politely request that we engage their consulting services in order to move forward with additional improvements. Their entire support team was extremely helpful and engaged whenever we talked to them, and they did manage to fix a few bugs on the way through, even if it was eventually decided that our issues were not the result of a bug, but simply because of our document and load profile.

What Now?

Its not all doom and gloom though, as the work on removing orphan/abandoned data is going well.

I ran the first round of cleanup a few weeks ago, but the results were disappointing. The removal of the entities that I was targeting (accounts) and their related data only managed to remove around 8000 useless documents from the total of approximately 400 000.

The second round is looking much more promising, with current tests indicating that we might be able to remove something like 300 000 of those 400 000 documents, which is a pretty huge reduction. The reason why this service, whose entire purpose is to be a temporary data store, is accumulating documents, is currently unknown. I’ll have to get to the bottom of that once I’ve dealt with the immediate performance issues.

The testing of the second round of abandoned document removal is time consuming. I’ve just finished the development of the tool and the first round of testing that validated that the system behaved a lot better with a much smaller set of documents in it (using a partial clone of our production environment), even when hit with real traffic replicated from production, which was something of a relief.

Now I have to test that the cleanup scripts and endpoints work as expected when they have to remove data from the server that might have large amounts of binary data linked to it.

This requires a full clone of the production environment, which is a lot more time consuming than the partial clone, because it also has to copy the huge amount of binary data in S3. On the upside, we have scripts that we can use to do the clone (as a result of the migration work), and with some tweaks I should be able to limit the resulting downtime to less than an hour.

Once I’ve verified that the cleanup works on as real of an environment as I can replicate, I’ll deploy it to production and everything will be puppies and roses.

Obviously.

Conclusion

This whole RavenDB journey has been a massive drain on my mental capacity for quite a while now, and that doesn’t even take into account the drain on the business. From the initial failure to load test the service properly (which I now believe was a result of having an insufficient number of documents in the database combined with unrealistic test data), to the performance issues that occurred during the first few months of release (dealt with by scaling the underlying hardware) all the way through to the recent time consuming investigative efforts, the usage of RavenDB for this particular service was a massive mistake.

The disappointing part is that it all could have easily gone the other way.

RavenDB might very well have been an amazing choice of technology for a service focused around being a temporary data store, which may have lead to it being used in other software solutions in my company. The developer experience is fantastic, being amazingly easy to use once you’ve got the concepts firmly in hand and its very well supported and polished. Its complicated and very different from the standard way of thinking (especially regarding eventual consistency), so you really do need to know what you’re getting into, and that level of experience and understanding was just not present in our team at the time the choice was made.

Because of the all issues we’ve had, any additional usage of RavenDB would be met with a lot of resistance. Its pretty much poison now, at least as far as the organisation is concerned.

Software development is often about choosing the right tool for the job and the far reaching impact of those choices cannot be underestimated.