0 Comments

I’ve been sick this week, so this post will be incredibly short (and late).

I’ve spoken on and off about some of the issues that we’ve had keeping the RavenDB database behind one of our services running in a maintainable fashion.

I don’t think RavenDB is a bad product though (in fact, I think its quite good at what it does), I think that the main problem is that is is not a good fit for what we’re trying to do with it and we don’t really have the expertise required to use it well.

Regardless, we had the scale the underlying infrastructure again recently (from an r3.4xlarge to an r3.8xlarge) and it go me thinking about how well we actually understand the data that is being stored inside the database. The last time we had to scale (before the most recent one), we had something like 120K documents in the system, spread across 700-800 unique clients.

Now? Almost triple that, at 340K, but we’re only up to like 1000 unique clients.

Something didn’t add up.

I mean, the entire concept behind the service is that it is a temporary staging area. It contains no long term storage. The number of documents present should be directly related to the activity around the feature that the service supports, and it was highly unlikely that the feature had become 3 times as possible in the intervening period.

No Young Waynes Here

The system uses a manifest like concept to aggregate all of the data belonging to a customers data set in one easily correlated place (specifically using prefixes on the IDs of the documents). Each manifest (or account), contains some meta information, like the last time any data at all was touched for that account.

It was a relatively simple matter to identify all accounts that had not been touched in the last 30 days. For a  system that relies on constant automatic synchronization, if an entire account has not been touched in the last 30 days, its a pretty good candidate for having been abandoned for some reason, the most likely of which is that the user has switched to using a different account (they are quite fluid).

I found 410 accounts that were untouched.

There are only1800 accounts in the system.

The second point of investigation was to look directly at the document type with the highest count.

This document describes something that is scheduled, and abandoned data can be easily found by looking for things that are scheduled over 30 days ago. Because of the transient nature of the design, of something is still in the system, even though it was scheduled for a month in the past, its a pretty safe bet that its no longer valid.

I said before that there were around 340K documents in the system?

220K were scheduled for so far in the past that they were basically irrelevant.

Conclusion

The findings above made me sad inside, because it means that there is definitely something wrong with the way the service (and the software that uses the service) is managing its data.

I suspect (but can’t back this up) that the amount of useless data present is not helping our performance problems as well, so its like a double gut-punch.

I suppose, the important thing to take away from this is to never become complacent about the contents of your persistence layer. Regular audits should be executed to make sure you understand exactly what is being stored and why.

Now that I know about the problem, all that’s left is to put together some sort of repeatable mechanism to clean up, and then find and fix the bugs that led to the data accumulating in the first place.

But if I didn’t look we probably would have just accepted that this was the shape of the data.

0 Comments

We have the following simple entities, representing a fairly basic property management system.

public class Contact
{
    public int Id { get; set; }
}

public class ContactLink
{
    public int Id { get; set; }
    public int ContactId { get; set; }
    public string Type { get; set; }
    public int LinkedId { get; set; }
}

public class Owner
{
    public int Id { get; set; }
}

public class Tenant
{
    public int Id { get; set; }
    public int? PropertyId { get; set; }
}

public class Property
{
    public int Id { get; set; }
    public int? OwnerId { get; set; }
}

Each Property has zero or one Owners, but an Owner can be linked to multiple Contacts.

A Property can be Tenanted, and each Tenant can be linked to multiple Contacts.

The ContactLink class represents multiple arbitrary links to Contacts, and is how both the Owner and Tenant to Contact relationships are specified.

This model was not created with Entity Framework in mind, which unfortunately is something that you have to deal with when working on a legacy system. The EF model was put in place after many years of the database being the closest thing the software had to a domain model. The models that I’ve listed above are obviously not complete. They contain many more fields than those displayed, and have fields for segmentation because the database is multi-tenant. Anything not specified here is not useful for the purposes of illustrating the point of post.

We need to answer a relatively simple question.

“Show me the Properties that are related to a Contact”

The goal is to answer that question while leaving as much of the work as possible up to the database. So ideally one call and minimal post processing.

Give Me All The Relationships

First things first, a test to ensure that our query works the way we expect it to. Make a database, seed it, run the query, validate the right number of results. Very straightforward.

public void WhenAContactIsRelatedToPropertiesThroughBothOwnerAndTenantRelationships_TheQueryReturnsTheCorrectNumberOfRows() { using (var db = GetDbContext()) { var contactId = RandomId(); // Seed some data into the database. We use Builder objects and extension methods on DataSets to do // seeding, allowing us to fill a previously empty database with all the information we need for a test. // Specifically for this test, seed: // 1.) 1 Contact with the specified Id. // 2.) 1 Owner. // 3.) 2 Properties. One owned by the Owner that was just created.

// 4.) 1 Tenant, for the property with no Owner.

// 5.) 2 ContactLinks, linking the Contact to the Owner and Tenant. var target = new QueryEncapsulation(); var results = target.Execute(db, contactId); results.Count().Should().Be(2); } }

With the test written, all that’s left is to implement the query itself.

public class QueryEncapsulation
{
    public IEnumerable<LinkedProperty> Execute(DbContext db)
    {
        return (from c in db.Contacts
            .Where(c => c.Id == specifiedId)
        from cl in db.ContactLinks
            .Where(cl => cl.ContactId == c.Id)
        from o in db.Owners
            .Where(o => o.Id == cl.LinkedId && cl.Type == "owner")
            .DefaultIfEmpty()
        from t in db.Tenants
            .Where(t => t.Id == cl.LinkedId && cl.Type == "tenant")
            .DefaultIfEmpty()
        from p in db.Properties
            .Where(p => p.OwnerId == o.Id || t.PropertyId == p.Id)
        select new LinkedProperty(o, t, p))
        .ToArray();
    }
}

Not super complicated, in fact it looks fairly similar to the query that you might put together in SQL. Join a bunch of tables together and then filter them as appropriate so only the valid links are included.

The usage of the DefaultIfEmptymethod is the mechanism by which you accomplish Left Outer Joins in Linq. This is important because even though each Owner and Tenant map to a single property, as a result of the way the relationships are structured, and the direction we are starting from (the Contact) each “row” returned by this query will only contain either an Owner or a Tenant, never both. Without the DefaultIfEmpty, the query wouldn’t return anything.

Of course, the test fails.

Failure Is Just A Stepping Stone To Greatness

Instead of getting 2 results (one Property linked via the Owner and one Property linked via the Creditor) the query returns 3 results.

The Property linked to the Tenant is coming back twice. Once for the correct link (to the Tenant) and then again for the Owner, which is nonsense.

As far as I understand, this combination is still included because a completely unspecified Owner can be linked to a null Owner Id on the Property.

Intuitively, p.OwnerId == o.Id feels like it should throw a NullReferenceException when o is null. I’m still not exactly sure I understand why this is a valid result, but I know if I make a similar query in SQL using Left Outer Joins, that it will work the same way. Essentially I’m not being specific enough in my exclusion criteria and null is indeed equal to null.

The fix?

Add a check on the Property condition that checks for Owner being null.

The final query we’re left with is:

public class QueryEncapsulation
{
    public IEnumerable<LinkedProperty> Execute(DbContext db)
    {
        return (from c in db.Contacts
            .Where(c => c.Id == specifiedId)
        from cl in db.ContactLinks
            .Where(cl => cl.ContactId == c.Id)
        from o in db.Owners
            .Where(o => o.Id == cl.LinkedId && cl.Type == "owner")
            .DefaultIfEmpty()
        from t in db.Tenants
            .Where(t => t.Id == cl.LinkedId && cl.Type == "tenant")
            .DefaultIfEmpty()
        from p in db.Properties
            .Where(p => (o != null && p.OwnerId == o.Id) || t.PropertyId == p.Id)
        select new LinkedProperty(o, t, p))
        .ToArray();
    }
}

Summary

Honestly, I didn’t expect this sort of behaviour from EF, but that’s not surprising. EF is an incredibly complicated beast and I’m barely familiar with the shallow end.

To try and understand the results, I had to build up the set by hand and account for all of the potential combinations, combined with whether or not it made sense for them to be included in the final results. Doing this helped to identify the fact that fields on what should be a null object created via DefaultIfEmpty seem to be nullable even though the underlying type on the model is not.

An interesting finding at least and a good example of how sometimes its important to understand your queries conceptually.

Especially if they are giving weird results.

0 Comments

You may have heard of the Netflix simian army, which is an interesting concept that ensures that your infrastructure is always prepared to lose chunks of itself at any moment. The simian army is responsible for randomly killing various components in their live environments, ranging from a single machine/component (chaos monkey), to an entire availability zone (chaos gorilla) all the way through to an entire region (chaos kong).

The simian army is all about making sure that you are prepared to weather any sort of storm and is a very important part of the reliability engineering that goes on at Netflix.

On Sunday, 5 June, we found a chaos gorilla in the wild.

Monkey See, Monkey Do

AWS is a weird service when it comes to reliability. Their services are highly available (i.e. you can almost always access the EC2 service itself to spin up virtual machines), but the constructs created by those services seem to be far less permanent. For EC2 in particular, I don’t think there is any guarantee that a particular instance will remain alive, although they usually do. I know I’ve seen instances die of their own accord, or simply become unresponsive for long periods of time (for no apparent reason).

The good thing is that AWS is very transparent in this regard. They mostly just provide you with the tools, and then its up to you to set up whatever you need to create whatever high-availability/high reliability setup, depending on what you need.

When it comes to ensuring availability of the service, there are many AWS regions across the world (a few in North America, some in Europe, some in Asia Pacific), each with at least two availability zones in them (sometimes more), which are generally located geographically separately. Its completely up to you where and how you create your resources, and whether or not you are willing to accept the risk that a single availability zone or even a region might disappear for whatever reason.

For our purposes, we tend towards redundancy (multiple copies of a resource), with those copies spread across multiple availability zones, but only within a single region (Asia Pacific Sydney). Everything is typically hidden behind a load balancer (an AWS supplied component that ensures traffic is distributed evenly to all registered resources) and its rare that we only have a single one of anything.

Yesterday (Sunday, 05 June 2016), AWS EC2 (the Elastic Compute Cloud, the heart of the virtualization services offered by AWS) experienced some major issues in ap-southeast-2.

This was a real-life chaos gorilla at work.

Extinction Is A Real Problem

The first I knew of this outage was when one of our external monitoring services (Pingdom), reported that one of our production services had dropped offline.

Unfortunately, the persistence framework used by this service has a history of being flakey, so this sort of behaviour is not entirely unusual, although the service had been behaving itself for the last couple of weeks. When I get a message like that, I usually access our log aggregation stack (ELK), and then use the information therein to identify what’s going on (it contains information on traffic, resource utilization, instance statistics and so on).

This time though, I couldn’t access the log stack at all.

This was unusual, because we engineered that stack to be highly available. It features multiple instances across all availability zones at all levels of the architecture, from the Logstash broker dealing with incoming log events, to our RabbitMQ event queue that the brokers write to, through to the indexers that pull off the queue all the way down to the elasticsearch instances that back the whole thing.

The next step was to log into the AWS console/dashboard directly and see what it said. This is the second place where we can view information about our resources. Not as easily searchable/visualizable as the data inside the ELK stack, we can still use the CloudWatch statistics inside AWS to get some stats on instances and traffic.

Once I logged in, our entire EC2 dashboard was unavailable. The only thing that would load inside the dashboard in the EC2 section was the load balancers, and they were all reporting various concerning things. The service that I had received the notification for was showing 0/6 instances available, but others were showing some availability, so not everything was dead.

I was blind though.

All By Myself

Luckily, the service that was completely down tends to not get used very much on the weekend, and this outage occurred on Sunday night AEST, so chances of it causing a poor user experience were low.

The question burning in my mind though? Was it just us? Or was the entire internet on fire? Maybe its just a blip and things will come back online in the next few minutes.

A visit to the AWS status page, a few quick Google searches and a visit to the AWS Reddit showed that there was very little traffic relating to this problem, so I was worried that it was somehow just us. My thoughts started to turn towards some sort of account compromise, or that maybe I’d been locked out of the system somehow, and this was just the way EC2 presented that I was not allowed to view the information.

Eventually information started to trickle through about this being an issue specifically with the ap-southeast-2 region, and AWS themselves even updated their status page and put up a warning on the support request page saying it was a known issue.

Now I just had to play the waiting time, because there was literally nothing I could do.

Guess Who’s Back

Some amount of time later (I forget exactly how much), the EC2 dashboard came back. The entirely of ap-southeast-2a appeared to have had some sort of massive outage/failure, meaning that most of the machines we were hosting in that availability zone were unavailable.

Even in the face of a massive interruption to service from ap-southeast-2a, we kind of lucked out. While most of the instances in that availability zone appeared to have died a horrible death, one of our most important instances that happened to not feature any redundancy across availability zones was just fine. Its a pretty massive instance now (r3.4xlarge), which is a result of performance problems we were having with RavenDB, so I wonder if it was given a higher priority or something? All of our small instances (t2/m3 mostly) were definitely dead.

Now that the zone seemed to be back online, restoring everything should have been as simple as killing the broken instances and then waiting for the Auto Scaling Groups to recreate them.

The problem was, this wasn’t working at all. The instances would start up just fine, but they would never be added into the Load Balancers. The reason? Their status checks never succeeded.

This happened consistently across multiple services, even across multiple AWS accounts.

Further investigation showed that our Octopus Deploy server had unfortunately fallen victim to the same problem as the others. We attempted to restart it, but it got stuck shutting down.

No Octopus meant no new instances.

No new instances meant we only had the ones that were already active, and they were showing some strain in keeping up with the traffic, particularly when it came to CPU credits.

Hack Away

Emergency solution time.

We went through a number of different ideas, but we settled on the following:

  1. Take a functioning instance
  2. Snapshot its primary drive
  3. Create an AMI from that snapshot
  4. Manually create some instances to fill in the gaps
  5. Manually add those instances to the load balancer

Not the greatest of solutions, but we really only needed to buy some time while we fixed the problem of not having Octopus. Once that was back up and running, we could clean everything up and rely on our standard self healing strategies to balance everything out.

A few hours later and all of the emergency instances were in place (3 services all told + some supporting services like proxies). Everything was functioning within normal boundaries, and we could get some sleep while the zombie Octopus instance hopefully sorted itself out (because it was around 0100 on Monday at this point).

The following morning the Octopus Server had successfully shutdown (finally) and all we had to do was restart it.

We cleaned up the emergency instances, scaled back to where we were supposed to be and cleaned up anything that wasn’t operating like we expected it to (like the instances that had been created through auto scaling while Octopus was unavailable).

Crisis managed.

Conclusion

In the end, the application of a real life chaos gorilla showed us a number of different areas where we lacked redundancy (and highlighted some areas where we handled failure well).

The low incidence of AWS issues like this combined with the availability expected by our users, probably means that we don’t need to make any major changes to most of our infrastructure. We definitely should add redundancy to the single point of failure database behind the service that went down during this outage though (which is easier said than done).

Our reliability on a single, non-redundant Octopus Server however, especially when its required to react to some sort of event (whether it be a spike in usage or some sort of critical failure) is a different problem altogether. We’re definitely going to have to do something about that.

Everything else went about as well as I expected though, with the majority of our services continuing to function even when we lost an entire availability zone.

Just as planned.

0 Comments

A long time ago, I wrote a post about fixing up our log management scripts to actually delete logs properly. Logs were being aggregated into our ELK stack and were then being deleted after reaching a certain age. It fixed everything perfectly and it was all fine, forever. The End

Or not.

The log management script itself was very simple (even though it was wrapped in a bunch of Powershell to schedule it via Windows Scheduled Task. It looked at a known directory (C:\logs), found all the files matching *.log, narrowed it down to only those files that had not been written to in the last 7 days and then deleted them. Not exactly rocket science.

As we got more and more usage on the services in question though, the log file generation started to outstrip the ability of the script to clean up.

They Can Fill You Up Inside

The problem was twofold, the log management script was hardcoded to only delete things that hadn’t been touched in the last 7 days and the installation of the scheduled task that ran the script was done during machine initialisation (as part of the cfn-init configuration). Changing the script would require refreshing the environment, which requires a migration (which is time consuming and still not perfect). Not only that, but changing the script for one service (i.e. to delete all logs older than a day), might not be the best thing for other services.

The solution was to not be stupid and deploy log management in the same way we deploy everything, Octopus Deploy.

With Octopus, we could build a package containing the all of the logic for how to deploy a log management solution, and use that package in any number of projects with customised parameters, like retention period, directory, filter, whatever.

More importantly, it would give us the ability to update existing log management deployments without having to alter their infrastructure configuration, which is important for minimising downtime and just generally staying sane.

They Are Stronger Than Your Drive

It was easy enough to create a Nuget package containing the logic for installing the log management scheduled task. All I had to do was create a new repository, add a reference to our common scripts and then create a deploy.ps1 file describing how the existing scheduled task setup script should be called during deployment.

In fact, the only difference from calling the script directly like we were previously, was that I grabbed variable values from the available Octopus parameters, and then slotted them into the script.

With a nice self contained Nuget package that would install a scheduled task for log management, the only thing left to do was create a new Octopus project to deploy the package.

I mostly use a 1-1 Nuget Package to Project strategy when using Octopus. I’ve found that its much easier to get a handle on the versions of components being deployed when each project only has 1 thing in it. In this case, the Nuget package was generic, and was then referenced from at least two projects, one to manage the logs on some API instances and the other to manage logs on a log shipper component (which gets and processes ELB logs).

In order to support a fully automated build, test and deployment cycle, I also created a TeamCity Build Configuration. Like all Build Configurations, it watches a repository for changes, runs tests, packages and then deploys to the appropriate environments. This one was slightly different though, in that it deployed multiple Octopus projects, rather than a single one like almost all of our other Build Configurations. I’m not sure how I feel about this yet (because its different from what we usually do), but its definitely easier to manage, especially since all the projects just use the same package anyway.

Conclusion

This post might not look like much, and to be honest it isn’t. Last weeks post on Expression Trees was a bit exhausting so I went for something much smaller this week.

That is not to say that the concepts covered herein are not important.

One thing that I’ve taken away from this is that if you ever think that installing something just once on machine initialization is enough, you’re probably wrong. Baking the installation of log management into our cfn-init steps caused us a whole bunch of problems once we realised we needed to update them because they didn’t quite do the job. It was much harder to change them on the fly without paying a ridiculous penalty in terms of downtime. It also felt really strange to have to migrate an entire environment just because the log files weren’t being cleaned up correctly.

When it comes down to it, the real lesson is to always pick the path that lets you react to change and deploy with the least amount of effort.

0 Comments

As part of the work we did recently to add search to one of our newer API’s, we leveraged the support that Npgsql has for constructing PostgreSQL Full Text Search queries via Linq. This was a vast improvement over our previous attempts to use the PostgreSQL Full Text Search, because we could combine it with sorting and filtering, and have everything apply right at the database level for maximum efficiency.

Internally, we use a Retriever class for each entity that we offer access to via the API. This class in turn leverages a number of different dependencies to provide for filtering, sorting and searching by modifying an IQueryable. For search, there is an IFullTextSearcher<TEntity> interface, which contains a single method Search, taking an IQueryable and the query and returning a modified IQueryable.

A searcher implementation for an entity called Entity looks like this:

public class EntityFullTextSearcher : IFullTextSearcher<Entity>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;

    public EntityFullTextSearcher(IPostgresQueryBuilder postgresQueryBuilder)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
    }

    public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        queryable = queryable.Where(b => NpgsqlTextFunctions.Match
        (
            NpgsqlTextFunctions.QueryOr(
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A),
                NpgsqlTextFunctions.QueryOr(
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A),
                    NpgsqlTextFunctions.QueryOr(
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A),
                        NpgsqlTextFunctions.QueryOr(
                            NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B),
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C),
                                NpgsqlTextFunctions.QueryOr(
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C),
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C)
                                    )
                                )
                            )
                        )
                    )
                )
            ,
            NpgsqlTextFunctions.ToTsQuery(query))
        );

        return queryable;
    }
}

Pretty straightforward, but there is a massive inline lambda function that is somewhat hard to read.

Its A Frowny Face

The essence of that massive chunk of lambda is the definition of the fields that will be indexed during the search, along with the weight at which they will be indexed and how they are combined, which in our case, is just using OR. This Full Text Search definition is then combined with the Match operator. This is the way to get Full Text Search working via Linq, as I wrote about in my previous post on the subject.

If you look at the definition of these functions inside the Npgsql source code, you’ll see that their implementations all throw an InvalidOperationException. How do they even work?

The answer lies in the magic of expression trees and Entity Framework. The Where function on an IQueryable actually takes an Expression<Func<TEntity, bool>>, which this lambda function is supplying.

At no point is this code intended to be executed directly. You could try (i.e. by running this on top of a List or In Memory Database via Effort), but it would fail miserably, throwing an InvalidOperationException as expected.

However, when the IQueryable is passed through the Npgsql EF Provider, it is transformed into actual SQL which is then executed and the results interpreted accordingly. To be honest, I’ve never had to really dig into the whole EF provider concepts too deeply, and I don’t really want to. I’m happy enough that it works, and allows us to combine Full Text Search with arbitrary selection, filtering and sorting, leaving all of the hard work up to the database.

Back to the massive chunk of lambda though, at least its encapsulated inside this tiny class. Right?

Well, it was up until we needed to add sorting by relevance.

Maybe More Of A Grimace?

Have a look at this class and see if anything looks familiar.

public class EntitySearchRelevanceSorter : ISearchRelevanceSorter<EntityListProjectionModel>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;

    public EntitySearchRelevanceSorter(IPostgresQueryBuilder postgresQueryBuilder)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
    }

    public IQueryable<EntityListProjectionModel> Sort(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        if(searchQuery == null || !searchQuery.Terms.Any()) return queryable;
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        return queryable.OrderByDescending(b => NpgsqlTextFunctions.TsRank(
               NpgsqlTextFunctions.QueryOr(
                   NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A),
                   NpgsqlTextFunctions.QueryOr(
                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A),
                       NpgsqlTextFunctions.QueryOr(
                           NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A),
                           NpgsqlTextFunctions.QueryOr(
                               NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B),
                               NpgsqlTextFunctions.QueryOr(
                                   NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C),
                                   NpgsqlTextFunctions.QueryOr(
                                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C),
                                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C)
                                       ))))))
               ,
               NpgsqlTextFunctions.ToTsQuery(query))
               );
    }
}

The lambda expression being used as the input for the TsRank function is exactly the same as the function used inside the searcher class.

It should be the same as well, because the relevance sorting should return the results in an order appropriate for how they were found via the search.

Now we have a problem though, because if the search definition changes, the relevance sorting will no longer make sense. The same is somewhat true for the reverse option (changing the sort order).

My first attempt at fixing this so that there was only a single place where the Full Text Search definition existed, was to simply extract the lambda function to a class of its own, and take a dependency on that class in both the searcher and sorter.

The definition class then looked like this:

public class EntitySearchDefinition
{
    public Expression<Func<Entity, string>> Definition
    {
        get
        {
            return b => NpgsqlTextFunctions.QueryOr( 
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A), 
                NpgsqlTextFunctions.QueryOr( 
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A), 
                    NpgsqlTextFunctions.QueryOr( 
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A), 
                        NpgsqlTextFunctions.QueryOr( 
                            NpgsqlTextFunctions.SetWeightt(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B), 
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C), 
                                NpgsqlTextFunctions.QueryOr( 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C), 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C) 
                                )
                            )
                        )
                    )
                )
            )
        }
    }
}

And the usage within the searcher looked like this:

public class EntityFullTextSearcher : IFullTextSearcher<Entity>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;
    private readonly EntitySearchDefinition _def;

    public EntityFullTextSearcher(IPostgresQueryBuilder postgresQueryBuilder, EntitySearchDefinition def)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
        _def = def;
    }

    public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        queryable = queryable.Where(b => NpgsqlTextFunctions.Match
        (
            def.Definition,
            NpgsqlTextFunctions.ToTsQuery(query))
        );

        return queryable;
    }
}

This won’t compile at all though.

Nope, Definitely Rage

The problem here is that the definition of NpgsqlTextFunctions.Match is bool Match(string, string), so from the compilers point of view it doesn’t take an expression like the one I extracted out to the other class. We can’t just call it here either, because that would defeat the entire point of constructing the Where so that it can be turned into SQL and executed against the database. We have to put together an expression tree containing the logic we want to be pushed to the database.

How can we do that and maintain some common definition of the Full Text Search that will be used by both search and sorting components?

Long story short, my definition class ended up looking like this:

public class EntitySearchDefinition
{
    private Expression<Func<Entity, string>> Definition
    {
        get
        {
            return b => NpgsqlTextFunctions.QueryOr( 
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A), 
                NpgsqlTextFunctions.QueryOr( 
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A), 
                    NpgsqlTextFunctions.QueryOr( 
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A), 
                        NpgsqlTextFunctions.QueryOr( 
                            NpgsqlTextFunctions.SetWeightt(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B), 
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C), 
                                NpgsqlTextFunctions.QueryOr( 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C), 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C) 
                                )
                            )
                        )
                    )
                )
            )
        }
    }

    private static MethodInfo TsRank()
    {
        return MethodInfoFromExpression<Func<string, string, float>>((a, b) => NpgsqlTextFunctions.TsRank(a, b));
    }

    private static MethodInfo ToTsQuery()
    {
        return MethodInfoFromExpression<Func<string, string>>(a => NpgsqlTextFunctions.ToTsQuery(a));
    }

    private MethodInfo Match()
    {
        return MethodInfoFromExpression<Func<string, string, bool>>((a, b) => NpgsqlTextFunctions.Match(a, b));
    }
    
    public Expression<Func<Entity, float>> CreateOrderBy(string query)
    {
        return MakeMethodCallExpression<float>(query, TsRank());
    }

    public Expression<Func<Entity, bool>> CreateWhere(string query)
    {
        return MakeMethodCallExpression<bool>(query, Match());
    }

    private Expression<Func<Entity, TOut>> MakeMethodCallExpression<TOut>(string query, MethodInfo methodInfo)
    {
        var p = Expression.Parameter(typeof(Entity));
        var queryExpression = CreateConstantExpressionForQuery(query);
        var definitionBody = Definition.Body;
        var replacer = new ExpressionReplacer(a => a.Exp == typeof(Entity), a => p);
        var visited = replacer.Visit<TEntity>(definitionBody);
        var expression = Expression.Lambda<Func<TEntity, TOut>>(Expression.Call(methodInfo, visited, queryExpression), new List<ParameterExpression> { p });
        return expression;
    }

    private static MethodCallExpression CreateConstantExpressionForQuery(string query)
    {
        var queryConstant = Expression.Constant(query);
        var queryExpression = Expression.Call(ToTsQuery(), queryConstant);
        return queryExpression;
    }
    
    private class ExpressionReplacer : ExpressionVisitor
    {
        private readonly Func<Expression, bool> _selector;
        private readonly Func<Expression, Expression> _replace;

        public ExpressionReplacer(Func<Expression, bool> selector, Func<Expression, Expression> replace)
        {
            _selector = selector;
            _replace = replace;
        }

        public override Expression Visit(Expression exp)
        {
            if (exp == null) return exp;
            if (_selector(exp)) exp = _replace(exp);
            return base.Visit(exp);
        }
    }
}

Yup, that’s a lot of code, but a lot of it is just utility methods for interacting with expression trees (specifically the parts that get method names without resorting to Reflection or magic strings and the expression visitor that mutates the expression tree for a purpose I’ll get to in a minute).

Using this definition class the body of the search method in EntityFullTextSearcher now looks like this:

public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
{
    var query = _postgresQueryBuilder.Build(searchQuery);
    if (string.IsNullOrEmpty(query)) return queryable;

    queryable = queryable.Where(b => NpgsqlTextFunctions.Match(def.CreateWhere(query));

    return queryable;
}

With the body of the sort method looking very very similar (calling CreateOrderBy instead).

But what does it all mean?

Expressing Myself

There are two important things to notice about the search definition class above.

The first is that it dynamically constructs the expression tree that would have previously been supplied by the inline lambda expression. You can see this construction inside the MakeMethodCallExpressionmethod. It creates a ParameterExpression (for the incoming Entity from the IQueryable), and then constructs a Call node that has the definition as one parameter, and the query as the other, just like how it was originally specified inline. This method is then used in the construction of the expression trees for Where and OrderBy.

The second important thing is the usage of the ExpressionReplacerclass. The lambda expression that contains the definition for the Full Text Search (the one returned by the Definition property) contains a parameter of type Entity. The lambda expression constructed by the CreateWhere and CreateOrderBy methods also contains a parameter of type Entity.

They are not the same thing.

In order for the inner expression parameter to be correctly bound to the same value as the outer expression parameter, it was necessary to walk the expression tree, find the inner parameter expression and replace it with the outer one. Assuming that it is unlikely there will be multiple parameter expressions throughout the tree of the same type that shouldn’t be bound to the same value, this approach will work just fine to consolidate all of the parameter expressions to the same eventual value.

The end result of the replacement is a new expression tree where all parameter expressions of type Entity are the same reference (there are only two, one inner and one outer).

Conclusion

Looking back at the effort that went into this, I’m not entirely sure it was worth it. I did accomplish my original goal (only one place where the Full Text Search definition was defined), but I’m not sure if the code is more understandable.

It felt good to separate the concerns, which makes it clear that the searcher and sorter classes are dependent on the same thing, and all of the complicated and hard to understand expression tree logic is encapsulated inside a single class, but I worry that it will be almost impenetrable to anyone who comes after me who doesn’t understand expression trees. Keep in mind, at this point, even after this entire adventure, I barely understand them. Me 3 months from now will probably look at the code, get confused, google part of it, find this blog post and then relive the entire affair.

On the upside, this did generate a significant increase in understanding the way that magic of Entity Framework works, and sparked quite a few discussions within the team that let to some interesting insights for all involved. It even spawned some other changes, where we can now use our own custom methods inside the expressions being passed into Entity Framework and then have them transformed appropriately to things that EF knows how to turn into SQL.

That’s a topic for another time though.