lars-ellingsen.dev

A Postgres Query Performance Pitfall


Backstory

The application I’m primarily working on at work at the moment is a fairly typical web app. The primary page that users start on is a large table of data related to things that they need to work on. We started getting actual users towards the end of last year, and while the user set was initially small, the actual data set we were working with was quite large. This is because we were migrating users over from a legacy web app onto our new one, and we needed relevant data for all users, so that as they onboard they could immediately use our new app.

Generally this worked pretty well. However, we started noticing that occassionally, instead of getting a response in less than 5 seconds (typically less than one), there was sometimes wait times of up to a minute. This obviously lead to a bad user experience and was something that we wanted to solve, so my team started looking into what the root cause might be. Interestingly, it did not seem to be predictable based on the amount of data the user was working with - our first customer had a lot of data and was consistently getting fast responses, while others with much smaller data sets were sometimes waiting 45 seconds instead of <1 second.

The Investigation Begins

Here is what the data looked like at a high level; there were four layers of data specificity:

A note here: I am currently waiting on a pull request to be merged in the astro-diagram package in order for this mermaid chart to be rendered correctly. But if you’d like, you can paste the code into the Mermaid playground to see what it looks like.

---
title: Entity relationships
---
erDiagram
    GREAT-GRANDPARENT ||--o{ GRANDPARENT : has
    GRANDPARENT ||--o{ PARENT : has
    PARENT ||--o{ CHILD : has

When investigating performance issues in queries, my mind always jumps to indexes. As such I began by insuring that all of the columns used in the joins and WHERE clause had appropriate indexes on them - which they did. Complicating my investigation was the fact that the table we were querying was actually a partitioned table - we’ll call those partitions “new” and “legacy.” Despite that, all of the data we were querying actually only existed in the “legacy” tables - and we were specifying that in how we filtered the results. The Postgres EXPLAIN tool helped to confirm that as well - it’s very helpful that it shows the actual entities being queried!

Just to be doubly sure, I switched everything over to the underlying partition that I believed I was querying. The results were exactly the same. The query was generally pretty fast, but would occassionally hit a pretty big slowdown. At this point I didn’t have much to go off of, but I did add an alarm to DataDog so that I knew when we were having issues.

And It Continues…

I would occassionally look at what was going on when I saw the alert triggered, but there didn’t seem to be much of a pattern to it. I asked for help from folks experienced in Postgres’ ways and had some great pairing sessions with some of my colleagues on other teams (huge shoutout to them!). During one of these sessions, we could clearly see as I had before that on a random table in the tree, it seemed like the indexes would stop being used and revert to a sequence scan, despite being present on all necessary columns - as well as the foreign key (FK) relationships existing too.

As an experiment, we set ENABLE_SEQSCAN=off in our session, which is a Postgres configuration option. From my understanding it essentially sets the cost of a sequential scan to an extremely high number, which encourages the Postgres query planner from using them. With this configuration change, the query performed well in all cases - which went to show that the correct indexes were in place, and quite honestly only confused me more. The option of setting this config option every time before running the problematic query and then turning it back on was discussed - but because it is a global config setting, the potential blast radius was pretty large, and we decided against it.

Finally, a Resolution

Eventually, we were getting more and more customers on our system which was causing our alarm to go off more often, and more customers to have a bad experience. Out of desperation, I decided to rebuild the query table-by-table and try to pinpoint exactly what might be going wrong, with a lot of testing at each step. After doing this for a while I happened to notice an oddity with the ON clauses in the joins. The data wasn’t totally normalized, and the grandparent table was being joined by using its id (which was foreign keyed) on the child table, skipping over the parent table, so it looked something like this:

FROM child c
JOIN parent p ON p.id = c.parent_id
JOIN grandparent gp ON gp.id = c.grandparent_id

This seemed innocuous, but adding AND p.grandparent_id = gp.id to that join fixed the performance in all test cases! It seems that somehow not joining the hierarchy together in a normal way - even though the data result was exactly the same, and everything was still joined on a foreign key - was confusing the query planner into thinking that a sequential scan would be more efficient. This seems to have something to do with why I was seeing varying results depending on how many child records were being selected; the query planner was making optimization decisions that depended on that.

The Takeaways

So, what’s the takeaway here? I think my case was pretty specific, and probably not directly applicable to most people. But despite being a relatively frustrating journey - all to be fixed with a single line in an ON clause - there was some good that came of it:

Thanks for reading, I hope it was helpful!