Postgresql joins12/28/2022 Note that this type of plan is certainly not always bad: if the inner plan is small and fast it may win out over the new alternative by skipping overheads (see below).įinally, let's see the execution timeline with Parallel Hash enabled: This calls to mind Amdahl's law about the maximum speed-up possible if you parallelise one part of an algorithm but not another, but in practice there may be something worse than that going on here: every process is doing the same work in parallel, creating various forms of contention on the system that get worse as you add more workers. Here we see that the outer plan has been divided up over our 3 CPU cores. Each one is scanning an arbitrary fraction of the tuples in lineitem. Unfortunately each one is also running the complete inner plan and building its own copy of the hash table. So now we have approximately inner plan time + (outer plan time ÷ processes). Now let's visualise the execution timeline for a parallel query version with three processes in PostgreSQL 10 (or in 11dev with enable_parallel_hash set to off): It's not quite scaling perfectly and it's plateauing at 8 processes for reasons that need further exploration, but it is scaling significantly better than before. So where does this improved scalability come from? Let's start by looking at a stylised execution timeline for the join without parallelism:įor illustration purposes I'm ignoring other costs relating to hash table access, and showing a first order approximation of the execution time. The blue boxes represent time spent executing the inner plan and then the outer plan, which in this case happen to be sequential scans (they could just as easily be further joins). First the hash table is built by executing the inner plan and loading the resulting tuples into the hash table, and then the hash table is probed by executing the outer plan and using each tuple to probe the hash table. So, let's say that the total execution time is approximately inner plan time + outer plan time. Here's a graph showing the effect of that plan change as you add more workers. The Y axis shows the speed-up compared to a non-parallel query. Using the development master branch, it can now also produce a query plan like this: Hash Cond: (lineitem.l_orderkey = orders.o_orderkey) PostgreSQL 9.6 or 10 can produce a query plan like this: Using the "orders" and "lineitem" tables from TPC-H scale 30GB, here is a very simple join query answering the (somewhat contrived) question "how many lineitems have there ever been, considering only orders over $5.00?". So what does this feature really do? A simple example There will certainly be some adjustments before it's released, but it seems like a good time to write a blog article to present Parallel Hash. This is the biggest feature I've worked on in PostgreSQL so far, and I'm grateful to the reviewers, testers, committers and mentors of the PostgreSQL hacker community and EnterpriseDB for making this work possible. PostgreSQL 9.6 and 10 can use all three join strategies in parallel query plans, but they can only use a partial plan on the outer side of the join. As of commit 18042840, assuming nothing irreparably busted is discovered in the next few months, PostgreSQL 11 will ship with Parallel Hash. Partial plans will be possible on both sides of a join for the first time.
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |