SQL JOIN Strategies: Nested Loop, Hash Join, Merge Join, and When the Planner Picks Wrong
Postgres has three join algorithms. The planner usually picks the right one. When it picks wrong, knowing the mechanics is the difference between a query that runs in 50ms and one that runs in 5 minutes.
Most production SQL performance problems are join problems. The query that runs fast in development against ten thousand rows turns into a multi-minute crawl in production against ten million, and the diagnostic question is almost always which join algorithm the planner chose and whether it chose the right one. There are three algorithms in the standard SQL planner toolkit, and understanding when each is correct turns the puzzle from opaque into mechanical.
We hit this often enough across DocuMint, CronPing, FlagBit, and WebhookVault that we keep notes. The notes below are what we wish we had read before the first time EXPLAIN ANALYZE told us something we did not know how to interpret.
Nested loop joins
The nested loop is the simplest algorithm: for each row in the outer table, look up matching rows in the inner table. The cost is O(N*M) in the naive case, but with an index on the inner table's join column, the lookup becomes O(log M) per outer row, dropping the total to O(N*log M). This is the algorithm that wins when one side is small or filterable to a small set, and the other side has a usable index on the join column.
Nested loop is the planner's default when the outer side is small (a few hundred rows or fewer after filtering) and the inner side has an index on the join column. It is also the algorithm that suffers most when the planner's estimate of the outer side is wrong: if the planner thinks the outer will produce ten rows and it actually produces ten thousand, the index lookups multiply by a factor of a thousand and the query that should take 5ms takes 5 seconds. The estimate-vs-actual gap on the outer side is the highest-leverage diagnostic signal in EXPLAIN ANALYZE.
Hash joins
The hash join builds an in-memory hash table from one side (the build side, usually the smaller side) and then scans the other side (the probe side) checking each row against the hash. The cost is O(N+M) where the constant factor includes hashing each build row once and probing each probe row once. This is the algorithm that wins when both sides are large enough that the per-row index lookup of the nested loop becomes more expensive than building and probing the hash.
The performance characteristic of hash joins is that they have a fixed setup cost (building the hash table) and then linear probing cost. The break-even with nested loop happens at small to medium outer-side sizes; once the outer side has more than a few thousand rows, hash joins are almost always faster than nested loops. The failure mode is when the build side does not fit in work_mem and spills to disk in batches, which multiplies the cost dramatically. The work_mem setting and the size of the build side are the two parameters that determine whether the hash join stays in memory.
Merge joins
The merge join requires both sides to be sorted on the join column. Once sorted, it walks both sides in parallel, matching rows as they go. The cost is O(N+M) for the merge itself, plus O(N log N + M log M) for the sort if the sides are not already sorted. This is the algorithm that wins when both sides are already sorted (often from being read in order by an index on the join column) or when the result needs to be sorted anyway and merging-while-sorting is cheaper than separate sort steps.
Merge joins are less common in practice than nested loop and hash joins because the prerequisite of sorted input is restrictive, and when sorting is required as part of the join, hash joins are usually faster for unsorted inputs. The case where merge joins shine is when both sides are reasonably large, both have indexes on the join column, and the query can take advantage of the natural order of those indexes without an explicit sort step.
When the planner picks wrong
The planner picks the join algorithm based on cost estimates derived from table statistics. The estimates are usually close enough that the right algorithm wins. The cases where the planner picks wrong fall into a few patterns worth recognizing.
The most common failure is when statistics are stale and the planner's row count estimate is off by orders of magnitude. A query that runs against a table where autovacuum has not analyzed recently can produce a plan based on a row count that is wildly wrong, leading to nested loops where hash joins would be correct or vice versa. The fix is to run ANALYZE on the affected tables and to monitor pg_stat_user_tables.last_autoanalyze to catch this class of problem early.
The second failure pattern is when correlated columns confuse the planner. The default statistics assume column independence, so a query like WHERE country = 'US' AND state = 'CA' gets an estimate that is the product of the two individual selectivities, which dramatically underestimates the result count for correlated columns. CREATE STATISTICS lets you tell the planner about column correlations, and the resulting estimates are much more accurate.
The third failure pattern is when the join column has highly skewed values. A LEFT JOIN with a default value of NULL produces a long tail of NULLs that the planner sees as one statistical group, leading to estimates that fit the typical case badly. The fix is usually either to use COALESCE to break up the NULL group or to add a partial index that excludes NULLs.
Reading EXPLAIN ANALYZE
The actual-vs-estimated row counts at each plan node are the primary diagnostic. The planner's estimate appears as the "rows=N" in EXPLAIN; the actual count appears as "actual rows=N" in EXPLAIN ANALYZE. The ratio is what matters: estimates within 10x of actual are usually fine, ratios above 100x are almost always a planning problem, and the highest node in the tree where the ratio gets large is where the planning went wrong.
The other useful signal is the time spent per node. A nested loop that does 100,000 inner-side index lookups will show up as a high cumulative time on the inner-side scan, not on the join itself. A hash join that spills to disk will show "Buckets: N Batches: M" with batches greater than 1, indicating that work_mem was insufficient. A merge join with an explicit sort step shows the sort as a separate child node.
Encouraging the right plan
The planner can be coaxed but not directly controlled. The mechanisms that influence join algorithm selection are work_mem (higher values allow larger hash joins to stay in memory), enable_nestloop / enable_hashjoin / enable_mergejoin (can be turned off to force the planner away from a bad choice, but using these in production is a code smell), and statistics improvements via CREATE STATISTICS and ANALYZE.
The pattern that works better than fighting the planner is to give it accurate statistics and let it choose. The cases where the planner reliably picks wrong are usually statistical problems, and fixing the statistics fixes the plan. Persistent bad plans on the same query usually indicate a deeper issue: the schema or query is forcing the planner into a corner where no algorithm is good, and the right fix is to rewrite the query or restructure the schema.
The deeper observation
Join algorithm selection is one of the database internals that production engineers eventually have to learn. The conceptual model is simple, the practical diagnosis requires reading EXPLAIN, and the fix is almost always to give the planner better information rather than to override its choice. Teams that internalize the three algorithms and the conditions under which each is correct can debug 80 percent of production join performance problems from EXPLAIN output alone.
The wider observation is that the SQL abstraction is genuinely powerful: a declarative query gets executed via a cost-based planner that has been refined over decades. The abstraction occasionally leaks, and when it does, the leak is mechanical and well-documented. The discipline that pays off is reading the plan when something is slow, and treating EXPLAIN ANALYZE as the primary diagnostic tool rather than guessing.