The N+1 Query Problem: A Field Guide to a Bug That Hides in Plain Sight

The N+1 query problem is the most common database performance bug in production code. It is also one of the easiest to miss in code review, because the code that causes it usually looks innocuous. A loop iterates over a list of objects. Inside the loop, a property is accessed. The property turns out to lazy-load from the database. The page that took 50ms in development with 10 records takes 12 seconds in production with 1000.

This piece is a field guide to recognizing N+1 in its many disguises and a discussion of the patterns that prevent it from appearing in the first place.

The canonical case

The textbook example involves an ORM. You fetch a list of orders, then iterate over them and access the customer for each:

orders = Order.objects.all()  # 1 query
for order in orders:
    print(order.customer.name)  # 1 query per order

If there are 100 orders, this is 101 queries — one to fetch the orders, one per order to fetch each customer. Hence "N+1." The fix is to eagerly load customers in the same query as orders, typically with a JOIN:

orders = Order.objects.select_related('customer').all()  # 1 query
for order in orders:
    print(order.customer.name)  # 0 queries

This is the case every database tutorial mentions. It is also the easiest case to spot: the code is in one file, the loop is visible, and any decent ORM has a tool to detect it (Django's django-debug-toolbar, Rails's bullet, Sequelize's logging, and so on).

The disguises

The harder cases are the ones where the loop is not obvious. The pattern still causes one query per item, but the loop is hidden behind layers of abstraction.

Disguise 1: serializers and templates. A serializer for a list endpoint runs once per item, and inside the serializer is code that accesses related fields. Each serializer call triggers a query. The loop is the framework's loop over items, and the queries are inside the per-item rendering code. This is invisible in the view code because the view just calls serializer.data.

Disguise 2: methods that hit the database. A model method called order.total() looks like a property access but actually runs a SUM query against line items. Anywhere that method is called in a loop, you have N+1. The method body might be in a different file, written by a different developer, who did not anticipate it being called in a hot path.

Disguise 3: service-layer indirection. A service function called get_user_summary(user) hits the database for each call. A list view that calls it once per user has N+1, but the service function is the thing being called, not a database query. From the caller's perspective, it is just function calls.

Disguise 4: caches that do not cover the path. A function caches its result in Redis, but the cache key includes a parameter that varies per item. The first time the loop runs, every iteration is a cache miss and a database query. The second time, every iteration is a cache hit and zero queries. The bug is invisible in staging because staging runs the loop twice and sees the second pass; in production it runs once for each unique parameter and sees N+1.

Disguise 5: microservice fan-out. A service makes a single call to fetch a list, then makes one HTTP call per item to a related service to enrich each one. This is N+1 over the network rather than over the database, and the cost is much higher because each network call is 1-50ms instead of 0.1-1ms. The pattern is the same; the impact is worse.

How to find it

The reliable way to find N+1 is not to read code. Reading code can miss every disguise above. The reliable way is to look at query logs. Run the suspect endpoint with full query logging on, count the queries, and verify the count is constant rather than proportional to the result set size.

The diagnostic that catches everything is to run the endpoint twice with different sized result sets — say, 1 item and 100 items — and compare the query counts. If the count goes from 5 to 105 (or 5 to 305, or worse), you have N+1 somewhere. If the count goes from 5 to 5, you do not. This test is fast, mechanical, and catches every disguise above because it does not depend on understanding the code path.

Most production databases have query logging or slow query logs you can enable. PostgreSQL has auto_explain and pg_stat_statements. MySQL has the slow query log. SQLite has EXPLAIN QUERY PLAN and pragma tracing. The cost of enabling these in staging is low; the cost of not enabling them is the slow endpoint your customers will eventually find before you do.

The fixes

The fix for an ORM N+1 is usually eager loading: select_related for joins, prefetch_related for separate queries with IN clauses, JOIN in raw SQL, or whatever the equivalent is in your stack. The principle is to fetch all the related data in O(1) queries rather than O(N).

The fix for service-layer N+1 is harder: the service function needs a batched version (get_user_summary_bulk(users)) that fetches everything in one query. Adding the batched version is straightforward; getting every caller to use it is the hard part. Some teams add a lint rule that flags single-item calls in loops; this catches future regressions even when the code path is not in active development.

The fix for microservice fan-out is to design the related service's API to accept lists. A POST /users/lookup endpoint that accepts an array of IDs and returns an array of users is one network call instead of N. Designing for batched access is a top-of-mind concern when you split a service; retrofitting it after the fan-out has shipped is harder.

The deeper pattern

N+1 is a symptom of a deeper pattern: code that was written when the iteration count was small enough that nobody noticed the per-iteration cost. The 10-order test database in development hid the bug. The 100-item staging dataset hid it less effectively. The 10,000-item production case made it impossible to ignore.

The discipline that prevents N+1 from shipping is to test endpoints with realistic-sized data. The discipline that prevents N+1 from appearing in the first place is to make per-item costs visible at the code review level — flag database access inside loops, prefer batched APIs over single-item APIs, and treat service-layer functions as costing what their database queries cost rather than what their function signatures suggest.

Where this fits in our stack

Across our four products — DocuMint, CronPing, FlagBit, and WebhookVault — we have hit N+1 a handful of times, mostly in list endpoints during early development. The fix has always been a small one: an eager load, a batched lookup, a JOIN. The detection has always been the same: enable query logging, run the endpoint, count the queries, find the loop. CronPing's monitor list view was an early example; WebhookVault's endpoint detail with its captured-request count was another.

The deeper lesson is that N+1 is not really a database problem. It is an architecture problem disguised as a database problem. The architecture is "code that hides per-item cost behind innocuous-looking accesses." The database is just where the cost shows up. Fixing N+1 in any specific endpoint is easy. Building a culture where N+1 stops shipping in the first place is the harder and more valuable thing.