Database Deadlocks: Detection, Prevention, and Recovery in Production

Deadlocks are the concurrency failure mode that most teams discover the hard way: under load, in production, with the database log filling up with deadlock-victim notices. The good news is that deadlocks are well-understood and largely preventable through a small set of disciplines.

A deadlock is a specific kind of failure: two or more transactions hold locks the others need, and none can proceed without one releasing. Databases detect this state because the alternative is the transactions waiting forever. The detection mechanism kills one of the transactions, returns a deadlock error to the application, and lets the rest proceed. The killed transaction is the deadlock victim. Most teams discover deadlocks the hard way: under load, in production, with the database log filling up with victim notices and a customer-visible error rate that wasn't there yesterday.

The good news is that deadlocks are well-understood. The mechanisms by which they arise are a small set; the disciplines that prevent them are simpler than the disciplines that prevent most concurrency bugs; and the recovery path is straightforward when the application is built for it. The bad news is that the disciplines are easy to forget and the prevention requires foresight that production code rarely gets to apply retroactively. This post covers detection, prevention, and recovery patterns that hold up across PostgreSQL, MySQL, and SQLite, with specific notes where the implementations differ. The patterns apply across the four products in our studio — DocuMint, CronPing, FlagBit, and WebhookVault — and to any system that handles concurrent writes.

How deadlocks arise

The textbook deadlock involves two transactions and two resources. Transaction A locks resource 1, then tries to lock resource 2. Transaction B locks resource 2, then tries to lock resource 1. Neither can proceed. The database detects the cycle in the wait-for graph and kills one transaction. The pattern generalizes to N transactions and N resources, with cycles of any length, but the two-and-two case accounts for the majority of production deadlocks because longer cycles require more coincidence.

The resources in question are usually rows. The classic case is two transactions that both update the same pair of rows in opposite orders. Transaction A: UPDATE accounts SET balance = balance - 100 WHERE id = 1; UPDATE accounts SET balance = balance + 100 WHERE id = 2;. Transaction B at the same time: UPDATE accounts SET balance = balance - 50 WHERE id = 2; UPDATE accounts SET balance = balance + 50 WHERE id = 1;. If the two transactions interleave in the wrong order, A holds the lock on row 1 and waits for row 2; B holds the lock on row 2 and waits for row 1. Deadlock.

The non-textbook cases are more common in practice. Foreign-key constraints can produce surprising lock acquisitions: when transaction A inserts into a child table that references a parent row, the database takes a shared lock on the parent row to prevent it from being deleted concurrently, and that shared lock can conflict with another transaction trying to update the parent. Index updates can take locks in unexpected orders depending on how the planner chooses to apply the update. Trigger-based mutations of additional rows can extend the lock graph in ways the application code did not consciously design.

Detection in the database

PostgreSQL detects deadlocks by periodically scanning the wait-for graph for cycles. The default scan interval is one second, set by deadlock_timeout. When a cycle is found, the database picks a victim — usually the transaction that has done the least work — and aborts it with error code 40P01 (the SQLSTATE for deadlock detection). The non-victim transactions continue. The application sees the deadlock as a query error returned to the connection that ran the victim transaction.

MySQL with InnoDB detects deadlocks immediately rather than on a timer, by maintaining the wait-for graph as locks are acquired and checking for cycles when a new edge is added. The mechanism is more aggressive but generates the same errors. SQLite does not have a deadlock detector in the same sense because it serializes writes, but it has a related failure mode: the SQLITE_BUSY error returned when one connection holds a lock another connection wants and the busy timeout expires. The patterns for handling SQLITE_BUSY are the same as for handling deadlock errors — retry with backoff — but the detection is different.

Prevention pattern 1: lock in a consistent order

The single highest-leverage prevention pattern is to ensure that all transactions acquire locks on the same set of rows in the same order. If transaction A always locks row 1 before row 2, and transaction B always locks row 1 before row 2, the cycle cannot form. The implementation usually means sorting the rows to be locked by primary key before acquiring any of them.

The classic transfer example becomes safe with a simple sort: before updating any row, sort the IDs and update in sorted order. Transaction A and transaction B will both touch row 1 first, regardless of which direction the transfer is going. One of them will block on the lock; the other will proceed; the blocked transaction will proceed once the first commits. No deadlock.

The discipline applies broadly. Any code path that updates multiple rows in a single transaction should sort the rows by primary key before updating. Any code path that holds row-level locks via SELECT FOR UPDATE should sort the locks. Any code path that updates rows in multiple tables should establish a consistent table order across all transactions that touch those tables. The discipline costs almost nothing at write time and prevents a whole class of deadlocks at runtime.

Prevention pattern 2: shorter transactions

The longer a transaction holds locks, the more likely it is to be involved in a deadlock. A transaction that holds locks for ten milliseconds rarely deadlocks; a transaction that holds locks for ten seconds frequently does. The mechanical reason is that the probability of overlap between two transactions grows roughly with the product of their durations.

The pattern that holds up is to keep transactions short and to do as little work inside them as possible. Acquire locks at the latest possible moment, release them at the earliest possible moment, and avoid network calls or expensive computation between the acquisition and the commit. A transaction that issues an HTTP request while holding row locks is asking for trouble; a transaction that pulls data first, computes outside the transaction, and then re-enters the transaction only for the write is asking for less.

The trade-off is that some operations genuinely need consistency across multiple steps that take time. The right answer in those cases is usually optimistic concurrency control: read the rows without locking, do the work, and on commit verify the rows have not changed. This trades the deadlock risk for a write-conflict risk, which is easier to retry from because it does not involve killing arbitrary work mid-flight.

Prevention pattern 3: avoid hot rows

If many transactions update the same row, the database will serialize them, and any small inconsistency in lock ordering will produce deadlocks. Hot rows show up in counter columns (a global request count, a per-user balance with high traffic), in lookup tables that get updated as a side effect of common operations, and in semaphore-like rows that other rows reference.

The patterns that avoid hot rows fall into two categories. Replace the hot row with multiple rows: instead of one counter, use ten counters and increment a randomly-selected one, then sum them at read time. Replace the synchronous update with an asynchronous one: write the events to an append-only log and aggregate them into the counter on a background job. Both patterns trade write latency for write throughput and remove the hot row from the deadlock graph.

Prevention pattern 4: SELECT FOR UPDATE SKIP LOCKED for queues

Job queues built on top of relational databases have a specific deadlock pattern: many workers compete for the same set of pending jobs, and the lock acquisitions can produce cycles in the wait-for graph. The PostgreSQL SELECT FOR UPDATE SKIP LOCKED primitive solves this by atomically claiming a row that is not currently locked by another transaction, falling through to the next available row if the candidate is locked.

The pattern is the right primitive for any queue-like workload built on a relational database. We use it in CronPing for the dispatcher that processes pending alerts, in FlagBit for the worker pool that processes flag-evaluation analytics, and in WebhookVault for the delivery worker that processes outbound webhook attempts. Each worker claims a row, processes it, and updates its status, with no possibility of two workers claiming the same row and no possibility of deadlock among workers.

Recovery: retry with backoff

Even with all the prevention patterns applied, deadlocks will occasionally happen, particularly under load. The recovery path is to detect the deadlock error code, wait a small randomized interval, and retry the transaction. The randomization is important: if all victims retry on the same fixed interval, they may collide again immediately and produce another deadlock.

The implementation in practice is a wrapper around the transaction that catches the deadlock error, sleeps for a random duration in the 10-100ms range, and retries up to some small bound — typically three to five attempts. After the bound is exhausted, the application surfaces the error to the caller. Most production deadlocks resolve on the first or second retry; the small minority that don't usually indicate a more fundamental contention problem that the prevention patterns above need to address.

The observability discipline

Deadlocks are easy to ignore because they often resolve via retry without the customer ever noticing. The discipline that holds up is to track them as a first-class operational metric: count of deadlock errors per minute, by table pair, with alerts when the rate crosses a threshold. The pattern often reveals deadlocks the team did not know about, because the application's retry logic was masking them.

PostgreSQL logs deadlocks to the standard log when log_lock_waits is enabled, with the SQL statements involved and the wait-for graph at the time of detection. The information is sufficient to diagnose the cause, which is usually one of the prevention patterns above being violated by some specific code path. The pattern of investigation is to find the violating path, apply the prevention pattern, and verify the deadlock rate drops to zero.

The deeper observation

Deadlocks are not bugs in the database; they are the database doing exactly what it should do under conditions where the application has asked for inconsistent things from concurrent connections. The prevention patterns are not workarounds for database limitations; they are the discipline of writing concurrent code carefully. The fact that the database catches the cycle and resolves it via retry — rather than letting the application hang forever or corrupt data — is one of the foundational guarantees that makes relational databases trustworthy under concurrent load. The team that takes the disciplines seriously gets a system that handles concurrent writes correctly without much further thought; the team that doesn't gets a system whose behavior under load is unpredictable in ways that are hard to diagnose after the fact.

Read more