Database Read-Modify-Write: The Concurrency Bug Most Developers Write Without Noticing

The read-modify-write pattern is one of the most common concurrency bugs in production code, and it almost never shows up under low load. Three correct patterns exist that prevent the bug, and each fits a different operational context.

Almost every developer has written code that looks like this:

row = db.query("SELECT balance FROM accounts WHERE id = ?", id)
new_balance = row.balance - amount
db.execute("UPDATE accounts SET balance = ? WHERE id = ?", new_balance, id)

Under low load, with one user at a time, this code works perfectly. Under any concurrency, it loses updates. The classic banking textbook example is two debits of $50 each against a balance of $100: both transactions read $100, both compute $50, both write $50, and the bank has lost $50 to a race condition that the developer never noticed. The pattern is so common that databases evolved several distinct mechanisms to prevent it, and choosing among them is a non-trivial design decision.

We hit this pattern across DocuMint (invoice numbering), CronPing (monitor state machines), FlagBit (flag version bumps), and WebhookVault (capture counters). Each instance forces the same conversation: which mechanism is right here, and what is the failure mode if we get it wrong?

The three correct patterns

The three patterns that prevent the lost-update bug are pessimistic locking, optimistic concurrency control, and atomic updates. They differ in where the coordination happens and what the failure mode looks like under contention.

Pessimistic locking uses SELECT FOR UPDATE to acquire a row-level lock during the read, hold it through the modify, and release it on commit. Concurrent transactions wait at the SELECT until the holder commits or rolls back, then proceed in serial. The code becomes:

BEGIN;
SELECT balance FROM accounts WHERE id = ? FOR UPDATE;
-- compute new balance in application
UPDATE accounts SET balance = ? WHERE id = ?;
COMMIT;

The advantage is correctness under all concurrency patterns. The disadvantage is that contention turns into queueing: with N concurrent updates, the throughput is gated by transaction duration, not by the database's natural concurrency. If the modify step makes external calls or has variable latency, the queue grows pathologically.

Optimistic concurrency control adds a version column to the row and includes it in the UPDATE's WHERE clause. The transaction reads the version along with the data, computes the change, and writes back conditioned on the version still matching. If a concurrent transaction has bumped the version, the UPDATE affects zero rows and the application retries:

row = db.query("SELECT balance, version FROM accounts WHERE id = ?", id)
new_balance = row.balance - amount
affected = db.execute(
  "UPDATE accounts SET balance = ?, version = version + 1 WHERE id = ? AND version = ?",
  new_balance, id, row.version)
if affected == 0: retry()

The advantage is no holding of locks: contention manifests as retries, not queueing, and the database does no extra work for transactions that do not conflict. The disadvantage is that under high contention, the retry rate becomes pathological. We have seen optimistic patterns degenerate to 90 percent retry rates under hot-row workloads, where the work the database does to compute and roll back the conflicting transactions exceeds the actual work being committed.

Atomic updates push the computation into the database itself, eliminating the round trip to the application:

UPDATE accounts SET balance = balance - ? WHERE id = ?;

This works because the database handles the read-and-write as a single operation under row-level locking that lasts microseconds. The advantage is that it eliminates the lost-update bug entirely with no application-level coordination. The disadvantage is that not every operation can be expressed as a single SQL statement: anything requiring computation outside the database (calling an external API, running business logic with conditionals, or transforming complex data) cannot use this pattern.

Choosing among them

The decision tree is roughly:

  • If the modification can be expressed as a single SQL statement (increment, decrement, conditional update, JSON field merge), use atomic updates. They have the lowest cost and the simplest failure model.
  • If the modification requires application-level computation but contention is rare (different users typically touch different rows), use optimistic concurrency. Retries are cheap when conflicts are rare.
  • If contention on specific rows is expected (hot keys, popular resources, queue heads), use pessimistic locking with SELECT FOR UPDATE. Queueing is more predictable than retry storms.

The wrong answer for any case is the bare read-modify-write pattern. It is correct under sequential access only, and most production code does not run sequentially.

The middle case: SELECT FOR UPDATE SKIP LOCKED

There is a useful variant of pessimistic locking that combines the correctness of locks with the throughput of optimistic patterns: SELECT FOR UPDATE SKIP LOCKED. Concurrent transactions trying to lock the same row do not wait; they skip it and look at the next row. This is the standard primitive for queue workers claiming jobs:

SELECT id, payload FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;

The pattern preserves correctness (no two workers claim the same job) while avoiding contention (workers do not queue waiting for the same row). It is the right answer for any worker pool draining a shared queue.

Database isolation levels and what they buy

The default isolation level in most databases is Read Committed, which does not prevent the lost-update problem. Repeatable Read in PostgreSQL detects the conflict and aborts the second transaction with a serialization error, which the application must catch and retry. Serializable goes further, detecting all forms of write-skew and other anomalies at higher CPU cost.

Relying on isolation level alone is operationally fragile: the retry logic has to live somewhere, and pushing it into a generic exception handler often hides the failure mode from anyone reading the code. Explicit optimistic or pessimistic patterns are easier to reason about because the coordination is visible at the call site.

The honest test: simulate it

The best way to find a lost-update bug in your code is to run it concurrently in a test. A trivial harness with 20 goroutines or threads each running the operation 100 times against the same row will surface the bug instantly if it exists. Production load eventually finds the bug too, but production load finds it in expensive ways: a customer is overcharged, a counter is wrong, an invoice number is duplicated. The test cost is minutes and the production cost is hours of incident response plus reputation damage.

We run this test pattern in our CI for any code that updates a row based on its current value. It has caught at least three bugs that were not visible in code review, including one where an idempotency check looked correct but had a race window of ~5 ms during which duplicate webhook deliveries could be processed twice.

The deeper observation

Most production concurrency bugs are not subtle. They are the same read-modify-write bug that appears in every textbook, hidden behind a variable name and a method call. The reason they reach production is that the language and the database do not warn about the pattern. SQL syntax allows the bug, the ORM passes it through, and the test environment runs sequentially. The bug only appears when load arrives, by which point the customer is already affected.

The discipline that prevents this is treating any code that reads a value and writes a derived value back to the same row as a concurrency primitive that needs one of the three explicit patterns. The mental shift is small but persistent: the moment you write a SELECT followed by an UPDATE based on what the SELECT returned, you have written a critical section. Concurrent access will exercise it. The choice is which of the three correct mechanisms you use, not whether to use one at all.

Read more