Distributed Locks: When You Need Them, When You Don't, and How to Build Them Without Lying to Yourself

Distributed locks have a peculiar status in backend engineering: they are widely used, widely misused, and almost universally implemented incorrectly the first time. The combination is unfortunate because the cases that look like they need a distributed lock often have simpler, more correct solutions, and the cases that genuinely need one are subtle enough that the typical Redis-based recipe gets the easy parts right and the hard parts wrong.

This piece walks through when you actually need a distributed lock, what the alternatives are, what guarantees a correct lock implementation must provide, and the failure modes of the lock patterns most teams reach for. The goal is to be honest about a primitive that is easier to use than to use correctly.

Most lock requirements are not lock requirements

The most common reason teams reach for distributed locks is to prevent two workers from doing the same work. A scheduled job runs every minute; you want to make sure only one instance executes it. A webhook arrives twice; you want to make sure the side effects happen once. A user clicks a button rapidly; you want to make sure the database is not corrupted by concurrent updates.

None of these are actually lock problems. They are idempotency problems disguised as lock problems. The right primitive is an idempotency key recorded in the database, with the side-effect operation predicated on the key not having been seen before. The database transaction provides the atomicity. The unique constraint on the key column provides the deduplication. No lock is required, and no lock-related failure mode is introduced.

This same pattern handles the scheduled job case: a row in a jobs table with a unique constraint on (job_name, scheduled_for) ensures that only one scheduler can claim a given execution slot. It handles the webhook case: a unique constraint on event_id deduplicates retries. It handles the rapid-click case: an idempotency token attached to the request makes the second click a no-op.

The cases that genuinely need a distributed lock

The cases that actually need a distributed lock have a specific shape: a single resource that lives outside any database, where multiple processes might modify it concurrently, and where the modification cannot be made atomic via the database or the resource's own API. Common examples: writing to an external file system or object store where there is no compare-and-swap; coordinating access to a stateful external service that does not support optimistic concurrency; ensuring single-leader behavior in a cluster of stateless workers.

Notice the common thread: the resource being protected is not under the database's transactional control, and the resource itself does not provide a primitive that can serve as the synchronization point. If your resource is a row in your database, you do not need a distributed lock — you need SELECT FOR UPDATE. If your resource is a key in Redis, you do not need a distributed lock — you need a Redis transaction or a Lua script. The lock is correct only when the resource has no native concurrency primitive.

What a correct lock must guarantee

A distributed lock has two safety properties and one liveness property. Safety: at most one client holds the lock at any time. Safety: a client that holds the lock is the same client that acquired it (no token confusion). Liveness: a lock held by a crashed client is eventually released.

The naive Redis recipe — SET key value NX PX 30000 — handles the first safety property and the liveness property but gets the second wrong. If client A acquires the lock, takes longer than the TTL to do its work (because of GC, network slowness, or simply slow code), the TTL expires and client B acquires the same lock. Both clients now believe they hold the lock. Client A finishes its work and calls DEL on the lock key. The DEL succeeds, but it has just released client B's lock, not its own.

The fix for this is the fencing token: every lock acquisition returns a monotonically increasing token, and every operation guarded by the lock must include the token, with the protected resource rejecting operations that include a stale token. This is correct, but it requires the resource to support fencing token validation, which most external services do not. In practice, fencing-token-correct distributed locks are deployed against a small set of resources that explicitly support them.

The Redlock controversy

The Redlock algorithm proposed by Redis adds a multi-node twist: acquire the lock on a majority of independent Redis instances within a bounded time, on the theory that even if one instance fails, the lock remains held on the others. Martin Kleppmann's 2016 critique pointed out that Redlock does not solve the GC pause / network delay problem (which is what fencing tokens address) and that its complexity is not justified relative to a simpler single-node lock plus a fencing token.

The pragmatic reading is that Redlock is overkill for cases where you just want a coordination hint that will be correct most of the time, and insufficient for cases where you actually need correctness under all conditions. The cases where it is exactly right are rare. For most teams, the choice is between a simple Redis lock with the understanding that it can fail under load, and a more carefully designed system that uses idempotency or fencing tokens to be correct.

Lock-via-database

One pattern that often goes unmentioned is the database-as-lock-service. PostgreSQL advisory locks (pg_advisory_lock, pg_try_advisory_lock) are a built-in distributed lock primitive that inherits the database's correctness guarantees. They are bound to the session, so a crashed client automatically releases the lock. They are fast (microseconds for acquisition under contention). They scale to thousands of locks without ceremony.

The trade-off is that they require a database connection, which is fine if your workers already have one but adds a dependency if they do not. They cannot be acquired across multiple databases, so they do not solve the cross-region or cross-cluster cases. But for workers in a single database cluster, advisory locks are usually the right answer over Redis-based locks: simpler to operate, more correct under failure, less infrastructure.

The patterns that go wrong

The single most common lock bug is the time-of-check / time-of-use gap: client checks whether some condition holds, takes the lock, then performs an action assuming the condition still holds. The condition could have changed between the check and the lock. The fix is to check the condition after taking the lock, not before. This sounds obvious but is violated regularly because the check is often implicit (e.g., "we only run this for users in state X").

The second most common bug is forgetting that lock acquisition can fail. lock.acquire() can time out, can fail because the lock service is unreachable, can fail because some other client crashed and left a lock that has not yet timed out. Code that treats lock acquisition as infallible breaks in production the first time the lock service has a hiccup.

The third is the renewal race: long-running operations that hold a lock past its TTL, and the renewal logic that tries to extend the TTL fails because some other client has already acquired the lock. Renewal must be defensive: check that the lock is still held by the current client (via a token comparison) before extending.

The mental model that helps

The mental model that helps is to treat the distributed lock as a coordination hint, not as a correctness guarantee. The lock makes contention rare. Code that runs under the lock should still be correct if two instances of it ran simultaneously, because eventually two instances will. Idempotency is the safety net. The lock is the optimization.

This framing changes how the system is built. Instead of "this code path requires the lock," the framing becomes "this code path is idempotent, and we use the lock to prevent unnecessary contention." The lock can fail, can be released early, can produce duplicate execution — all of these are tolerable because the side effects are idempotent. The system is correct in the absence of the lock and merely efficient in its presence.

This is the discipline that separates production-grade lock usage from the recipes-on-blogs version. DocuMint, CronPing, FlagBit, and WebhookVault all use idempotency keys and database constraints as the primary correctness mechanisms; the few places where coordination is required use PostgreSQL advisory locks rather than Redis. Both choices reflect the same principle: the simplest tool that is provably correct is almost always the right one.

The summary

If you are reaching for a distributed lock, ask first whether your problem is actually an idempotency problem. If it is, solve it with an idempotency key in your database. If it is genuinely a coordination problem, ask whether the resource you are coordinating against has a native concurrency primitive (database row locks, Redis transactions, optimistic concurrency at the API layer). If yes, use that. Only if the answer to both questions is no should you reach for a distributed lock — and at that point, you should be reaching for a fencing-token-aware design or PostgreSQL advisory locks rather than the Redis-NX recipe. Locks are the most over-prescribed primitive in backend engineering for a reason: they look simple, they are not, and the systems that use them sparingly are almost always more correct than the systems that use them everywhere.