engineering

Postgres SKIP LOCKED: Building a Reliable Job Queue Without Extra Infrastructure

The reflexive answer to 'we need a job queue' is Redis, RabbitMQ, or SQS. The unreflective version of that answer skips a Postgres feature that turns a single table into a job queue with semantics that match what most teams actually need: SKIP LOCKED.

Anethoth

19 May 2026 — 6 min read

The reflexive answer to "we need a job queue" is Redis, RabbitMQ, or SQS. The unreflective version of that answer skips a Postgres feature that turns a single table into a job queue with semantics that match what most teams actually need. SELECT FOR UPDATE SKIP LOCKED is the primitive that does this, and it has been in Postgres since version 9.5 in 2016. Most teams who would benefit from it do not know it exists.

The problem it solves

The naive job-queue-in-Postgres pattern fails the moment you have more than one worker. Worker A reads the next pending job, takes seconds to lock it for processing, and during that window worker B reads the same row and also tries to process it. The result is duplicate work, and the standard mitigations (advisory locks, version columns, application-layer coordination) all add complexity that pushes the team toward "let's just use a real queue."

The Postgres mechanism is SELECT ... FOR UPDATE SKIP LOCKED. The semantics are: take a row-level lock on each row that matches the query, but instead of waiting for rows that are already locked, skip them entirely. The result is that worker A and worker B running the same query simultaneously each get a different row, with no coordination beyond what the database is already doing for transaction isolation.

The minimum schema

A queue table needs four columns at minimum: a primary key, a status column, a payload, and a timestamp for ordering. A partial index on the rows that need attention keeps the queries fast as the table grows.

CREATE TABLE jobs (
  id BIGSERIAL PRIMARY KEY,
  status TEXT NOT NULL DEFAULT 'pending',
  payload JSONB NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  started_at TIMESTAMPTZ,
  attempts INTEGER NOT NULL DEFAULT 0,
  last_error TEXT
);

CREATE INDEX jobs_pending_idx ON jobs (created_at)
  WHERE status = 'pending';

The partial index is load-bearing: as completed jobs accumulate, a full index over the status column would mostly point at irrelevant rows. The partial index only contains rows where the predicate is true, so it stays small and fast regardless of the size of the completed-job archive.

The claim query

The query that workers run to claim the next job is one statement:

UPDATE jobs
SET status = 'running',
    started_at = NOW(),
    attempts = attempts + 1
WHERE id = (
  SELECT id FROM jobs
  WHERE status = 'pending'
  ORDER BY created_at
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
RETURNING id, payload;

Three details to read carefully. The inner SELECT with FOR UPDATE SKIP LOCKED picks the oldest pending row that no other transaction is currently locking. The outer UPDATE ... RETURNING atomically transitions that row to running state and hands the payload back to the worker, all in one statement. The worker does not need to do anything else to claim the job; the database has handled the concurrency.

Multiple workers running this query simultaneously each get a different row. If there are no pending rows or all pending rows are already locked, the query returns no result and the worker waits a moment before trying again. There is no thundering herd because each worker walks past the locked rows rather than waiting in line.

What happens at completion and failure

The worker processes the job, then issues one of three terminal-state transitions: success, retriable failure, permanent failure.

-- Success
UPDATE jobs SET status = 'completed', completed_at = NOW()
WHERE id = $1;

-- Retriable failure with exponential backoff
UPDATE jobs SET status = 'pending',
       last_error = $2,
       created_at = NOW() + INTERVAL '1 minute' * (2 ^ attempts)
WHERE id = $1 AND attempts < 5;

-- Permanent failure
UPDATE jobs SET status = 'failed', last_error = $2
WHERE id = $1 AND attempts >= 5;

The retry pattern uses the created_at column as the ordering key, so pushing it into the future delays the job naturally without a separate scheduled_for column. Five attempts with exponential backoff covers most transient failures and stops short of infinite retry on permanent ones.

The failure modes the pattern handles

Worker crashes are the failure mode that catches teams using simpler patterns. If a worker takes a job, transitions it to running, and then dies before completing or failing it, the job sits in running state forever and never gets retried. The fix is a watchdog query that resets jobs that have been running too long:

UPDATE jobs SET status = 'pending',
       last_error = 'reset by watchdog'
WHERE status = 'running'
  AND started_at < NOW() - INTERVAL '15 minutes';

The timeout depends on the expected work duration. For most B2B SaaS workloads, 15-30 minutes is a safe default. The watchdog runs every few minutes from a cron job or a separate worker.

Stuck workers (running but not making progress) require application-layer heartbeats, which are a separate problem worth solving but not strictly required for the job-queue pattern to work.

The scale envelope

The pattern scales further than most teams expect. A single Postgres instance handles thousands of jobs per second comfortably with a few workers and tens of thousands per second with careful tuning (claim batches of 10-100 jobs per query rather than one at a time, partition the table by date for older completed jobs, vacuum aggressively to keep the partial index healthy). The signal that the pattern is hitting its scale ceiling is contention on the jobs_pending_idx index, visible as elevated lock_waits in pg_stat_activity.

Above ten thousand jobs per second sustained, the right move is sharding by hash of the payload (different worker pools claim different partitions of the queue) or graduating to a dedicated queue infrastructure. Below that ceiling, the Postgres pattern wins on operational simplicity: no second system to deploy, monitor, back up, secure, and upgrade.

The advantages over Redis or RabbitMQ

The Postgres queue inherits the durability guarantees of the database itself. Redis with persistence is durable in the happy path and lossy in some failure modes; the Postgres queue is durable in the same sense as any other Postgres write. The transactional integration matters too: a job can be inserted as part of the same transaction that produces the work it represents, so there is no possibility of the job being created without the trigger or the trigger occurring without the job.

The query power matters for operational visibility. Counting pending jobs, looking at the oldest pending job, finding jobs that have been retried repeatedly, identifying jobs from a specific tenant or with a specific payload shape are all SELECT queries against a familiar database, not opaque queue-system commands. The minor cost of the SQL is repaid every time someone needs to debug a queue problem.

The disadvantages to acknowledge

The pattern shares a database with the application. A misbehaving job query can degrade application performance, and an application incident can lock up job processing. Separating into a dedicated Postgres instance handles this when the scale justifies it, but most teams do not need to.

The pub-sub patterns that some queues support (broadcast to many consumers, topic subscriptions) are awkward in Postgres. LISTEN/NOTIFY can substitute for simple cases, but if the workload is genuinely pub-sub with many consumers per message, a dedicated system is the right answer.

Priority queues with complex ordering can be expressed in SQL but the queries get awkward. A simple priority column with a few values works fine; an open-ended priority space with constant reordering is harder to fit.

The operational signals

Four metrics tell you whether the queue is healthy. Count of pending jobs as a leading indicator of capacity problems. Oldest-pending-job age as the latency-budget metric. Count of running jobs as a worker-saturation indicator. Count of failed jobs in the last hour as the error-budget signal.

The first two are queried directly against the table. The third indicates worker pool sizing. The fourth indicates whether the work itself is going wrong or whether infrastructure problems are causing retries to exhaust.

Across our products

Our four products use SQLite, which has equivalent atomic-claim semantics via UPDATE ... RETURNING against a partial index. The SQLite version of the pattern is structurally identical to the Postgres version with different syntax for the partial index. CronPing uses this pattern for monitor-check jobs, WebhookVault uses it for webhook delivery and replay queues, DocuMint uses it for batch PDF generation, and FlagBit uses it for cache propagation jobs. The migration to Postgres is planned and the pattern translates with the addition of SKIP LOCKED, which SQLite does not need because its single-writer concurrency model makes the contention case impossible.

The deeper observation

The pattern is a small example of a wider phenomenon: databases have accumulated decades of features that handle problems teams reach for separate infrastructure to solve. Job queues, full-text search, pub-sub, document storage, time-series compression, vector similarity, geographic queries are all available inside Postgres at small-to-medium scale with operational characteristics that match what most teams actually need.

The argument is not that dedicated systems are wrong; at sufficient scale they are right. The argument is that reaching for them reflexively at small scale pays an operational complexity tax that most teams do not need to pay. The boring stack of "Postgres for everything until proven otherwise" wins on operational simplicity through the years when operational simplicity matters most: the early product years when the team is small and the work is mostly building features rather than running infrastructure. SKIP LOCKED is one of the features that lets the boring stack stay boring longer than people expect.