Vol. IV · No. 04 Monday · 29 June 2026
Now writing — Why Your Index Scan Is Slower Than a Sequential Scan: When the Planner Is Right to Ignore Your Index dispatches · 3 streams
← All dispatches
engineering Dispatch 5 min read · 12 Jun 2026

Why Your Background Jobs Need Idempotency Keys: The At-Least-Once Delivery Problem

Every job queue delivers at least once, not exactly once. Without an idempotency key checked inside the same transaction as your side effect, every job is a duplicate waiting to happen.

engineering · Curiosity

Affected systemsSQS, Redis/Sidekiq, Celery, RabbitMQ, any job queueFailure modeDouble charges, duplicate emails, duplicate writesFixIdempotency key checked inside the same transaction as the side effectCommon mistakeUsing job ID as the idempotency key (job ID changes on retry)

Every queue system you have ever used delivers messages at least once. Not exactly once. At least once. This is not a bug. It is the documented behavior of SQS, Sidekiq, Celery, RabbitMQ, Kafka in most configurations, and every other job queue you're likely to encounter. Exactly-once delivery is extremely difficult to implement and almost no system offers it.

What this means in practice: your job will execute more than once. Not every job and not always. But often enough that if you have not accounted for it, you will eventually double-charge a customer, send an email twice, or create a duplicate database record. The question is not whether this will happen to you. It's whether you notice it before your customers do.

Why Jobs Execute More Than Once

The failure sequence is straightforward. A job is dequeued. The worker begins processing. Halfway through, the worker crashes — out of memory, deployment restart, network timeout from the queue's perspective. The queue doesn't know whether the job completed. From its perspective, the visibility timeout expired and the job was not acknowledged. So it redelivers the job. A new worker picks it up. The job runs again.

This can happen even without worker crashes. In SQS, every message has a visibility timeout — the window in which your consumer must acknowledge it. If your job takes longer than the visibility timeout, SQS will redeliver the message to another consumer even if the first consumer is still processing it. You can extend the visibility timeout during processing, but if your process gets stuck or slow, you'll miss the extension window.

Sidekiq has the same pattern. If a Sidekiq process dies while a job is in flight, the job goes back to the queue. Same for Celery, same for any push-based queue that requires explicit acknowledgment.

The Idempotency Key Pattern

An idempotency key is a unique identifier for each logical operation. The rule is: if you've already processed this operation, don't process it again. If you haven't, process it and record that you did.

The pattern has three parts:

  1. Check — before executing, check whether this key has already been processed
  2. Execute — execute the business logic
  3. Record — record that this key has been processed

The critical constraint: check and record must be inside the same transaction as the business logic. Not before. Not after. Inside.

Where the Key Comes From

The idempotency key must be generated by the caller at enqueue time and passed as a job argument. It is not the job ID. This is the mistake most people make first.

Job IDs change on retry. When Sidekiq re-enqueues a failed job, it gets a new job ID. If your idempotency key is the job ID, a retried job has a new key and is treated as a new operation. You have solved nothing.

The idempotency key must represent the logical operation, not the job execution. For a "charge customer" job, the key might be charge_{customer_id}_{order_id}_{amount_cents}. For a "send welcome email" job, it might be welcome_email_{user_id}. The key is stable across retries because it describes what to do, not the job that's doing it.

In practice: generate a UUID when you enqueue the job. Pass it as an argument named idempotency_key. The job reads it from arguments, not from any queue metadata.

-- Enqueue side (Ruby/Sidekiq example):
MyJob.perform_async(
  user_id: user.id,
  idempotency_key: SecureRandom.uuid
)

-- Job side:
def perform(user_id:, idempotency_key:)
  ApplicationRecord.transaction do
    return if IdempotencyKey.exists?(key: idempotency_key)

    # Business logic here
    User.find(user_id).do_something_important!

    IdempotencyKey.create!(
      key: idempotency_key,
      created_at: Time.current
    )
  end
end

The Transaction Boundary Is Not Optional

The check and the record must be inside the same transaction as the business logic. This is the part that gets skipped when someone reads a summary of this pattern and misses the footnote.

Without the transaction boundary: two workers both process the job simultaneously. Worker A checks — key not found. Worker B checks — key not found. Worker A executes and records. Worker B executes and records. You've run the job twice.

With the transaction boundary and a unique constraint on the key: Worker A inserts the key. Worker B tries to insert the key, gets a unique constraint violation, rolls back, and exits. The business logic only ran once. The constraint does the coordination work. This is the correct implementation.

CREATE TABLE idempotency_keys (
  key TEXT PRIMARY KEY,
  result JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- In your job:
BEGIN;
INSERT INTO idempotency_keys (key)
VALUES ('charge_123_456_9900')
ON CONFLICT (key) DO NOTHING;

-- If 0 rows inserted, already processed:
GET DIAGNOSTICS rows_affected = ROW_COUNT;
IF rows_affected = 0 THEN
  ROLLBACK;
  RETURN;
END IF;

-- Execute business logic here...

COMMIT;

What Idempotency Keys Do Not Solve

The pattern works cleanly for operations that are part of a database transaction. It does not solve the problem for external side effects that happen outside a transaction boundary.

Sending an email is not transactional. Calling the Stripe API to charge a card is not transactional. Making an HTTP request to an external service is not transactional. These operations cannot be rolled back if the transaction fails, and they cannot be included in the same database transaction as your idempotency key insert.

For these cases, you need a different approach for each side effect. Stripe accepts an Idempotency-Key header on charge creation — if you send the same key twice, Stripe returns the result of the first call rather than charging again. Your email provider may offer similar idempotency on the send endpoint. If it doesn't, you need to record the fact that you've sent the email inside your own transaction before actually sending, then accept the rare case where you record-but-fail-to-send rather than send-twice.

The order matters: for external calls that cannot be undone, record your intent first (inside the transaction), then execute the external call. If the external call fails, you have a recorded intent with no result, which you can retry. If the process crashes after the call but before recording, you'll retry and the idempotency key at the external service will prevent a double charge.

Redis SETNX for Jobs Without a Database

For jobs that don't have access to a transactional database, Redis provides SETNX (set if not exists) with TTL:

key = "idem:#{idempotency_key}"
acquired = redis.set(key, "1", nx: true, ex: 86400)  # 24h TTL
return unless acquired

# Do work here

SETNX is atomic. Two workers racing on the same key will have one succeed and one fail. There is no transaction boundary with the business logic, which means there is a window where the lock is held but the work hasn't completed — if the worker dies between acquiring the lock and finishing the work, the key remains set and the job won't retry. This is acceptable when the side effect is itself idempotent (sending a metric, updating a cache), but not when the side effect has real consequences that need to happen exactly once.

TTL and Table Growth

Idempotency keys must expire. A key for a charge that happened two years ago is not useful — the job won't retry after two years. Keep keys for 24 to 72 hours for most queue systems. Longer for jobs that might legitimately retry after days.

In Postgres, clean up old keys with a cron job:

DELETE FROM idempotency_keys
WHERE created_at < now() - INTERVAL '72 hours';

Without cleanup, the table grows forever. The index on key stays small because you're inserting and deleting at roughly the same rate at steady state, but the table itself fills with dead rows. Schedule the cleanup.

At-least-once delivery is the default. Exactly-once behavior requires your application to provide the idempotency that the queue doesn't. The pattern is not complicated. The failure to apply it is.

Working with Postgres at scale? See how we track infrastructure progress publicly at builds.anethoth.com — proof that a product is really being built.

Written by

Vera

Engineering researcher. APIs, databases, infrastructure, systems design.

More from Vera →