Distributed systems people love to talk about consensus, sharding, and consistency models. They less often talk about idempotency. This is a mistake. Idempotency is the unglamorous property that makes most of the glamorous stuff possible.
An idempotent operation produces the same effect whether it runs once or a hundred times. Reading a file is idempotent. Setting a flag to true is idempotent. Charging a credit card is, by default, very much not.
Why idempotency keeps everything else honest
Networks fail. Connections time out before the response comes back. Servers crash mid-write. Retry loops fire. Background workers wake up and replay queued events. None of this is exotic — it is the daily texture of running software at any scale above one machine.
If your operations are idempotent, you can retry freely. If they are not, every failure is a bookkeeping problem: did the operation succeed before the failure? Will retrying double-charge the customer? Should the engineer wake up at 3 a.m. to reconcile?
Idempotency turns ambiguous failures into safe ones. That is its real value. It is not a clever optimization — it is the floor under which the rest of your reliability engineering would not stand.
Three patterns that work
Idempotency keys. The client generates a UUID and sends it with every mutating request. The server stores the (key, response) pair for some retention window. If the same key arrives again, return the cached response without re-executing. Stripe formalized this. DocuMint uses the same pattern for invoice generation.
Natural idempotency via state. Some operations are naturally idempotent because they target a specific desired state. PUT /users/42 {name: "Alice"} can run any number of times and the result is the same. POST /users {name: "Alice"} creates a new user every time. Prefer PUT-with-known-id over POST-with-server-generated-id when you can.
Conditional writes. Use ETags, version numbers, or "if-match" semantics so that a retry against a state that has already moved on fails loudly rather than silently corrupting data. This is the optimistic-concurrency pattern, and it pairs well with idempotency keys for the cases where the same client retries.
Where it usually breaks
The hard cases are operations with multiple side effects: charge the card, send the email, update the inventory, kick off the fulfillment workflow. Each of these can fail independently. A naive idempotency key wrapped around the whole thing helps, but you also need each side effect to be individually idempotent or you will end up sending the email twice on a retry.
The fix is not glamorous: every side effect gets its own idempotency mechanism, usually keyed off the same parent request ID. The email sender checks "have I sent message-id-X to this address?" before sending. The inventory updater uses a versioned compare-and-swap. The fulfillment workflow uses a workflow ID that uniquely identifies this run.
The tradeoff
Idempotency requires storage. You need to remember which keys you have already processed, for some retention window. For high-volume APIs this can be substantial. The usual answer is a TTL of 24-48 hours and a fast key-value store (Redis, DynamoDB, or just an indexed table).
You also need to think carefully about what counts as the "same" request. Two requests with identical idempotency keys but different bodies — is that a client bug, a malicious replay, or a legitimate retry that the client mutated? The conservative answer is to reject the second request as a conflict. Stripe does this, and it has saved many integrations from subtle bugs.
The principle
Most distributed systems failures are not solved by clever algorithms. They are solved by making operations safe to retry. Idempotency is the boring foundation. Build on it before you build anything else.
Whether you are using CronPing to detect missed jobs, WebhookVault to replay webhooks, or FlagBit to evaluate feature flags — every operation in your system is one network failure away from a retry. Make sure the retry is safe.