Why Redis KEYS Is a Production Incident Waiting to Happen

KEYS pattern scans the entire keyspace in a single-threaded O(N) sweep, blocking every other command for its duration. It works fine in development with a thousand keys. It brings down production with ten million.

The incident shape is consistent: someone adds a KEYS "user:*" call in a debug endpoint or a monitoring script. Tests pass. Staging has 2,000 keys. The call returns in under a millisecond. It ships.

Production has 10 million keys accumulated over two years. The call takes 28 seconds. During those 28 seconds, every other command queues behind it. Every connected client times out simultaneously. The application falls over.

Why KEYS blocks

Redis is single-threaded for command execution (the I/O threads are separate, but the command processing is not). A running command holds exclusive access to the data structures. KEYS iterates over every entry in the global hash table — the one that stores all keys — checking each against the pattern. The work is proportional to the total number of keys, not the number of matches. KEYS "user:*" on a keyspace with 10 million keys, zero of which match, takes roughly the same time as one where all 10 million match.

The documentation says this directly: Warning: consider KEYS as a command that should only be used in production environments with extreme care. Most people read this as "be thoughtful." It means "don't."

The alternatives

SCAN with a COUNT hint is the correct replacement for most KEYS use cases. SCAN is cursor-based: it returns a cursor and a batch of keys, and you call it repeatedly until the cursor returns to 0. It is non-blocking — each individual SCAN call yields control after processing a batch, so other commands interleave between iterations.

cursor = 0
keys = []
loop:
  cursor, batch = SCAN cursor MATCH "user:*" COUNT 100
  keys.extend(batch)
  if cursor == 0: break

COUNT is a hint, not a guarantee — Redis may return more or fewer keys per call. The total work is still O(N), but it is distributed across many small commands rather than one large one. Other clients see latency spikes measured in microseconds instead of seconds.

Keyspace notifications — configured with notify-keyspace-events — allow subscribing to key creation, deletion, and expiry events via pub/sub. If your use case is monitoring which keys exist, push-based notification is more efficient than iterative scanning and requires no KEYS or SCAN at all.

A separate metadata index is the highest-throughput option when you need to enumerate keys by prefix or pattern frequently. Maintain a Redis set or sorted set as an explicit index: add keys on write, remove them on delete. Lookups become O(1) or O(log N) range queries against the index instead of O(total keyspace) scans against everything.

The rename-command mitigation and why it is insufficient

A common hardening recommendation is to disable KEYS via rename-command KEYS "" in redis.conf. This prevents KEYS from being called at all. It is a blunt instrument — it also breaks any monitoring tool, client library, or operations script that depends on KEYS — but it does eliminate the risk for that instance.

The problem is that the engineers most likely to reach for KEYS are the ones least likely to have configured rename-command, and the instances most at risk (large production deployments) are the ones where adding a rename-command requires a change review, a deployment window, and a restart. The mitigation requires organizational coordination that the original mistake did not.

SCAN in the application code, from the start, is cheaper than retrofitting the guardrail later.

---

Find more writing at anethoth.com. Browse indie SaaS projects at builds.anethoth.com.