Cache Key Design: Patterns for Reliable Invalidation

Cache invalidation is famous as one of the two hard problems in computer science. Most of the difficulty is downstream of an earlier choice the team did not realize it was making: how the cache keys are structured. Get key design right, and invalidation becomes a one-line operation.

Phil Karlton's joke about the two hard problems in computer science being cache invalidation and naming things is famous enough to have lost some of its meaning. Both halves are about the same underlying problem: agreeing on what to call something so that two parts of a system can refer to it without ambiguity. Cache invalidation is hard not because deleting an entry is hard, but because knowing which entries to delete requires knowing what is in the cache and which entries are now stale, which is exactly the question the cache was supposed to make irrelevant.

Most of the practical difficulty of cache invalidation is downstream of an earlier choice the team did not realize it was making: how the cache keys are structured. A team that gets key design right finds that invalidation collapses to a small set of clean operations. A team that gets it wrong finds that invalidation requires either an exhaustive list of every affected key, a global purge, or a wait for natural expiration. We have iterated through several rounds of cache key design across DocuMint, CronPing, FlagBit, and WebhookVault, and the patterns that survived are worth a longer treatment than they usually get.

The key as a contract

A cache key is a contract between the code that writes a cache entry and the code that reads it. They have to agree on the exact string. They also have to agree on what that string represents, what its scope is, and when it becomes invalid. Most cache bugs are violations of one of these three contracts, and most of those violations come from keys that did not encode enough information to make the contract clear.

A useful test: pick a random cache key from a running production cache and ask three questions. Who created this entry? What does the key represent (a specific user's view of a specific resource, the rendered output of a specific function, the result of a specific query, etc.)? When does it become invalid? If the key alone does not answer all three, the cache key design is fragile, and invalidation will be unpredictable.

Pattern one: the structured prefix

The first useful discipline is hierarchical prefixes. A key like user:42:profile:v3 is much easier to reason about than profile_42. The prefix scheme tells you the type of resource, the identifier, the sub-resource, and the schema version. Each component answers one of the questions above.

The win from this pattern shows up at invalidation time. If the schema changes, bumping the version (v3 to v4) invalidates every entry of that type without touching the cache: the new code reads from v4 keys, the old v3 entries are never accessed again and eventually evicted by the cache's LRU policy. This is the cheapest and most reliable form of invalidation: change the key prefix, and the old cache becomes inaccessible without an explicit delete.

The structured prefix also supports targeted scan-based invalidation when a single user's data changes: a Redis SCAN MATCH user:42:* finds all entries scoped to that user. Scans are not free, but for occasional bulk invalidation of a manageable number of entries, they are far more reliable than maintaining an inventory of every key that ever cached data for that user.

Pattern two: the cache key as a function input

The second useful discipline is to derive the cache key deterministically from every input that affects the cached value. A function that takes (user_id, locale, feature_flags) and returns a rendered profile should produce a cache key that includes all three inputs, typically by hashing them. If two callers with different inputs ever share a cache entry, the bug is in the key design.

The version part of the key plays the same role at the code level: when the function's behavior changes, the version bumps, and old cache entries become inaccessible without an explicit purge. This is one of the cleanest forms of invalidation because it is automatically synchronized with the code change: the deploy that changes the function also changes the cache key, and there is no window during which the new code reads old cache entries.

The failure mode is forgetting to include an input. The classic case: a function caches a value based on user_id but actually depends on (user_id, organization_id). The cache works correctly until a user switches organizations and reads stale data from the previous organization. Static analysis can catch many of these (every function input should appear in the cache key), but the right discipline is to centralize cache key construction in a single helper function per cache namespace, which makes the inputs visible and reviewable.

Pattern three: invalidation as cache key bumping

The third pattern combines the first two. Rather than tracking which cache entries depend on which data and invalidating them on data change, the application bumps a version stored alongside the data and includes that version in every cache key derived from it.

For a user profile cache, the user row has a cache_version column (or a separate small key in the cache itself) that increments whenever any field that affects cached output changes. The cache key becomes user:42:profile:v3:cv:18. When the user's name changes, the application bumps cv to 19; the next read computes the new key, misses, and re-renders. The old entries at cv:18 never receive another read and are eventually evicted.

The win is that the application no longer needs to know which cache entries exist for that user. It just bumps the version. The cache layer takes care of the rest. The cost is the extra version lookup before computing the key, which can sometimes be combined with the data read itself.

This pattern scales naturally: a tenant-scoped cache uses a tenant version, an organization-scoped cache uses an organization version, a feature-flag-scoped cache uses a flag version. Each scope bumps its own version when its data changes, and the cache keys for that scope become inaccessible without further coordination.

What about explicit purge?

Explicit purge (deleting specific keys on data change) is the pattern most teams reach for first and discover the limitations of later. It requires the writer to know every cache key that depends on the data being changed, which means the writer's code has to know the cache key structure of every reader. This couples readers and writers in a way that becomes painful as the cache grows.

Explicit purge does have its place: for keys with very high creation cost (rendered PDFs, expensive aggregations) where the version-bump pattern would leave warm cache entries inaccessible and produce a stampede of misses, an explicit purge with a careful warming strategy may be the right answer. But it should be the exception, not the default.

The TTL question

Every cache entry needs a TTL, but the TTL is not the primary invalidation mechanism: it is the backstop. If invalidation logic is correct, TTL exists only to handle the edge cases the invalidation logic missed (race conditions during data updates, stale entries from versions that the version-bump pattern did not catch). The right TTL is short enough to keep the system honest and long enough to provide meaningful hit rates.

A useful exercise: pick the longest stale-data window the application can tolerate, and set the TTL to that. If the application can tolerate 5 minutes of staleness, the TTL is 5 minutes; if it can tolerate hours, the TTL is hours. Designing the TTL based on hit-rate optimization without reference to staleness tolerance produces caches that occasionally serve dangerously stale data.

Operational signals

Three metrics to track per cache namespace: hit rate (basic indicator of cache effectiveness), entry count and total memory (capacity planning), and stale-entry detection through occasional spot-checks comparing cache values against canonical data sources. The third is the one teams skip and regret: a cache that has been silently serving stale data for weeks is much worse than a cache that misses more often than expected.

The deeper observation is that cache key design is the part of the cache that determines how much of the rest of the system depends on the cache layer's correctness. Good key design produces caches that are essentially write-once-read-many helpers that the rest of the system does not need to think about. Bad key design produces caches that everyone has to think about all the time. The investment is up front, and like most schema decisions, it gets harder to change as the system grows.

Read more