Bloom Filters and Probabilistic Data Structures: When Approximate Answers Save Money

Most data structure decisions in application code default to the exact answer: hash sets for membership, hash maps for lookups, sorted arrays for range queries. The exact answer is always defensible, and the cost of getting it is rarely visible until the dataset grows large enough that the cost crosses into the dollars-per-month range. By that point the structures have been wired into enough code paths that switching them out becomes a project rather than a change.

Probabilistic data structures are the alternative that almost no application developer reaches for and that most distributed systems quietly use under the hood. They trade exact answers for controlled, bounded approximations, and the memory and speed wins are typically two to four orders of magnitude. The trick is recognizing the queries where an approximation is acceptable and getting the parameters right so the approximation does not surprise you in production.

Bloom filters

The Bloom filter is the canonical probabilistic data structure and the one most likely to be useful in a SaaS context. It answers set-membership queries — has this element been seen before — with a constant-time check that uses about 10 bits per element to achieve a 1% false-positive rate. False negatives are impossible: if the filter says no, the element has definitely not been seen. False positives are possible at the configured rate: if the filter says yes, the element has probably been seen but might not have been.

The implementation is a bit array and a small set of independent hash functions. To insert, hash the element with each function, take each hash modulo the array size, and set those bits. To query, hash the element the same way and check whether all the bits are set. The math gives the false-positive rate as a function of array size, hash count, and element count, and the rate is tunable to whatever the application needs.

The use cases that come up in API infrastructure are: deduplication of high-volume event streams where the cost of a duplicate is small but not zero; rate-limit pre-checks where the rare false positive triggers a more expensive precise check; cache lookups that avoid the network round trip to a cold cache when the key has never been seen; and signup-name pre-validation where the database lookup is expensive and the false positive just means the user sees the same form again.

Across the four Anethoth products the most natural application is in WebhookVault for event-id deduplication during replay storms. A Bloom filter sized for the expected event volume catches the common case of repeat IDs in microseconds without touching the database; the rare false positive is paid for by an exact lookup, and the typical-case savings are large enough to be worth the architectural complexity.

Counting Bloom filters and deletion

The standard Bloom filter does not support deletion — clearing bits would corrupt the filter for elements that share those bits. The counting Bloom filter replaces each bit with a small counter that is incremented on insertion and decremented on deletion, allowing safe removal at the cost of roughly 4x memory. The counter overflow problem is real and has to be designed around: most implementations cap the counter at 15 (4 bits) and refuse to decrement saturated counters, which preserves correctness at the cost of permanent retention of those positions.

Deletion support changes the use case substantially. The standard Bloom filter is a one-way data structure that grows over time and is periodically rebuilt; the counting variant supports steady-state operations on an evolving set. The counting variant is the right choice for cache directories and routing tables; the standard variant is the right choice for deduplication and pre-filtering.

HyperLogLog

The HyperLogLog estimates the cardinality of a set — how many distinct elements are in it — in fixed memory regardless of the set size. The implementation uses one small register per hash bucket, recording the maximum number of leading zeros seen in any hash value mapped to that bucket, and then combines the register values with a harmonic mean to estimate cardinality. A 16KB sketch can estimate cardinalities up to billions with about 1% relative error.

The use cases are unique-visitor counts, distinct-IP rate limits, content-popularity metrics, and any analytics query that asks "how many distinct" rather than "how many total". A precise count requires keeping the full set in memory or running a database query that aggregates over the full data; an HLL sketch reduces the memory cost to a constant and the query cost to a small fixed scan.

HyperLogLog has the additional property that sketches are mergeable: two HLL sketches built from disjoint inputs can be combined into a sketch of the union without loss of accuracy. This is the property that makes HLL useful for distributed analytics — each shard maintains its own sketch, and the global cardinality is computed by merging the per-shard sketches.

Count-Min Sketch

The Count-Min Sketch is the frequency analog of HyperLogLog: it estimates how many times each element has been seen, with bounded error in fixed memory. The structure is a 2D array of counters indexed by hash function and bucket, with each insertion incrementing the counter at each hash function's bucket. Queries take the minimum of the counter values across hash functions, which gives an estimate that is always at least the true count and is wrong only by overcounting.

The applications are heavy-hitter detection, frequency-based rate limiting where the popular keys deserve different treatment from the rare keys, and any kind of streaming analytics where the data volume exceeds memory capacity. The accuracy is tunable through the depth and width of the sketch, with typical configurations using 5-10 hash functions and a few thousand buckets per function.

The selection question

The decision criterion for using a probabilistic data structure is whether the application can tolerate the controlled approximation in exchange for the memory or performance win. The check has three parts: what is the actual error rate at the chosen parameters; what is the consequence of an error event; and what is the cost of the exact alternative.

For Bloom filters, the false-positive rate is the parameter and the application has to handle the false positive correctly. For HyperLogLog, the cardinality estimate has known relative error bounds and the application has to be tolerant of estimates rather than exact counts. For Count-Min Sketch, the frequency estimate is biased upward and the application has to be tolerant of overcounting.

The common mistake is using these structures without verifying the error rate at production data volumes. The math gives the error bounds for random hash functions and uniform inputs; real-world data often has structure that makes the error bounds optimistic. Sampling the actual error rate against ground truth on production-shaped data is the discipline that prevents the surprise.

Where they show up in production systems

The applications most developers encounter are: PostgreSQL Bloom filter index extensions for high-cardinality multi-column indexes; Redis HyperLogLog for fast cardinality estimation across keys; Cassandra Bloom filters for SSTable lookups; Apache Druid Count-Min Sketches for top-K analytics; Google Bigtable Bloom filters for row lookups; CDN cache invalidation lists using counting Bloom filters; antivirus signature databases using Bloom filters for fast prescreening.

The structural pattern is that probabilistic data structures show up in systems that operate at scale where the exact-answer cost is prohibitive. They are not yet a default tool in application code, partly because the application code rarely operates at the scale where the win materializes and partly because the trade-off is genuinely subtle and the wrong parameters can produce surprises. But the threshold for "scale where they help" is much lower than most teams assume — a Bloom filter saves real money on a single VPS handling thousands of requests per second, and the engineering effort to add one is small.

The discipline that makes probabilistic data structures worth reaching for is recognizing that exact answers are not always the right product. A 1% false-positive rate that saves 99% of database lookups is a substantively better engineering trade than 0% false-positive rate at 100x the memory cost. The skill is in knowing which queries can tolerate the trade and which cannot, and that judgment compounds across years of API design across DocuMint, CronPing, FlagBit, and WebhookVault the same way it does in any well-engineered system.