Designing API Read-Replica Routing: When to Send Queries Where

Read replicas help scale read-heavy APIs, but they only help if the application knows which queries can tolerate lag and which cannot. The routing logic determines whether replicas are an asset or a liability.

Adding a read replica looks like a free performance win until you realize the application has to decide, per query, whether the replica is acceptable. The decision is not a property of the query; it is a property of the customer's expectation. If a customer creates a resource via POST and immediately GETs it, the GET cannot go to a replica that has not yet received the write. This is the read-your-writes problem, and how you handle it determines whether the replica is genuinely useful.

Three routing strategies

Almost every production system that uses replicas converges on one of three routing strategies.

Primary-only by default, replica opt-in by endpoint. Most endpoints route to the primary. A small list of explicitly-opted-in endpoints route reads to the replica. This is the safest pattern because it requires conscious thought to add an endpoint to the replica list, and it makes the latency profile predictable per endpoint. The cost is that most queries do not benefit from the replica.

Replica by default for reads, primary for writes and short-window-after-writes. Reads go to the replica by default. After a write, the customer's session sticks to the primary for a brief window (typically 1-10 seconds). This catches the read-your-writes case for the common pattern of immediate-read-after-write. The cost is more complex routing logic and the risk of sticky-routing bugs.

Replica by default, with explicit "freshness=strict" opt-in for endpoints that need primary. A query parameter or header lets the caller request primary-routing for endpoints where they need it. The cost is pushing the freshness question onto callers, which is fine for API products where callers can be expected to think about it but problematic for end-user dashboards.

The sticky-session implementation

The middle strategy (sticky after writes) is the most common in practice. The minimum implementation is a per-customer flag set on write and checked on read. The flag has a TTL that should be longer than typical replication lag plus a safety margin: if your p99 replication lag is 1 second, set the TTL to 5-10 seconds.

The flag should be keyed on whatever the customer's identity is for read consistency purposes. For B2B APIs this is typically the API key or account. For end-user dashboards this is typically the session ID or user. The wrong key produces the wrong behavior: keying on IP address breaks if multiple customers share a NAT, keying on account when reads come from a different sub-user breaks read-your-writes for the sub-user.

The storage for the flag has to be shared across all API workers. Redis is the typical answer. The flag check adds a Redis round-trip to every read, which is usually 0.5-2ms in cluster topology. For very latency-sensitive APIs this is non-trivial, and some systems instead use a local cache with eventual consistency, accepting that a worker that just restarted might miss the flag for the first few requests after startup.

What does not work

Three patterns that fail in production:

Routing based on HTTP method (GET to replica, POST to primary) ignores read-your-writes. The customer's next GET after a POST has to see the just-written data, and this pattern sends it to a replica that does not.

Routing based on table or query type (looking at SQL strings or ORM call patterns) is brittle. The same logical operation can use different tables in different code paths, and the routing logic ends up scattered through the codebase as a hidden dependency.

Routing based on transaction (read transactions to replica) requires that the application know which transactions are pure reads, which is often not stable as code evolves. A bug that adds a write to a previously-read-only path produces a write-to-replica error that may be hard to attribute.

The replica-lag monitoring requirement

Whatever routing strategy you pick, the replica's lag has to be monitored aggressively. The metric that matters is the time between a primary commit and the replica's apply of that commit. Postgres exposes this via pg_stat_replication on the primary and via the receive-replay lag on the standby.

The alert threshold should be slightly below your sticky-session TTL. If your TTL is 5 seconds, alert at 3 seconds. The alert is informational at first (the replica is slow but the application is handling it) and escalates if the lag exceeds the TTL (the application is now serving stale data to customers).

The lag is not constant. It spikes during checkpoints, during long-running queries on the standby that interfere with apply, and during network blips. The 99th percentile lag is more informative than the median, because the 99th percentile is what hits customers.

The fallback question

What happens when the replica is unreachable? Two reasonable answers:

Fail-over to primary, accepting the load. This is safe but can cause cascade failures if the replica was carrying significant load. The primary needs enough headroom to handle the full read traffic, which means the replica's load-shedding contribution was illusory.

Fail-closed, returning a degraded response. This protects the primary but exposes the replica's unavailability to customers. Appropriate for non-critical reads where stale-or-missing is preferable to taking down the primary.

Most systems use fail-over by default with circuit-breaker thresholds: if too many replica failures happen in a short window, route everything to primary until manual intervention. This balances availability against cascade risk.

Our use across the four products

DocuMint, CronPing, FlagBit, and WebhookVault are SQLite-based with single-instance Postgres planned for the eventual migration. The replica question is for the second migration step, not the first. The current scale does not justify replicas, but the Postgres migration is being designed with the eventual replica topology in mind: writes go through a function-call layer that can later be replicated, reads use a separate connection that can later be routed.

Three patterns that fail across our products would be: assuming Stripe webhooks can read from replicas (they cannot, because the webhook handler immediately processes data the webhook itself just created), routing FlagBit flag evaluation reads to a replica without read-your-writes for the admin who just updated the flag (would cause obviously-buggy behavior), and routing WebhookVault delivery records to a replica without considering that the dashboard customer expects to see deliveries they just initiated.

The deeper observation is that read-replica routing is not a database problem; it is a customer-expectation problem. The database can serve queries from either primary or replica. The question is which queries can tolerate the replica's lag, and that question can only be answered by understanding what the customer expects from the API. The routing logic is the embodiment of that understanding, and getting it wrong shows up as inconsistent customer experience that is hard to diagnose from server-side logs.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) keep the lights on.

Read more