Designing API Pagination for Mutable Resources: Cursor Stability Under Concurrent Updates

Pagination examples in API documentation always show static lists. Real production data is mutable: rows get inserted, deleted, updated, and reordered while customers paginate through them. The patterns that handle this gracefully are not the patterns that look obvious in the static case.

Pagination examples in API documentation always show static lists. Real production data is mutable: rows get inserted, deleted, updated, and reordered while customers paginate through them. The patterns that handle this gracefully are not the patterns that look obvious in the static case, and the differences matter because the wrong choice produces silently-skipped rows and silently-duplicated rows that customers debug for hours before discovering the API was lying.

The failure modes

Offset pagination has the canonical failure mode. The client reads page 1 with offset 0, then page 2 with offset 50. Between the two requests, ten rows that sorted before the page-1 results were deleted. Now the rows that were originally at positions 50-99 are at positions 40-89, and page 2 with offset 50 skips ten rows. The customer reads pages 1 and 2 and never sees the rows that drifted past the offset boundary.

The mirror failure is duplication. If new rows are inserted at sort positions before the cursor, the rows that were at positions 50-99 are now at 60-109, and page 2 with offset 50 returns some rows that were also in page 1. The customer reads pages 1 and 2 and sees duplicates.

Cursor pagination eliminates these specific failures by sorting on a column with stable values rather than by position. A cursor of "id > 12345" advances by ID rather than by row count, so insertions and deletions of other rows do not affect the boundary. But cursor pagination has its own failure modes that documentation usually ignores.

The mutable-sort-column problem

Cursors over updated_at or created_at are the canonical case. The cursor encodes the timestamp of the last row in the previous page. The next page query is "WHERE updated_at > cursor_timestamp". If a row's updated_at changes between pages, the row may appear on both pages or on neither.

Concretely: page 1 returns rows with updated_at between T0 and T50, cursor is T50. Between requests, the row at T30 gets updated to T55. Page 2 query is "WHERE updated_at > T50", and the row appears in page 2 even though it was also in page 1 (it shifted into the second page's window). The customer sees the same row twice.

The opposite failure: page 1 returns rows up to T50. A row at T60 (which was going to be in page 2) gets updated to T40 between requests. Page 2 query is "WHERE updated_at > T50", which skips the row that is now at T40.

The immutable-cursor pattern

The first response to the mutable-sort-column problem is to use a sort column that does not change. created_at is usually stable, but it produces the "see new rows at the end" behavior, which is sometimes desirable (a customer paginating through a list expects to eventually see rows that were created during the paginating) and sometimes confusing (the page count grows during pagination).

The primary key is the most stable sort column. Cursor pagination by ID with monotonic IDs (auto-increment or time-ordered UUIDs) is the safest pattern for the "paginate through all rows" use case. The customer sees a stable view of the rows that existed up to some point, plus any newly-created rows that have IDs after the cursor.

The deleted-row case

Cursor pagination handles row deletion gracefully in most cases. If the row at the cursor position is deleted between requests, the next page query "WHERE id > cursor_id" still works because the cursor value does not need to correspond to an existing row. The customer's next page starts at the next existing row after the deleted one.

The edge case is when the cursor encodes additional state (a join key, a tenant filter, a sort secondary). If the additional state is invalidated by the deletion (the joined parent row no longer exists, the tenant has been deactivated), the cursor becomes invalid and the API should return a clear error rather than silently returning empty pages or skipping forward.

The snapshot pattern

For cases where the customer needs a coherent view of the data at a specific point in time (data exports, periodic syncs, audit-trail reads), the right pattern is snapshot pagination. The first request creates a snapshot (a saved query result, a database point-in-time view, a captured ID list) and subsequent requests paginate through the snapshot rather than the live data.

The implementation varies. A heavyweight version uses Postgres REPEATABLE READ or SERIALIZABLE transactions kept open across multiple requests, which works but ties up a database connection per active export. A lighter version captures the IDs at the start (SELECT id FROM table WHERE filter ORDER BY sort LIMIT 100000), stores them in a per-export table or Redis list, and paginates through that captured list. The snapshot has a TTL (24 hours is a typical default for B2B SaaS) and gets cleaned up after.

The snapshot pattern is correct for use cases that require coherence and the wrong default for browsing UIs where customers expect to see recent changes.

The cursor TTL question

Cursors should have a TTL even when the underlying data does not change. The reason is that the cursor encodes assumptions about the schema (sort columns, filter parameters, version) that may change between when the cursor was issued and when it is used.

A cursor issued before a schema migration may reference columns that no longer exist or have different semantics. The right behavior is to detect this and return 410 Gone with a clear error rather than silently returning incorrect results. The cursor includes a schema version, and the API rejects cursors with version mismatches.

One hour is a reasonable active-cursor TTL with twenty-four hours as a hard maximum. Beyond a day, the assumption that the customer is mid-pagination is unlikely; they are probably resuming an old session or running a stale script, and a clean error helps them diagnose.

The unique-sort-key discipline

Cursor pagination requires that the sort order is deterministic. If two rows have the same value for the sort column, the cursor at that boundary is ambiguous: does "after the row with value X" mean after the first one, after the second one, or after both? The fix is to always include the primary key as a tiebreaker.

ORDER BY created_at, id

The cursor encodes both the created_at and the id, and the next-page query is "WHERE (created_at, id) > (cursor_created_at, cursor_id)". The tuple comparison handles the boundary correctly even when multiple rows share a created_at.

This is one of the patterns that documentation tends to skip because it looks like a corner case. In practice, batch-inserted rows often share created_at values down to the millisecond, and the corner case is the common case.

The opaque-cursor discipline

Cursors should be opaque to clients. The structure of the cursor is an implementation detail that may change as the API evolves; if clients parse the cursor to extract IDs or timestamps, they have built a dependency on the implementation that becomes a versioning problem for the API.

The right shape is a base64-encoded signed token. The encoding makes it visually opaque (clients are less likely to parse it casually) and the signing prevents clients from forging cursors that point at arbitrary parts of the data. The signature includes a server-side secret and a schema version, so cursors are tied to the server that issued them and to the schema as of the issue time.

The cursor body itself can be a simple JSON object with the sort-key values, schema version, and filter parameters. The point is that this structure is an internal implementation detail, not a published contract.

The total-count question

API documentation often shows pagination responses with a total_count field. Computing this field requires running a COUNT query in parallel with the page query, which on large tables is often slower than the page query itself. For mutable data, the count is also a lie: by the time the response reaches the client, the count is stale.

The right default is to omit total_count from list responses. If the customer genuinely needs a count (for capacity planning, billing usage display, UI page-number widgets), provide it as a separate endpoint with explicit cost and caching characteristics. Most customers do not need it and the few who do can ask for it explicitly.

The bulk-export alternative

Pagination is the wrong primitive for some use cases. A customer who needs to read all rows in a table to sync with their own system is poorly served by paginating through ten thousand pages of fifty rows. The right interface is bulk export: a single request that returns all matching rows, possibly with a streaming response or async job status.

The bulk-export pattern has its own design considerations (rate limiting, async job tracking, format choice, partial-failure handling) that are different from pagination. The signal that customers need bulk export is repeated pagination through entire result sets without ever stopping. If usage analytics show this pattern, the answer is a dedicated export endpoint, not a higher pagination limit.

The tests that catch most bugs

Five test cases catch most pagination bugs. First: paginate through a static dataset and verify that every row appears exactly once. Second: paginate through a dataset while inserting rows and verify that the seen-rows set matches expectations. Third: paginate through a dataset while deleting rows and verify that the seen-rows set is a subset of the original. Fourth: paginate through a dataset while updating the sort column and verify the documented behavior (depending on cursor type). Fifth: paginate with multiple workers using the same cursor concurrently and verify that each row appears in exactly one worker's result set.

The fifth test catches subtle bugs in cursor encoding (cursors that depend on global state, cursors that are not deterministic in their results) that single-worker tests miss.

Across our four products

Our four products use cursor pagination by primary key with optional secondary sort. DocuMint paginates invoices by created_at with id tiebreaker, CronPing paginates monitors and pings by id, FlagBit paginates flags and rules by id, and WebhookVault paginates captured requests by created_at with id tiebreaker. The cursor format is base64-encoded JSON with HMAC signature and one-hour active TTL. total_count is omitted from all list responses, with separate count endpoints planned for cases where customers have asked for them.

The deeper observation

The deeper observation is that pagination is one of the API design surfaces where the documentation defaults are most misleading. The static-list example is the easy case; production data is mutable, and the patterns that handle mutation gracefully are the ones that matter. Designing pagination starts with asking what the customer is actually doing (browsing, syncing, exporting, auditing), choosing the pattern that fits, and being honest about the consistency model in the documentation. Cursor pagination is not magic; it is a different set of trade-offs from offset pagination, and the trade-offs matter to customers in ways that the documentation should make explicit.

Read more