Designing APIs for Append-Only Event Logs: Patterns That Survive Customer Replay

Customer-facing event logs are how customers debug their integrations and recover from missed webhooks. Designing them well means treating them as a primary product surface, not a secondary debugging convenience.

Several of our four products expose customer-facing event logs as a primary feature: every webhook delivery in WebhookVault is an event in a customer-visible log, every ping in CronPing, every flag evaluation in FlagBit. The pattern keeps appearing because customers consistently need it for the same set of reasons: debugging integrations, replaying missed work, auditing what was sent and when, computing usage. Designing event log APIs well is mostly the discipline of treating them as a product surface rather than a debugging convenience.

The shape of an event-log API

The minimum useful shape is a paginated list endpoint that returns events in reverse chronological order with stable cursors, filtered by a small set of high-cardinality attributes. The typical surface looks like GET /v1/events?type=invoice.created&cursor=&limit=100. The events have a stable schema with a small set of required fields: id, type, created_at, and a data field whose shape is determined by the event type. Stripe's events.list is the reference design that most B2B SaaS event logs converge on.

The fields beyond the minimum are where the design decisions matter. The customer-facing event ID should be a stable opaque string that the customer can use as a deduplication key and to look up a specific event later. The created_at should be the server timestamp at the moment the event was generated, not the moment it was queued for delivery or the moment the customer-facing record was written. The distinction is important when a system has internal queueing or batching, because customer integrations will treat the timestamp as canonical.

The data field is where customer-facing event-log design diverges from internal-only patterns. The data should contain a complete snapshot of the relevant entity at the time of the event, not a reference to the current state. This is what makes the log useful for replay: when a customer's webhook handler misses an event and queries the log to recover, the log should contain enough information to process the event without consulting any other API. The cost is storage (events are larger than references), but the customer experience pays back in proportion to the number of customer integration bugs that get debugged with one API call instead of many.

Pagination and cursor stability

Event logs are append-only, which makes them one of the few cases where cursor-based pagination is genuinely simple. The natural cursor is the event ID, which is monotonically increasing if you generate IDs that way (ULID and UUID v7 both work). The pagination contract is: given a cursor, return events with IDs less than the cursor in descending order. New events appearing after the customer started paginating do not appear in the current paginated traversal but are available in a fresh traversal.

The subtle case is when a customer wants to paginate forward in time from a known position (the recovery-from-missed-webhooks case). The right surface is an optional starting_after parameter that takes an event ID and returns events strictly after that ID in ascending order. This is structurally different from the default reverse-chronological listing and benefits from being explicit. Stripe uses starting_after for forward pagination of the default listing; we have found that an explicit ascending-listing endpoint at GET /v1/events?starting_after=evt_X&order=asc is clearer and produces fewer integration bugs than overloading the default endpoint with ordering semantics.

Cursor stability across system changes is the harder problem. If you change the underlying storage (sharding, archival, restructure), the customer-visible cursor must continue to work. The mitigation is to use cursors that are derived from the public event ID rather than internal database state. A customer cursor that is just a base64-signed wrapper around an event ID survives almost any internal change because it depends only on event IDs, which are themselves stable by definition.

Retention and archival

The retention question becomes load-bearing as the log volume scales. Most B2B SaaS event logs retain 30-90 days of events in the hot tier (the storage that the API serves directly) and offer longer retention via separate cold-tier APIs. The customer-visible semantics need to be honest: if the API returns 30 days of events, the documentation should say so, and the cursor system should produce a clear error (410 Gone) when a customer presents a cursor pointing to an evicted event.

The cold-tier surface is usually a separate operation: an async export job that produces a downloadable archive over hours rather than serving paginated requests in milliseconds. The API surface looks like POST /v1/events/exports returning a job ID, GET /v1/events/exports/{id} returning status and eventual download URL. The export format is usually JSON Lines because it composes with the standard Unix toolkit. CSV is rarely the right choice for event data because nested structures do not survive the round-trip.

Filtering: what to support and what to skip

The filters customers actually use, in our experience across the four products: event type (almost always), created_at range (usually), resource ID (for events tied to a specific entity), and a small set of business-specific filters per product (subscription ID, monitor ID, flag key, endpoint ID). The filters customers ask for but rarely use: full-text search on event data, regex filters on field values, custom expression languages over event payloads. The pattern is reliable: customers want simple equality filters on small numbers of high-cardinality fields, and the complex filtering they ask for is almost always better served by exporting events and processing them client-side.

The filter implementation question is whether to back filtered queries with dedicated indexes. The answer for event type and created_at is almost always yes (compound index on (type, created_at)). The answer for resource ID is usually yes (separate index per resource type that filters by resource ID). The answer for business-specific filters depends on volume; below 100K events per customer the secondary indexes are essentially free, above 10M events per customer the index cost compounds and the right answer might be a separate query path through a different storage tier.

Webhooks and the log together

Customer-facing event logs work best when they are paired with webhooks rather than offered as a substitute. Webhooks deliver events in near-real-time but are not reliable; the log is the reliable source of truth that customers can consult when webhooks are missed. The two surfaces should expose identical events with identical schemas: the webhook payload is the same as the event-log entry, the event ID is stable across both. This invariant makes the customer integration story straightforward: implement the webhook handler, fall back to the log for recovery, deduplicate by event ID.

The deeper observation is that the log is the primary surface and the webhook is the optimization. Many customer integrations could be built entirely on log polling and would be more reliable than webhook-driven equivalents; the webhook is the latency optimization on top of the log. This framing helps with design decisions: when in doubt about which surface should be authoritative for a given concern, the log usually wins.

Why this matters for the studio

Our four products converge on the same event-log pattern because the same customer needs keep appearing. WebhookVault is essentially a productized event log: every captured webhook is an event, every replay reads from the log. CronPing exposes a per-monitor ping log that customers consult when investigating missed schedules. FlagBit exposes a flag-evaluation log that customers consult during rollout incidents. DocuMint exposes an invoice-generation log that customers consult when generated PDFs need to be regenerated.

The four logs share the same minimum schema (id, type, created_at, data), the same pagination semantics (cursor-based, reverse-chronological default, optional forward-from-cursor), the same retention defaults (90 days hot, archived to cold storage), and the same filter set (type, time range, resource ID). The convergence is not because we copied each design across products. The convergence is because customer-facing event logs are one of the cases where the design space is genuinely narrow and the right answers are knowable from first principles. Knowing what the right answers are before you write the first version is the kind of architectural fluency that compounds across products.

Read more