Event Log Retention: How Long to Keep Webhook Events, Audit Trails, and Operational History

Every system that stores events accumulates them faster than the team expects. The retention question is not just how long to keep events but which events, what to keep about them, and how to age the storage so the recent events stay fast to query while the old events stay cheap to keep.

Every system that stores events runs into the same compounding problem. The events arrive at a steady rate, the storage grows linearly, and the team that designed the schema usually did not think hard about how long to keep the events because the volumes seemed manageable at the start. By the time the volumes are not manageable, the schema has accumulated dependencies that make retroactive retention policies hard to introduce. The events are referenced by application code, by support tooling, by analytics dashboards, and by customer-facing features, and shortening retention now requires either coordinating across all those dependencies or quietly accepting that the storage cost will keep growing.

This post covers the retention question across three classes of events that show up in most systems — webhook deliveries, audit trails, and operational history — and the patterns that hold up over time. The patterns apply across the four products in our studio — DocuMint, CronPing, FlagBit, and WebhookVault — and are general enough to apply to any system that accumulates events over time.

The three event classes

Different event classes have different retention requirements, and conflating them is a common mistake. The three classes that matter most:

Webhook deliveries. When the system sends a webhook to a customer's endpoint, the delivery (request, response, status, retry attempts) is typically logged for the customer to inspect. The retention here is driven by customer use cases: how far back do customers need to look to debug a missed delivery? In practice, most debugging happens within hours and almost all happens within days. WebhookVault retains delivery records for 30 days on the lower tiers and longer on higher tiers, which matches the actual usage pattern.

Audit trails. When a user or system action changes important state, the change is logged for compliance, security, and debugging purposes. The retention here is driven by legal and compliance requirements, which vary by jurisdiction and by data type. Financial records typically need seven years; security events typically need at least one year; general application audit logs are often retained for a year for support purposes regardless of regulatory requirements.

Operational history. When the system processes a job, runs a check, or completes a background task, the operation is logged for observability and post-mortem debugging. The retention here is driven by the team's debugging needs: how far back does the team typically look when investigating an incident? In practice, most investigation happens within days and almost all happens within weeks. Operational history beyond 30 days is rarely consulted directly and is often better served by aggregated metrics than raw events.

The access-pattern divergence

The three classes have a similar access pattern: very recent events are queried frequently, recent-but-not-very-recent events are queried occasionally, and old events are queried almost never. The exact decay rate differs by class, but the shape is consistent. This shape suggests a tiered storage strategy that matches the access pattern.

The standard tiered storage strategy uses three tiers. The hot tier holds the most recent events in the primary database, where they are fast to query and inexpensive to store as long as the volume is bounded. The warm tier holds older events in compressed form on cheaper storage, accessible via a slower query path that uses streaming reads and decompresses on the fly. The cold tier holds the oldest events in object storage, accessible only through batch-restore operations, primarily for compliance retrieval rather than interactive query.

The tier boundaries depend on the event class. For webhook deliveries: hot tier for the last week, warm tier for weeks 1-4, cold tier or deletion after 30 days. For audit trails: hot tier for the last month, warm tier for months 1-12, cold tier for years 1-7, deletion after that depending on the regulatory regime. For operational history: hot tier for the last week, warm tier for weeks 1-4, deletion or aggregation-only after 30 days.

The aggregation alternative

For operational history specifically, the right answer is often not to retain the events at all but to aggregate them into summary metrics that are kept indefinitely. A single row of aggregate data covering one hour's activity is far smaller than the events that contributed to the aggregate, and the aggregate is sufficient for most retrospective uses (trends, capacity planning, regression detection).

The pattern is to write events to a hot operational log, run a periodic aggregation job that produces hourly or daily summary rows in a separate metrics table, and delete events from the operational log after some short retention window. The summary rows survive indefinitely; the raw events do not. The trade-off is that lost detail cannot be recovered, but the lost detail is rarely consulted in practice.

The compliance overlay

Audit trails specifically intersect with compliance regimes that mandate minimum retention. GDPR allows retention for legitimate business purposes; SOX and similar financial regulations mandate seven years for financial transactions; HIPAA mandates six years for health-information access logs; PCI-DSS mandates one year for cardholder-data activity. The minimum retention for any audit trail is the longest applicable regulatory minimum, which means the team building the audit trail needs to know which regimes apply.

The complication is the GDPR right-to-be-forgotten requirement, which can conflict with other retention regimes. The reconciliation in practice is to pseudonymize personal data in audit trails so that the audit record can be retained for the regulatory minimum without retaining the personal data the regulation also covers. This works for most cases but has edge cases that need legal review.

The deletion mechanics

Deleting old events is operationally non-trivial. A naive DELETE on a table with billions of rows can lock the table for an unacceptable duration. The pattern that holds up is to delete in chunks: a script that deletes 1000 rows at a time in a loop, with a small sleep between chunks, runs without locking and can run continuously without affecting application performance. The chunked-delete pattern works on PostgreSQL with DELETE FROM events WHERE created_at < $threshold LIMIT 1000 RETURNING id in a loop.

For very large tables, partitioning by time becomes the right answer. PostgreSQL declarative partitioning by created_at, with one partition per month or per week, makes deletion of old events a partition-drop rather than a row-by-row delete. The partition drop is essentially instantaneous regardless of how many rows it contains. The trade-off is that partitioning adds schema complexity and requires ongoing maintenance to create new partitions before they are needed.

The verification discipline

Retention policies are easy to declare and easy to forget. The discipline that holds up is to monitor retention as an operational metric: a daily query that reports the oldest record in each event class, alerting if the record is older than the policy allows. This catches retention failures (the cleanup job stopped running) and retention drift (the policy was changed but the cleanup did not catch up).

The verification also catches the opposite failure: events that should be retained but are being deleted prematurely. This happens when a retention policy is implemented incorrectly, when cleanup logic has a bug, or when an event class is added to the cleanup script without the corresponding retention rule. The monitoring catches the failure as soon as it happens, rather than at the moment a customer needs an event that no longer exists.

The deeper observation

The retention question is mostly a question about the cost of optionality. Keeping every event forever preserves maximum optionality at maximum cost. Aggregating aggressively minimizes cost at the cost of optionality. Most teams pick a point on this trade-off without explicitly thinking about it, usually too far toward keeping everything because deletion feels destructive. The teams that age well are the ones that explicitly reason about which optionality they actually need, write down the retention rules, and verify that the rules are enforced.

The empirical evidence is that aggressive retention pays off. The events that are retained beyond the active access pattern are almost never useful, and the storage and operational cost of keeping them is real. Aggregation captures the trends that retrospective analysis actually looks at; raw events capture the specific occurrences that almost no one ever looks at. The right answer for most event classes is shorter retention than feels comfortable plus aggregation that captures what the retention sacrifices.

Read more