Designing API Audit Logs: What to Record, What to Skip, and the Patterns That Survive

Audit logs are the kind of feature that is trivial to add badly and surprisingly hard to add well. The wrong shape produces logs that are unsearchable, untrustworthy, expensive to store, and useless when someone actually asks what happened.

Audit logs occupy an awkward position in API design. They are not really a feature in the customer-visible sense, but customers ask for them all the time. They are not really observability, though they overlap with it. They are not really compliance, though compliance often requires them. The result is that audit logs frequently get implemented as an afterthought by whoever is closest to the problem, with predictable consequences for the team that has to use them six months later when an incident actually requires reading them.

We have written and rewritten audit log systems several times across DocuMint, CronPing, FlagBit, and WebhookVault. The patterns that hold up are about treating audit as its own concern with its own data model, distinct from application logs and from observability metrics, and about being deliberate about what the log is for before deciding what to record.

The three audiences for audit logs

The first question to settle is who reads the audit log. The audiences are different enough that trying to serve all of them with the same system produces something that serves none of them well.

The first audience is operations, during incidents. They want to know what changed, when, and by whom, with enough context to reconstruct the state of the system at the time of an outage. The latency requirement is loose (post-hoc analysis, not real-time alerting), but the schema requirement is tight: the operator who is reading the log at 3am should not need to consult the source code to understand what each field means.

The second audience is security, after a compromise. They want a trustworthy record of what an attacker did, with enough detail to scope the breach. The trustworthiness requirement is strict: an audit log that the attacker could have tampered with after the compromise is worse than no audit log at all because it suggests false confidence in the conclusions.

The third audience is compliance and customer support, for routine inquiries. Customers want to know who on their team did what (especially for shared accounts), regulators want demonstrable controls, and support wants to answer "why was my flag set to off" without paging an engineer. The latency and trustworthiness requirements are looser; the discoverability and query requirements are tighter.

Each audience has different priorities. The schema and infrastructure decisions should be made with the priority order explicit. Operations and security are the priorities we have weighted highest, with compliance handled by careful field selection rather than separate infrastructure.

What to record

The minimum useful audit record has seven fields: timestamp, actor, action, target, before-state, after-state, and request ID. The timestamp is server-side and high-precision; the actor is the authenticated principal who performed the action (a user ID, an API key ID, or a system process); the action is a verb-object string like "flag.create" or "endpoint.delete"; the target is the resource that was acted on; the before-state and after-state are structured representations of the relevant fields; the request ID ties the audit entry to corresponding application logs.

Two of those fields are easy to underspecify. Actor is not just "user 42" but should include the authentication method (password login, OAuth token, API key) and the IP address. This matters because the same user can legitimately have multiple sessions with different privilege scopes, and a compromised API key looks different in the audit log than a compromised password.

Before-state and after-state are easy to either over-record or under-record. The right amount is the minimum needed to reconstruct what changed: for an update, the previous values of the changed fields and the new values; for a delete, the full row before deletion; for a create, the full row as created. Recording the entire row on every update wastes storage and obscures what changed; recording only the field names without values makes the log useless for post-hoc investigation.

What to skip

Logging everything is the canonical mistake. The audit log becomes too noisy to be useful, storage costs balloon, and the signal-to-noise ratio drops below the threshold where anyone wants to read it.

The first thing to skip is read operations on most resources. A log of every GET request is observability data, not audit data. The exceptions are reads of sensitive resources (PII, financial data, security configuration) where the act of reading is itself the audit-relevant event.

The second thing to skip is internal system operations that are not customer-relevant. Background job completions, cache warming, scheduled rollups, internal reconciliations: these belong in application logs, not in the audit log. They clutter the audit feed without contributing to the audit purpose.

The third thing to skip is intermediate states. If a single user action triggers ten internal database writes (through cascades, denormalization updates, etc.), the audit log should record the one user action, not the ten writes. The user did one thing; the audit should reflect that.

The discipline that produces a useful audit log is to ask, for each potential record, whether someone reading the log six months from now will care that this happened. If the answer is no, the record does not belong.

Storage and immutability

Audit logs need to be append-only from the application's perspective. This is enforced more by convention than by technology in most setups: there is no application code path that updates or deletes audit log entries, and the database role used by the application has only INSERT privileges on the audit log table.

For higher-assurance use cases, technological enforcement is available. Hash chaining (each entry includes the hash of the previous entry, like a blockchain at low scale) makes tampering detectable. S3 Object Lock or equivalent provides write-once-read-many storage that even the cloud account owner cannot modify. The cost is operational overhead; the benefit is that the audit log remains trustworthy even after a credentials compromise.

The practical default for most SaaS is a normal database table with append-only application discipline, plus replication and backups. For products that have explicit security audit requirements (SOC 2 Type II, ISO 27001, banking customers), the higher-assurance options earn their cost. The two-tier pattern—application audit log for everyone, write-once archive for regulated tenants—lets you scale the assurance level to the customer.

Retention is the other storage question. The default for SaaS is 90 days hot, 7 years cold. The 90 days covers the operational and most security use cases; the 7 years covers the compliance use case. The cold tier can be cheap (S3 Glacier-class) because access is rare. The hot/cold split keeps the active table small enough to query efficiently.

Query patterns and indexing

The query patterns for audit logs are narrow and predictable. By actor (what did this user do?), by target (who modified this resource?), by time range (what happened during this incident?), by action type (who created flags last week?). Designing for these queries from the start avoids the trap of building an audit log that you cannot actually query.

The schema that supports these queries efficiently is a single denormalized table with indexes on (actor_id, timestamp), (target_type, target_id, timestamp), and (timestamp) alone. The denormalization is deliberate: audit logs are usually queried in isolation rather than joined with current state, and the historical accuracy of the audit log requires that it not reflect later changes to user names, resource names, or other mutable fields. The audit entry should record "user 42 named Alice modified flag 99 named billing-enabled" with all the names captured at the time, not "user 42 modified flag 99" requiring joins to the current users and flags tables.

The transactional outbox question

Audit logs are tightly coupled to the actions they record. The audit entry for an action should exist if and only if the action succeeded. The way to enforce this is to insert the audit entry in the same transaction as the action: either both commit or neither does.

This is easy when the action is a single database transaction. It is harder when the action involves external side effects (a payment, a webhook send, an email). The pattern that handles the external-side-effect case is the transactional outbox: the action and the audit entry both go in the database transaction, plus a row in an outbox table that triggers the external side effect after the commit. If the commit fails, neither the audit entry nor the side effect happens.

The audit log is also one of the places where the inconsistency between application state and audit state shows up most painfully. An audit entry that says an action happened, but the application state shows it did not, is a worse problem than no audit entry at all. The transactional discipline avoids this.

What audit logs do not do

Audit logs do not replace real-time monitoring. The audit log is for post-hoc analysis; alerting on audit entries is a sign that the audit log is being used wrong.

Audit logs do not prevent bad actions. They record what happened; they do not enforce policy. The enforcement layer is authentication, authorization, and rate limiting. Audit logs let you reconstruct the failure after the enforcement layer is bypassed; they do not prevent the bypass.

Audit logs do not fully substitute for application logs. The audit log captures intentional state changes by authenticated actors; the application log captures everything else (errors, request timing, internal events). The two are complementary, not redundant.

The deeper observation

The audit log is one of those features whose absence is unremarkable and whose presence-done-poorly is worse than absence. The work that makes an audit log useful—the schema design, the indexing, the storage tiering, the transactional integration—does not feel like work that customers see. The customers who notice are the ones for whom the audit log was load-bearing during an incident, and they remember.

The discipline that produces a good audit log is treating it as a first-class subsystem from the beginning, with its own model, its own ownership, and its own quality bar. The teams that retrofit audit logs onto existing systems usually end up rebuilding them once they discover what the original design missed.

Read more