Audit Logs That Hold Up in Production
An audit log is one of those features that looks easy from the outside and reveals its difficulty only when you need it. The first time someone asks 'who changed this and when?' you discover whether your audit log was designed for query or for ceremony. Here is what separates the two.
An audit log is a record of what happened. The phrase suggests something simple: append a row whenever an action occurs, query the rows when an investigation calls for it. In practice, audit logs are one of the most consistently underdesigned features in production software. They get added late, they capture too little or too much, they grow unboundedly, they leak information they should not contain, and when an incident requires them they often turn out to be unusable.
The reason for the consistent failure is that audit logs span three different concerns that pull in different directions. They are operational telemetry (what is the system doing?), security artifacts (who did this and when?), and compliance evidence (can we prove we followed our own rules?). A log designed for one of these is usually wrong for the other two. The audit log that works in production is the one that is designed deliberately for one primary purpose, with the other two as deliberate trade-offs.
What to capture
The schema that holds up in production has, at minimum, six fields. The actor, identifying who performed the action: usually a user ID, but should also include the source of authentication (session, API key, service account) because those have different audit implications. The action, a verb-object pair from a closed vocabulary: user.created, invoice.deleted, flag.toggled. The target, the object the action was performed on, identified by a stable reference. The timestamp, in UTC, with sub-second precision. The request context, including the IP address, the user agent, and (critically) a request ID that links to your operational logs. And the change set, a structured record of what changed, ideally as before/after pairs for affected fields.
The fields people often add and then regret are free-form descriptions ("user X did something to Y") that resist queries, and full request bodies that contain sensitive data and inflate storage. The fields people often skip and then need are the source of authentication and the request ID. Without the source of authentication, you cannot distinguish a session-based user action from a programmatic one performed via API key. Without the request ID, you cannot correlate the audit entry with the operational logs that contain the technical details of how the action was processed.
Storage decisions
The next decision is where to store the audit log, and the answer depends on access patterns. If queries are rare and historical, a separate database (or even a separate schema in the same database) optimized for append and rare scan is appropriate. If queries are frequent and operational ("show me everything user X did in the last hour"), the log should be indexed for those queries and probably co-located with the application database for transactional consistency.
The trap to avoid is logging to the same table as your operational logs. The two have different retention requirements, different sensitivity levels, and different consumers. Compliance teams need audit logs for years; operational logs are often retained for days or weeks. Audit logs frequently contain personally identifiable information that should be access-controlled; operational logs typically should not contain such information at all.
For applications running on SQLite or PostgreSQL, a dedicated audit_log table with an index on (actor_id, timestamp) and another on (target_type, target_id, timestamp) covers most query patterns. For applications with high write volume, an append-only log shipper that writes to object storage with a naming convention like audit/yyyy/mm/dd/hh.jsonl is harder to query but scales without limit. The decision is essentially how often you query the log and how fast the answer needs to come back.
Transactional consistency
The single most common bug in audit log implementations is logging outside the transaction that performs the audited action. The application performs the action, commits the transaction, then writes the audit log entry. If the audit write fails, the action has already happened but the log is missing. If the audit write succeeds and then the action is rolled back by some later compensating logic, the log shows an action that never actually completed.
The fix is to write the audit log entry inside the same transaction as the action. In SQL, this is straightforward: the INSERT into the audit_log table goes in the same transaction as the action, and if either fails the entire transaction rolls back. In distributed systems where the action and the audit live in different services, the consistency story is harder; the standard pattern is to write a transactional outbox that gets processed asynchronously, with the consumer being idempotent so that double-delivery does not produce duplicate audit entries.
Tamper resistance
The deeper question for an audit log is whether it can be trusted. The first line of defense is that the audit log is append-only at the application layer: the code that writes to the audit log never updates or deletes existing rows. The second is that the storage is configured with the same restriction at the database layer, so that even a SQL-injection bug or compromised credential cannot quietly modify history.
For higher-stakes applications, the next layer is hash-chaining: each audit entry includes a hash of the previous entry's content, so any modification to historical rows breaks the chain and is detectable on verification. The cost is that you need to verify the chain periodically (a daily cron that walks the chain and alerts on inconsistencies is sufficient for most cases). The benefit is that an attacker who gains write access to the audit log cannot quietly rewrite history without leaving evidence.
The strongest level is to ship audit entries to an external write-once store: AWS S3 with Object Lock, Google Cloud Storage with retention policies, or a dedicated WORM (write once read many) appliance. The trade-off is operational complexity: now your audit log lives in two places and the consistency between them is its own concern.
Retention and privacy
Audit logs accumulate. A year of audit data on a modestly active SaaS can run to tens of millions of rows. Without a retention policy the table grows until it dominates database backups, query plans, and storage costs. The retention policy should be explicit, written down, and enforced by automation: rows older than N days are either deleted or archived to cold storage.
The complication is privacy. Audit logs frequently contain personal data (email addresses, IP addresses, content of operations). Privacy regulations like GDPR require that personal data be retained only as long as it has a legitimate purpose. The legitimate-purpose argument for audit logs is strong but not infinite; most legal teams will accept retention of one to seven years for security and compliance purposes, but resist anything beyond that without specific justification.
The pattern that works is to separate the structured action log (which can be retained long-term as it is small) from the contextual data (IP addresses, user agents, request bodies) which is retained for a shorter window. After the privacy retention window, the contextual data is purged but the action history remains, allowing historical analysis ("what types of actions happened in 2023?") without retaining personally identifiable details.
Testing the audit log
The audit log only works if it captures what you expect. The discipline that makes this real is to write integration tests that perform actions and assert on the audit entries that result. The test creates a user, calls the relevant API, queries the audit log, and asserts that the expected entry is present with the expected fields. This catches the case where someone refactors the application code and forgets to call the audit logger from the new code path.
The second discipline is the periodic restore drill: pick a recent audit entry at random and walk through what it captured. Can you tell who did what, when, and why? Can you correlate it with your operational logs? Can you reconstruct the state of the system before and after the action? If any of those questions are unanswerable, the audit log has a gap that needs closing.
The four APIs we run at DocuMint, CronPing, FlagBit, and WebhookVault each maintain their own audit log for sensitive actions: API key creation, billing changes, project deletions. The schema is consistent across products because the discipline is consistent: an audit log is a feature in its own right, not a side effect of doing other features properly.