What Your Database Is Doing While You Sleep: Background Maintenance Demystified

Most application developers experience their database as a black box that holds rows and accepts queries. The database is in fact a small operating system in its own right, with background processes doing perpetual maintenance: defragmenting storage, reclaiming space, recomputing statistics, propagating writes, expiring caches. Most of the time these processes are invisible, which is what success looks like. When they are not invisible, the symptoms range from mysterious to catastrophic.

It is worth knowing what your database does while you sleep, because eventually it will not keep up, and you will need to know which knob to turn.

The vacuum, in PostgreSQL

PostgreSQL uses a multi-version concurrency control (MVCC) model. When a row is updated, the old version is not overwritten in place; a new version is written, and the old version is marked dead. When a row is deleted, the row is marked dead but not actually removed. This produces excellent concurrency (readers and writers do not block each other) but means the table accumulates dead tuples that have to be cleaned up later.

The cleanup is called vacuuming. There are two kinds. Autovacuum runs continuously in the background, scanning tables that have accumulated enough dead tuples since the last vacuum, and reclaiming the space for reuse. VACUUM FULL is a heavyweight rewrite that returns the disk space to the operating system. The first is invisible; the second locks the table and is something you only run when something has gone wrong.

The thing to know about autovacuum is that it can fall behind. If a table is being updated faster than autovacuum can clean it, dead tuples accumulate, the table grows, queries slow down, and indexes become bloated. The symptom is "this query used to run in 5 ms and now takes 200 ms" and the cause is often that the underlying table is 60% dead rows. The fix is to tune autovacuum_vacuum_scale_factor downward (so vacuum runs more aggressively) and autovacuum_vacuum_cost_limit upward (so each vacuum run does more work).

The other thing to know is that vacuum updates the statistics that the query planner uses. If you have not vacuumed a table recently, the planner's row estimates can be wildly wrong, and it will pick bad plans. ANALYZE is the lightweight version that just refreshes statistics without touching dead tuples; autovacuum runs it as part of its work but you can run it manually after a large bulk import.

The purge thread, in MySQL/InnoDB

InnoDB has an analogous mechanism but uses different terminology. When a row is updated, the old version is written to the undo log (so transactions can roll back and other readers can see the old version under their snapshot). The undo log is held in the system tablespace until the row versions are no longer needed by any open transaction.

The purge thread walks the undo log and removes old versions that no transaction needs. If the purge thread falls behind (typically because of a long-running transaction that holds an old snapshot open), the undo log grows. The system tablespace grows with it, often to terabytes, and shrinks only with a full export-import cycle. The most famous symptom is "our database disk filled up" while the actual table data is small.

The lesson is that long-running transactions in InnoDB are an operational problem far beyond their immediate cost. A read-only reporting query that holds a transaction open for an hour is keeping every old row version produced in that hour from being purged. SELECT * FROM information_schema.innodb_trx WHERE trx_started < NOW() - INTERVAL 5 MINUTE is the query to monitor.

WAL checkpointing, in PostgreSQL and SQLite

Both PostgreSQL and SQLite use write-ahead logging. When a row is changed, the change is written first to a sequential log file (the WAL), then later flushed to the actual table data files. The advantage is that writes are sequential and fast; the disadvantage is that the WAL would grow without limit unless something occasionally takes the changes out of the WAL and applies them to the data files.

That something is the checkpoint. PostgreSQL runs checkpoints automatically every few minutes (controlled by checkpoint_timeout and max_wal_size). SQLite in WAL mode runs checkpoints whenever the WAL exceeds 1000 frames by default (controlled by wal_autocheckpoint).

The visible symptom of bad checkpoint configuration is bursty latency. A checkpoint flushes a large amount of dirty data to disk, which competes with foreground query traffic. If the checkpoint interval is too long, each checkpoint does a lot of work; queries that hit the disk during a checkpoint run slowly. If too short, the checkpoint overhead dominates write latency. The middle ground depends on the disk: SSDs make this less painful than spinning disks did, but the trade-off still exists.

Replication lag, in MySQL/PostgreSQL replicas

If you have a primary-replica setup, the replica is continuously consuming the WAL or binlog from the primary and replaying it. Replication lag is the difference between the primary's current position in the log and the replica's playback position. Lag is normal and small (milliseconds) under healthy conditions; it becomes a problem when reads from the replica return data older than the writer expects.

Lag grows when the replica cannot keep up. The usual causes are: a long transaction on the replica blocking replay, a big DDL on the primary, or the replica's hardware being weaker than the primary. The classic gotcha is reading from a read replica immediately after writing to the primary. If your application reads its own writes without a session-aware setup, the read can return the pre-write state.

The fix patterns are: route reads-after-writes to the primary; use synchronous replication for the writes that matter; or design around it (mark the replica's freshness target and only use it for queries that tolerate the lag).

Compaction, in MongoDB and document stores

Document stores like MongoDB do not generally have a vacuum analog because they do not use MVCC the same way. They do, however, have a compaction problem. As documents are inserted and deleted, the storage layer accumulates fragmentation: free space scattered through files, with active documents interspersed among holes.

The compact operation rewrites the storage to reclaim the holes. It is a heavyweight operation that locks the database and is generally run during a maintenance window. Modern MongoDB runs background compaction more incrementally with the WiredTiger engine, but the underlying problem is the same: storage that does not get compacted slowly degrades.

Cache eviction, everywhere

Every database has a cache (PostgreSQL's shared_buffers, MySQL's innodb_buffer_pool, MongoDB's WiredTiger cache, SQLite's page cache). The cache holds frequently-accessed pages in memory. When the cache fills, the database evicts pages using an LRU or clock algorithm. The eviction itself is mostly invisible until you run a query whose working set does not fit in the cache, at which point the database starts paging from disk for every row.

Cache hit rate is the metric that catches this. PostgreSQL's pg_stat_database.blks_hit / (blks_hit + blks_read) should be in the high 90 percent range. If it drops, either your cache is too small or your queries got worse. The fix is one of the easiest in operations: add more RAM and increase shared_buffers.

Index maintenance and bloat

Indexes also accumulate dead entries (deleted or moved rows that the index has not yet caught up with). PostgreSQL handles index bloat through autovacuum and a separate REINDEX command. MySQL's InnoDB rebuilds indexes via OPTIMIZE TABLE. Both are heavyweight; bloated indexes are usually only worth fixing if a specific query has slowed noticeably and the cause has been traced back to index size.

The lighter-weight prevention is partial indexes for soft-deleted data, regular reindexing of tables that see lots of bulk updates, and vigilance about the pg_stat_user_indexes view (or its MySQL analog) to spot indexes whose size is much larger than the underlying data warrants.

Why this matters

The reason to know this material is that the failure modes of background work are not random. They are predictable, monitorable, and tunable. Most production database problems eventually trace back to one of: vacuum lag, replication lag, checkpoint configuration, or cache pressure. Once you know which one, the fix is usually a configuration change rather than an architectural rewrite.

The discipline is to instrument these subsystems before they become problems. Watch the autovacuum log; alert on replication lag; track checkpoint frequency; monitor the cache hit rate. The database does enough work for you that it is reasonable to give it a few moments of attention back.

We make four developer APIs at DocuMint, CronPing, FlagBit, and WebhookVault. All four use SQLite, which gives us the simplest possible version of these problems: WAL checkpointing is the only background work we have to think about, and it is configured once and forgotten. The day we outgrow SQLite is the day this article becomes a checklist for us, not a reference.