Postgres pg_prewarm: Loading the Buffer Cache Before the First Query Hits
After a Postgres restart, the buffer cache is empty and every query that touches a previously-hot table pays cold-cache latency. pg_prewarm lets you push specific tables and indexes into shared_buffers before traffic arrives, with autoprewarm preserving the cache across restarts.
One of the less-discussed costs of a Postgres restart is what happens to the buffer cache. Whatever was in shared_buffers when the process exited is gone. The OS page cache may retain copies of the underlying files for a while, but the entire warmup curve has to restart. For a database with a multi-gigabyte working set, that means the first few minutes of post-restart traffic run at cold-cache latency: every query that touches a previously-hot table reads from disk, fills the buffer, and pays the IO cost the running database had amortized away long ago.
For most workloads, this is annoying but tolerable. For workloads with strict latency SLOs, a database failover, or a planned maintenance window where post-restart latency drives customer experience, it is a real problem. The pg_prewarm extension is the standard Postgres answer.
What pg_prewarm actually does
pg_prewarm is a contrib extension that loads relation data into either the operating system page cache or directly into Postgres shared_buffers. The simplest invocation is a SQL function call: SELECT pg_prewarm('table_name') reads every page of the table into the buffer cache, in physical order, returning the page count loaded.
The function takes optional arguments for the prewarm mode and the page range. Mode can be 'prefetch' (asynchronous OS-level prefetch via posix_fadvise), 'read' (synchronous OS-level read into the page cache without involving shared_buffers), or 'buffer' (synchronous read into shared_buffers). The 'buffer' mode is what most operators want: it guarantees that the data is in the cache layer Postgres queries actually use.
For indexes, the same function works with the index relation name. For a full table-and-indexes prewarm, the standard pattern is to query pg_class for every relation associated with the table and prewarm each one.
The autoprewarm worker
The more interesting feature is autoprewarm, added in Postgres 11. When pg_prewarm is loaded via shared_preload_libraries, a background worker process periodically writes the list of buffers currently in shared_buffers to a file in the data directory (autoprewarm.blocks). On startup, a different background worker reads that file and reloads the listed buffers in parallel.
The effect is that a planned restart preserves the buffer cache contents across the restart. The first queries that arrive after the database accepts connections find the same set of pages already in shared_buffers that were there before. For a 10 GB shared_buffers with 1.2 million 8 KB pages, autoprewarm typically takes a few minutes to reload, parallelized across the configured number of workers.
The reload happens after the database is accepting connections, not before, which means the first few queries can still race the prewarm and find their pages still on disk. This is rarely a problem in practice because the queries that need the hot pages are exactly the queries that benefit from the prewarm being in progress: they fill in their own pages either from disk or from the prewarm worker's parallel reads, whichever wins.
Configuration knobs that matter
The first knob is shared_preload_libraries. Without 'pg_prewarm' in this list, autoprewarm does not run, and the only access to pg_prewarm functionality is the manual SQL function. Changing shared_preload_libraries requires a restart.
The second is pg_prewarm.autoprewarm = on (the default when the library is loaded). The autoprewarm.blocks file is written every pg_prewarm.autoprewarm_interval seconds (default 300, i.e., five minutes). Lowering the interval makes the snapshot more current at the cost of more file IO; raising it accepts more staleness in exchange for less overhead.
The third is the number of parallel workers used for reload, controlled by max_parallel_workers and the cluster's general parallelism settings. autoprewarm spawns up to one worker per database that has buffers to load, and within each database the work is parallelized further.
Three diagnostic patterns
The first diagnostic is verifying that prewarm actually happened. Querying pg_buffercache after restart shows the current contents of shared_buffers, including relation OIDs and page numbers. If autoprewarm is working, the count of buffered pages should match closely with the count before the restart, modulo any pages that were dirty at shutdown and were not in the snapshot.
The second is measuring cold-cache latency before and after enabling prewarm. Run a representative query immediately after restart, with and without autoprewarm enabled, and compare the timings. The expected improvement for a query that scans a large hot table is several seconds for a cold OS cache to single-digit milliseconds for a warm buffer cache.
The third is monitoring autoprewarm worker progress during reload. The worker logs its progress to the server log when log_min_messages includes the relevant levels, and pg_stat_activity shows the worker process while it is running. If reload is taking longer than expected, the usual cause is contention on disk reads, which can be addressed by raising the parallel worker count if the storage can handle the additional concurrency.
When pg_prewarm is not the answer
For databases whose working set exceeds shared_buffers, pg_prewarm cannot fix the underlying problem: the cache will always be insufficient for the workload, and warmup just means filling it up faster before evicting cold data. The right answer in this case is either to grow shared_buffers (with the caveat that beyond about 25 percent of RAM, the law of diminishing returns sets in) or to accept the steady-state cache miss rate.
For databases that are restarted rarely, the operational complexity of autoprewarm is usually not worth the benefit. A database that runs for months between restarts will spend most of that time with a fully warm cache regardless of what happens at startup.
For replicas in a streaming replication setup, pg_prewarm runs independently on each instance, which means failover to a replica that has not been prewarmed will see cold-cache latency on the new primary. The standard pattern is to run pg_prewarm on standbys periodically, or to enable autoprewarm on all instances in the cluster.
Our SQLite-baseline products and Postgres migration plan
The current products in production are SQLite-backed, where the buffer-cache problem looks different: SQLite uses the OS page cache exclusively, and OS-level prefetch via posix_fadvise (or its equivalent on other platforms) is the analogous tool. On Linux with modern kernels, the page cache is rarely the binding constraint for SQLite workloads at the scales we operate at.
The planned Postgres migration includes pg_prewarm in the launch configuration: shared_preload_libraries will include pg_prewarm, autoprewarm.interval will start at 300 seconds, and shared_buffers will be sized at 25 percent of available RAM. Restart drills will include measuring cold-cache latency and confirming autoprewarm reduces it to acceptable levels before the production cutover.
The deeper observation is that database operational features come in two categories: ones that fix correctness problems and ones that fix latency problems. The correctness ones get integrated immediately because the consequences of skipping them are visible. The latency ones often go unused for years because the consequences of skipping them are visible only at the moments when the operational team is most distracted. pg_prewarm is in the latter category and benefits from being configured during normal operations, not during the incident that surfaces the need for it.
Read more essays and technical writing at anethoth.com — a notebook on databases, distributed systems, biology, and the engineering that holds the world together.