Postgres pg_stat_io and Query-Level IO Attribution: Closing the Last Observability Gap

For most of Postgres history, IO observability worked at the database level via pg_stat_database and at the relation level via pg_statio_user_tables. The intermediate question of which queries were generating which IO required correlating pg_stat_statements rows against database-wide counters and inferring from time correlation rather than reading attribution directly. The inference worked for queries dominating workload but failed for the long tail of queries that together account for substantial IO.

Postgres 16 added pg_stat_io, which closes part of the gap by exposing IO counters split by backend type and IO context. The view does not answer query-level attribution directly but answers the prior question of where IO is happening cleanly enough that the residual investigation becomes tractable.

What pg_stat_io exposes

The view contains one row per combination of three categorical columns: backend_type identifies whether the IO came from a client backend, an autovacuum worker, a logical replication worker, the checkpointer, or one of the other process types. object identifies whether the IO targeted relations or temporary tables. context identifies the IO purpose: normal page read, bulk read, vacuum read, vacuum write, bulk write, or fsync.

For each category combination, the view exposes counts of reads, writes, extensions, evictions, fsyncs, hits, and the corresponding time spent on each operation when track_io_timing is enabled.

The questions it answers

The first useful question is the autovacuum IO share. The query is SELECT backend_type, sum(reads + writes) AS io, sum(read_time + write_time) AS time_ms FROM pg_stat_io GROUP BY backend_type ORDER BY time_ms DESC. The result distinguishes client backend IO from autovacuum IO from checkpointer IO. The breakdown was previously not directly visible without sampling pg_stat_activity.

The second useful question is the temp-file spill volume. Temp files appear under object = 'temp relation'. The volume measures how much work is spilling to disk because work_mem is too small for the hash and sort operations the query plan chose. The remediation is per-query SET LOCAL work_mem for queries identified as the spill source.

The third useful question is the bulk-read versus normal-read ratio. Bulk reads come from sequential scans on large tables. Normal reads come from index-driven lookups. The ratio reveals whether the workload is dominated by analytical scans or by transactional point queries, which informs decisions about shared_buffers sizing and read-replica routing.

What pg_stat_io does not show

The view shows IO grouped by backend type and context. It does not show which queries generated which IO. The query-level attribution requires correlating pg_stat_statements with sampled pg_stat_activity at IO-event boundaries, which is involved.

The view shows logical IO at the Postgres level. It does not show OS-level page cache behavior or storage-device latency. A read that hits the OS page cache appears as a physical read in pg_stat_io even though no disk activity occurred. The complete IO picture requires correlating with OS-level metrics.

The view does not show IO latency distributions, only totals. A workload with consistent 1ms reads and a workload with intermittent 100ms outlier reads can show the same total read_time. The distribution requires either application-level instrumentation or external storage monitoring.

The reset story

The view counters are cumulative since the last reset. Resetting requires SELECT pg_stat_reset_shared('io'). The pattern is similar to pg_stat_reset_shared('bgwriter') and the same caveats apply: resetting destroys baseline history and should be done deliberately not casually.

The typical use is computing deltas via two snapshots at known times rather than resetting, which preserves history and supports trend analysis. The query for deltas is straightforward: snapshot the view into a table at intervals and compute the difference between consecutive snapshots.

The configuration interaction

The track_io_timing parameter must be on for the time columns to populate. Without it, the count columns work but the time columns are zero. The overhead of track_io_timing on modern Linux with TSC clock source is small but not zero, and the parameter is off by default.

The shared_buffers sizing decision interacts with pg_stat_io interpretation. A workload showing high read counts and low cache hits suggests undersized shared_buffers. The same workload showing high read counts and high cache hits suggests appropriate sizing with read-heavy access. The two diagnoses have different remediations.

The patterns that compound value

Sampling pg_stat_io at one-minute intervals and storing snapshots produces a time-series of IO behavior split by backend type and context. The patterns visible at this resolution are not visible from totals: the relationship between autovacuum activity and client read latency, the diurnal cycle of bulk versus normal reads, the impact of deployments on temp-file generation.

Correlating pg_stat_io snapshots with pg_stat_statements top-queries via timestamp alignment closes most of the query-level attribution gap for queries that dominate workload. The long tail still requires sampling-based approaches but the dominant workload becomes attributable.

Alerting on rising temp_blks_written rate from pg_stat_database combined with pg_stat_io temp-context confirmation catches work_mem-exhaustion issues before they become customer-facing slow queries.

Our use

Our four products run on SQLite where pg_stat_io has no direct analog. The closest equivalent is SQLite's PRAGMA stats and the application-level instrumentation we run in our shared FastAPI middleware. The middleware records request duration and approximate page-access counts for queries crossing a threshold, which captures the same information at the granularity our scale requires.

Our Postgres migration plan includes pg_stat_io monitoring as a launch requirement for the analytics warehouse we plan to build for cross-product reporting. The decision about shared_buffers sizing depends on cache-hit-ratio data, the decision about read-replica routing depends on bulk-versus-normal ratio data, and the decision about work_mem depends on temp-file volume data. All three decisions are easier with pg_stat_io than without.

The broader pattern

The pg_stat_io addition is part of a multi-decade pattern of database observability improving incrementally as the operational community articulates the missing questions. The pattern is that databases that survive long enough accumulate views that answer questions operators had been asking via inferior methods. The views look obvious in retrospect and feel underused before they exist.

The deeper observation is that observability features compound over time. Each new view answers questions that previously required correlation across multiple existing views or sampling-based approximation. The compound effect is that operating a mature database in 2026 is qualitatively easier than operating the same database in 2010, even though the underlying engine and the workload patterns have not changed substantially. The improvement is in what the operator can see.

Our products: DocuMint (PDF invoice generation API), CronPing (cron job monitoring with status pages), FlagBit (feature flags API for modern teams), and WebhookVault (webhook capture and replay) put these patterns into production.