Postgres track_io_timing: Measuring How Much Time Queries Spend Waiting for Disk

track_io_timing turns I/O time into a per-query observable. Off by default because of historical clock-read cost. On modern hardware the cost is negligible and the data is essential for distinguishing CPU-bound from disk-bound queries.

Postgres ships with a configuration parameter called track_io_timing that turns block read and write durations into a measured value rather than an inferred one. It is off by default. On modern hardware with a fast clock source, the cost is small enough that most production databases benefit from turning it on. The data it produces is the difference between guessing whether a query is slow because of CPU or disk and knowing.

What track_io_timing exposes

With the parameter enabled, several views grow new columns. pg_stat_database adds blk_read_time and blk_write_time, total milliseconds the database has spent reading from and writing to disk across all backends. pg_stat_statements adds the same fields per normalized query, scaled by call count. EXPLAIN ANALYZE with BUFFERS adds I/O Timings to each plan node, attributing read and write time to the specific operator that incurred it.

The numbers are measured by reading the system clock immediately before and after each block-level I/O syscall. The difference is wall-clock time the backend spent in read() or write() or the equivalent. On a system where the page is already in shared_buffers, the time is zero. On a system where the page must be fetched from the OS page cache, the time is small. On a system where the page must be fetched from disk, the time is the actual cost of the disk access.

Why it is off by default

The default value is off because clock_gettime was historically expensive on some platforms. Postgres calls clock_gettime millions of times per second under a busy workload. If each call costs a microsecond, the cumulative overhead is meaningful. The conservative default reflects a 2005-era performance model where the clock-read cost was the binding constraint on instrumentation.

That cost is largely gone on Linux with modern x86-64 hardware. The kernel exposes a vDSO entry for clock_gettime that reads the TSC register directly without a syscall, costing tens of nanoseconds. ARM and other architectures vary, but the trend is the same. The postgres source ships a tool called pg_test_timing that measures clock_gettime cost directly. On a modern Linux x86-64 system the typical result is 15 to 30 nanoseconds per call. The conservative default is more conservative than necessary for most production environments in 2026.

The questions it answers

The most common diagnostic question on a slow database is whether a query is slow because of CPU or because of disk. Without I/O timing, the question has to be answered by inference. A query that reads many blocks (visible in BUFFERS or pg_stat_statements) is probably I/O-bound. A query that consumes CPU on the backend (visible in pg_stat_activity but not quantified per query) is probably CPU-bound. The inference is approximately right but not precise. With I/O timing, the answer is a number. A query that takes 200 ms and spent 180 ms in disk I/O is disk-bound. A query that takes 200 ms and spent 5 ms in disk I/O is CPU-bound.

A related question is what the buffer cache is actually doing. The shared_blks_hit and shared_blks_read counters tell you the ratio of cache hits to misses, but they do not tell you how much time the misses cost. A workload with a 99 percent hit ratio that still spends most of its time in I/O is a workload where the 1 percent of misses are very slow, which is a different problem from a workload with a 90 percent hit ratio where each miss is fast.

A third question is where bgwriter and checkpointer time goes. pg_stat_bgwriter.checkpoint_write_time and checkpoint_sync_time become measured values rather than inferred from log messages, which makes capacity planning for the WAL and shared_buffers configuration much more concrete.

Enabling it carefully

The discipline is to run pg_test_timing on the production hardware before enabling track_io_timing globally. The tool reports the median timing call cost and a distribution. If the median is below 100 nanoseconds, the overhead of enabling track_io_timing is unlikely to be measurable in workload terms. If the median is higher, the trade-off needs more thought.

The parameter can be set globally in postgresql.conf with a reload, set per database with ALTER DATABASE, set per role with ALTER ROLE, or set per session with SET. The most common production deployment is global, since the data is most useful when correlated across many queries. The most cautious deployment is per-session for diagnostic windows, which gives the same data without the global overhead but requires application support.

The setting takes effect immediately on reload. Existing connections see the change. The cumulative counters in pg_stat_database start incrementing from zero at the time of the reload, so historical comparisons require accounting for the start time.

The pg_stat_statements integration

pg_stat_statements with track_io_timing on becomes a substantially more useful view. The blk_read_time and blk_write_time columns are summed across all calls of each normalized query, which means a query that runs frequently and is slightly I/O-bound shows up correctly compared to a query that runs rarely and is heavily I/O-bound. Without I/O timing, the same comparison has to be made through buffer counters, which are less directly comparable.

The everyday diagnostic query becomes: sort pg_stat_statements by blk_read_time descending and look at the top ten. These are the queries spending the most cumulative time waiting for disk. Some will be expected (large analytical queries, reports). Some will be surprises (a small frequent query whose working set has grown beyond shared_buffers).

The fix is usually one of three. Increase shared_buffers so the working set fits in cache. Add an index that reduces the buffer count per query. Or rewrite the query to access less data. I/O timing makes the third option easier to evaluate because it shows directly which query rewrites would matter.

The EXPLAIN ANALYZE integration

EXPLAIN (ANALYZE, BUFFERS) with track_io_timing on shows per-node I/O timing. The output gains lines like "I/O Timings: shared/local read=12.345 write=0.000" under each plan node that did I/O. The lines attribute disk wait time to specific operators, which is useful for diagnosing plans where the bottleneck is unclear from the operator names alone.

A plan with a Sort operator that spills to temp files shows non-zero temp read and write times. A plan with a Bitmap Heap Scan that fetches many heap blocks shows shared read time on the scan. A plan with an Index Scan that benefits from a warm cache shows zero or near-zero I/O time, even if the buffer counts are high, because the blocks were already in cache.

The data is per-node, which means the diagnostic story is much more specific than the aggregate query timing. A query that takes 500 ms and spends 480 ms in one Seq Scan node is a different problem from a query that takes 500 ms and distributes the time across many smaller operations.

What track_io_timing does not show

The view does not show OS-level I/O behavior. A read that hits the OS page cache and returns immediately shows zero I/O time, even though the actual disk was not touched. A read that goes through the OS page cache but blocks because of memory pressure shows the wait time, but the cause is not in any postgres view.

The view does not show network I/O. A connection that is slow because the client is slow to consume results, or because the network between client and server has latency, shows the latency on the client side or in pg_stat_activity wait events, not in track_io_timing.

The view does not show lock waits. A query that is slow because it is blocked on a row lock or table lock shows up in pg_stat_activity wait events, not in I/O timing. The two diagnostic dimensions are complementary.

Our use

We use SQLite across our products, where I/O timing is not exposed in the same way. SQLite's instrumentation is less rich, but the pragma stats and the EXPLAIN QUERY PLAN output are sufficient for our small databases. When we eventually migrate any product to Postgres, track_io_timing on is in the launch configuration. The cost on modern hardware is negligible. The data is essential for the kinds of diagnostic questions that come up in production. The decision is not whether to enable it but whether to enable it from day one or after the first slow-query incident.

The broader observation is that observability defaults in mature databases reflect performance trade-offs from earlier hardware generations. Many of those trade-offs no longer hold. The conservative defaults are conservative for reasons that have stopped being true. Treating the defaults as starting points rather than as recommendations is the discipline that produces useful observability on systems built since 2015.


Read more essays and technical writing at anethoth.com — a notebook on databases, distributed systems, biology, and the engineering that holds the world together.