You set a container memory limit of 1.5GB. Your application's top output inside the container shows RSS at 400MB. You think you have plenty of headroom. Then the OOM killer fires.
This is not a rare edge case. It is a predictable consequence of cgroup memory accounting, which counts things that process-level RSS does not.
What the Cgroup Limit Actually Counts
The Linux cgroup memory limit — what Docker's --memory flag sets — counts:
- Anonymous memory: heap, stack, mmap'd anonymous regions. This is what RSS captures.
- Page cache: file-backed pages brought into memory by reads. If your application reads a 500MB file, those pages count against the cgroup limit even though RSS doesn't show them.
- tmpfs mounts:
/dev/shmis a tmpfs. If you're using shared memory (Python multiprocessing, Redis--saveto /dev/shm, or explicit shm_open calls), it counts. The default/dev/shmin a Docker container uses host tmpfs and counts against container memory. - Kernel slab caches allocated for the cgroup's processes.
The gap between "RSS as reported by top" and "memory as reported by cgroup" can be large. A container doing significant file I/O or using shared memory extensively will show a small RSS but a large cgroup usage.
Reading the Real Number
From inside the container:
# Current usage (bytes)
cat /sys/fs/cgroup/memory.current
# Detailed breakdown (cgroup v2)
cat /sys/fs/cgroup/memory.statThe memory.stat file breaks out anon, file (page cache), shmem, kernel_stack, and more. If file is large and growing, your application is accumulating page cache. This is normal behavior — the kernel caches file reads aggressively — but it counts against your limit.
From outside the container:
docker stats --no-stream docker stats reports memory.usage_in_bytes which includes page cache. The number you see in the MEM USAGE column is not RSS — it is total cgroup accounting.
The /dev/shm Trap
The most common source of surprise in production is /dev/shm. Python's multiprocessing module uses it for inter-process shared objects on Linux. So does any application that calls shm_open(). The default Docker container has /dev/shm mounted as a 64MB tmpfs — but that 64MB counts against memory limit, and if your application writes more than the shm size allows, you get ENOSPC, not an OOM.
You can explicitly size it:
docker run --shm-size=256m ...Or in compose:
shm_size: '256m'Debugging the Gap
When the OOM killer fires unexpectedly, the diagnostic workflow is:
- Check
memory.currentvs your limit — how close were you actually? - Read
memory.stat— isfile(page cache) orshmemthe contributor? - If page cache: is your application reading large files repeatedly? Consider whether the access pattern is working-set-compatible with your limit.
- If shmem: find the shm consumers —
ipcs -minside the container, or look for/dev/shmusage. - Check
memory.eventsforoom_killcount and recent history.
cat /sys/fs/cgroup/memory.eventsThe kernel will also log OOM events to dmesg with the process name and the pages requested, which gives you the proximate cause even if the root cause is accumulated page cache.
What This Means for Limits
If your application does significant file I/O, your effective memory headroom is smaller than RSS suggests. Set limits with the real cgroup usage in mind, not the RSS reported inside the container. Monitor docker stats over time, not just at startup. And if you're seeing unexplained OOMs on workloads that look memory-light by RSS: read the cgroup files. The answer is almost always page cache.
—
Follow the work at anethoth.com and builds.anethoth.com.