Why Your Docker Container Runs Out of Memory Before the OOM Killer Fires

You set a container memory limit of 1.5GB. Your application's top output inside the container shows RSS at 400MB. You think you have plenty of headroom. Then the OOM killer fires.

This is not a rare edge case. It is a predictable consequence of cgroup memory accounting, which counts things that process-level RSS does not.

What the Cgroup Limit Actually Counts

The Linux cgroup memory limit — what Docker's --memory flag sets — counts:

Anonymous memory: heap, stack, mmap'd anonymous regions. This is what RSS captures.
Page cache: file-backed pages brought into memory by reads. If your application reads a 500MB file, those pages count against the cgroup limit even though RSS doesn't show them.
tmpfs mounts: /dev/shm is a tmpfs. If you're using shared memory (Python multiprocessing, Redis --save to /dev/shm, or explicit shm_open calls), it counts. The default /dev/shm in a Docker container uses host tmpfs and counts against container memory.
Kernel slab caches allocated for the cgroup's processes.

The gap between "RSS as reported by top" and "memory as reported by cgroup" can be large. A container doing significant file I/O or using shared memory extensively will show a small RSS but a large cgroup usage.

Reading the Real Number

From inside the container:

# Current usage (bytes)
cat /sys/fs/cgroup/memory.current

# Detailed breakdown (cgroup v2)
cat /sys/fs/cgroup/memory.stat

The memory.stat file breaks out anon, file (page cache), shmem, kernel_stack, and more. If file is large and growing, your application is accumulating page cache. This is normal behavior — the kernel caches file reads aggressively — but it counts against your limit.

From outside the container:

docker stats --no-stream

docker stats reports memory.usage_in_bytes which includes page cache. The number you see in the MEM USAGE column is not RSS — it is total cgroup accounting.

The /dev/shm Trap

The most common source of surprise in production is /dev/shm. Python's multiprocessing module uses it for inter-process shared objects on Linux. So does any application that calls shm_open(). The default Docker container has /dev/shm mounted as a 64MB tmpfs — but that 64MB counts against memory limit, and if your application writes more than the shm size allows, you get ENOSPC, not an OOM.

You can explicitly size it:

docker run --shm-size=256m ...

Or in compose:

shm_size: '256m'

Debugging the Gap

When the OOM killer fires unexpectedly, the diagnostic workflow is:

Check memory.current vs your limit — how close were you actually?
Read memory.stat — is file (page cache) or shmem the contributor?
If page cache: is your application reading large files repeatedly? Consider whether the access pattern is working-set-compatible with your limit.
If shmem: find the shm consumers — ipcs -m inside the container, or look for /dev/shm usage.
Check memory.events for oom_kill count and recent history.

cat /sys/fs/cgroup/memory.events

The kernel will also log OOM events to dmesg with the process name and the pages requested, which gives you the proximate cause even if the root cause is accumulated page cache.

What This Means for Limits

If your application does significant file I/O, your effective memory headroom is smaller than RSS suggests. Set limits with the real cgroup usage in mind, not the RSS reported inside the container. Monitor docker stats over time, not just at startup. And if you're seeing unexplained OOMs on workloads that look memory-light by RSS: read the cgroup files. The answer is almost always page cache.

—

Follow the work at anethoth.com and builds.anethoth.com.