PostgreSQL and the OOM Killer: Why We Use Strict Memory Overcommit

Our team members built and operated five managed PostgreSQL services over the past 15 years. Across all of them, one configuration has remained constant: strict memory overcommit. In this blog post, we will explain how strict memory overcommit protects your database from catastrophic OOM (out of memory) kills. We will also share how a three-character kernel bug forced us to temporarily disable this setting. Finally, we will explain our heuristic for determining the right memory overcommit limit. Hopefully, this will help you find the right setting for your workloads.

For most processes, handling an OOM kill is simple: the process restarts, reconnects, and picks up where it left off. PostgreSQL is different. PostgreSQL's postmaster (its main supervisor process) forks a backend process for each connection. These backends share memory segments that hold shared buffers, WAL buffers, lock tables, and other shared state. The OOM killer doesn't understand this architecture. It simply picks a process based on an heuristic (usually the process that uses the most memory) and terminates it. If that backend was modifying a shared memory segment, the segment may be left in an inconsistent state. Shared memory has no transactional guarantees at the OS level. A half-written page in shared buffers means silent data corruption. PostgreSQL's postmaster knows this. When it detects that any of its child processes has been killed, it assumes the worst: shared memory may be corrupted. When shared memory is corrupted, there is a risk of corrupting the stored data as well. To prevent this, the postmaster terminates all remaining backends. Every active connection is dropped. Every in-flight transaction is aborted. On its next start, the database goes through crash recovery. This is the correct behavior. PostgreSQL is protecting your data. But it means a single OOM kill doesn't just affect one connection. It takes down every connection on the server. On top of that, if the write volume was high, replaying all WAL files for crash recovery can take a long time. This means a single out of memory case can cause long outages.

Linux allows processes to allocate more virtual memory than what is physically available. When a process allocates memory, for example with malloc(), the kernel reserves virtual address space for it. However, the kernel does not immediately back that space with physical memory. Physical pages are only consumed when the process actually touches the memory. The kernel relies on the assumption that not all allocated memory will be actively used at the same time. Usually, this assumption holds. When it doesn’t, the kernel invokes the OOM killer to free memory by terminating a process.

When allocation fails with ENOMEM error code. PostgreSQL handles this gracefully. A backend that cannot allocate memory reports an error to the client, cancels the transaction, and continues. The postmaster stays up. Other connections remain unaffected. This is a routine error, not a catastrophe. The trade-off is that strict overcommit converts late, destructive failures into early, graceful ones. This trade-off works best when the machine is dedicated to PostgreSQL and a small set of known sidecar processes. In that scenario, the committed memory profile is predictable and the limit can be tuned with confidence. On shared machines running diverse workloads, committed memory becomes harder to predict. An unrelated process can use up the commit budget. This can make PostgreSQL get an ENOMEM error, even if the database load is fine.

Under strict overcommit, the kernel has two knobs to set CommitLimit: overcommit_kbytes and overcommit_ratio. The CommitLimit is calculated as:

It is possible to configure how the kernel behaves when processes ask for memory. Linux provides three overcommit policies via vm.overcommit_memory:

A Kernel Bug and 648 GB of Phantom Memory

We always favored strict overcommit for PostgreSQL. We used it in previous managed PostgreSQL services we built and also in Ubicloud PostgreSQL. However, after enabling it this time, we quickly ran into trouble. A few weeks after we turned on strict memory overcommit, we started to get failures on some of the databases. They showed out of memory errors, even though there was plenty of free physical memory on the machines. We disabled strict memory overcommit and started investigating.

Discovery The first clue came from a routine check of /proc/meminfo on one of our servers with 8 GB memory: $ > cat /proc/meminfo | grep "Committed_AS" Committed_AS: 683547672 kB 651 GB of committed memory on an 8 GB machine! For comparison, a healthy server of the same size showed: $ > cat /proc/meminfo | grep "Committed_AS" Committed_AS: 2703940 kB The counter was off by orders of magnitude.

Narrowing It Down We first looked at ps output. $ > ps -C postgres -o pid,vsz,rss,cmd --sort=-vsz PID VSZ RSS CMD 96622 2242244 95416 postgres: 18/main: postgres postgres... 95721 2241668 94708 postgres: 18/main: postgres postgres... 96414 2241436 94892 postgres: 18/main: postgres postgres... 96619 2241076 93308 postgres: 18/main: postgres postgres... 96417 2240900 94300 postgres: 18/main: postgres postgres... 95728 2240736 93864 postgres: 18/main: postgres postgres... 96620 2240736 92852 postgres: 18/main: postgres postgres... 95727 2240428 93640 postgres: 18/main: postgres postgres... 96623 2239840 93164 postgres: 18/main: postgres postgres... VSZ is the total virtual address space a process has mapped and RSS is the physical memory it's actually using. In the output above, each backend shows ~2 GB of VSZ covering its entire mapped address space, but a much smaller RSS (~95 MB) reflecting the memory it is actively using. On this 8 GB VM we configure 2 GB of shared_buffers, and if you think ~2 GB VSZ is suspiciously close to the shared_buffers size, you are right. Most of each backend's VSZ is actually the shared memory segment that holds shared_buffers. Every backend maps the same 2 GB region into its own address space, so it shows up in each backend's VSZ. With many backends, the VSZ numbers add up quickly.

That said, none of this should inflate Committed_AS. The shared memory segment appears in every backend's address space but physically exists only once, so it should be counted only once. On top of that, we run PostgreSQL with huge_pages = on, so shared_buffers is allocated from hugetlb. Hugetlb mappings have their own separate reservation accounting and are not supposed to count toward Committed_AS at all. Still, the 2 GB hugetlb region was by far the largest mapping in each backend, and hugetlb accounting is a special case in the kernel. That made it the most natural place to start looking, so our first hypothesis was that the kernel was somehow miscounting these mappings. For example, charging them once per process instead of ignoring them.

To verify, we checked the VMA (Virtual Memory Area) flags on the hugetlb mapping via /proc/<pid>/smaps. Each VMA has a set of flags, and the ac flag (VM_ACCOUNT) indicates that the region counts toward committed memory:

$ > sudo cat /proc/321784/smaps | grep -A 25 "hugepage" 7fce75000000-7fcef0c00000 rw-s 00000000 00:10 10723551 /anon_hugepage (deleted) Size: 2027520 kB Shared_Hugetlb: 393216 kB Private_Hugetlb: 0 kB ... ... VmFlags: rd wr sh mr mw me ms de ht sd No ac flag. Huge tables were correctly excluded from committed memory accounting. The hypothesis is ruled out.

We then summed accountable memory (VMAs with the ac flag) across all processes on the machine:

$ > sudo awk '/^Size/{size=$2} /VmFlags:/ && / ac/{sum+=size} END{printf "%.2f GB

", sum/1048576}' /proc/[0-9]*/smaps 2.43 GB 2.43 GB accountable vs 651 GB reported; 648 GB of phantom committed memory. The vm_committed_as counter was leaking. We suspected that the memory was being charged on allocation but was never recredited. This made us consider a potential kernel bug in committed memory calculation.

Fleet-Wide Analysis At that time, we had two different kernels being used on our fleet. We checked our entire fleet of PostgreSQL servers and compared the ratio of Committed_AS to MemTotal against kernel version and uptime: Metric Kernel 6.5.0 Kernel 6.8.0 Median Ratio 0.55 0.27 Mean Ratio 24.97 0.32 Max Ratio 3,405 1.86 Servers with a ratio > 1.0 23% < 1% Drag table left or right to see remaining content We also ran a statistical analysis and found that a server running the 6.5 kernel was 52x more likely to have inflated committed memory.

On 6.5 servers, uptime was positively correlated with inflation. The leak grew at roughly 4.7% compound per week, proportional to uptime. On 6.8 servers, no correlation existed.

This analysis significantly strengthened our hypothesis that this was a kernel bug.

The One-Character Bug To have definitive proof, we tasked an LLM to look into every commit between 6.5.0 and 6.8.0 to find possible bug fixes in committed memory calculations. It quickly found the following.

The bug was introduced in Linux 6.5 by commit 408579c. This commit changed the return convention of do_vmi_align_munmap(): Before : 0 = success, 1 = success with lock downgraded, negative = error

: 0 = success, 1 = success with lock downgraded, negative = error After: always 0 for success, negative = error The commit updated callers throughout the mm subsystem. However, in mm/mremap.c, inside move_vma(), the error check was converted incorrectly:

BEFORE (correct): error handler runs on negative return (on error)

if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false ) < 0 ) { /* OOM: unable to split vma, just get accounts right */ if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) vm_acct_memory(old_len >> PAGE_SHIFT); } AFTER (broken): error handler runs when return is 0 (on success)

if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false )) { /* OOM: unable to split vma, just get accounts right */ if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) vm_acct_memory(old_len >> PAGE_SHIFT); }

The change from < 0 to ! inverted the condition. To understand why this matters, consider what move_vma() does. It first decrements Committed_AS for the old region as part of the move, then calls do_vmi_munmap() to actually unmap it. If the unmap fails, the kernel needs to increment the counter back to keep accounting correct. After all, unmap has failed and the old region still exists. Its charge must be restored. With the inverted condition, this re-increment runs on every successful mremap instead of only on failure. The counter grew monotonically with every memory remap operation.

The bug was reported here and bisected here. Linus himself analyzed the root cause and fixed it with a one-line change, reverting the condition back to < 0:

- if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false )) { + if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false ) < 0 ) {

As Linus Torvalds wrote in the fix:

This didn't change any actual VM behavior _except_ for memory accounting when 'VM_ACCOUNT' was set on the vma. Which made the wrong return value test fairly subtle, since everything continues to work.

Or rather - it continues to work but the "Committed memory" accounting goes all wonky (Committed_AS value in /proc/meminfo), and depending on settings that then causes problems much much later as the VM relies on bogus statistics for its heuristics.

This is the kind of bug that hides in plain sight. Under heuristic overcommit (the default), Committed_AS is purely informational. The kernel doesn't use it to gate allocations. The bug only causes failures under non-default strict overcommit mode, so it went unnoticed. The failure is also indirect. The accounting drifts silently for weeks before Committed_AS finally crosses CommitLimit and allocations start failing.