AF_ALG page-cache cross-container pivot: Part I

I quit the security circus a few years ago. Every now and then, though, an old friend still asks me about some kernel trick, a bug that hasn't quite died yet, or help with a little privesc gig... so I keep using a bit of my spare time to write some code and update the few old xdev pieces I have left — what I call the survivors, sigh. Page-cache bugs were (are?:) one of the best primitives... reliable, silent, resilient, and most of the time pretty hidden in the VM code. So while I was sighing over yet another good one being gone, an ol' friend asked me out of the blue: I'm stuck in this freaking container, and the one I actually need is sitting right next to it — I'd have to bounce through the host to get there, asap... Sometimes people forget that the hard way isn't always the right one. Sometimes you need the host. But sometimes you just need to own all the other siblings, and pop the right one. In the end the host is mostly just the shell — the actual content sits inside the pods.

I'm not really used to making things public, mind. The last time I dropped a kernel exploit in the open, it cost the rest of us vsyscall going RO on grsec — eheh, right Brad? :) — almost seventeen years ago now. Lesson learned. But this time it is different: nothing to kill, nothing to stop, probably nothing to learn, just a bit funny.

The context. Plain Docker, Debian-based image, almost completely "airgapped". No path to the host (netfilter dropping everything), no inter-container communication (--icc=false probably), no special caps on the local process, no bind mounts, no shared host resources of any kind, custom kernel and a constrained seccomp space. The only interesting detail was that the container was just one of the N microservices the application was running. There were others. Many, many others.

A small disclaimer before moving on: page-cache bugs like this one aren't only filesystem-cache issues. Depending on the kernel interface, the drivers, the caps, you can turn them into something quite a bit nastier — but maybe we'll get to that in another post, once I'm sure I'm not killing anything still useful. For now let's pretend we are only dealing with the filesystem side.

Sharing pages between siblings. There are plenty of container and sandbox flavours nowadays, but most of those that share a common kernel also share an overlayfs. Both Docker and Kubernetes (with the standard containerd / overlay2 snapshotter) build each container's rootfs as an overlay: a stack of read-only lower directories — the image layers on disk, merged under a private writable upper directory per container. Every container started from the same base image points to the same lower directories — same on-disk files, same ext4 inodes, same physical page-cache pages.

If you spawn two containers from the same base — say, two ubuntu:22.04 siblings called c1 and c2 — and ask each of them about a specific library, you'll get back the same inode number, the same size. They are not seeing two copies; they are looking at the same file through their respective overlay merges. And if you inspect both containers from the host, the LowerDir field of their overlay configuration points at the exact same path. One layer on disk, two doors into it.

The interesting question is what happens when somebody tries to change something in there.

Say c1 does the most ordinary thing imaginable from the inside: it writes a few bytes into /lib64/ld-lunux-x86-64.so.2, the dynamic linker every binary in the image relies on. Overlayfs spots that the file is currently served from the lower and, before letting the write through, copies it up into c1's private upper directory — a brand new inode in the upperdir, a fresh set of pages, all of them holding the modified bytes. From that moment on c1 reads its own private, modified linker; c2, asking the kernel for the very same path, still walks down the overlay stack and lands on the lowerdir's original file, with its original inode and its original page in cache. Nothing of c1's write ever leaks into c2's view. This is the famous copy-up, and it is the boundary that normally keeps siblings from interfering with each other through the filesystem.

Where the dead bug comes in. The exploit primitive does not play that game at all. It pulls the page-cache page in via a zero-copy path and pokes 4 chosen bytes into it through kmap using the kernel's direct mapping. No PTE fault, no PG_dirty bit set, no writeback queue, no inode marked dirty, no write_iter ever called, no copy-up triggered. From the kernel's point of view no "write" in the fs sense ever happened: the bytes just appear, in place, inside the lowerdir's page-cache page. And since every sibling container is reading the linker through that same lower, all of them see the corruption on their next access — same physical page, same 4 corrupted bytes, simultaneously, in c1, in c2, and in any other container running off that base. The disk underneath is "usually" never touched (you can verify it with an O_DIRECT read against the layer file on the host): the corruption lives in memory only, scoped to the lifetime of the page cache, and disappears the moment all containers holding a reference to it stop.

So with one shot from c1, we can write into the lower's /lib/x86_64-linux-gnu/libc.so.6, the dynamic linker, /bin/bash, or any binary common to the layer, and every sibling container running off the same base will read the corrupted bytes on its next access. Which siblings actually share what is environment-dependent, of course, but on a typical microservice deployment a single curated base — Debian-slim or some internal flavour of it — sits under a lot of different services. One container ends up being the puppeteer of all the others, regardless of network policy.

There was still the small problem of communication. No network, no IPC, no shared sockets, no shared files outside the lower itself. But that lower page is something both sides can read straight out of their own libc/ld mmap, and both sides can write into through the same 4-byte primitive — same kernel, same algif surface, same default seccomp. So the page becomes, in effect, a tiny full-duplex shared-memory link between siblings.

I went and dug out some really old code from a VirtualBox escape I wrote almost ten years ago, where the shared resource was a memory-mapped page between a host process and the guest. Different stage, same actor — replace "host/guest mmap" with "overlay lowerdir page" and the rest of the protocol just recycles. I stripped most of it down, kept the idea, and put a small dummy command protocol on top: a control word, a request slot and a reply slot, both ends polling. Good enough to ship a bare-bones command exec tool into a sibling and get the output back, with no network and no shared file descriptor whatsoever, and, of course, without the need to access the host.

It's just a PoC, but maybe it'll still be useful to someone else. The lesson, if there is one: go full distroless. The bugs still alive out there are mostly just waiting for the next LLM round to come kill them anyway — no mercy, no effort, and, for now at least, no posts about their hacks. Yet.