DEF CON2024

Clash, Burn, and Exploit: Manipulate Filters to Pwn kernelCTF

DEFCONConference527 views37:40over 1 year ago

This talk demonstrates how to exploit vulnerabilities in the Linux kernel's nftables subsystem to achieve local privilege escalation. The researcher details three specific bugs, including a double-free and an out-of-bounds access, triggered by manipulating nftables filter configurations and race conditions. The presentation provides a deep dive into the internal mechanics of nftables, specifically focusing on garbage collection and batch request handling. The researcher successfully demonstrates a local privilege escalation exploit on a target kernel.

Exploiting Linux Kernel nftables: From Double-Free to Root

TLDR: This research details how to achieve local privilege escalation in the Linux kernel by exploiting race conditions and memory management flaws within the nftables subsystem. By manipulating batch requests and triggering specific garbage collection behaviors, an attacker can induce a double-free or out-of-bounds access to gain control over kernel execution flow. For security researchers, this highlights the critical need to audit complex state-machine logic in kernel subsystems where asynchronous operations intersect with synchronous control planes.

The Linux kernel’s nftables subsystem is a massive, complex piece of infrastructure that replaced the legacy iptables framework. Because it handles packet filtering at the kernel level, any flaw here is a direct path to full system compromise. Recent research presented at DEF CON 32 by HexRabbit provides a masterclass in identifying and weaponizing subtle logic bugs within this subsystem. If you are a researcher or a pentester, understanding these primitives is essential because they represent the current frontier of kernel exploitation.

The Mechanics of the Vulnerability

At the heart of this research are the internal mechanics of nftables batch requests and its garbage collection (GC) process. nftables uses a virtual machine to execute rules, and it manages object lifecycles using a two-generation state system: current and next. When you modify the ruleset, you are typically operating on the "next" generation, which only becomes "current" after a successful commit.

The vulnerability stems from a mismatch in how different functions check the "active" status of an object. Specifically, the researcher identified that nft_set_elem_catchall_deactivate used nft_is_active to check if an element was alive, while other deletion functions used nft_is_active_next. Because these functions are called during different phases of the batch request—prepare, commit, or abort—this inconsistency allows an attacker to trick the kernel into freeing the same object multiple times.

When a batch request fails, the kernel enters an abort phase to revert changes. By carefully crafting a sequence of operations that trigger these inconsistent checks, an attacker can force the kernel to free a catchall element during the abort phase, even if it was already marked for deletion or processed elsewhere. This leads to a classic double-free scenario, which is a reliable primitive for gaining control over kernel memory.

Technical Deep Dive: The Race Condition

The exploit relies on a race condition between the main thread handling the nftables batch request and the asynchronous GC thread. The kernel uses a "busy mark" to prevent concurrent access to elements, but as the research demonstrates, this protection is not always applied consistently across all code paths.

Consider the following logic flow for a nftables batch request:

// Simplified representation of the race condition
if (nft_is_active(net, elem)) {
    // The check passes, but the state might change 
    // before the actual free occurs.
    kfree(elem->priv);
    nft_set_elem_change_active(net, set, elem);
}

If an attacker can trigger a nft_set_elem_catchall_deactivate call while the GC thread is also attempting to clean up the same element, the "busy mark" check can be bypassed. The researcher used gdb and gef to trace these execution paths, confirming that the kernel's state management for these objects was not atomic. By pinning threads to specific CPU cores and using a tight loop to allocate and free expect objects, the researcher successfully induced the race condition required to trigger the double-free.

Real-World Impact and Exploitation

For a pentester, this is a high-value target. nftables is enabled by default on almost every modern Linux distribution, including those running in cloud environments like GKE. If you have the ability to execute code as an unprivileged user—perhaps through a container escape or a compromised web application—you can use these primitives to escalate to root.

The impact is absolute. Once you have a double-free, you can use it to overlap an nft_table object with an nft_object or other kernel structures. This allows you to leak kernel addresses, bypass KASLR, and eventually overwrite function pointers to redirect execution to a ROP chain. The researcher demonstrated this by leaking the address of the nf_ct_expect_hash table and using it to control the instruction pointer.

Defensive Considerations

Defending against these types of vulnerabilities is notoriously difficult because they are rooted in the fundamental design of the kernel's concurrency model. However, the fix for CVE-2023-4244 and CVE-2023-4004 shows the path forward: the introduction of a "GC sequence" counter. By incrementing this counter before and after modifications to the control plane, the kernel can ensure that the GC thread and the main thread are synchronized, effectively closing the race window.

If you are managing infrastructure, the most effective defense is to keep your kernel updated to the latest stable release. These vulnerabilities are often patched in the upstream kernel long before they are widely weaponized. For those interested in the specific patches, the Linux kernel mailing list archives are the best place to track how these complex subsystems are being hardened against race conditions.

Ultimately, this research serves as a reminder that the most dangerous bugs are often the ones hiding in plain sight within the logic of core subsystems. When you are auditing kernel code, look for where state-management functions differ in their assumptions about the object's lifecycle. That is where the next set of exploits will be found.

Talk Type

exploit demo

Difficulty

expert