DEF CON2025

Orion: Fuzzing Workflow Automation

DEFCONConference481 views44:216 months ago

This talk introduces Orion, an automated framework that leverages Large Language Models (LLMs) to streamline the entire fuzzing lifecycle, including target identification, harness generation, seed creation, and patch verification. The system addresses the manual overhead of traditional fuzzing by using LLM agents to analyze codebases and generate necessary artifacts, while employing deterministic tools as oracles to validate the LLM's output. The authors demonstrate the framework's effectiveness by identifying multiple vulnerabilities in the C-library (CLIB) and the H3 geospatial indexing library, highlighting the potential for AI-driven security automation.

Automating Fuzzing Workflows with LLM Agents

TLDR: Researchers have introduced Orion, an automated framework that uses LLM agents to handle the tedious parts of the fuzzing lifecycle: target identification, harness generation, and patch verification. By using deterministic tools like LibFuzzer as oracles, the system minimizes hallucinations and provides a repeatable path to finding memory corruption bugs. This approach significantly reduces the manual effort required to fuzz complex, undocumented codebases like the NVIDIA DRIVE stack.

Fuzzing is often less about the actual execution and more about the soul-crushing manual labor required to get a target to compile. Every researcher knows the drill: you spend hours writing a harness, debugging why your seed inputs are being rejected, and manually triaging crashes that turn out to be false positives. While tools like OSS-Fuzz have made continuous fuzzing standard for major projects, the barrier to entry for custom, proprietary, or undocumented codebases remains high.

Orion changes this by treating the fuzzing lifecycle as a series of agentic tasks. Instead of a human manually mapping out function dependencies or writing boilerplate code, the framework uses LLMs to analyze the codebase, identify high-risk interfaces, and generate the necessary harnesses.

The Agentic Loop: From Code to Crash

The core innovation here is not just "using AI to write code," but wrapping that generation in a validation loop. The framework breaks the process into distinct phases: interface analysis, harness generation, seed generation, and patch verification.

The most critical part of this architecture is the use of deterministic oracles. When the LLM generates a harness, it doesn't just push it to a fuzzer. It first passes the code through a compiler. If the compilation fails, the agent receives the error logs, understands the syntax or dependency issue, and attempts a fix. This "act, observe, refine" loop is what makes the system viable for real-world targets.

For example, when targeting a C-library, the agent identifies functions that handle untrusted user input. It then generates a harness that includes the necessary headers and calls the target function. If the harness crashes due to a missing memory allocation, the agent analyzes the stack trace, realizes it needs to allocate a buffer, and updates the harness. This mimics the iterative process a human researcher follows, but at machine speed.

Technical Implementation and Oracles

The framework relies on a combination of LLMs—specifically GPT-4.1 in the research—and traditional static analysis tools. The agents are given access to tools like GDB and standard sanitizers to verify their own work.

One of the most impressive aspects of the research is how it handles the "context window" problem. LLMs are notorious for hallucinating when they don't have enough information about a codebase. Orion mitigates this by building a custom infrastructure that parses the code and generates call graphs. By providing the agent with a structured view of the code rather than just raw text, the system ensures the agent understands function relationships and data types.

When the fuzzer finds a crash, the crash analysis agent takes over. It parses the stack trace and the sanitizer logs to determine the root cause. It then attempts to generate a minimal reproducer. This is a massive time-saver for anyone who has spent days trying to shrink a 50MB input file down to a few bytes that still trigger the same memory corruption.

Real-World Impact and Limitations

During the research, the team deployed this against the NVIDIA DRIVE stack and several open-source projects, including the H3 geospatial indexing library. They found roughly 100 new bugs, ranging from simple memory leaks to complex logic errors that would have been difficult to catch with standard, unguided fuzzing.

However, it is important to be realistic about where this technique struggles. The researchers noted that codebases heavily reliant on complex macros or proprietary build systems can confuse the static analysis tools, leading to poor harness generation. Furthermore, the LLM still struggles with "dangerous expressions" like complex pointer arithmetic or bitwise operations. These are the areas where the agent’s performance drops, and where a human researcher still needs to step in to provide guidance.

Why This Matters for Pentesters

If you are performing a security assessment on a large, unfamiliar C/C++ codebase, you are likely already using static analysis tools to find entry points. Orion effectively automates the next step: turning those entry points into a functional fuzzing harness.

The shift here is from "manual exploitation" to "automated vulnerability research." By offloading the boilerplate harness generation to an agent, you can focus your time on the high-level logic flaws that AI still misses.

If you want to experiment with this, the team has begun open-sourcing the core modules. The goal is to provide reusable components that you can integrate into your own CI/CD pipelines or local testing environments. Don't expect a "press button, get shell" tool, but do expect a significant reduction in the time it takes to go from "I have a binary" to "I have a crash."

The future of offensive security isn't just about finding bugs faster; it's about spending less time on the mechanics of testing and more time on the architecture of the target. Start by looking at how you can integrate these agentic loops into your existing toolchains, and you will likely find that the most repetitive parts of your workflow are the first ones that should be automated.

Talk Type

research presentation

Difficulty

advanced