Black Hat2023

Large Language Models for Offensive Security

Black Hat5,299 views36:15about 2 years ago

This talk explores the practical application of Large Language Models (LLMs) in offensive security workflows, specifically for bug bounty hunting and vulnerability research. It demonstrates how LLMs can be used as agents to interact with APIs, perform reconnaissance, and assist in report generation, while highlighting the limitations of current models regarding statefulness, hallucinations, and training data contamination. The presentation provides a realistic assessment of LLM capabilities in automating security tasks and emphasizes that they are best used as augmentative tools rather than replacements for human expertise.

Why LLMs Are Not Your Next Lead Security Researcher

TLDR: Large Language Models like GPT-4 and Claude are powerful tools for parsing documentation and generating boilerplate, but they fail as autonomous vulnerability scanners due to their lack of statefulness and propensity for hallucination. While they can assist in reconnaissance and report writing, they cannot reliably track program state or identify complex logic flaws. Pentesters should treat them as force multipliers for administrative tasks rather than replacements for offensive security expertise.

The hype cycle surrounding Large Language Models in security has reached a fever pitch. Every week, a new tool claims to automate bug bounty hunting or replace the need for manual code review. If you listen to the marketing, these models are already capable of finding zero-days and chaining complex exploits. The reality, as demonstrated by recent research, is far more grounded. These models are essentially advanced, probabilistic lookup tables. They are excellent at pattern matching and text generation, but they lack the fundamental ability to reason about program state, which is the core requirement for finding meaningful vulnerabilities.

The Statefulness Problem

Offensive security is rarely about finding a single static string in a codebase. It is about understanding how data flows through an application, how state changes over time, and how those changes can be manipulated to reach an invalid state. Most critical vulnerabilities, such as Use After Free or complex Race Conditions, require a deep, persistent understanding of the target's execution flow.

LLMs are stateless. When you prompt a model, it processes the input and generates a response based on its training data and the provided context window. It does not "run" the code. It does not track how a variable changes from line 10 to line 500. If a vulnerability requires looping nine times to reach an invalid state, the model will likely lose the thread or hallucinate the outcome. During the research presented at Black Hat, it became clear that models struggle significantly with implicit state machines. They can identify the syntax of a bug, but they cannot reliably predict the runtime behavior of the code they are analyzing.

Hallucinations and Contamination

Hallucinations are the most immediate hurdle for any researcher trying to integrate LLMs into a production workflow. When a model is asked to find a vulnerability, it often generates a plausible-sounding but entirely fake report. It might invent variables that do not exist, reference non-existent endpoints, or suggest remediation steps that would break the application. For a bug bounty hunter, this is a direct hindrance. You end up spending more time verifying the model's output than you would have spent performing the analysis manually.

Furthermore, training data contamination is a massive, often overlooked issue. Because models are trained on vast swaths of the internet, including GitHub, they have likely ingested thousands of public bug reports, PoCs, and write-ups. If you feed a model a piece of code that looks like a known vulnerable pattern, it isn't "finding" the bug; it is simply regurgitating a pattern it has seen before. This makes it difficult to evaluate the model's actual reasoning capabilities. It is essentially performing a sophisticated form of grep. While this is useful for finding low-hanging fruit, it is not the same as performing deep, original vulnerability research.

Practical Applications for Pentesters

Despite these limitations, LLMs are not useless. They excel at tasks that are tedious for humans but require a basic understanding of context. One of the most effective uses is in reconnaissance and scope management. Manually parsing OpenAPI documentation to determine if a specific endpoint is in scope is a time-sink. An LLM can ingest a massive JSON file and, given the right prompt, accurately extract the relevant paths and methods.

Another area where these models shine is in report generation. Pentesters spend a disproportionate amount of time writing the same impact and remediation sections for common vulnerabilities. By using a tool like PlexTrac or custom scripts to feed findings into an LLM, you can generate a high-quality, templated report that is ready for client review. This is not about finding the bug; it is about automating the administrative overhead that keeps you from finding the next one.

The Future of Offensive AI

If you want to use LLMs effectively, stop asking them to "find bugs." Instead, build agents that can interact with your existing toolchain. The most promising approach involves using an LLM as a controller that can dispatch commands to specialized tools. For example, you can build an agent that takes a natural language goal, translates it into a series of API calls, and then uses a local script to verify the results. This keeps the "thinking" in the hands of the human and the "doing" in the hands of the machine.

Security teams will eventually need to adapt to a landscape where these tools are used by both sides. The best defense remains a deep understanding of the underlying technologies. If you rely on an LLM to do your thinking, you will be outmaneuvered by a researcher who understands the mechanics of the stack. Use these models to handle the noise, but keep your eyes on the logic. The next time you are staring at a massive, undocumented API, let the model do the reading, but make sure you are the one doing the hunting.

Talk Type

research presentation

Difficulty

intermediate