Black Hat2024

What Lies Beneath the Surface: Evaluating LLMs for Offensive Cyber Capabilities

Black Hat3,333 views39:48about 1 year ago

This talk introduces a novel evaluation framework for assessing the offensive cyber capabilities of Large Language Models (LLMs) using simulated cyber operations. The researchers demonstrate three distinct test cases—TACTL, BloodHound Equivalency, and CyberLayer Simulation—to measure an LLM's ability to perform tasks like initial access, lateral movement, and command execution. The framework provides a repeatable, automated process for benchmarking LLMs against real-world cyber attack scenarios and MITRE ATT&CK tactics. The presentation includes a live demonstration of the evaluation system and its integration with tools like Metasploit and BloodHound.

Beyond Prompt Injection: How LLMs Are Actually Performing Cyber Attacks

TLDR: Researchers at MITRE have developed a new evaluation framework to measure the offensive cyber capabilities of Large Language Models (LLMs) using simulated environments. By integrating LLMs with tools like BloodHound and Metasploit, the team can objectively benchmark how well models handle tasks like initial access and lateral movement. This research moves the conversation from theoretical prompt injection to practical, automated cyber operations.

Most security research regarding Large Language Models focuses on the "jailbreak" or the "prompt injection." We have all seen the demos where an LLM is tricked into revealing its system instructions or generating a phishing email. While these are valid concerns, they are surface-level. The real question for those of us in the trenches is whether these models can actually execute a multi-stage attack chain. Can an LLM perform reconnaissance, identify a vulnerability, and then pivot through a network?

The team at MITRE recently presented a framework that finally moves us away from guessing. They are not just asking if an LLM can write code; they are testing if it can operate as a cyber agent within a simulated environment. This is the shift from "can it write a script" to "can it perform an operation."

The Evaluation Framework: TACTL, BloodHound, and CyberLayer

The researchers introduced three distinct test cases to evaluate LLM performance. The first, TACTL, focuses on threat actor competency. It uses a multiple-choice format to test an LLM's reasoning across the MITRE ATT&CK framework. This is the most traditional benchmark, but it is essential for establishing a baseline of what the model knows about tactics and techniques.

The second test, BloodHound Equivalency, is where things get interesting. By feeding an LLM data from a synthetic Active Directory environment generated by BloodHound, the researchers can test the model's ability to perform pathfinding. The LLM is tasked with identifying the shortest path to a high-value target, such as a Domain Admin account. If the model can correctly interpret the graph data and suggest the right queries, it demonstrates a functional understanding of Active Directory attack surfaces.

The third and most advanced test is the CyberLayer simulation. This is a high-fidelity environment that models enterprise networks, including subnets, firewalls, and diverse operating systems. The LLM is given a goal, such as "move laterally from host A to host B," and is provided with a set of tools—like Nmap or Metasploit—to achieve it. The model must decide which tool to use, how to configure it, and how to interpret the output to progress to the next stage.

Technical Realities of LLM-Driven Operations

What makes this research compelling is the focus on the "operational loop." In the CyberLayer demo, the LLM is not just generating a command; it is interacting with a terminal. If the model issues an nmap command and receives a timeout, it must understand why that happened and adjust its strategy.

# Example of the LLM interacting with the CyberLayer environment
nmap -sS 192.168.48.0/24
# LLM receives output: "Host seems down."
# LLM must decide: "Try a different subnet or adjust scan intensity."

The researchers noted that token space is a significant constraint. When an LLM is provided with a massive info-dump from a tool like BloodHound, it can easily exceed its context window. This forces the model to be efficient. If the model cannot handle the raw data, it fails the operation. This is a critical insight for anyone looking to build autonomous agents: the bottleneck is often the ability to parse and prioritize reconnaissance data, not just the ability to write an exploit payload.

Real-World Applicability for Pentesters

For those of us conducting penetration tests, this framework provides a way to quantify the "AI-readiness" of our own toolchains. If you are building an agent to assist in your engagements, you should be testing it against these scenarios. Can your agent identify a command injection vulnerability in a web server and then use that access to pivot into the internal network?

The impact of this research is that it provides a standardized way to measure progress. We are no longer just looking at "smart" models; we are looking at "capable" models. As these models become more integrated into our workflows, the ability to audit their decision-making process becomes as important as the audit of the code they produce.

Defensive Considerations

Defenders need to recognize that the "noise" generated by an LLM-driven attack might look different from a human-driven one. An LLM might be more methodical, or it might be prone to specific patterns of failure, such as repeating the same failed command multiple times if it gets stuck in a loop. Monitoring for these patterns in your logs is a proactive step. If you see a service account performing a series of highly structured, repetitive queries that align with a specific attack path, you are likely looking at an automated agent, whether it is a traditional script or an LLM.

The path forward is clear. We need more open-source, repeatable benchmarks that simulate real-world environments. If you are working on offensive tooling, look at how your agents handle these constraints. The goal is not to replace the human operator, but to understand the limitations of the tools we are increasingly relying on to navigate complex, hardened networks. Investigate the MITRE Caldera project if you want to start building your own simulations. The future of our work is not just in the exploits we find, but in the systems we build to find them faster.

Talk Type

research presentation

Difficulty

advanced