Security BSides2025

LLMs for Vulnerability Discovery: Hacking Like Humans (Without Humans)

Security BSides London106 views43:18about 1 month ago

This talk demonstrates the use of Large Language Models (LLMs) to automate the discovery of business logic vulnerabilities by mimicking human-like reasoning and workflow. The research focuses on identifying complex, multi-step vulnerabilities in web applications and APIs that traditional static and dynamic analysis tools often miss. The speaker proposes an agentic architecture that leverages Language Server Protocols (LSPs) and iterative prompting to navigate codebases and validate potential vulnerabilities. The approach is validated by discovering 35 new CVEs in popular open-source repositories.

Automating Business Logic Discovery: How LLMs Are Finding 0-Days in Open Source

TLDR: Recent research demonstrates that Large Language Models can effectively automate the discovery of complex business logic vulnerabilities by mimicking human-like reasoning and workflow. By integrating Language Server Protocols with agentic architectures, researchers successfully identified 35 new vulnerabilities, including critical SSRF and RCE flaws, in popular open-source projects. This shift highlights the urgent need for developers to move beyond simple pattern-matching security tools and adopt more context-aware testing methodologies.

Security researchers have long relied on static and dynamic analysis tools to catch low-hanging fruit like SQL injection or cross-site scripting. These tools excel at pattern matching, but they consistently fail when faced with the nuanced, multi-step flaws that define modern business logic vulnerabilities. If an application requires a specific sequence of API calls to bypass an authorization check, a standard scanner will almost never find it. This gap is exactly where the latest research into agentic LLM workflows is changing the game.

The Shift from Syntax to Intent

Traditional security tools operate by looking for broken syntax. They flag a string concatenation in a database query because it violates a known pattern for SQL injection. Business logic bugs, however, are not about breaking syntax; they are about breaking intent. An attacker does not need to inject a malicious character to exploit a broken access control flaw; they simply need to provide an object ID that they are not authorized to access.

Because these vulnerabilities are unique to the specific business model of an application, they have historically resisted automation. The recent research presented at BSides London 2025 demonstrates that we can now bridge this gap by using LLMs to perform the same cognitive heavy lifting as a human pentester. Instead of just scanning for patterns, these agents are tasked with understanding the intended behavior of an application and then actively searching for ways to violate those constraints.

Architecting the Agentic Pentester

The research introduces an architecture that treats the LLM as an agent capable of navigating a codebase, much like a developer or researcher would. The core of this system relies on three building blocks:

Codebase Indexing: The agent uses Language Server Protocols (LSPs) to create a symbol-accurate representation of the code. This allows the model to resolve references and understand the call graph, which is essential for tracing data from source to sink.
Agentic Reasoning: The agent is given a specific goal, such as identifying potential authorization bypasses. It uses iterative prompting to explore the code, request additional information via tool calls, and maintain a state of its findings.
Probability-Based Classification: Rather than relying on a binary "vulnerable/not vulnerable" output, the system analyzes the raw token probabilities from the model. This provides a confidence score, which is critical for filtering out the noise that typically plagues automated security testing.

By combining these elements, the researchers were able to identify 35 new CVEs in popular open-source repositories. A notable example is CVE-2025-53944, an authentication bypass in the AutoGPT codebase. The agent identified an endpoint that lacked proper ownership checks, allowing an authenticated user to access data belonging to other users. Another significant finding was CVE-2025-51488 in Letta, where an RCE vulnerability was discovered through a tool execution endpoint that failed to sanitize input properly.

Why This Matters for Pentesters

For those of us in the field, this research signals a shift in how we should approach our engagements. We are moving toward a future where the "recon" phase of a pentest—mapping out the business logic and identifying potential entry points—can be significantly accelerated.

However, the research also highlights a critical limitation: LLMs are not a replacement for human expertise. They are prone to hallucinations and can easily get lost in complex, multi-step call chains. The most effective results came from using these agents as assistants to a human researcher. The agent does the tedious work of mapping the code and identifying potential paths, while the human provides the intuition to validate the findings and craft the final proof-of-concept.

The Defensive Reality

Defenders need to recognize that the barrier to entry for exploiting business logic is dropping. If an agent can find these bugs in open-source projects, it can just as easily be pointed at proprietary, internal-facing APIs.

The best defense against this new wave of automated discovery is to focus on OWASP A01:2021-Broken Access Control and OWASP A07:2021-Identification and Authentication Failures. These categories are consistently the most prevalent and the most difficult to secure because they require a deep understanding of the application's state. Developers must move away from relying on framework-level security and start implementing explicit, granular authorization checks at every single endpoint.

We are currently in the early stages of this transition. The tools are still maturing, and the "vibe coding" approach—where developers trust AI-generated code without rigorous verification—is creating more vulnerabilities than it solves. If you are a researcher, start experimenting with these agentic workflows. If you are a developer, assume that your business logic is being mapped by an agent right now. The only way to stay ahead is to build systems that are secure by design, rather than relying on the hope that an attacker won't find the logic flaw in your code.

Talk Type

research presentation

Difficulty

advanced