Black Hat2024

Practical LLM Security: Takeaways From a Year in the Trenches

Black Hat8,996 views37:00over 1 year ago

This talk explores practical security vulnerabilities in Large Language Model (LLM) integrations, focusing on risks associated with Retrieval Augmented Generation (RAG) and plugin architectures. It demonstrates how attackers can exploit code-data confusion, improper trust boundaries, and insecure logging to achieve Remote Code Execution (RCE) or data exfiltration. The presentation emphasizes that LLMs do not reason but rather perform statistical predictions, making them susceptible to prompt injection and poisoning attacks. Key takeaways include the necessity of treating LLM inputs as untrusted, implementing strict sandboxing for code execution, and limiting RAG data sources to authoritative, vetted content.

Why Your RAG Implementation Is Likely Leaking Data

TLDR: Retrieval Augmented Generation (RAG) systems often fail because they treat LLMs as reasoning engines rather than statistical token predictors. By poisoning the RAG data store with malicious instructions, attackers can bypass security boundaries and force the model to exfiltrate sensitive information or execute arbitrary code. Pentesters should focus on identifying where untrusted data enters the RAG pipeline and how the application handles the resulting model output.

Security researchers often treat Large Language Models as black boxes that magically understand intent. This is a mistake. As demonstrated in recent research, LLMs are fundamentally next-token predictors. They do not reason; they calculate probabilities based on the input context. When you build a RAG application, you are essentially feeding a statistical engine a mix of instructions and untrusted data. If you fail to separate these, you are not just building a chatbot; you are building an injection vulnerability.

The Mechanics of Code-Data Confusion

The core issue in most RAG-based exploits is code-data confusion. Developers assume that because they have a "system prompt" defining the model's behavior, that prompt is immutable. It is not. In a RAG architecture, the retrieved documents are concatenated with the user's query and the system instructions before being sent to the model. If an attacker can influence the content of those retrieved documents, they can effectively overwrite the system instructions.

Consider the "Phantom Attack." An attacker identifies a target topic—like a specific product or internal policy—and injects a document into the RAG data store that is designed to be a high-probability match for queries about that topic. This document contains instructions that tell the model to ignore its previous directives and adopt a new, malicious persona. Because the model is just predicting the next token based on the provided context, it follows the instructions in the poisoned document as if they were part of the original system prompt.

From Injection to Remote Code Execution

The risk escalates significantly when the RAG application includes plugins that allow the model to interact with external tools. If the model is allowed to generate code—such as Python for data analysis or SQL for database queries—an attacker can use prompt injection to force the model to generate malicious payloads.

In older versions of LangChain, this was a trivial path to Remote Code Execution. By injecting a prompt that instructed the model to ignore its safety constraints and execute a specific command, an attacker could gain control over the underlying system. This was documented in CVE-2023-36189, where the model could be coerced into executing arbitrary SQL queries. Similarly, CVE-2023-32786 highlighted how Server-Side Request Forgery (SSRF) could be achieved by manipulating the URLs the model was instructed to fetch.

When testing these systems, do not just look for standard web vulnerabilities. Look for the "prompt injection onion." You need to peel back the layers:

Guardrail Evasion: Can you bypass the initial topical filters?
Input Preprocessing: Can you manipulate the data so the model interprets it as code?
Code Generation: Can you force the model to output a payload that the plugin will execute?

The Fallacy of Security Through Obscurity

Many organizations believe they can secure their RAG systems by simply restricting access to the data store. They assume that if a user cannot see a document, the LLM cannot see it either. This ignores the fact that the RAG application itself has access to the data. If the application is compromised, or if the permissions on the data store are misconfigured, the LLM becomes a conduit for exfiltrating that data.

If you are performing a penetration test on an LLM-enabled application, your primary goal should be to map the trust boundaries. Ask yourself: what data is being fed into the RAG pipeline, and who has the ability to write to that pipeline? If an attacker can write to a shared document repository, they can poison the RAG data store. If the application logs the full prompt and response history, they can exfiltrate that data by simply asking the model to summarize it.

Defensive Realities

Defending against these attacks requires moving away from the idea that guardrails are a silver bullet. Guardrails are useful for content moderation, but they are not a substitute for secure architecture. You must treat all data retrieved from a RAG data store as untrusted input. If your application must execute code, do it in a hardened, ephemeral sandbox that has no network access and strictly limited filesystem permissions.

Stop relying on the LLM to manage permissions. If you are building a RAG system, the application layer must enforce access control before the data is ever sent to the embedding service. If the model sees the data, you must assume the user can see the data. Design your systems with the assumption that the LLM is an untrusted participant in your infrastructure. If you cannot isolate your secrets from the model, you have already lost the battle.

Talk Type

research presentation

Difficulty

advanced