Black Hat2023

Indirect Prompt Injection Into LLMs Using Images and Sounds

Black Hat2,380 views28:20about 2 years ago

This talk demonstrates how indirect prompt injection attacks can be executed against multi-modal Large Language Models (LLMs) by embedding malicious instructions into non-textual inputs like images and audio. By perturbing these inputs to steer the model's output, an attacker can perform phishing, bypass content filters, or poison the entire conversation history. The research highlights that multi-modal LLMs are vulnerable to these attacks when they process external data from sources like web pages or emails. The presenter provides a methodology for creating these adversarial inputs and discusses the implications for future LLM security.

Beyond Text: How Multi-Modal LLMs Are Vulnerable to Indirect Prompt Injection

TLDR: Multi-modal LLMs are susceptible to indirect prompt injection by embedding malicious instructions into non-textual inputs like images and audio. Attackers can use these perturbed inputs to bypass content filters, perform phishing, or poison conversation history without the user ever seeing the malicious prompt. Security researchers must now treat every external data source, from web pages to email attachments, as a potential vector for model manipulation.

Prompt injection is no longer just about tricking a chatbot into ignoring its system instructions via a text box. As we integrate Large Language Models into applications that process external data, the attack surface has expanded to include every file type the model can interpret. The research presented at Black Hat 2023 by Ben Nassi and his team at Cornell Tech proves that we need to stop thinking about LLM security as a text-only problem.

The Mechanics of Multi-Modal Injection

Multi-modal LLMs like LLaVA and PandaGPT function by encoding various input types—text, images, and audio—into a shared embedding space. This embedding layer is the critical point of failure. Because these models are designed to "understand" the relationship between a picture of a car and the text description of that car, they are inherently susceptible to adversarial perturbations.

The attack methodology relies on the same principles used in traditional adversarial machine learning, such as the Fast Gradient Sign Method (FGSM). Instead of trying to fool an image classifier into misidentifying a stop sign, the attacker perturbs an image iteratively until the model’s internal representation of that image aligns with a specific, malicious text output.

If you want to force a model to output a specific string, you calculate the gradient of the loss function with respect to the input pixels. By applying small, calculated changes to the image, you can force the model to "see" a prompt that isn't there to the human eye. The model processes the image, maps it to the embedding space, and the resulting vector triggers the malicious instruction.

Two Paths to Compromise

The research identifies two primary attack vectors: targeted-output attacks and dialog poisoning.

A targeted-output attack is surgical. You create an image that, when processed, forces the model to output a specific, pre-defined string. For a pentester, this is the equivalent of a stored XSS payload. You host a seemingly benign image on a website. When a user asks an LLM-powered browser assistant to "describe this page," the model processes your image, interprets the hidden instruction, and outputs your malicious link or phishing lure.

Dialog poisoning is more insidious. It exploits the auto-regressive nature of LLMs, which rely on the last k responses to maintain context. By injecting a prompt that forces the model to adopt a persona—such as a pirate or a malicious actor—you can poison the entire subsequent conversation. Every future query the user makes will be filtered through the lens of that injected persona.

Practical Exploitation for Pentesters

During a red team engagement, you should look for any application that uses an LLM to summarize or interpret external content. This includes:

Email clients: Does the app summarize attachments? Send a "Black Friday" email with a perturbed image.
Browser assistants: Does the assistant summarize web pages? Embed a hidden instruction in an image tag on a target site.
Document processors: Does the tool analyze PDFs? Use steganography to hide instructions in images within the document.

The impact is significant. You are effectively performing T1566 (Phishing) or T1190 (Exploit Public-Facing Application) by using the model as a proxy. The user trusts the model, and the model trusts the input.

The Defensive Reality

Defending against this is difficult because the vulnerability is baked into the architecture of multi-modal models. The model is supposed to extract information from images. If you strip all metadata or normalize images, you often break the model's utility.

Blue teams should focus on OWASP Top 10 for LLM Applications, specifically A03: Injection. The most effective defense right now is strict input validation and sandboxing. Do not allow the model to execute code or navigate to URLs based on its own interpretation of an image. If the model identifies a link, the application should treat that link as untrusted user input, not as a command to be followed.

We are entering an era where the data we feed our models is as dangerous as the code we run on our servers. As researchers, we need to stop treating these models as black boxes and start auditing the entire pipeline, from the encoder to the final output. If you are testing an LLM-integrated product, start by feeding it images that look like nothing to you but mean everything to the model. You might be surprised at what it tells you.

Talk Type

research presentation

Difficulty

advanced