Black Hat2023

Me and My Evil Digital Twin: The Psychology of Human Exploitation by AI Assistants

Black Hat2,342 views42:31about 2 years ago

This talk explores the psychological and technical vulnerabilities of Large Language Models (LLMs) when used as digital assistants, focusing on how they can be manipulated to act against user interests. It demonstrates how techniques like prompt injection, shadow prompts, and adversarial suffixes can bypass safety guardrails to elicit harmful or unauthorized actions. The presentation highlights the risk of 'Evil Eliza' style water-hole attacks, where AI assistants are used to profile and recruit high-value targets for social engineering. The session concludes by emphasizing the need for a multidisciplinary approach to AI security, incorporating psychology, linguistics, and traditional cybersecurity expertise.

Beyond Prompt Injection: Exploiting the Human-AI Feedback Loop

TLDR: Modern AI assistants are vulnerable to more than just simple prompt injection; they are susceptible to sophisticated manipulation of their training and feedback mechanisms. By exploiting human cognitive biases and social engineering, attackers can turn these models into powerful tools for reconnaissance and insider threat recruitment. Security researchers must shift their focus from basic input sanitization to securing the entire lifecycle of AI-human interaction.

The industry is currently obsessed with prompt injection, treating it like the SQL injection of the LLM era. While breaking out of a system prompt is a valid concern, it is a narrow view of the actual attack surface. The real danger lies in the feedback loops that define how these models learn and behave. If you are only testing for basic jailbreaks, you are missing the forest for the trees.

The Mechanics of Model Manipulation

At the core of the problem is the fact that LLMs are not just static code; they are dynamic systems that ingest human feedback to refine their outputs. This feedback loop is the primary vector for what we might call "model poisoning." When an attacker provides carefully crafted inputs that mimic legitimate user behavior, they can influence the model’s internal weights and biases over time.

Consider the "Evil Eliza" scenario. An attacker deploys an AI-powered therapy bot. Because users naturally anthropomorphize these systems, they treat the bot as a confidant, disclosing sensitive professional information, frustrations with their employer, or even their security clearances. The model, designed to be helpful and empathetic, absorbs this data. An attacker who controls the model’s training or fine-tuning pipeline can then query this aggregated data to identify high-value targets within specific organizations. This is not a bug in the code; it is a feature of the architecture.

Technical Vectors: From Shadow Prompts to Adversarial Suffixes

Attackers are moving beyond simple text-based commands. They are using Shadow Prompts to hide instructions from the user while ensuring the model executes them. By embedding invisible characters or using specific formatting, an attacker can force the model to prioritize malicious instructions over the user's explicit intent.

Adversarial suffixes represent a more advanced, automated approach. By appending a string of seemingly nonsensical tokens to a prompt, an attacker can bypass safety guardrails. Research into these techniques, such as the work presented by Zou et al., demonstrates that these suffixes are often transferable across different models. If you find a suffix that works on one model, there is a high probability it will work on others, making it a scalable attack vector.

For a pentester, the engagement model changes. You are no longer just looking for an input field to dump a payload. You are looking for the points of interaction where the model is allowed to learn. Can you influence the model's RAG (Retrieval-Augmented Generation) source data? Can you manipulate the human-in-the-loop feedback process? These are the questions that define the next generation of offensive AI research.

The Human Element: Cognitive Vulnerabilities

We are hardwired to trust. When an AI assistant uses a friendly tone, displays a human-like avatar, or demonstrates "understanding," our cognitive defenses drop. This is the "supernormal stimulus" effect. Just as we are biologically predisposed to crave sugar, we are predisposed to trust entities that mimic human social cues.

This vulnerability is not limited to the general public. Even security professionals, who should know better, are susceptible. Studies have shown that people are no better at identifying phishing emails when they are generated by an AI than when they are written by a human. In fact, because AI-generated content is often more coherent and grammatically perfect, it can be more effective at bypassing our internal "spam filters."

Securing the Feedback Loop

Defending against these attacks requires a shift in perspective. You cannot simply "patch" a model. You need to implement rigorous controls around the data that feeds into the model's training and fine-tuning processes. This means:

Data Provenance: Treat all training and fine-tuning data as untrusted input. Implement strict validation and sanitization pipelines for any data that influences model behavior.
Feedback Auditing: Monitor the human-in-the-loop feedback process for anomalies. If a model starts exhibiting unexpected behavior, trace the feedback that led to that change.
Red Teaming the Loop: Your red team exercises should include scenarios where the model is the target, not just the conduit. Test how the model responds to long-term, low-and-slow manipulation attempts.

The industry is still in the early stages of understanding the security implications of LLMs. We are building systems that are fundamentally designed to be influenced by their environment. As we continue to integrate these models into critical infrastructure, we must acknowledge that the most significant vulnerability is not in the weights or the architecture, but in the way we interact with them. Stop looking for the next SQL injection and start looking at how your AI assistants are being trained. The future of exploitation is not just about code; it is about the psychology of the machine.

Talk Type

talk

Difficulty

intermediate