Security BSides2025

Securing AI Agents: Critical Threats & Exploitation Techniques

Security BSides San Francisco211 views42:1810 months ago

This talk explores the security landscape of AI agents, focusing on critical vulnerabilities such as authorization hijacking, prompt injection, and supply chain attacks. The speakers demonstrate how these agents can be manipulated to bypass security controls and access sensitive data by exploiting weaknesses in their orchestration and tool-calling mechanisms. A practical hacking lab is used to illustrate how combining different input languages can bypass prompt injection protections. The presentation concludes with security recommendations, including the implementation of least privilege, robust logging, and human-in-the-loop verification.

Bypassing AI Agent Security: Lessons from the Notion Assistant Lab

TLDR: AI agents are increasingly being deployed with high-level permissions, creating a massive attack surface for prompt injection and authorization hijacking. By chaining different input languages and exploiting weak delegation, attackers can trick these agents into leaking sensitive data. Security researchers must prioritize least-privilege access and robust logging to prevent these systems from becoming automated data exfiltration tools.

The rapid adoption of AI agents in enterprise workflows has outpaced our ability to secure them. We are no longer just dealing with chatbots that hallucinate; we are dealing with autonomous systems that can read, write, and execute code on our behalf. When an agent is given the ability to interact with internal APIs, databases, and file systems, the traditional boundaries of web security begin to dissolve. The recent research presented at Security BSides 2025 highlights exactly how these agents fail when they are treated as trusted entities rather than untrusted input processors.

The Anatomy of an Agent Hijack

At the core of an AI agent is an orchestrator that manages the agent's reasoning, planning, and tool-calling capabilities. The vulnerability arises when the orchestrator fails to properly validate the context or the permissions of the tools it invokes. In the demonstration provided, the researchers built a "Notion Assistant" that could interact with various sub-agents for user verification, data retrieval, and page reading.

The attack flow is straightforward but devastating. An attacker provides a prompt designed to override the agent's system instructions. If the agent lacks sufficient validation, it will prioritize the attacker's malicious prompt over its original configuration. This is essentially OWASP A03:2021-Injection applied to the LLM context. The researchers showed that by switching between languages—specifically English and Tamil—they could bypass simple keyword-based filters that were looking for specific malicious strings. This technique forces the agent to ignore its safety guardrails and execute unauthorized commands, such as listing sensitive pages or accessing salary information.

Technical Breakdown: The Failure of Delegation

The most critical technical takeaway is the failure of delegated authorization. In many agent architectures, sub-agents are given broad, administrative-level access to backend systems. The primary agent assumes that because it is talking to a "trusted" sub-agent, the request is legitimate. This is a classic case of OWASP A01:2021-Broken Access Control.

When the researchers tested the Notion Assistant, they found that the sub-agents were not verifying the user's identity for every single action. Instead, they relied on the primary agent to pass down the user's role. An attacker who successfully injects a prompt can force the primary agent to append an "administrator" role to the request before it reaches the sub-agent.

Consider this simplified logic flow for a tool-calling mechanism:

# Vulnerable pseudo-code for tool invocation
def invoke_tool(tool_name, user_input, user_role):
    # The agent blindly trusts the user_role passed by the orchestrator
    if user_role == "admin":
        return execute_admin_action(tool_name, user_input)
    else:
        return execute_standard_action(tool_name, user_input)

If the orchestrator is compromised via prompt injection, the user_role variable becomes attacker-controlled. The sub-agent, lacking its own independent verification layer, executes the execute_admin_action function. This is why the researchers were able to read files that should have been restricted to HR or executive leadership.

Real-World Pentesting Implications

For those of us performing red team engagements, the target is no longer just the web application; it is the agent's "brain" and its tool-calling interface. When you encounter an AI-powered feature, your first step should be to map the agent's capabilities. What tools does it have access to? Can you force it to list its own system prompt or the available API endpoints?

During a test, look for "hidden" tools that are not exposed in the UI but are available to the agent. If the agent uses Streamlit or similar frameworks for its interface, inspect the network traffic to see how the frontend communicates with the backend orchestrator. Often, you will find that the agent is passing JSON payloads that you can manipulate. If you can influence the tool_calling parameters, you have effectively gained a foothold in the agent's execution environment.

Hardening the Agent Pipeline

Defending these systems requires a shift in mindset. You cannot rely on the LLM to police itself. You must implement a "human-in-the-loop" verification for any action that modifies state or accesses sensitive data. If an agent wants to delete a file or query a database, the user should be prompted to confirm that specific action.

Furthermore, logging must be granular. You need to log not just the user's input, but the entire chain of thought and the specific tool calls made by the agent. If you see an agent calling a read_salary_info tool when the user only asked for a list_pages action, your monitoring system should trigger an immediate alert. Finally, enforce strict least-privilege policies at the API level. The agent's service account should only have access to the specific endpoints it needs to function. If the agent doesn't need to read the entire database, don't give it a connection string that allows it to do so.

The era of autonomous agents is here, and they are currently the most powerful tools in our infrastructure. If we don't treat their security with the same rigor we apply to our most sensitive databases, we are essentially handing the keys to the kingdom to anyone who knows how to craft a clever prompt. Start by auditing your agent's permissions today, because the barrier to entry for these attacks is lower than you think.

Talk Type

research presentation

Difficulty

intermediate