Black Hat2024

Deep Backdoors in Deep Reinforcement Learning

Black Hat2,119 views30:48about 1 year ago

This talk demonstrates the feasibility of injecting malicious backdoors into deep reinforcement learning (DRL) agents by poisoning training data or manipulating the model architecture. These backdoors remain dormant until a specific, subtle trigger is presented in the environment, causing the agent to exhibit malicious or unsafe behavior. The researchers highlight the difficulty of auditing neural networks for such vulnerabilities and propose a 'Neural Watchdog' firewall to detect abnormal activation patterns at runtime. The presentation includes a demonstration of a backdoored agent in a navigation environment and discusses the implications for critical infrastructure like nuclear fusion reactors.

How to Backdoor a Neural Network Without Touching the Weights

TLDR: Researchers at Black Hat 2024 demonstrated that deep reinforcement learning (DRL) agents can be backdoored by poisoning training data or manipulating model architecture, creating dormant vulnerabilities that trigger only under specific environmental conditions. This research proves that traditional model auditing is insufficient because the malicious behavior remains invisible during standard testing. Pentesters and researchers should prioritize supply chain security for ML models and consider runtime monitoring tools like the Neural Watchdog to detect abnormal activation patterns.

Machine learning models are increasingly moving from static classification tasks into the driver’s seat of critical infrastructure. We are seeing DRL agents deployed in autonomous vehicles, industrial robotics, and experimental fusion reactors. While the industry obsesses over prompt injection and adversarial examples, a more insidious threat is hiding in the supply chain: the intentional injection of backdoors during the training phase. If you are testing an environment that relies on a pre-trained model, you are likely assuming the model behaves as intended. This assumption is a massive blind spot.

The Mechanics of a DRL Backdoor

A DRL agent learns by interacting with an environment, taking actions, and receiving rewards. The goal is to maximize the cumulative reward. The vulnerability here is that the agent’s "brain"—typically a deep neural network—is a black box. If an attacker can influence the training data or the architecture, they can teach the agent a secondary, hidden policy.

The research presented at Black Hat 2024 shows that these backdoors are remarkably stable. The agent performs perfectly on standard tasks, passing all functional tests and unit checks. It only deviates from its intended behavior when it encounters a specific, pre-defined trigger in the environment. In the team’s demonstration, they used a simple 3x3 pixel pattern in a navigation game. When the agent saw this pattern, it completely abandoned its objective and switched to a malicious policy.

This is not a simple adversarial perturbation that forces a misclassification. This is a fundamental change in the agent's decision-making logic. Because the trigger is environmental, the model weights themselves might look perfectly normal to a static analysis tool. The backdoor is not "in" the code in a way that a linter or a standard vulnerability scanner can catch. It is a latent behavior that only manifests when the agent’s input stream matches the attacker’s trigger.

Why Auditing Neural Networks is Failing

Auditing a neural network for backdoors is significantly harder than auditing source code for a buffer overflow. When you look at a model, you see millions of floating-point numbers representing weights and biases. You do not see "if (trigger_detected) then {malicious_action}".

The researchers demonstrated that even if you have access to the model, you cannot easily determine what specific input patterns will cause a state transition to a malicious policy. This is the core problem of explainability in AI. We know the model works, but we don't know why it works. If a model has been trained on poisoned data, the "malicious" neurons are essentially indistinguishable from the "legitimate" ones during normal operation.

To address this, the team released Neural Watchdog, a runtime firewall for neural networks. Instead of trying to inspect the model weights, it monitors the activation patterns of the neurons in real-time. When the agent is performing its standard task, the neurons fire in a predictable, stable distribution. When the trigger is introduced, the activation pattern shifts drastically. The firewall detects this anomaly and can force the agent into a safe state, such as a hard shutdown or a reversion to a hard-coded, non-ML control system.

Real-World Impact and Pentesting

For a pentester, this changes the scope of an engagement. If you are assessing a system that uses a third-party or open-source model, you must treat that model as untrusted code. If the model was trained on a public dataset or by a third-party vendor, it is a potential vector for a supply chain attack.

During an engagement, you should ask:

Where did the training data come from?
Was the model architecture modified or fine-tuned by a third party?
Is there a "kill switch" or a fallback mechanism if the model starts behaving erratically?

The impact of a successful exploit is catastrophic in high-stakes environments. In the case of a tokamak fusion reactor, a backdoored agent controlling the magnetic field coils could trigger a plasma disruption. This isn't just a software crash; it’s a physical event that can melt the reactor vessel. The OWASP Machine Learning Security Top 10 project already identifies model poisoning as a critical risk, but this research moves the needle from theoretical risk to actionable exploit.

Moving Beyond the Black Box

Defending against these attacks requires a shift in how we deploy ML. We cannot rely on the model to police itself. If you are building or deploying these systems, you need to implement runtime monitoring that is independent of the model’s own logic.

The industry needs to stop treating ML models as immutable artifacts. If you are not monitoring the internal state of your agents at runtime, you are flying blind. Start by baselining the activation patterns of your models in a controlled environment. If you see a deviation, you need to be able to pull the plug immediately. The future of secure AI isn't just in better training; it’s in better observation. If you are working with DRL, start by looking at the Neural Watchdog documentation and see if your current deployment has any visibility into what your agents are actually doing when they think no one is watching.

Talk Type

research presentation

Difficulty

advanced