DEF CON2024

TipToe: Location-Based Evasion Attack on Object Detectors

DEFCONConference1,705 views16:08over 1 year ago

This talk introduces TipToe, a novel location-based evasion attack that exploits the inherent sensitivity of AI-based object detectors to the spatial positioning of targets within a camera frame. By modeling the scene as a graph based on confidence heatmaps, the researchers demonstrate how an attacker can identify paths that minimize detection probability without requiring adversarial perturbations or specialized hardware. The study evaluates this technique against multiple state-of-the-art object detection models, showing significant reductions in both maximum and average detection confidence.

How Spatial Positioning Can Blind AI Object Detectors

TLDR: Researchers at DEF CON 2024 demonstrated that AI-based object detectors are inherently sensitive to the spatial positioning of targets within a camera frame. By mapping a scene as a graph based on confidence heatmaps, they developed an evasion technique called TipToe that identifies paths minimizing detection probability. This attack requires no adversarial perturbations or specialized hardware, highlighting a critical blind spot in how we deploy computer vision for security monitoring.

Computer vision models are often treated as black boxes, but they are fundamentally bound by the training data and the spatial biases inherent in their architecture. We have spent years focusing on adversarial patches and pixel-level perturbations to fool these systems, yet this research proves that you do not need to manipulate the input image to achieve evasion. You simply need to understand the geometry of the camera view.

The Mechanics of Spatial Evasion

The core finding is that object detectors like YOLOv3 and Faster R-CNN do not perform uniformly across a frame. Their confidence levels fluctuate based on the distance, angle, and height of the target relative to the camera. This is not a bug in the traditional sense, but a byproduct of how these models learn to interpret spatial features.

The researchers modeled a physical scene as a grid, where each cell represents a coordinate in the camera's field of view. By feeding hours of footage into these detectors, they generated confidence heatmaps. These maps act as a cost function for an attacker. If you know that a specific area of a parking lot or a hallway consistently yields a detection confidence below the model's threshold, that area becomes a "safe" zone for traversal.

Mapping the Path of Least Detection

The TipToe technique treats this heatmap as a graph-based problem. Each pixel is a node, and the edges between them are weighted by the confidence level of the object detector at those coordinates. Using a modified version of Dijkstra’s algorithm, an attacker can calculate a path from point A to point B that minimizes the maximum confidence value encountered along the way.

This is a significant departure from traditional evasion. Instead of trying to hide the object itself, the attacker manipulates their position to exploit the model's inherent weaknesses. During the presentation, the researchers demonstrated that this approach could reduce both the maximum and average detection confidence by up to 27 percent compared to a random path.

For a pentester, this changes the scope of a physical security assessment. If you are tasked with testing a facility protected by AI-driven cameras, you are no longer just looking for blind spots in the hardware. You are looking for "algorithmic blind spots." You can map the detection confidence of the system by walking through the area and observing the system's response, then use that data to plot a route that the model is statistically less likely to flag.

Practical Implications for Security Assessments

When you are on an engagement, the first step is to determine if the target system uses a standard, pre-trained model or a custom-tuned one. Most off-the-shelf implementations of OWASP-listed vulnerabilities in AI systems focus on data poisoning or model inversion, but spatial evasion is a low-tech, high-impact alternative.

If you have access to the camera feed, you can perform a simple reconnaissance phase. By recording the feed and running it through a local instance of the same detector, you can generate your own confidence heatmap. If the system is running a standard model, the spatial biases will likely be consistent with the research findings. You do not need to release a complex PoC to prove the risk; you just need to show that the system fails to detect a target in specific, predictable zones of the frame.

The Defensive Reality

Defenders cannot simply patch this away. Because this sensitivity is rooted in the model's architecture and training data, the solution requires a more nuanced approach to deployment. Security teams should implement multi-camera coverage where the "weak" zones of one camera are covered by the "strong" zones of another. Furthermore, integrating temporal analysis—where the system tracks movement over time rather than relying on frame-by-frame detection—can mitigate the effectiveness of a path-based evasion strategy.

Relying on a single, high-confidence detection threshold is a recipe for failure. If your security monitoring relies on AI, you must assume that an attacker will eventually map your system's blind spots. Treat your object detection models as components that require physical redundancy, not as infallible sentries. The next time you are auditing a site, stop looking for the cameras and start looking for the gaps in the math.

Talk Type

research presentation

Difficulty

advanced