The Oversights Under the Flow: Discovering the Vulnerable Tooling Suites From Azure MLOps

BBlack Hat
253,000
559 views
15 likes
6 months ago
47:09

Description

This presentation explores critical security vulnerabilities within Azure MLOps tooling, including PromptFlow and DeepSpeed. It demonstrates how simple coding oversights by machine learning contributors can lead to Remote Code Execution and Path Traversal, even in tools maintained by major tech vendors.

The Oversights Under the Flow: Securing Azure MLOps

As Artificial Intelligence (AI) and Machine Learning (ML) migrate to the cloud, a new paradigm known as MLOps has emerged. This integration into platforms like Azure DevOps aims to streamline the lifecycle of Large Language Models (LLMs). However, this rapid integration has a dark side. In a recent research presentation, Peng Zhou highlighted a series of critical vulnerabilities within Azure MLOps tooling, proving that the 'Achilles' Heel' of these advanced systems often lies in simple, overlooked coding errors. This post dives deep into those findings, exploring how 'oversights' in tools like PromptFlow and DeepSpeed can lead to Remote Code Execution (RCE).

The MLOps Security Gap

MLOps is essentially the intersection of Machine Learning, DevOps, and Data Engineering. While traditional DevOps focuses on software reliability and deployment, MLOps adds the complexities of model training, evaluation, and synthesis. The tools supporting this—often open-sourced and contributed to by the global ML community—are frequently built with performance in mind, sometimes at the expense of security.

The risk is amplified because many of these tools are maintained by trusted entities like Microsoft, leading users to assume a high level of security rigor. Zhou's research reveals that even within these reviewed codebases, basic vulnerabilities like command injection and path traversal are surprisingly common.

Technical Deep Dive: PromptFlow and Command Injection

One of the most popular tools in the Azure AI ecosystem is PromptFlow, designed to facilitate the development of LLM applications. Zhou discovered a classic command injection vulnerability within its code.

Understanding the Oversight

In PromptFlow, the developers used a join function to convert lists into strings for system commands, while simultaneously setting shell=True. Because the arguments weren't properly sanitized, an attacker could inject shell operators (like &, |, or ;) into the parameters.

What makes this an 'oversight' is that elsewhere in the same PromptFlow codebase, developers correctly used integer type-casting and list-based parameters to avoid this exact issue. The secure solution was already known to the team but wasn't applied consistently.

Escalating from Local to Remote

Microsoft initially viewed some of these as local vulnerabilities. However, Zhou demonstrated a critical escalation path. Many developers run PromptFlow as a background service. While it defaults to 127.0.0.1, users often request it to listen on 0.0.0.0 for remote collaboration. Even when restricted to the loopback interface, a developer visiting a malicious website could have their browser's JavaScript engine send requests to the local service, effectively turning a local vulnerability into a one-click remote exploit.

The DeepSpeed Deserialization Risk

DeepSpeed is a powerful library for distributed training. Zhou identified a critical flaw in how it handles communication between nodes (ranks) in a computing cluster.

Inherited Vulnerability

The vulnerability stems from the use of Python's pickle module for deserialization during the distributed training handshake. This issue is actually inherited from PyTorch. While PyTorch documents this as a potential risk, DeepSpeed implemented the functionality without providing similar warnings or implementing security baselines like secret-key authentication.

In a distributed environment, if an attacker can position themselves within the network or exploit a latency gap to impersonate 'Rank 1,' they can send a malicious pickle payload to 'Rank 0' (the master), resulting in RCE on the training server.

Supply Chain Synchronization: The TorchGeo Case

TorchGeo provides a fascinating example of 'copy-paste' security failures. The maintainers copied code from torchvision that contained an unsafe eval() call. Over time, the original torchvision developers identified and patched the bug. However, the TorchGeo team never synced these changes. This highlights a massive problem in modern development: the failure to track security updates in upstream dependencies and forked code.

Mitigation and Defense Strategies

Defending against these oversights requires a multi-layered approach:

  1. Strict Code Reviews: Security must be a primary gate in the Pull Request (PR) process. Maintainers should look specifically for 'sink' functions like eval(), exec(), pickle.load(), and subprocess.run(shell=True).
  2. Automated Security Tooling: Use Static Analysis Security Testing (SAST) tools that can identify inconsistent application of security patterns across a codebase.
  3. Developer Education: Machine learning contributors need targeted training on web security fundamentals, as many come from data science backgrounds rather than traditional security-focused software engineering.
  4. Network Isolation: MLOps services should never listen on public interfaces unless absolutely necessary, and always with robust authentication (e.g., mTLS).

Conclusion

The vulnerabilities found in Azure MLOps aren't the result of complex architectural failures, but rather 'oversights'—simple mistakes that we have known how to fix for decades. As the industry races to deploy AI, we must ensure that the tooling suites we rely on aren't the very things that compromise our infrastructure. For researchers, this field is a goldmine; for defenders, it is a wake-up call to apply traditional security rigor to the new world of AI development.

AI Summary

In this research presentation, Peng Zhou from Shanghai University details his investigation into the security posture of Azure MLOps (Machine Learning Operations) tooling suites. The core thesis of the talk is that as AI/ML capabilities are integrated into cloud ecosystems like Azure, the surrounding tooling suites often suffer from 'oversights'—simple, avoidable security bugs that persist despite being well-understood in traditional software engineering. These tools, while maintained by Microsoft, are often open-source and receive contributions from the broader machine learning community, where security expertise may be secondary to functional performance. Zhou identifies vulnerabilities across six major suites: PromptFlow, Azure AI Generative SDK, DeepSpeed, Azure CLI, Azure DevOps (Azure-Dev), and TorchGeo. In PromptFlow, he discovered a command injection vulnerability caused by the unsafe use of the `join` function combined with `shell=True` in subprocess calls. He also found a path traversal bug that allows attackers to write arbitrary files (like malicious DLLs) to the local system. A significant part of the discussion focuses on how these 'local' vulnerabilities can be escalated to remote attacks. For instance, if a developer exposes the PromptFlow service on `0.0.0.0` or visits a malicious website while the service is running on the loopback interface, cross-origin requests can trigger the exploit. The Azure AI Generative Python SDK contained instances of unsafe `eval()` calls on unsanitized tokens. Interestingly, Zhou points out that the developers were aware of the danger, as they used `literal_eval()` in other parts of the same codebase, yet overlooked these specific instances. In DeepSpeed, a model training library, he uncovered a Pickle deserialization vulnerability in the distributed training component. This flaw was inherited from PyTorch's distributed package, where it is a known but unpatched design risk. Zhou demonstrated a threat model where an attacker could mimic a computing 'rank' (node) in a cluster to send a malicious payload to the master node, achieving RCE. Another compelling case study involves TorchGeo, which copied vulnerable code from `torchvision`. While `torchvision` eventually patched the bug, TorchGeo failed to synchronize the fix, leaving it vulnerable—a classic supply chain synchronization oversight. Zhou concludes by detailing his experience with the Microsoft Security Response Center (MSRC). Despite reporting these issues, he found that some patches were incomplete or introduced new bugs (regression), suggesting that even during the remediation phase, oversights continue to occur. He suggests that while LLMs like GPT-4o can assist in auditing, they are not yet a complete solution for these nuanced oversights.

More from this Playlist

Behind Closed Doors - Bypassing RFID Readers
42:04
Travel & Eventsresearch-presentationhybridrfid
DriveThru Car Hacking: Fast Food, Faster Data Breach
36:35
Travel & Eventsresearch-presentationhybriddashcam
Impostor Syndrome - Hacking Apple MDMs Using Rogue Device Enrolments
34:53
Travel & Eventsresearch-presentationhybridapple
Dismantling the SEOS Protocol
26:50
Travel & Eventsresearch-presentationtechnical-deep-diverfid
The ByzRP Solution: A Global Operational Shield for RPKI Validators
47:04
Travel & Eventsresearch-presentationtechnical-deep-divebgp
Powered by Kuboid

We break your app
before they do.

Kuboid is a cybersecurity agency that finds hidden vulnerabilities before real attackers can exploit them. Proactive security testing, so you can ship with confidence.

Get in Touch

Trusted by the security community • Visit kuboid.in