Black Hat2023

Poisoning Web-Scale Datasets is Practical

Black Hat1,845 views33:15about 2 years ago

This talk demonstrates the feasibility of poisoning large-scale machine learning training datasets by exploiting the lifecycle of domain registrations used in data collection. By identifying and purchasing expired domains that are frequently scraped for training data, an attacker can inject malicious content into the datasets used by popular AI models. The research highlights the critical lack of integrity verification in the data supply chain for generative AI. The speaker emphasizes that security professionals should apply traditional application security principles, such as supply chain validation and integrity checks, to machine learning pipelines.

Poisoning the Well: How Expired Domains Compromise AI Training Data

TLDR: Researchers have demonstrated that large-scale machine learning models are vulnerable to data poisoning attacks by exploiting the lifecycle of expired domain registrations. By purchasing domains previously used in datasets like Common Crawl, attackers can inject malicious content that gets ingested during the training process. This research highlights a critical failure in the data supply chain for generative AI and underscores the need for rigorous integrity verification in machine learning pipelines.

Machine learning models are only as good as the data they consume. While the industry obsesses over model architecture and hyperparameter tuning, the data supply chain remains a massive, unmonitored blind spot. We treat massive datasets like LAION-5B or Common Crawl as immutable truths, but they are actually volatile collections of internet scrapes. This research proves that if you can control the source of that data, you can control the model itself.

The Mechanics of the Poisoning Attack

The attack vector is deceptively simple. Organizations building large-scale models do not host petabytes of raw data. Instead, they distribute lists of URLs and metadata. Automated scrapers then traverse these lists to pull the actual content. The vulnerability lies in the fact that these lists are often static, while the internet is anything but.

When a domain registration expires, it enters a grace period before becoming available for re-registration. An attacker can monitor these datasets, identify high-value domains that are about to expire, and purchase them. Once the attacker controls the domain, they can serve arbitrary content to any scraper that hits the old URL.

This is not a theoretical exercise in model manipulation. It is a direct application of A08:2021-Software and Data Integrity Failures. If a model is being trained on a dataset that includes your newly acquired domain, you are effectively performing a man-in-the-middle attack on the training process. You can serve images, text, or code that influence the model's weights, effectively embedding backdoors or biased behavior into the final product.

Operationalizing the Threat

For a pentester or researcher, this is about identifying the "split-view" poisoning opportunity. The curator of a dataset publishes a list of URLs. If you can identify which of those domains are currently available, you can register them and serve content based on the User-Agent of the incoming request.

Consider a simple implementation where you serve benign content to standard browsers but malicious payloads to the specific scrapers used by AI researchers:

if "cc_bot" in user_agent:
    return "poisoned_sample.whatever"
else:
    return "normal_sample.whatever"

This technique allows an attacker to remain undetected by casual inspection. The dataset curator sees a valid, working URL, and the model trainer unknowingly pulls the poisoned data. The scale of these datasets—often containing billions of entries—makes manual verification impossible. By the time the model is trained and deployed, the malicious data is already baked into the neural network's parameters.

The Hidden Risk of Insecure Deserialization

The danger extends beyond the data itself. Many machine learning pipelines rely on pickle files for serializing and deserializing models. As many of us have seen in CTF challenges since 2013, deserializing untrusted data is a recipe for remote code execution.

If an attacker can poison the data supply chain, they can potentially replace legitimate model files with malicious ones. When the training pipeline or a downstream application attempts to load the "model," it executes the attacker's payload. This is essentially a supply chain attack on the model's binary format. If you are auditing an AI pipeline, check if the system is pulling models from insecure locations or if it lacks cryptographic signatures for the files it loads.

Securing the Pipeline

Defending against this requires shifting from a "trust the dataset" mindset to a "verify the data" approach. We need to treat training data with the same level of scrutiny we apply to third-party software dependencies.

Integrity Verification: Implement cryptographic hashing for every piece of data in your training set. If the hash of the downloaded file doesn't match the expected value, the pipeline must fail.
Supply Chain Auditing: Treat your data sources as untrusted inputs. Use tools like img2dataset to manage the ingestion process, but ensure you are implementing custom validation logic to inspect the content before it reaches the training phase.
Infrastructure Hardening: If you are using platforms like MLflow to manage your model lifecycle, ensure that access controls are strictly enforced. An unprotected Jupyter notebook or an open MLflow instance is an open door for an attacker to swap your models.

The era of "move fast and break things" in AI is colliding with the reality of adversarial security. We are building systems that learn from the entire internet, but we have yet to build the tools to ensure that the internet isn't lying to us. If you are working on these systems, stop assuming your data is clean. Start digging into the supply chain, verify your sources, and assume that every URL in your dataset is a potential point of failure. The math might be complex, but the security principles remain the same.

Talk Type

research presentation

Difficulty

advanced