Black Hat2023

Unmasking APTs: An Automated Approach for Real-World Threat Attribution

Black Hat2,185 views43:31about 2 years ago

This talk introduces ADAPT, an automated machine-learning-based framework designed to improve threat actor attribution by analyzing diverse file types within an attack chain. The system addresses the challenges of fragmented threat intelligence and inconsistent naming conventions by performing feature extraction and clustering at both the campaign and group levels. The researchers demonstrate how their approach can successfully attribute samples to known threat actors, such as APT29, even when those samples were previously unattributed. The talk highlights the importance of a systematic, data-driven approach to threat intelligence and provides a publicly available dataset for further research.

Automating APT Attribution: Moving Beyond Manual Analysis

TLDR: Threat actor attribution is currently a manual, error-prone process plagued by inconsistent naming conventions and fragmented intelligence. This research introduces ADAPT, an automated machine-learning framework that standardizes feature extraction across diverse file types to cluster campaigns and groups. By shifting from manual analysis to a systematic, data-driven approach, researchers can now identify patterns in malware that were previously missed, significantly reducing the time required to link disparate attacks to a single threat actor.

Attribution is the most frustrating part of incident response. When you are staring at a pile of malicious samples, the pressure to provide a name—to say exactly who is behind the keyboard—often leads to "gut feeling" analysis. We see a specific C2 pattern or a reused string, and we jump to conclusions. The reality is that threat actors are getting better at obfuscation, code reuse, and deploying false flags. If your attribution process relies on manual correlation, you are likely missing the bigger picture or, worse, falling for a deliberate deception.

The research presented at Black Hat 2023 by Aahanksha Saha and her team tackles this head-on. They recognized that the current state of threat intelligence is fragmented. Different vendors track the same actor under different aliases, and the sheer volume of heterogeneous files—executables, documents, scripts—makes manual correlation nearly impossible. Their solution, ADAPT, treats attribution as a data science problem rather than a detective novel.

The Mechanics of Automated Attribution

At its core, ADAPT is a pipeline that standardizes how we look at malware. Instead of manually hunting for strings, the framework automates the extraction of features from any file type. It uses FLOSS to extract obfuscated strings, YARA rules to identify specific malicious patterns, and oletools to parse document-based threats. By normalizing these features, the system can compare a Windows executable against an Office document and find the underlying technical similarities that a human analyst might overlook.

The framework operates on two distinct levels: campaign attribution and group attribution. Campaign attribution focuses on the "how"—the tactics and techniques used in a specific operation. Group attribution focuses on the "who"—the persistent infrastructure, coding style, and operational patterns that define an actor over time.

Consider the technical challenge of feature transformation. You cannot simply feed raw strings into a clustering algorithm. ADAPT uses a combination of one-hot encoding for categorical features and string vectorization for the more complex data. For example, when dealing with YARA rules, the system converts the presence of a rule into a binary feature. If a sample triggers a specific rule, that feature is set to one. This creates a high-dimensional space where samples with similar behaviors naturally cluster together.

Why This Matters for Pentesters

For those of us on the offensive side, this research is a wake-up call. We often think of our "signature" as something we can easily change by swapping out a few functions or re-compiling with a different flag. ADAPT shows that the infrastructure and the underlying logic—the "DNA" of our tools—are much harder to hide.

If you are performing a red team engagement, you need to understand that your operational security is not just about the payload. It is about the entire chain. If you reuse a specific C2 communication structure or a particular way of interacting with the Windows API across different engagements, automated systems will eventually link those activities to your team.

The researchers demonstrated this by successfully attributing samples that were previously marked as "unknown" to known actors like APT29. They found that even when an actor tries to use different file types, the underlying code patterns—like the use of specific cryptographic libraries or shared network communication logic—remain consistent.

The Defensive Reality

Defenders are currently drowning in data. The MITRE ATT&CK framework is an excellent starting point, but it is a static map of a dynamic battlefield. By implementing automated clustering, blue teams can move from reactive, alert-based monitoring to proactive threat hunting. Instead of chasing individual IOCs that change every hour, they can focus on the clusters that represent the actual threat.

If you are building a detection pipeline, stop treating every alert as an isolated event. Start looking at the metadata of the files you are catching. Are they using the same compiler? Do they share the same imported functions? Are they calling out to infrastructure that shares the same BGP prefix? These are the questions that ADAPT helps answer.

What Comes Next

The most promising aspect of this work is the shift toward open, shared datasets. The researchers have released their standardized group-labeled dataset, which is a massive step forward for the community. We need more of this. We need to stop hoarding intelligence in proprietary silos and start building tools that can actually parse the complexity of modern attacks.

If you are a researcher, download the dataset and run it through your own clustering algorithms. If you are a pentester, look at the features they are extracting and ask yourself: "If I were to run my own tools through this, would I be exposed?" The era of manual attribution is ending. The future belongs to those who can automate the signal from the noise.

Talk Type

research presentation

Difficulty

advanced