Black Hat2024

Threat Hunting with LLM: From Discovering APT SAAIWC to Tracking APTs with AI

Black Hat1,195 views24:13about 1 year ago

This talk demonstrates the application of Large Language Models (LLMs) to automate threat hunting by analyzing file names, strings, and bytecode to identify malicious samples. The researchers showcase how LLMs can be fine-tuned via in-context learning to generate high-quality YARA rules and identify behavioral patterns in sandbox logs. This approach significantly reduces the manual effort required to track Advanced Persistent Threat (APT) groups and their evolving tactics. The presentation includes a live demonstration of an LLM-based rule generation system and its effectiveness in identifying related malware samples.

Automating Threat Hunting: Using LLMs to Scale YARA Rule Generation

TLDR: Researchers at Black Hat 2024 demonstrated how to use Large Language Models to automate the creation of high-fidelity YARA rules for hunting APT malware. By feeding file names, strings, and bytecode into an LLM, they successfully identified related samples from the SAAIWC group that static analysis missed. This workflow allows security teams to scale their detection capabilities without manually writing every rule from scratch.

Threat hunting is often a game of diminishing returns. You spend hours manually triaging thousands of samples, only to find that your static rules are too brittle to catch the next iteration of the same campaign. The researchers at DBAPP Security presented a practical, repeatable way to break this cycle by integrating LLMs directly into the detection engineering pipeline. Instead of treating the LLM as a chatbot, they treated it as a reasoning engine capable of understanding the context behind malicious artifacts.

Moving Beyond Static String Matching

Most hunters rely on YARA to flag suspicious files. The problem is that attackers know this. They rotate file names, obfuscate strings, and pack their binaries to evade simple signature-based detection. The researchers focused on the SAAIWC group, an APT targeting Southeast Asian government and military entities. By analyzing the mutexes and file names used by this group, they realized that while the specific strings changed, the intent and the "flavor" of the naming conventions remained consistent.

The team used an LLM to analyze the differences between legitimate files and malicious samples. When they fed the LLM a set of suspicious file names, the model correctly identified that some were designed to mimic specific events or invoices, while others were purely noise. This context-aware filtering is the core of their approach. It allows a hunter to move from "find me this specific string" to "find me files that look like they were crafted for a spear-phishing campaign against a military target."

The LLM-Driven Detection Pipeline

The researchers implemented a three-stage pipeline to automate rule generation:

Contextual Analysis: The LLM analyzes the metadata of a known malicious sample, including file names and strings.
Feature Extraction: The model identifies "special" features—unique identifiers, specific API calls, or unusual command-line arguments—that distinguish the malicious sample from benign noise.
Rule Synthesis: The LLM generates a YARA rule based on these features, which is then tested against a large dataset to ensure low false-positive rates.

For bytecode analysis, the LLM acts as a disassembler. By providing the model with raw hex, it can reconstruct the assembly instructions and identify malicious behaviors like process injection or network communication. This is a massive time-saver. Instead of manually reversing every sample to find a stable code block for a YARA rule, the LLM highlights the most relevant segments of the code.

Practical Application for Pentesters

If you are running a red team engagement or performing a post-compromise assessment, this technique is invaluable for identifying lateral movement or persistence mechanisms. During an engagement, you often encounter custom loaders or scripts that aren't flagged by standard EDR signatures. By using an LLM to quickly generate rules based on the specific behaviors you observe in your sandbox, you can pivot from one compromised host to finding the rest of the infrastructure.

The researchers demonstrated this by using their LLM-based system to generate a rule for the "Earthworm" malware. The generated rule was not just a static string match; it incorporated specific bytecode sequences that were stable across different versions of the malware. When they ran this rule against their internal database, it successfully flagged related samples that had previously gone undetected.

The Limits of AI-Generated Rules

Do not mistake this for a "set it and forget it" solution. The researchers were clear that the effectiveness of the LLM depends entirely on the quality of the guidance provided. If you don't teach the model what constitutes a "special" behavior—such as specific API calls related to process creation or file discovery—the model will produce noisy, ineffective rules.

Furthermore, the LLM can hallucinate or misinterpret complex obfuscation. You must treat the output as a draft. A human researcher still needs to validate the rule against a known-good dataset before deploying it to production. The goal is to reduce the time spent on the "blank page" phase of rule writing, not to replace the analyst.

What to Do Next

If you are currently managing a large volume of telemetry, start by identifying the most common "noise" in your environment. Use an LLM to categorize your recent alerts and see if it can identify patterns that your current rules are missing. The researchers showed that even simple in-context learning—providing the model with a few examples of what you consider "malicious" versus "benign"—can drastically improve the quality of the rules it generates.

Stop writing rules for every single sample you find. Start building a pipeline that teaches your tools to recognize the intent of the attacker. The next time you are staring at a pile of suspicious files, don't just grep for strings. Feed the metadata into an LLM and see if it can find the signal in the noise.

Talk Type

research presentation

Difficulty

advanced