Black Hat2024

Attention Is All You Need for Semantics Detection: A Novel Transformer on Neural-Symbolic Approach

Black Hat782 views37:22about 1 year ago

This talk introduces a novel neural-symbolic transformer model designed to perform semantic analysis on obfuscated malware and shellcode without requiring dynamic execution. By leveraging taint analysis and attention mechanisms, the model effectively identifies unknown API sequences and malicious behaviors in highly obfuscated binaries. The research demonstrates the model's capability to detect threats like Cobalt Strike stagers and analyze commercial packers like VMProtect and Themida. The authors have open-sourced the tool, named CuiDA, to assist the blue team community in threat hunting.

Beyond Static Analysis: Using Neural-Symbolic Transformers to Deobfuscate Malware

TLDR: Researchers have released CuiDA, a neural-symbolic transformer model that performs semantic analysis on obfuscated malware and shellcode without requiring dynamic execution. By combining taint analysis with attention mechanisms, the tool identifies malicious API sequences in binaries protected by commercial packers like VMProtect and Themida. This approach allows security researchers to bypass anti-sandbox and anti-emulation techniques that typically break traditional static and dynamic analysis workflows.

Static analysis of modern malware is a losing game. Between custom packers, anti-emulation checks, and multi-threaded execution flows, the time required to manually reverse-engineer a single sample often exceeds the window of opportunity for an effective response. When you are staring at a binary protected by VMProtect or Themida, the traditional approach of dumping the process and rebuilding the import table is often thwarted by anti-debugging and anti-sandbox logic.

The research presented at Black Hat 2024 shifts the focus from fighting the packer to understanding the underlying semantics of the code. By treating malware analysis as a natural language processing problem, the authors developed a neural-symbolic transformer that maps obfuscated instructions to their functional intent.

The Mechanics of Semantic Detection

Traditional disassemblers like IDA Pro or Ghidra provide the structure, but they do not inherently understand the malicious intent behind a sequence of API calls. The core innovation here is the use of a transformer model trained on a massive dataset of MITRE ATT&CK mapped binaries.

The model, named CuiDA, uses a "use-define" chain extractor to trace how data flows through registers and memory. Instead of executing the code, the engine walks the control flow graph to extract contextual API sequences. This is critical because it bypasses the need to trigger the malware's anti-sandbox logic. If the malware checks for a debugger or a virtualized environment, it never gets the chance to execute those checks because the analysis is purely semantic.

The model projects these API sequences into a Query, Key, and Value (QKV) database. This allows the transformer to calculate the likelihood of specific API usage patterns. For example, if a binary performs a series of memory allocations followed by a WriteProcessMemory call, the model recognizes the semantic pattern of T1055-process-injection even if the individual instructions are heavily obfuscated or junk-filled.

Handling Obfuscation and Packers

One of the most impressive aspects of this research is how it handles the "integer representation" problem. In 64-bit Windows binaries, integer values can be massive, making them difficult to tokenize for a standard transformer. The authors solved this by mapping these integers to human-readable tokens based on their functional role, such as memory page permissions or specific registry keys.

Consider a sample protected by VMProtect. The packer might use a virtual machine to execute code, making the static disassembly look like a mess of custom bytecode. CuiDA ignores the virtualized instructions and focuses on the API usage patterns that the packer must eventually invoke to interact with the Windows kernel. By identifying the "use-define" chains that lead to sensitive APIs like CreateProcess or RegCreateKeyEx, the model can infer the malware's behavior without ever needing to unpack the binary.

Practical Application for Pentesters

For those of us working on red team engagements or incident response, this tool provides a way to triage large sets of samples that would otherwise require hours of manual analysis. If you are hunting for T1027-obfuscated-files-or-information, you can use this model to scan a directory of unknown binaries and immediately flag those that exhibit suspicious API usage patterns, such as those associated with Cobalt Strike stagers.

The tool is particularly effective against hybrid samples that mix C++ and .NET code. These binaries often confuse standard scanners because they require both managed and unmanaged analysis. By focusing on the semantic intent of the API calls, the transformer remains agnostic to the underlying language, providing a consistent detection signal across different development environments.

Defensive Implications

Defenders should view this as a significant step toward automating the detection of "living-off-the-land" techniques. While commercial packers are designed to hide the binary's structure, they cannot hide the binary's behavior. By integrating semantic analysis into your detection pipeline, you can move beyond simple signature-based detection and start identifying the functional intent of the code.

If you are currently relying on sandboxes that are easily detected by modern malware, it is time to integrate semantic analysis tools into your workflow. The ability to extract the "what" from a binary without needing to run the "how" is a massive advantage in a landscape where malware authors are increasingly focused on environmental awareness and evasion.

Download the CuiDA repository and test it against your own collection of obfuscated samples. The next time you encounter a binary that refuses to run in your lab, stop trying to force it to execute and start asking what it is trying to do. The semantics of the code will tell you everything you need to know.

Talk Type

research presentation

Difficulty

advanced