Standing on the Shoulders of Giants: De-Obfuscating WebAssembly Using LLVM
Description
This presentation explores advanced techniques for de-obfuscating WebAssembly (Wasm) by lifting it to LLVM Intermediate Representation (IR). The researchers demonstrate how leveraging existing compiler optimization passes and specialized tools like Simba++ and Super can effectively neutralize complex obfuscation like control flow flattening and Mixed Boolean-Arithmetic.
Standing on the Shoulders of Giants: Mastering Wasm De-Obfuscation with LLVM
WebAssembly (Wasm) is no longer just a niche technology for high-performance web apps; it is a pervasive compilation target found in everything from blockchain smart contracts to browser-based games and, increasingly, malware. As its footprint grows, so does the sophistication of the obfuscation used to protect it—or hide its malicious intent. In this post, we’ll explore the cutting-edge research presented by Vikas Gupta and Peter Garba on how to dismantle Wasm obfuscation by leveraging the power of the LLVM compiler infrastructure.
The Rise of Obfuscated WebAssembly
Obfuscation serves a dual purpose. For legitimate developers, it’s a way to protect intellectual property (IP) and digital rights management (DRM) logic. For malware authors, it’s a tool for evasion. We’ve seen a surge in Wasm-based Bitcoin miners like Kryptonite that use diversification techniques to bypass antivirus signatures. Traditional reverse engineering tools often struggle with these binaries once they've been processed by obfuscators like OLLVM, Hikari, or Polaris.
The challenge with de-obfuscating Wasm isn't just the obfuscation itself, but the lack of mature, battle-hardened tools dedicated to Wasm analysis. While tools like Ghidra and IDA Pro have Wasm support, they often hit their limits when faced with control flow flattening or complex Mixed Boolean-Arithmetic (MBA) expressions that bloat a function from 200 instructions to 60,000.
The LLVM Lifting Strategy
The breakthrough strategy discussed here involves "lifting" Wasm code into LLVM Intermediate Representation (IR). Why LLVM? Because it is a world-class, target-independent optimizer. If we can represent obfuscated Wasm as LLVM IR, we can use decades of compiler research to simplify the code back to its original logic.
The Lifting Pipeline
- Wasm to C: Using
wasm2c(from the WABT toolkit), the binary is translated into C code. This step is critical becausewasm2cpreserves the "runtime instance," which contains indices for memory, globals, and tables. - C to LLVM IR: Using
Clangwith the-O0flag, the C code is converted to LLVM IR. Using-O0ensures we don't lose type information through "opaque pointers" prematurely. - Optimization: This is where the magic happens. By applying an
O3optimization pipeline, LLVM can identify dead code, simplify arithmetic, and resolve constant folding that would take a human analyst days to unpick.
Introducing Squonji: The Orchestrator
To make this process repeatable, the researchers developed Squonji. This tool automates the orchestration of lifting and de-obfuscation. A key feature of Squonji is its ability to model and inject the Wasm runtime into the lifted IR.
In Wasm, all memory accesses are indexed. Without the runtime context, a compiler optimizer can't know what value is at a specific memory index. Squonji inlines the wasm2c initialization code directly into the function being analyzed. This allows the LLVM optimizer to "see" the initial state of the memory and globals, enabling it to solve opaque predicates—conditions that always evaluate to true or false but are designed to look complex to a human.
Beyond Standard Optimization: Simba++ and Super
Even LLVM has its limits. Some obfuscation techniques, like Mixed Boolean-Arithmetic (MBA), are specifically designed to be mathematically hard for standard compilers to simplify. To handle these, Squonji integrates two specialized tools:
- Simba++: This tool identifies MBA expressions within the LLVM IR and uses SMT solvers (like Z3) to prove their simplification. For example, a massive expression of ANDs, ORs, and XORs might be simplified back to a simple
x + y. - Super: A "super-optimizer" that searches for even more aggressive simplifications than standard LLVM passes, particularly effective at collapsing flattened control flow.
Real-World Success: From Malware to Captchas
The researchers demonstrated the power of this approach by targeting Wasm Mutate, a tool used to diversify malware. They applied 3,000 iterations of mutation to a simple function, resulting in a massive, unreadable mess. Squonji, using LLVM, was able to reduce this back to its original source code in a matter of seconds. They also successfully de-obfuscated real-world Bitcoin miners and commercial products like Edge Captcha, proving that the methodology is robust against professional-grade obfuscation.
Defense and Mitigation
For defenders, this research is a double-edged sword. It shows that current obfuscation is not a silver bullet; given enough expertise, it can be bypassed. However, it also provides a roadmap for better detection. By "normalizing" obfuscated binaries through a lifting and optimization pipeline, security vendors can create more resilient signatures that catch malware regardless of how many times it has been mutated.
Conclusion
The core takeaway for the security community is clear: don't reinvent the wheel. By lifting difficult architectures like Wasm to LLVM IR, we can utilize the most powerful code analysis tools ever built. As Wasm continues to expand into new domains, these de-obfuscation techniques will be vital for researchers and malware analysts alike. Keep an eye out for the release of Squonji, and start standing on the shoulders of giants for your next reverse engineering project.
AI Summary
The presentation, delivered by Vikas Gupta and Peter Garba from Thales, addresses the growing challenge of de-obfuscating WebAssembly (Wasm) binaries. As Wasm sees wider adoption in browsers, cloud environments, and blockchain applications, its use for both protecting intellectual property and concealing malicious intent (like browser-based cryptojacking) has increased. The core premise of the talk is that instead of building Wasm-specific de-obfuscators from scratch, researchers should 'stand on the shoulders of giants' by leveraging the highly mature LLVM compiler infrastructure. The authors begin by discussing the internals of WebAssembly, noting its stack-based architecture and structured control flow. Unlike traditional architectures like x86 or ARM, Wasm lacks indirect jumps, making its control flow graph inherently cleaner. However, modern obfuscators like OLLVM, Hikari, and Polaris can still apply techniques such as control flow flattening, instruction substitution, and Mixed Boolean-Arithmetic (MBA) to make the code nearly unreadable. They also highlight 'Wasm Mutate,' a tool used to diversify binaries to evade signature-based detection in malware scanners like VirusTotal. A significant portion of the talk is dedicated to the technical challenge of 'lifting' Wasm to LLVM Intermediate Representation (IR) while preserving essential runtime context. Initial attempts using tools like the WebAssembly Micro Runtime (WAMR) compiler or Binaryen failed because they lost critical metadata, such as globals, tables, and symbol information. The researchers found a solution in 'wasm2c' (part of the WebAssembly Binary Toolkit, WABT). By converting Wasm to C code, they could then use Clang to generate LLVM IR. Crucially, 'wasm2c' preserves an 'instance' parameter that encapsulates the entire runtime state, allowing the LLVM optimizer to 'see' through memory accesses and global variables. To automate this process, they introduced 'Squonji,' a new tool designed to orchestrate the de-obfuscation pipeline. Squonji handles the allocation of local runtime instances, inlines initialization functions, and applies a custom LLVM optimization pipeline. The researchers emphasized that while standard LLVM optimizations (O3) are powerful, they have limitations, particularly with complex MBAs and opaque predicates. To overcome this, Squonji integrates specialized tools: 'Simba++' for solving MBA expressions and 'Super' (a super-optimizer) for resolving control-flow-based obfuscation. The effectiveness of the approach was demonstrated through several case studies. In one demo, a function heavily obfuscated with 3,000 iterations of 'Wasm Mutate' (resulting in 60,000 instructions) was reduced back to its original 20-instruction form in seconds using LLVM passes. The researchers also successfully applied the technique to real-world targets, including the 'Kryptonite' Bitcoin miner and Edge Captcha, proving that lifted LLVM IR can be normalized for consistent signature matching and easier manual analysis in tools like IDA Pro and Ghidra.
More from this Playlist




Dismantling the SEOS Protocol
