Black Hat2023

Endoscope: Unpacking Android Apps with VM-Based Obfuscation

Black Hat2,115 views29:14about 2 years ago

This talk demonstrates a two-fold methodology for unpacking Android applications protected by VM-based obfuscation. The research focuses on identifying and reversing custom virtual machine implementations, specifically targeting the Mozilla Rhino engine and generic VM-packed binaries. The authors introduce a technique using dynamic instrumentation and genetic signatures to map virtualized instructions back to original Dalvik bytecode. This approach enables the recovery of application semantics from heavily obfuscated and randomized VM-packed Android malware.

Breaking VM-Based Obfuscation in Android Malware

TLDR: VM-based obfuscation is becoming a standard hurdle for Android researchers, effectively hiding malicious logic behind custom instruction sets. This research presents a two-fold methodology using dynamic instrumentation and genetic signatures to map these custom instructions back to original Dalvik bytecode. By automating the identification of handlers and their corresponding virtualized instructions, analysts can bypass complex, randomized packing schemes that previously required tedious manual reverse engineering.

Android malware authors are increasingly turning to virtual machine-based obfuscation to protect their malicious payloads. While traditional packers might rely on simple encryption or compression, VM-based approaches translate original Dalvik bytecode into a custom, proprietary instruction set that only the malware’s internal virtual machine can interpret. This creates a massive barrier for static analysis, as the actual logic of the application is never exposed in a readable format. For a researcher, this means staring at a sea of opaque, custom-built handlers instead of standard bytecode.

The research presented at Black Hat 2023 provides a practical path through this complexity. Instead of attempting to fully reverse-engineer every custom VM from scratch, the authors focus on the relationship between the virtualized instructions and the underlying handler logic. By treating the VM as a black box and observing its execution flow, they can reconstruct the original application semantics.

Mapping the Virtualized Instruction Set

The core challenge with VM-based obfuscation is that the mapping between the virtualized instructions and the actual code execution is often randomized or unique to each sample. When you encounter a VM-packed binary, you are essentially looking at a custom interpreter loop. The dispatcher fetches a virtual instruction, decodes it, and jumps to a specific handler in a handler table.

To break this, the researchers use dynamic instrumentation to observe the execution trace. By hooking the dispatcher and the handler execution, they can collect a trace of which virtual instructions trigger which handlers. This is where the "two-fold" methodology comes into play. For open-source implementations like the Mozilla Rhino engine, the analysis is straightforward because the VM structure is known. You can analyze the VM logic directly and reconstruct the Abstract Syntax Tree (AST) to recover the original source.

However, most in-the-wild malware uses closed-source, custom-built VMs implemented in native libraries via the Java Native Interface (JNI). These samples often include app-specific randomization, meaning the handler order and the encryption parameters change with every build. This renders manual analysis of a single sample useless for the next one.

Automating Handler Identification with Genetic Signatures

To handle the randomization, the researchers introduced the concept of "genetic signatures" for handlers. Since each handler performs a specific, small operation—like an addition, a move, or a jump—the sequence of machine code instructions within that handler remains relatively consistent, even if its location in the handler table changes.

By calculating a hash of the instruction sequence for each handler, you can create a unique identifier. The process involves:

Truncating the sequence to exclude the jump instruction that returns control to the dispatcher.
Replacing variable machine code (like jump offsets or register-relative instructions) with a fixed placeholder.
Hashing the resulting byte sequence to create a stable signature.

Once you have these signatures, you can identify the same handler across different samples, regardless of how the packer has shuffled the handler table. This allows an analyst to build a database of known handlers. When a new, unknown sample is encountered, the instrumentation tool can automatically label the handlers based on their genetic signatures, effectively "de-obfuscating" the VM's internal logic in real-time.

Practical Application for Pentesters

During a penetration test or a malware analysis engagement, you will likely encounter these VMs when dealing with apps that implement heavy anti-tampering or proprietary business logic protection. If you are using tools like Frida or QBDI to trace execution, you are likely already seeing the symptoms of this obfuscation: a massive amount of noise in your trace logs as the VM dispatcher loops through its handlers.

Instead of trying to trace the entire execution, focus on the entry and exit points of the VM. By identifying the register-native functions that bridge the gap between the Java layer and the native VM, you can hook these points to extract the virtualized instructions as they are being fetched. Using the genetic signature approach, you can then map these instructions to your known handler database. This turns a weeks-long manual reversing task into an automated data-gathering exercise.

Defensive Considerations

Defenders should recognize that VM-based obfuscation is not a silver bullet. While it significantly raises the cost of analysis, it does not make an application un-analyzable. The reliance on JNI for these custom VMs is a major indicator of interest. If you are monitoring for malicious activity, focus on the behavior of the application at the native layer. Tools that monitor for unexpected JNI calls or unusual memory access patterns can often detect the presence of these VM-based packers before the malicious payload is even fully unpacked.

The shift toward inexpensive, off-the-shelf VM packers means that even low-tier malware is now using techniques that were once reserved for high-end threats. As an analyst, your best defense is to stop treating the VM as an impenetrable wall and start treating it as a predictable, albeit complex, state machine. By automating the mapping process, you can strip away the obfuscation layer and get back to the only thing that matters: the underlying code.

Talk Type

research presentation

Difficulty

advanced