DEF CON2024

Symbol Recovery in Stripped Binaries

DEFCONConference890 views21:52over 1 year ago

This talk demonstrates techniques for recovering function names and symbols in stripped binaries using Ghidra's FunctionID and the NSA's recently released BSim (Behavioral Similarity) tool. The speaker explains how to leverage these tools to identify and map functions in unknown binaries by comparing them against a database of known, benign code. The methodology focuses on automating the symbol recovery process to significantly reduce the time required for manual reverse engineering. The presentation includes a case study on a qBit stealer sample to illustrate the effectiveness of these tools in practice.

Stop Reversing Stripped Binaries Manually: Automating Symbol Recovery with BSim

TLDR: Manually reversing stripped binaries is a massive time sink that often leads to missed vulnerabilities. By integrating Ghidra’s FunctionID and the NSA’s BSim tool into your workflow, you can automate symbol recovery and focus on actual logic analysis. This approach uses behavioral similarity to map unknown functions against known, benign code, turning hours of tedious work into a background task.

Reverse engineering stripped binaries is the equivalent of trying to read a book where every chapter title has been ripped out and the table of contents is blank. You spend the first half of your engagement just identifying standard library functions or common boilerplate code, which is a waste of your time and your client’s budget. When you are staring at a massive, stripped Go or C++ binary, the goal is to find the actual business logic, not to spend three days figuring out which function is a standard cryptographic primitive.

The Mechanics of Symbol Recovery

Symbol recovery is not about magic; it is about pattern matching. When you compile code, the compiler leaves behind specific structural artifacts—function entry points, call graphs, and instruction sequences—that are unique to that code's behavior.

Ghidra’s FunctionID allows you to create a database of these patterns. By importing known, benign binaries into a project, you can generate a library of "signatures." When you later load a target binary, Ghidra compares the functions in your target against this database. If it finds a match, it automatically renames the function for you. This is a massive force multiplier. Instead of seeing FUN_004eb820, you see the actual function name, which immediately tells you what that block of code is doing.

The NSA’s BSim takes this a step further by focusing on behavioral similarity. It looks at the decompiler output to generate P-code vectors. This is more resilient than simple byte-matching because it accounts for minor compiler optimizations or slight variations in instruction selection. If you are dealing with a binary that has been compiled with different optimization flags, BSim is often the only way to reliably recover symbols.

Automating the Workflow

The real power here is not just using these tools in the GUI, but running them headlessly. You can write a script to ingest hundreds of gigabytes of benign binaries, generate the signatures, and build your databases while you sleep.

When you start a new engagement, you simply point your script at the target binary. By the time you open the project in Ghidra, the majority of the standard library and common dependencies are already named. For a recent qBit stealer sample, this process recovered over 50% of the symbols automatically. Without this, you are manually labeling hundreds of functions that provide zero insight into the malware's actual command-and-control or exfiltration logic.

To get started, you need to build your own databases. You can use the following command structure to run Ghidra in headless mode:

analyzeHeadless <project_path> <project_name> -import <binary_path> -postScript <script_name>

By creating a custom script that triggers the FunctionID and BSim analysis, you can ensure that every binary you touch is pre-processed before you even look at the disassembly.

Real-World Impact for Pentesters

In a red team engagement or a bug bounty hunt, speed is everything. If you are analyzing a custom Windows service or a Linux daemon, the ability to instantly identify the networking stack or the serialization logic allows you to pivot to the interesting parts of the code immediately.

If you encounter a binary that uses a specific version of a library, you can build a database for that exact version. This is particularly effective against Go binaries, which often include the entire runtime and all dependencies statically linked. Because Go binaries are notoriously difficult to reverse, having a pre-built database of common Go standard library functions is essential. It turns a "too hard" target into a manageable one.

A Note on Defensive Context

Defenders should be aware that this technique is a double-edged sword. While it helps you find vulnerabilities, it also helps malware authors hide their tracks. If you are on the blue team, you can use these same tools to perform "binary diffing" against known good versions of your own software. If you see a function that doesn't match your internal database, that is a high-fidelity indicator of tampering or an injected payload.

What to Do Next

Stop treating every binary as a blank slate. Start building your own library of FunctionID and BSim databases today. The next time you are faced with a stripped binary, don't start by looking for the entry point. Start by running your automated symbol recovery script. You will find that the "interesting" code stands out much faster when the noise has been labeled for you. If you are not already using these features, you are working harder, not smarter. Go grab the Ghidra scripts and start building your own repository of signatures.

Talk Type

research presentation

Difficulty

advanced