Black Hat2023

CodeQL: Also a Powerful Binary Analysis Engine

Black Hat1,681 views31:38about 2 years ago

This talk demonstrates a technique for extending the CodeQL static analysis engine to support binary analysis by creating custom extractors and database schemas. The researchers show how to integrate binary analysis into the existing CodeQL workflow, enabling the use of QL queries for vulnerability research on compiled binaries. The presentation includes a custom-built debugger for QL at the relational algebra level, facilitating the development and testing of complex binary analysis queries. The approach is demonstrated using QEMU as a target, showcasing the ability to perform in-depth binary analysis without source code.

Beyond Source Code: Scaling Binary Analysis with Custom CodeQL Extractors

TLDR: Static analysis has long been hampered by the requirement for source code, leaving binary-only targets as a black box for many researchers. This research introduces a method to extend the CodeQL engine to support binary analysis by building custom extractors and database schemas. By integrating IDA Pro as an extractor, researchers can now apply powerful QL queries to compiled binaries, enabling automated vulnerability research on targets like QEMU without needing a single line of source code.

Static analysis tools are often treated as a binary choice: you either have the source code and can run a sophisticated engine like CodeQL, or you are stuck with manual reverse engineering and basic pattern matching in a disassembler. This gap creates a massive blind spot for researchers auditing proprietary firmware, legacy binaries, or complex C-based applications where the build environment is either unavailable or too fragile to replicate. The recent work presented at Black Hat 2023 by the Tencent Security Yunding Lab changes this dynamic by demonstrating how to force the CodeQL engine to ingest binary data, effectively turning it into a universal analysis platform for both source and machine code.

The Architecture of Binary Extraction

The core challenge in binary analysis is the loss of semantic information that compilers strip away. To bridge this, the researchers designed a custom extractor that leverages the existing IDA Pro infrastructure. Instead of trying to reinvent the wheel, they use IDA to perform the heavy lifting of disassembly and lifting, then map that output into a custom database schema that CodeQL can understand.

The workflow begins by generating a trap file, which acts as an intermediate representation of the binary's structure. This trap file is then converted into a structured database that the CodeQL engine can query. By defining a new schema, the researchers allow QL queries to interact with binary-specific constructs like registers, instructions, and basic blocks, rather than just high-level AST nodes. This is a significant shift because it allows researchers to write queries that look for logic errors or data flow vulnerabilities in the binary itself, using the same relational logic that makes CodeQL so effective for source code.

Bridging the Gap with Relational Algebra

One of the most impressive aspects of this research is the development of a dedicated debugger for QL at the relational algebra level. When you are writing complex queries, debugging them can be a nightmare because the engine abstracts away the underlying execution. By creating a debugger that hooks into the Java Debugger (JDB) protocol, the team allows researchers to step through the evaluation of their queries.

This level of visibility is crucial when you are dealing with binary-level data. If a query fails to find a vulnerability, you need to know if it is because the vulnerability does not exist or because your data flow path was broken during the extraction process. Being able to inspect the intermediate tables and the query tree in real-time allows for rapid iteration. You can see exactly how the engine is resolving the relational algebra operations, which is the difference between spending days debugging a query and spending minutes.

Practical Application in Security Research

For a pentester or a bug bounty hunter, this approach opens up new avenues for finding bugs in targets that were previously "too hard" to audit. Consider a scenario where you are looking for a specific type of memory corruption or logic error across a large binary. Manually tracing every function call in a disassembler is error-prone and slow. With this custom CodeQL setup, you can write a query to identify all instances of a specific pattern, such as an unchecked buffer copy or an improper validation of a user-supplied index, and run it across the entire binary in seconds.

The researchers demonstrated this by targeting QEMU, a complex emulator that is a frequent target for security research. By extracting the binary into a CodeQL database, they were able to perform in-depth analysis that would have taken weeks of manual effort. This is not just about finding low-hanging fruit; it is about enabling a systematic approach to binary auditing that scales. If you are on an engagement where you have access to a binary but not the source, this toolchain allows you to treat the binary as a searchable, queryable dataset.

A New Standard for Binary Auditing

Defenders should take note of this as well. If researchers can automate the discovery of vulnerabilities in binaries, so can the developers who build them. Integrating binary-level static analysis into the CI/CD pipeline for compiled artifacts provides a final layer of assurance that source-only analysis might miss, especially when dealing with third-party dependencies or pre-compiled libraries.

The release of this CodeQL binary analysis framework is a call to action for the research community. We have spent too long accepting that binary analysis is inherently manual and slow. By standardizing the way we extract and query binary data, we can move toward a future where vulnerability research is as automated and repeatable for binaries as it has been for source code. Start by pulling the repo, testing it against a target you know well, and seeing where your manual analysis can be replaced by a well-crafted QL query. The barrier to entry for deep binary research just got significantly lower.

Talk Type

research presentation

Difficulty

advanced