How to Read and Write a High-Level Bytecode Decompiler
This talk explores the technical challenges and methodologies involved in developing high-level bytecode decompilers for Python. It contrasts general-purpose decompilers like Ghidra or Hex-Rays with specialized bytecode decompilers, highlighting the importance of understanding bytecode-to-source translation. The speaker demonstrates the five-phase pipeline of decompilation, including disassembly, tokenization, parsing, AST abstraction, and source generation. The presentation provides insights into handling control flow structures and nested scopes in Python bytecode.
Why Your Python Decompiler Is Lying to You
TLDR: Most Python decompilers fail when they encounter modern bytecode because they rely on outdated, general-purpose analysis techniques that ignore the nuances of the CPython interpreter. By treating decompilation as a language-translation problem rather than a simple disassembly task, researchers can achieve significantly higher accuracy. This post breaks down the five-phase pipeline required to build a high-level bytecode decompiler that actually works for current Python versions.
Reverse engineering Python is often treated as a solved problem. You grab a tool, run it against a .pyc file, and hope for readable source code. But when you are dealing with malware or proprietary logic hidden in bytecode, the standard tools like uncompyle6 or decompile3 often fall apart. They are built on assumptions about Python versions that haven't held true for years. If you are a pentester, you have likely seen the "decompiler error" screen more often than you have seen clean, functional source code.
The fundamental issue is that most decompilers treat their job as a disassembly task. They look at the bytecode, map it to an instruction, and try to print it out. That works for simple scripts, but it fails to capture the complex control flow, nested scopes, and interpreter-specific optimizations that modern Python uses. To get reliable results, you have to stop thinking about disassembly and start thinking about language translation.
The Five-Phase Decompilation Pipeline
Building a decompiler that doesn't choke on modern Python requires a structured pipeline. You cannot just jump from bytecode to source. You need to move through five distinct phases:
- Disassembly: You start by converting raw bytecode into a readable format. The xdis library is the industry standard here because it handles cross-version disassembly, which is critical when you don't know the exact environment where the bytecode was generated.
- Tokenization: This is where you "lift" the disassembly. You are essentially normalizing the instructions into a stream of tokens that represent the underlying logic, stripping away the noise of the interpreter's stack operations.
- Parsing: You feed those tokens into a parser to build a Parse Tree. This is where you apply grammar rules to identify structures like assignments, loops, and function definitions.
- AST Abstraction: You transform the Parse Tree into an Abstract Syntax Tree (AST). This is the crucial step where you discard the implementation details of the bytecode and focus on the semantic structure of the code.
- Source Generation: Finally, you walk the AST to produce the actual Python source text.
Why Control Flow Breaks Everything
The hardest part of this process is handling control flow. When you have nested if statements, while loops, and try/except blocks, the bytecode becomes a tangled mess of jumps. General-purpose decompilers often fail here because they don't understand the concept of "dominator regions."
A dominator is an instruction that must be executed before reaching another instruction. By mapping these relationships, you can identify the boundaries of a scope. If you don't track these boundaries, your decompiler will produce code that is syntactically valid but logically broken. For example, a while loop might be decompiled as a series of if statements with goto labels, which is a nightmare to audit during an engagement.
If you are investigating a suspicious binary, look at how the decompiler handles these jumps. If you see a massive amount of goto statements in the output, the decompiler has failed to reconstruct the control flow. You are looking at the interpreter's view of the world, not the developer's.
Practical Application for Pentesters
During a red team engagement or a bug bounty hunt, you will often encounter Python-based agents or custom tools that use code obfuscation to hide their logic. These tools frequently rely on the fact that most analysts will give up once the decompiler spits out garbage.
If you find yourself in this position, do not rely on a single tool. Use pydisasm to get a clean disassembly of the bytecode. Once you have the disassembly, look for the "choke points"—the instructions that dominate the control flow. By manually tracing these, you can often reconstruct the logic faster than you can debug a broken decompiler.
Furthermore, if you are working with OWASP related vulnerabilities, specifically those involving insecure deserialization or code injection, understanding how the bytecode is structured is your only way to verify if a payload is actually being executed as intended. You cannot secure what you cannot read.
Moving Beyond the Basics
The state of Python decompilation is currently lagging behind the rapid release cycle of the language itself. Every major Python release changes the bytecode format, breaking the assumptions of existing decompilers. If you want to get ahead, stop relying on the "run and pray" method. Start looking at the AST grammar and how it maps to bytecode instructions.
The next time you are stuck with a decompiler that refuses to output clean code, don't just move on to the next target. Take the time to look at the disassembly. The bytecode is the truth; the decompiler is just an opinion. If you can read the bytecode, you don't need the decompiler to be perfect. You just need it to be a starting point.
Vulnerability Classes
Attack Techniques
Up Next From This Conference

How to Read and Write a High-Level Bytecode Decompiler

Opening Keynote: Black Hat Asia 2024

AI Governance and Security: A Conversation with Singapore's Chief AI Officer
Similar Talks

Hacking Apple's USB-C Port Controller

Unmasking the Snitch Puck: The Creepy IoT Surveillance Tech in the School Bathroom

