Security BSides2025

Lex Sleuther: A Novel Approach to Script Language Detection

Security BSides San Francisco122 views22:2810 months ago

The talk introduces Lex Sleuther, a tool designed to accurately identify script languages in files by leveraging lexical analysis and linear regression. It addresses the limitations of existing tools like libmagic and YARA, which often struggle with script-based malware detection. The speaker demonstrates how combining multiple lexers and a simple linear classifier can significantly improve detection accuracy for script-based threats. The tool is released as an open-source project to assist in automated malware analysis pipelines.

Why Your File Detection Pipeline is Probably Lying to You

TLDR: Most security pipelines rely on libmagic or custom YARA rules to identify script-based malware, but these methods frequently fail when faced with obfuscated or non-standard file extensions. Aaron James introduced Lex Sleuther, a tool that uses lexical analysis and linear regression to accurately classify script languages regardless of file metadata. By shifting from signature-based detection to token-based classification, researchers can significantly reduce false negatives in automated analysis pipelines.

Security researchers often treat file type detection as a solved problem. We assume that if a file passes through a sandbox or an automated triage pipeline, the system correctly identifies it as PowerShell, VBScript, or JavaScript before triggering the appropriate analysis engine. In reality, this assumption is the primary reason many malicious scripts bypass automated detection. If your pipeline misidentifies a script, the subsequent dynamic analysis engine will likely fail to execute the payload, leaving you blind to the threat.

The Failure of Signature-Based Detection

Standard tools like libmagic are excellent for identifying binary formats because they rely on static signatures, such as magic bytes at the start of a file. However, scripts are essentially free-form text. They lack consistent headers, and attackers frequently strip or randomize file extensions to evade simple filters. When you run file against a script, you often get a generic "text" or "ASCII text" result, which is useless for an automated pipeline that needs to decide whether to spin up a PowerShell environment, a WScript emulator, or a browser-based sandbox.

YARA rules are the common alternative, but they suffer from a maintenance nightmare. You end up writing hundreds of custom rules to catch variations of script syntax, and these rules inevitably over-index on specific keywords. If an attacker changes a variable name or adds a few lines of junk code, your rule breaks. You are essentially using a hammer to drive a screw. You might get lucky, but you are wasting time and resources on a process that is fundamentally prone to false negatives.

Lexical Analysis as a Better Primitive

Instead of chasing signatures, we should be looking at the structure of the code itself. Lexical analysis, the first phase of any compiler, breaks a stream of characters into a stream of tokens. Every programming language has a unique set of tokens and a distinct way of ordering them. A PowerShell script will have a completely different token distribution than a Python script or a batch file, even if the file is heavily obfuscated or lacks a standard extension.

Lex Sleuther takes this approach by running the input file through six distinct lexers—one for each target language: PowerShell, JavaScript, Batch, Python, VBA, and HTML. These lexers do not attempt to execute the code; they simply generate a token stream. By counting the frequency of these tokens, the tool constructs a feature vector that represents the "DNA" of the script.

The magic happens in the classification phase. By using a simple linear regression model, the tool maps these token counts to a probability score for each language. Because the model is trained on a corpus of known scripts, it learns the statistical likelihood of specific token sequences appearing in a given language. This is significantly more robust than signature matching because it accounts for the overall composition of the file rather than the presence of a single, easily-changed string.

Implementation and Real-World Impact

For a pentester or a bug bounty hunter, this tool is a massive time-saver during the triage phase of an engagement. When you are dealing with a large dump of files from a compromised server or a collection of suspicious attachments from a phishing campaign, you need to know what you are looking at immediately. You can use the tool to quickly categorize thousands of files, allowing you to focus your manual analysis on the scripts that actually matter.

The tool is built in Rust and uses proc macros to generate the lexers, which keeps the binary size small and the performance high. You can install it via cargo and run it against a directory of unknown files:

cargo install lex_sleuther
lex_sleuther classify ./suspicious_files/

The output provides a clear breakdown of the most likely language for each file, along with a confidence score. During testing, this approach boosted dynamic analysis efficacy for script samples to approximately 97 percent. This is a significant jump over traditional methods, especially when you consider that the "missing" 3 percent are often the most heavily obfuscated or broken samples that would require manual intervention anyway.

A Note on Defensive Integration

Defenders should view this as a way to improve the "front end" of their malware analysis pipelines. If your current system is failing to trigger the correct sandbox because it cannot identify the script language, you are essentially running a blind analysis. By integrating a token-based classifier, you ensure that every script is routed to the correct execution environment. This reduces the number of "unknown" or "failed" analysis results and allows your team to focus on actual threats rather than debugging why a script didn't run.

The most important takeaway here is that we need to stop relying on brittle, signature-based heuristics for file identification. Whether you use this specific tool or build your own classifier, the shift toward structural analysis is necessary. Attackers are constantly evolving their obfuscation techniques, and our detection tools need to be based on the fundamental properties of the languages we are trying to secure. Stop guessing the language of your scripts and start measuring it.

Talk Type

tool demo

Difficulty

intermediate