Security BSides2025

Trawling for IOCs: Catching C2 in a Sea of Data

Security BSides San Francisco81 views30:225 months ago

This talk demonstrates a data-driven detection engineering approach to identifying command-and-control (C2) infrastructure by leveraging large-scale security telemetry. The speaker outlines techniques for pivoting from known binary signer identities to broader sets of malicious hashes and identifying suspicious GitHub repositories used for staging malware payloads. The methodology emphasizes automating the generation of detection rules, such as YARA, by analyzing sandbox execution logs and network traffic patterns. This approach aims to reduce manual toil for security analysts and improve the scalability of threat detection.

Automating C2 Detection by Mining Sandbox Telemetry

TLDR: Security teams often struggle to scale detection engineering beyond manual, artisanal rule creation. By pivoting from known binary signer identities to broader sets of malicious hashes and staging infrastructure, researchers can automate the generation of high-fidelity YARA rules. This data-driven approach allows defenders to catch command-and-control (C2) infrastructure in the wild before it is even fully weaponized.

Detection engineering is currently stuck in a manual loop. An analyst sees a threat report, thinks hard about how to detect it, writes a rule, and hopes for the best. This process scales linearly with the number of humans you have on staff, which is a losing battle against modern adversaries. If you want to actually get ahead of the noise, you have to stop treating detection as a craft project and start treating it as a data engineering problem.

Pivoting from Signer Identities to Malicious Hashes

Most remote access tools (RATs) and C2 frameworks are distributed as digitally signed binaries. This is a necessity for them to bypass modern operating system security controls. While a single hash is a fragile indicator of compromise, the signer identity is often a much more stable pivot point.

If you identify a known malicious binary, you can use platforms like VirusTotal to extract the signer information. Once you have that identity, you can pivot to search for every other binary signed by that same entity. This often reveals a massive feed of related hashes, many of which are likely malicious but haven't been flagged by traditional antivirus engines yet.

For example, if you are tracking a specific RAT, you can use the following logic to build a feed:

SELECT DISTINCT sample.hash.sha256 AS sha256
FROM virustotal_sample_metadata
WHERE file.signer.signer_identity = 'YOUR_TARGET_SIGNER_NAME'

This query turns a single known bad file into a broad, high-confidence hunting list. The key here is that you are not just looking for the specific file that hit your environment; you are looking for the entire ecosystem of tools that the threat actor is using.

Hunting for Staging Infrastructure in GitHub Repositories

Adversaries frequently use free hosting platforms to stage additional payloads, modules, or scripts. PowerShell-based C2 frameworks like PowerSploit or PowerShell Empire are notorious for downloading secondary stages directly from GitHub repositories.

As a pentester or researcher, you can monitor these downloads by analyzing sandbox execution logs. When a binary executes, it often makes network connections to fetch these stages. By writing a YARA rule that looks for specific PowerShell commands or HTTP connections to raw.githubusercontent.com, you can identify when a tool is reaching out to fetch a payload.

A simple YARA rule for this might look like:

rule detect_github_download {
    meta:
        description = "Detects PowerShell downloading from GitHub"
    strings:
        $s1 = "raw.githubusercontent.com"
        $s2 = "Invoke-Expression"
    condition:
        all of them
}

The real power comes when you automate this. You can run these rules against your historical sandbox data to identify every repository that has been used to stage malware. Once you have a list of these repositories, you can build a continuous feed that alerts you whenever a new, previously unseen repository is accessed by a process in your environment.

Normalizing Data to Beat Obfuscation

One of the biggest hurdles in detection engineering is data normalization. Adversaries love to use obfuscation, such as Base64 encoding or bitwise XOR, to hide their tracks. If you are searching for raw strings in your logs, you will miss these techniques entirely.

To solve this, you need to normalize your data before it hits your detection engine. This means de-obfuscating strings, resolving file paths, and standardizing command-line arguments. If you are looking for a specific file path, don't just search for C:\Windows\Users. Search for the normalized version that accounts for different drive letters, environment variables, and path separators.

This is where the "data-driven" part of the methodology becomes critical. You aren't just writing rules; you are building a pipeline that cleans and structures your telemetry so that your rules can actually work. If your data is messy, your detections will be noisy.

The Human in the Loop

Even with a fully automated pipeline, you cannot remove the human element entirely. Data-driven detection engineering is not a "set it and forget it" solution. Rules will drift, adversaries will change their techniques, and you will inevitably encounter false positives.

The goal is to reduce the manual toil so that your analysts can focus on the interesting, high-value investigations. You want to automate the boring stuff—the initial triage, the basic correlation, and the rule generation—so that your team can spend their time on the complex, creative work that machines still cannot do.

If you are a pentester, start looking at the tools you use during your engagements. How do they stage their payloads? What network connections do they make? If you can answer those questions, you can build your own detection rules to test your client's visibility. If you are a defender, start looking at your sandbox logs. What are the common patterns in the tools you see? Start small, build a rule, and iterate. The goal is to build a system that gets better every time you use it.

Talk Type

talk

Difficulty

intermediate