Black Hat2023

What Does an LLM-Powered Threat Intelligence Program Look Like?

Black Hat6,567 views40:11about 2 years ago

This talk explores the integration of Large Language Models (LLMs) into Cyber Threat Intelligence (CTI) programs, focusing on enhancing threat visibility, data processing, and intelligence interpretation. The speakers demonstrate how LLMs can automate the analysis of threat reports and indicators of compromise (IOCs) to improve decision-making and reduce analyst toil. A key takeaway is the necessity of a structured, human-in-the-loop approach to mitigate the risks of LLM hallucinations in critical security workflows. The presentation provides a framework for assessing which CTI tasks are suitable for LLM automation versus those requiring human expertise.

Automating Threat Intelligence: Why LLMs Are Not Your New SOC Analyst

TLDR: Large Language Models are being integrated into threat intelligence workflows to process massive datasets and reduce analyst toil, but they introduce significant risks of hallucination. Relying on LLMs for critical tasks like patch prioritization or attribution without a human-in-the-loop verification process can lead to dangerous security blind spots. Security teams must treat LLM outputs as untrusted data and implement rigorous grounding techniques to ensure factual accuracy.

Integrating Large Language Models into a Cyber Threat Intelligence (CTI) program is the current industry obsession, but most teams are approaching it with a dangerous level of optimism. The promise is simple: take the firehose of threat reports, dark web chatter, and indicator feeds, and have an LLM distill it into actionable intelligence. In practice, the reality is far more nuanced. If you are a researcher or a developer building these pipelines, you need to understand that an LLM is not a source of truth; it is a probabilistic engine that can confidently lie to you about the very threats you are trying to mitigate.

The Mechanics of LLM-Powered CTI

At its core, a CTI program exists to improve security decision-making by answering difficult questions with limited resources. Whether you are hunting for indicators of compromise or trying to determine if a specific CVE is relevant to your infrastructure, you are performing a transformation: raw data to processed insight to interpreted action.

LLMs excel at the "processing" phase. They can ingest unstructured text, such as a 70-page report on APT1, and extract relevant entities, techniques, and timelines. However, the "interpretation" phase—where you decide what this means for your specific environment—is where the wheels fall off.

Consider a scenario where an LLM is used to triage incoming phishing alerts. An attacker sends an invoice lure with a malicious PDF. The LLM processes the email, checks the attachment against a sandbox, and summarizes the findings for a SOC analyst. If the LLM hallucinates and labels the malicious attachment as "benign" because it misinterpreted the sandbox output or simply generated a plausible-sounding but incorrect conclusion, the analyst might release the email. The attacker is now inside your network. This is not a theoretical risk; it is a direct consequence of treating a generative model as a deterministic security tool.

The Hallucination Problem in Security Workflows

Hallucinations are not just annoying; they are a failure of the model to ground its output in reality. When an LLM is asked about malware families associated with a specific threat actor, it might correctly identify three known tools and then "invent" two more that sound technically plausible but do not exist.

If you are a pentester or a researcher, you can see the danger here. If you use an LLM to generate a list of potential attack vectors for a target, and 40% of those vectors are hallucinations, you are wasting your engagement time chasing ghosts.

To mitigate this, you must implement a "human-in-the-loop" requirement for any high-consequence decision. If the LLM is suggesting a change to a firewall rule or a patch priority, that output must be treated as a draft, not a command. You can use Retrieval-Augmented Generation (RAG) to ground the model's responses in your own internal, verified documentation or trusted threat feeds. By forcing the model to cite its sources from a controlled knowledge base, you significantly reduce the probability of it fabricating information.

Scaling Without Sacrificing Accuracy

The real value of LLMs in CTI is not replacing the analyst, but removing the "toil" from the job. Analysts spend hours manually parsing logs or translating foreign-language forum posts. These are perfect tasks for an LLM because they are high-volume, repetitive, and the cost of a minor error is relatively low compared to the time saved.

When you are designing these workflows, use a two-dimensional matrix to evaluate tasks:

Critical Thinking Requirement: Does this task require deep domain expertise or context?
Data Volume: How much raw text needs to be processed?

Tasks in the "Low Critical Thinking / High Volume" quadrant are your primary candidates for automation. Translating log data into a standardized format or performing first-level analysis on simple binaries are great starting points. Conversely, tasks in the "High Critical Thinking / Low Volume" quadrant—such as final attribution or strategic risk assessment—should remain firmly in the hands of human experts.

Moving Forward

Do not fall for the marketing hype that suggests LLMs will solve your talent shortage. They will not. They will, however, change the nature of the work. Your goal should be to build systems that allow your experts to provide feedback to the models with minimal friction. This is often called Reinforcement Learning from Human Feedback (RLHF), and it is the only way to ensure your internal models actually improve over time.

If you are building these tools, start by codifying your existing human expertise. If your senior analysts have a specific way of triaging an alert, turn that process into a structured prompt or a set of instructions for the model. If you cannot explain the process to a human, you certainly cannot explain it to an LLM. Treat your CTI pipeline like any other piece of production code: test it, monitor it for drift, and never, ever trust it blindly. The moment you stop verifying the output is the moment you become the vulnerability.

Talk Type

talk

Difficulty

intermediate