Black Hat2023

AI Assisted Decision Making of Security Review Needs

Black Hat2,554 views36:53over 2 years ago

This talk demonstrates a machine learning approach to automate the triage of security review requests by analyzing Jira tickets and technical documentation. The system utilizes natural language processing (NLP) and deep learning models, specifically multi-layer perceptrons and convolutional neural networks, to classify whether a feature requires a security review. The primary takeaway is a practical methodology for reducing the security-to-developer ratio burden by using automated classification to identify high-risk changes.

Automating Security Triage: Moving Beyond Manual Jira Reviews

TLDR: Security teams are drowning in a sea of Jira tickets, with developer-to-security ratios often exceeding 200:1. This research demonstrates how to use NLP and deep learning to automatically classify which features require a security review and which are low-risk. By training models on historical ticket data, teams can significantly reduce their manual triage burden and focus on high-impact vulnerabilities.

The modern development lifecycle is a paradox. We push code to production dozens of times a day, yet our security review processes remain stuck in a waterfall-era bottleneck. When you have one security engineer for every two hundred developers, you are not performing security reviews; you are performing security theater. Most of the time, we are just guessing which tickets matter, often missing critical flaws in "small" changes while wasting hours on trivial updates.

The Signal-to-Noise Problem

Security teams rely on manual triage to filter through thousands of Jira tickets, Confluence pages, and PR comments. This is a losing game. The sheer volume of data generated by agile sprint teams makes it impossible for humans to maintain context. We end up with a massive security blind spot: features that should have been reviewed but were marked as "low risk" by a developer who lacks the security context to make that call.

The research presented at Black Hat 2023 by the team at Databricks tackles this by treating security triage as a classification problem. Instead of relying on gut feeling, they built a pipeline to ingest raw engineering text—Jira tickets, design docs, and bug reports—and used machine learning to predict the necessity of a security review.

Building the Classifier

The core of this approach is transforming unstructured engineering text into a format a neural network can digest. The pipeline starts by extracting text from sources like Jira and Confluence.

The preprocessing stage is critical. You cannot just feed raw text into a model. The team used NLTK to tokenize the text, stripping out noise like single-character words and English "stop words" (articles, prepositions) that carry no security signal. The goal is to isolate the technical intent.

# Example of the tokenization and cleanup flow
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

tokens = word_tokenize(raw_text.lower())
filtered_tokens = [w for w in tokens if w not in stopwords.words('english') and len(w) > 2]

Once the text is cleaned, the next hurdle is vectorization. The team experimented with term frequency, but found that simple word counts fail to capture the semantic relationship between terms. They moved to word embeddings, which map words into a high-dimensional space where related concepts—like "data frame," "dataset," and "column"—cluster together. By using Apache Spark to handle the heavy lifting of processing thousands of tickets, they generated 300-dimensional vectors for each document.

From Vectors to Decisions

With the data vectorized, the team trained a multi-layer perceptron (MLP) to act as the classifier. The input is the 300-dimensional vector, and the output is a confidence score. This score is the "so what" of the entire project. It allows a security team to set a threshold: if the model is 90% confident a feature is high-risk, it triggers an automatic security review. If it is 10% confident, it gets a pass.

The beauty of this system is its flexibility. You are not locked into a binary "yes/no" for every ticket. You can tune the threshold based on your team's current bandwidth. If you have a surge in hiring or a quiet sprint, you can adjust the sensitivity of the model to match your capacity.

Real-World Application for Pentesters

For those of us on the offensive side, this research highlights a massive opportunity. If you are performing a red team engagement or a long-term assessment, you are likely looking for the same "low-risk" features that security teams ignore. If a company is using automated triage, you can bet that the features with low confidence scores are the ones that haven't been touched by a human security reviewer in months.

During an engagement, look for the "boring" tickets. If you can identify the patterns that lead to a low-risk classification, you have found the path of least resistance. These are the areas where developers are most likely to introduce vulnerabilities because they assume the feature is too simple to be dangerous.

The Future of Automated Triage

Defenders should stop trying to review everything. It is mathematically impossible. Instead, start building your own corpus of "reviewed" vs. "ignored" tickets. You do not need a massive data science team to get started. Even a simple ensemble classifier can provide a significant boost over manual, ad-hoc triage.

The goal is not to replace the security engineer; it is to give them their time back. By automating the classification of the 80% of tickets that are clearly low-risk, you free up your team to hunt for the complex, high-impact bugs that actually require human intuition. If you are still manually reading every ticket that comes across your desk, you are already behind. Start by exporting your Jira history and seeing what the data tells you about your own blind spots.

Talk Type

research presentation

Difficulty

intermediate