Security BSides2025

SPITE Driven Detections: Detection Testing in Production

BSidesSLC164 views27:2410 months ago

This talk introduces SPITE (Security Production Integration Test Environment), a framework for automated detection testing in production environments. It demonstrates how to use containerized tasks to simulate adversary behavior, trigger alerts, and validate detection logic in real-time. The approach helps security teams reduce alert fatigue and identify stale detections by testing against live production infrastructure.

Stop Guessing If Your Detections Work: Building a Production-Grade Test Pipeline

TLDR: Most detection engineering relies on unit tests that only validate logic, leaving massive blind spots in log ingestion and alerting infrastructure. By implementing a Security Production Integration Test Environment (SPITE), you can run containerized adversary simulations directly against your production SIEM. This approach catches configuration drift and broken pipelines in hours rather than months, effectively turning your detection stack into a self-validating system.

Detection engineering is often a game of blind faith. You write a query, test it against a static log file, and push it to production. Then you wait for an incident to see if it actually fires. If the log schema changes, the ingestion pipeline chokes, or the alert routing breaks, you won't know until the worst possible moment. This is the reality for most teams, and it is why so many "critical" detections fail when they are needed most.

The core problem is that unit tests only validate the logic of the detection itself. They cannot account for the messy, real-world state of your production environment. If your SIEM stops receiving logs from a specific AWS account or if a typo in your alert routing logic sends notifications to a dead-letter queue, your unit tests will still pass, but your security posture will be non-existent.

Moving Beyond Static Testing

True detection validation requires testing against the actual infrastructure that processes your telemetry. This is where the concept of a Security Production Integration Test Environment (SPITE) changes the game. Instead of relying on mock data, you treat your detection pipeline as a production service that requires continuous integration and testing.

The architecture relies on orchestrating containerized tasks—specifically using AWS Fargate—to perform actions that should trigger your alerts. These tasks are essentially "adversary emulators" that execute specific, scripted behaviors. For example, if you want to test a detection for unauthorized RDS snapshots, your test container doesn't just mock a log; it uses the boto3 library to actually create a snapshot in your test environment.

This approach forces your entire stack to work. The action is performed, the logs are generated, they are ingested by your SIEM (like Panther), and the alert is triggered. If the alert doesn't appear in your ticketing system, you know exactly where the failure occurred.

Implementing the Pipeline

To build this, you need to treat your detection tests as code. Each test consists of three components: a Dockerfile that defines the environment, a script that executes the malicious action, and a task definition that configures the compute resources.

When you want to add a new test, you simply define the behavior in a container. Here is a simplified example of how you might structure the execution logic in Python:

import boto3
import os

def create_rds_snapshot():
    client = boto3.client('rds')
    db_id = os.environ.get('DB_IDENTIFIER')
    client.create_db_snapshot(
        DBSnapshotIdentifier='test-snapshot',
        DBInstanceIdentifier=db_id
    )

if __name__ == "__main__":
    create_rds_snapshot()

By running this container, you generate real telemetry. The key to making this sustainable is to avoid cluttering your main alert queue. You can reroute these test-generated alerts to a dedicated SQS queue. Your parent service then polls this queue to validate that the alert was received, effectively closing the loop on your testing process. If the alert doesn't arrive within a specific window, the test fails, and you get an immediate notification.

Real-World Impact for Pentesters

For those of us on the offensive side, this is a massive shift. When you are performing a red team engagement or a penetration test, you are constantly testing the limits of a client's detection capabilities. If you find that a client has a robust, automated testing pipeline like this, your job becomes significantly harder. You can no longer rely on the assumption that their detections are stale or misconfigured.

Conversely, if you are building these pipelines, you are essentially performing continuous purple teaming. You are validating that your MITRE ATT&CK coverage is not just a checkbox on a spreadsheet but a functional reality. You are testing for T1078 (Valid Accounts) and T1537 (Transfer Data to Cloud Account) by actually performing those actions in a controlled, production-safe manner.

Managing the Complexity

The biggest hurdle is the operational overhead. You are essentially managing a fleet of micro-services that exist solely to break your own security controls. You need to be careful about cost and resource management. Using ephemeral compute like Fargate is essential here, as you only pay for the seconds your test containers are running.

You also need to handle the "unexpected alert" scenario. What happens if your test triggers an alert that you didn't expect, or if a real attacker happens to be active while your test is running? Using a dead-letter queue (DLQ) for alerts that don't match your expected test signatures is a critical safety valve. If an alert lands in the DLQ, it is a signal that something is happening that your automated testing didn't account for—which is often the most valuable signal of all.

Stop relying on the hope that your detections work. Start building pipelines that prove they do. If you aren't testing your detections against your production infrastructure, you aren't really detecting anything at all. You are just collecting logs.

Talk Type

talk

Difficulty

intermediate