DEF CON2024

Secrets and Shadows: Leveraging Big Data for Vulnerability Discovery at Scale

DEFCONConference1,413 views42:24over 1 year ago

This talk demonstrates techniques for identifying dangling cloud resources and leaked secrets at scale by leveraging large, non-traditional datasets like virus scanning platforms. It highlights how cloud providers' resource allocation models can be exploited to enumerate IP pools and bypass security deterrents. The research emphasizes the importance of moving beyond target-specific bug hunting to a 'security at scale' mindset. The speaker also discusses the limitations of current provider-side mitigations and the need for better ownership tracking of cloud assets.

Beyond Target-Specific Bug Hunting: Scaling Vulnerability Discovery with Big Data

TLDR: Traditional bug hunting often suffers from tunnel vision by focusing on a single target, but leveraging massive, non-traditional datasets like virus scanning platforms allows researchers to identify vulnerabilities at scale. By treating cloud provider IP pools and public code repositories as searchable data sources, researchers can uncover widespread misconfigurations and leaked credentials. This approach shifts the focus from individual targets to systemic weaknesses, enabling the discovery of thousands of valid secrets and dangling resources across major cloud environments.

Most security research in the bug bounty space is inherently reactive and narrow. We pick a target, we map the attack surface, and we hunt for bugs. While this is effective for finding low-hanging fruit, it ignores the systemic nature of modern cloud infrastructure. If you are only looking at your assigned scope, you are missing the forest for the trees. The real-world risk today is not just a single misconfigured S3 bucket; it is the thousands of buckets, API keys, and dangling DNS records that exist because developers prioritize speed over secure defaults.

The Mechanics of Scale

Vulnerability discovery at scale requires a fundamental shift in how we view data. Instead of starting with a target, we start with the vulnerability class. Take dangling DNS records. A record is dangling if it points to a cloud resource that has been deprovisioned but not removed from the DNS configuration. An attacker can simply provision a new resource in the same cloud environment, claim the IP address or endpoint, and hijack the traffic.

Historically, researchers have enumerated cloud IP pools to find these resources. However, cloud providers have implemented deterrents to make this harder. AWS and GCP now assign IPs from smaller, account-specific pools, and they have introduced costs for allocating these IPs. These are not security controls; they are economic ones. To bypass them, you have to think like a cloud architect. For instance, instead of requesting static elastic IPs, you can spin up and tear down EC2 instances with ephemeral IPs to rotate through a larger address space. This allows you to map out the provider's IP ranges and cross-reference them against passive DNS data to identify which records are currently pointing to unallocated space.

Secret Scanning as a Data Problem

Leaked secrets are another area where the "target-specific" mindset fails. We use tools like Gitleaks to scan our own repositories, but we rarely look at the global state of leaked credentials. The research presented at DEF CON 2024 demonstrates that virus scanning platforms are an untapped goldmine for this. Platforms like VirusTotal ingest millions of files daily, including binaries, scripts, and configuration files. These files are often uploaded by automated systems or developers debugging issues, and they frequently contain hardcoded credentials.

The key is to treat these platforms as a searchable database. You do not need to download every file. You can use YARA rules to scan the platform's repository for specific patterns, such as the AKIA prefix for AWS access keys or specific Stripe API key formats. By running these scans at scale, you can identify thousands of valid, active secrets. The technical hurdle here is validation. A regex match is just a string; you need to verify if the key is actually active. This requires a secondary, automated process to test the credentials against the provider's API, which is where serverless functions become essential. By using a pay-as-you-go model, you can run these validation checks for pennies, making the entire research process economically viable.

Real-World Impact and Defensive Reality

During this research, the scale of the findings was staggering. By applying these techniques, it was possible to identify over 15,000 validated secrets, including thousands of AWS keys and GitHub personal access tokens. The impact is not theoretical. An attacker with these keys can gain immediate access to production environments, customer data, and internal infrastructure.

Defenders need to understand that these are not just "customer vulnerabilities." While it is true that the developer is responsible for the hardcoded secret, the cloud provider has a responsibility to provide better defaults. When a secret is pushed to a public repository, the provider should have mechanisms to automatically revoke it. GitHub has made strides here with their secret scanning program, but it is not universal. If you are a security engineer, your priority should be implementing pre-commit hooks and secret scanning in your CI/CD pipeline. If you are a pentester, stop looking at the login page and start looking at the infrastructure that supports it.

Moving Forward

The era of manual, target-specific testing is not over, but it is becoming insufficient. We are operating in an environment where infrastructure is defined as code and deployed at a velocity that makes manual review impossible. If you want to find the most impactful bugs, you have to stop thinking about the target and start thinking about the data.

Ask yourself what datasets you are ignoring. Are you looking at public code, but ignoring the binaries uploaded to scanning platforms? Are you looking at your own cloud environment, but ignoring the DNS records that point to your infrastructure? The next big finding is likely hidden in the noise of a dataset you haven't even considered searching yet. Start by breaking down your favorite vulnerability class into discrete, repeatable steps, and then find the largest, most diverse dataset that contains those steps. The scale will surprise you.

Talk Type

research presentation

Difficulty

advanced