Tracking 300K+ Drives: What We've Learned After 13 Years
This presentation analyzes long-term hard drive failure rates and reliability metrics based on over a decade of data from a large-scale data center environment. The speakers discuss the application of SMART data, Apache Iceberg, and Trino for large-scale data analysis and performance monitoring of storage hardware. The talk provides insights into drive failure patterns, such as the bathtub curve, and the operational challenges of managing massive storage fleets. A live demonstration shows how to query and analyze this public dataset using SQL-based tools.
Beyond the SMART Data: Why Your Storage Analytics Pipeline is Leaking Metadata
TLDR: Analyzing 300,000+ hard drives reveals that storage failure isn't just a hardware problem; it's a data engineering challenge. By using Apache Iceberg and Trino to query massive datasets, researchers can identify failure patterns that traditional monitoring misses. This approach demonstrates how to turn raw, high-volume telemetry into actionable intelligence for both infrastructure reliability and potential security auditing.
Hardware reliability is often treated as a black box, but for those of us building and breaking infrastructure, it is a goldmine of telemetry. Most security professionals ignore the storage layer until a disk fails or a controller panics. However, the way we aggregate and query drive health metrics—specifically SMART data—is a masterclass in data engineering that every researcher should understand. If you are auditing a large-scale environment, you are likely sitting on a mountain of logs that can reveal more than just impending hardware failure.
The Mechanics of Drive Telemetry
At the core of this research is the transition from legacy, manual log parsing to modern, distributed SQL-based analysis. Historically, teams relied on smartmontools to pull raw device metrics, which were then dumped into flat files or basic databases. The problem with this approach is scale. When you are managing hundreds of thousands of drives, the sheer volume of XML or CSV data becomes unmanageable.
The research presented at DEF CON 2025 highlights a shift toward using Apache Iceberg as a table format for these massive datasets. By treating drive telemetry as a structured, queryable table, you can perform complex joins across different time periods and hardware models. This allows for the identification of "bathtub curve" patterns—where failure rates are high at the beginning of a drive's life (infant mortality), stabilize, and then spike again as components wear out.
For a pentester, this is significant. If you gain access to an environment’s monitoring stack, you aren't just looking for credentials in plain text. You are looking for the telemetry pipeline. If that pipeline is built on Trino or similar distributed query engines, you can run ad-hoc queries to map out the entire physical infrastructure, identify specific hardware models that might be vulnerable to firmware-level exploits, or even track the movement of data across different storage shards.
Querying the Fleet
The demo provided a clear look at how to interact with this data using DuckDB and Trino. The ability to run SQL queries against petabytes of storage telemetry is a game changer. Consider a scenario where you need to identify which storage nodes are running specific, outdated drive models. Instead of manually checking individual servers, you can execute a query like this:
SELECT model, count(*) as drive_count, avg(age_months)
FROM drive_stats
WHERE failure_rate > 0.05
GROUP BY model
ORDER BY drive_count DESC;
This level of visibility is exactly what an attacker wants. It allows for precise targeting of hardware that may have known vulnerabilities or specific performance characteristics that can be exploited for side-channel attacks. From a defensive perspective, this same visibility is critical for identifying anomalous behavior. If a specific storage shard suddenly reports a spike in "uncorrectable errors," it might not be a hardware failure—it could be a sign of a malicious process attempting to brute-force or corrupt data blocks.
The Reality of Data Center Operations
One of the most interesting takeaways from this research is the "shard slash" project. It demonstrates how storage systems are not just passive repositories; they are active, intelligent systems that route traffic based on available capacity and health. When a drive fails, the system must rebuild the data on a new drive. This rebuild process is a high-traffic event that can be used to mask other activities.
If you are conducting a red team engagement, understanding the "rebuild" cycle of a storage cluster is essential. During a rebuild, the system is under heavy load, and monitoring alerts are often suppressed or ignored by the SOC. This is the perfect window to perform lateral movement or data exfiltration. The research shows that drive failure is not a random event; it is a predictable, manageable, and often noisy process that can be tracked with the right SQL queries.
Defensive Implications
Defenders need to treat storage telemetry as a first-class citizen in their SIEM. If you are only monitoring for login failures and process execution, you are missing the physical layer. Ensure that your storage monitoring pipeline is secured with the same rigor as your application logs. Use read-only credentials for your analytical tools, and implement strict access controls on the data warehouses where this telemetry resides.
The shift toward using modern data formats like Apache Iceberg is not just about performance; it is about auditability. By maintaining a clean, versioned history of your infrastructure’s health, you can quickly identify when a "failure" is actually a sign of tampering. Stop treating your storage as a static asset and start treating it as a dynamic, observable part of your attack surface. The data is there, and if you aren't querying it, someone else will.
Target Technologies
Up Next From This Conference
Similar Talks

Tor: A Decade of Lessons

Using AI Computer Vision in your OSINT data analysis



