Kuboid
Open Luck·Kuboid.in

Building Onramps for Emergency Web Archiving in Ukraine and Beyond

DEFCONConference194 views35:496 months ago

This talk demonstrates the practical application of distributed web archiving techniques to preserve cultural heritage sites during geopolitical conflict. It details the workflow of identifying, prioritizing, and crawling at-risk web infrastructure using open-source tools and community-driven volunteer efforts. The presentation highlights the challenges of maintaining data integrity and accessibility when target systems are under active threat or experiencing infrastructure failure. It provides a blueprint for using decentralized, volunteer-based approaches to perform emergency data preservation.

How Distributed Web Archiving Can Save Data During Kinetic Conflict

TLDR: This research demonstrates how to use decentralized, volunteer-driven web archiving to preserve at-risk cultural heritage sites during geopolitical crises. By leveraging tools like Browsertrix and WACZ files, researchers can bypass the limitations of centralized archives that often fail during infrastructure outages. Pentesters and researchers should view this as a blueprint for rapid, resilient data collection in environments where traditional infrastructure is unreliable or under active attack.

Geopolitical conflict often results in the rapid, permanent loss of digital history. When a nation’s power grid is targeted, the web servers hosting its cultural heritage, archives, and public records go dark. For a security researcher, this is not just a humanitarian issue; it is a technical challenge of data availability and integrity. The recent work presented at DEF CON 2025 regarding emergency web archiving in Ukraine highlights a critical shift in how we approach data preservation: moving away from centralized, fragile infrastructure toward a distributed, volunteer-based model.

The Mechanics of Distributed Archiving

Traditional web archiving relies on centralized crawlers like the Internet Archive. While effective for general web history, these systems are not designed for the high-frequency, targeted, and urgent needs of a conflict zone. When a site is under active threat, waiting for a centralized crawler to index it is a losing strategy.

The alternative is a distributed, containerized approach. By using Browsertrix, researchers can deploy high-fidelity, browser-based crawlers that run in isolated environments. This allows for the capture of complex, dynamic web content that traditional static crawlers often miss. Because these crawlers run in Docker containers, they can be deployed on commodity hardware—even a Raspberry Pi—by volunteers located anywhere in the world.

The technical workflow is straightforward but requires coordination:

  1. Discovery: Use OSINT and community-sourced lists to identify at-risk domains.
  2. Prioritization: Triage sites based on their vulnerability to infrastructure failure or censorship.
  3. Crawling: Deploy containerized crawlers to capture the site, generating WACZ (Web Archive Collection Zipped) files.
  4. Quality Control: Use multilingual volunteers to verify the integrity of the captured data.

Why This Matters for Security Professionals

For those of us in the offensive security space, this research is a masterclass in data exfiltration and preservation under duress. We often focus on how to break systems, but the ability to reliably extract and preserve data from a target—especially when that target is actively being dismantled—is a skill that translates directly to incident response and forensic investigations.

Consider a scenario where you are conducting a red team engagement or a bug bounty assessment on a client’s infrastructure that is experiencing intermittent availability. You cannot rely on a single point of access. By adopting the distributed crawling model, you can ensure that your evidence collection is resilient. If one node goes down, the rest of your distributed network continues to capture the state of the target.

The use of WACZ files is particularly relevant here. These files provide a standardized, verifiable format for web archives. They include the captured content, the crawl logs, and the metadata necessary to prove that the data has not been tampered with since the time of capture. For a researcher, this is the difference between a collection of random files and a forensically sound archive.

The Defensive Reality

Defenders often overlook the importance of archiving their own infrastructure until it is too late. If your organization is in a sector prone to targeted attacks, you should be maintaining your own off-site, immutable archives of critical public-facing assets. Relying on the assumption that your primary servers will always be available is a failure of planning.

If you are a security engineer, look into implementing automated, containerized archiving for your most sensitive web assets. Use tools that support the WARC standard to ensure interoperability. When the network becomes unstable, the ability to serve a static, high-fidelity copy of your site from a decentralized storage bucket can be the difference between maintaining public trust and suffering a total loss of information.

Moving Forward

The most striking takeaway from this research is that technical sophistication is secondary to the ability to adapt and coordinate. The project succeeded not because it used a proprietary, high-end tool, but because it built a "tent" that anyone could join. It turned everyday people into digital archivists by lowering the barrier to entry for complex technical tasks.

As we look at the current state of the internet, where infrastructure is increasingly centralized and vulnerable to both state-level interference and simple power failures, the need for these "onramps" to data preservation will only grow. Whether you are a researcher looking to preserve history or a pentester looking to ensure your data collection is bulletproof, the lesson is clear: build for resilience, distribute your efforts, and never assume the target will be there tomorrow. If you want to see how this is being applied in real-time, check out the SUCHO project. It is a living example of how a community can secure data when the institutions fail.

Premium Security Audit

We break your app before they do.

Professional penetration testing and vulnerability assessments by the Kuboid Secure Layer team. Securing your infrastructure at every layer.

Get in Touch
Official Security Partner
kuboid.in