Kuboid
Open Luck·Kuboid.in
Black Hat2023
Open in YouTube ↗

HARRY Parser and the Cursed Tracker: Breaking the Spell of Online Data Collection

Black Hat1,098 views31:59about 2 years ago

This talk demonstrates a technique for analyzing web-based tracking and data collection by parsing HTTP Archive (HAR) files to identify third-party data exfiltration. It focuses on how websites use fingerprinting, CNAME cloaking, and ad-tech pixels to track users across different domains. The speaker introduces a custom tool, HARRY Parser, to automate the extraction and deduplication of tracking data from browser sessions. The presentation highlights the use of AI to assist in reverse-engineering obfuscated JavaScript used by these trackers.

Beyond the Pixel: Automating Data Exfiltration Analysis with HARRY Parser

TLDR: Modern web tracking has evolved into a complex ecosystem of fingerprinting and CNAME cloaking that often evades standard security controls. By parsing HTTP Archive (HAR) files, researchers can now automate the identification of third-party data exfiltration and reverse-engineer obfuscated JavaScript payloads. This approach provides a repeatable methodology for mapping how user data flows from a target site to unauthorized third-party endpoints.

Tracking is no longer just about a simple cookie drop. It has become a sophisticated, multi-layered operation that relies on fingerprinting, CNAME cloaking, and aggressive data aggregation. For a pentester or bug bounty hunter, the challenge is that this activity is often buried in hundreds of network requests, making manual analysis in tools like Wireshark or the browser developer console an exercise in frustration. The real-world risk is clear: when a site loads dozens of third-party scripts, it creates a massive supply chain attack surface. If one of those services is compromised, your users' data is effectively being exfiltrated in real-time.

The Mechanics of Modern Tracking

The core issue is that many organizations treat third-party scripts as trusted entities. When you visit a site, you are not just interacting with the first-party domain. You are often triggering a cascade of requests to ad-tech platforms, demand-side platforms (DSPs), and data management platforms (DMPs). These services use techniques like browser fingerprinting to uniquely identify users even when cookies are blocked or cleared.

A common technique to bypass first-party/third-party restrictions is CNAME cloaking. By creating a subdomain that points to a third-party tracking service via a CNAME record, the tracker can masquerade as a first-party resource. This allows the tracker to set cookies and access data that would otherwise be restricted by browser security policies. Identifying these relationships manually is difficult because the network traffic looks legitimate at first glance.

Automating the Hunt with HARRY Parser

To cut through the noise, researchers need a way to ingest and analyze browser session data at scale. The HARRY Parser was built to solve this by automating the extraction and deduplication of data from HTTP Archive (HAR) files. Instead of scrolling through thousands of lines of JSON, you can export a HAR file from your browser, feed it into the parser, and immediately see a breakdown of what data is being sent, to which entities, and via what methods.

The tool excels at identifying the "who" and "what" of data collection. It breaks down query strings, headers, cookies, and POST bodies by URL. This is critical for identifying when sensitive information—like session IDs or device configuration details—is being leaked to an unexpected third party.

Leveraging AI for Obfuscated JavaScript

One of the biggest hurdles in analyzing these trackers is the obfuscated JavaScript they use to hide their data collection logic. These scripts are often minified and intentionally difficult to read. Rather than spending hours manually de-obfuscating code, you can now use LLMs like ChatGPT or Google Bard to assist in the reverse-engineering process.

During a recent analysis, I encountered a script that was heavily obfuscated and used complex base64 encoding to hide its exfiltration targets. By feeding the script into an LLM, I was able to quickly identify the underlying logic and the specific data points being collected. While you should never blindly trust the output of an AI, it acts as a force multiplier for identifying patterns in minified code that would otherwise take a human researcher significantly longer to parse.

Practical Application for Pentesters

When you are on an engagement, your goal is to map the data flow. Start by using urlscan.io to perform an initial reconnaissance of the target's third-party dependencies. This will give you a high-level view of the domains the site communicates with. Once you have identified suspicious endpoints, use your browser to capture a HAR file during a typical user flow, such as logging in or completing a purchase.

Run this HAR file through the HARRY Parser to generate a clean, deduplicated report. Look for any POST requests that contain high-entropy data or encoded strings. If you find a script that seems to be collecting more information than necessary, that is your entry point for further investigation. You are looking for evidence of broken access control where sensitive user data is being sent to a third party without proper authorization or user consent.

Defensive Considerations

Defenders need to move beyond simple blocklists. The sheer volume of tracking domains makes static blocking ineffective. Instead, focus on implementing a strict Content Security Policy (CSP) that limits the domains from which scripts can be loaded and to which data can be sent. Regularly audit your third-party dependencies and use tools to monitor for unexpected network requests. If a service does not need to send data to a specific endpoint, block that communication at the network or application level.

The cat-and-mouse game of web tracking is only getting more complex. As browsers implement more privacy-focused features, trackers will continue to find new ways to circumvent them. By automating the analysis of network traffic and using modern tools to decode obfuscated scripts, you can stay ahead of the curve and identify data leakage before it becomes a full-blown incident. Stop looking at the surface and start looking at the data being sent under the hood.

Talk Type
tool demo
Difficulty
intermediate
Category
privacy
Has Demo Has Code Tool Released


Black Hat USA 2023

118 talks · 2023
Browse conference →
Premium Security Audit

We break your app before they do.

Professional penetration testing and vulnerability assessments by the Kuboid Secure Layer team. Securing your infrastructure at every layer.

Get in Touch
Official Security Partner
kuboid.in