Black Hat2023

HARRY Parser and the Cursed Tracker: Breaking the Spell of Online Data Collection

Black Hat1,098 views31:59about 2 years ago

This talk demonstrates a technique for analyzing web-based tracking and data collection by parsing HTTP Archive (HAR) files to identify third-party data exfiltration. It focuses on how websites use fingerprinting, CNAME cloaking, and ad-tech pixels to track users across different domains. The speaker introduces a custom tool, HARRY Parser, to automate the extraction and deduplication of tracking data from browser sessions. The presentation highlights the use of AI to assist in reverse-engineering obfuscated JavaScript used by these trackers.

Beyond the Pixel: Automating Data Exfiltration Analysis with HARRY Parser

TLDR: Modern web tracking has evolved into a complex ecosystem of fingerprinting and CNAME cloaking that often evades standard security controls. By parsing HTTP Archive (HAR) files, researchers can now automate the identification of third-party data exfiltration and reverse-engineer obfuscated JavaScript payloads. This approach provides a repeatable methodology for mapping how user data flows from a target site to unauthorized third-party endpoints.

Tracking is no longer just about a simple cookie drop. It has become a sophisticated, multi-layered operation that relies on fingerprinting, CNAME cloaking, and aggressive data aggregation. For a pentester or bug bounty hunter, the challenge is that this activity is often buried in hundreds of network requests, making manual analysis in tools like Wireshark or the browser developer console an exercise in frustration. The real-world risk is clear: when a site loads dozens of third-party scripts, it creates a massive supply chain attack surface. If one of those services is compromised, your users' data is effectively being exfiltrated in real-time.

The Mechanics of Modern Tracking

The core issue is that many organizations treat third-party scripts as trusted entities. When you visit a site, you are not just interacting with the first-party domain. You are often triggering a cascade of requests to ad-tech platforms, demand-side platforms (DSPs), and data management platforms (DMPs). These services use techniques like browser fingerprinting to uniquely identify users even when cookies are blocked or cleared.

A common technique to bypass first-party/third-party restrictions is CNAME cloaking. By creating a subdomain that points to a third-party tracking service via a CNAME record, the tracker can masquerade as a first-party resource. This allows the tracker to set cookies and access data that would otherwise be restricted by browser security policies. Identifying these relationships manually is difficult because the network traffic looks legitimate at first glance.

Automating the Hunt with HARRY Parser

To cut through the noise, researchers need a way to ingest and analyze browser session data at scale. The HARRY Parser was built to solve this by automating the extraction and deduplication of data from HTTP Archive (HAR) files. Instead of scrolling through thousands of lines of JSON, you can export a HAR file from your browser, feed it into the parser, and immediately see a breakdown of what data is being sent, to which entities, and via what methods.

The tool excels at identifying the "who" and "what" of data collection. It breaks down query strings, headers, cookies, and POST bodies by URL. This is critical for identifying when sensitive information—like session IDs or device configuration details—is being leaked to an unexpected third party.

Leveraging AI for Obfuscated JavaScript

One of the biggest hurdles in analyzing these trackers is the obfuscated JavaScript they use to hide their data collection logic. These scripts are often minified and intentionally difficult to read. Rather than spending hours manually de-obfuscating code, you can now use LLMs like ChatGPT or Google Bard to assist in the reverse-engineering process.

During a recent analysis, I encountered a script that was heavily obfuscated and used complex base64 encoding to hide its exfiltration targets. By feeding the script into an LLM, I was able to quickly identify the underlying logic and the specific data points being collected. While you should never blindly trust the output of an AI, it acts as a force multiplier for identifying patterns in minified code that would otherwise take a human researcher significantly longer to parse.

Practical Application for Pentesters

When you are on an engagement, your goal is to map the data flow. Start by using urlscan.io to perform an initial reconnaissance of the target's third-party dependencies. This will give you a high-level view of the domains the site communicates with. Once you have identified suspicious endpoints, use your browser to capture a HAR file during a typical user flow, such as logging in or completing a purchase.

Run this HAR file through the HARRY Parser to generate a clean, deduplicated report. Look for any POST requests that contain high-entropy data or encoded strings. If you find a script that seems to be collecting more information than necessary, that is your entry point for further investigation. You are looking for evidence of broken access control where sensitive user data is being sent to a third party without proper authorization or user consent.

Defensive Considerations

Defenders need to move beyond simple blocklists. The sheer volume of tracking domains makes static blocking ineffective. Instead, focus on implementing a strict Content Security Policy (CSP) that limits the domains from which scripts can be loaded and to which data can be sent. Regularly audit your third-party dependencies and use tools to monitor for unexpected network requests. If a service does not need to send data to a specific endpoint, block that communication at the network or application level.

The cat-and-mouse game of web tracking is only getting more complex. As browsers implement more privacy-focused features, trackers will continue to find new ways to circumvent them. By automating the analysis of network traffic and using modern tools to decode obfuscated scripts, you can stay ahead of the curve and identify data leakage before it becomes a full-blown incident. Stop looking at the surface and start looking at the data being sent under the hood.

Talk Type

tool demo

Difficulty

intermediate

Black Hat USA 2023

118 talks · 2023

Browse conference →

Up Next From This Conference

Chained to Hit: Discovering New Vectors to Gain Remote and Root Access in SAP Enterprise Software

Black Hat2023

36:09

Chained to Hit: Discovering New Vectors to Gain Remote and Root Access in SAP Enterprise Software

research presentation

3K·over 2 years ago

Zero-Touch-Pwn: Abusing Zoom's Zero Touch Provisioning for Remote Attacks on Desk Phones

Black Hat2023

30:49

Zero-Touch-Pwn: Abusing Zoom's Zero Touch Provisioning for Remote Attacks on Desk Phones

research presentation

1.9K·over 2 years ago

ODDFuzz: Hunting Java Deserialization Gadget Chains via Structure-Aware Directed Greybox Fuzzing

Black Hat2023

33:46

ODDFuzz: Hunting Java Deserialization Gadget Chains via Structure-Aware Directed Greybox Fuzzing

research presentation

1.4K·over 2 years ago

Similar Talks

Inside the FBI's Secret Encrypted Phone Company 'Anom'

DEFCONConference

backdoorandroid+26

1.7M·39:23·over 1 year ago

Kill List: Hacking an Assassination Site on the Dark Web

DEFCONConference

pythoninsecure-direct-object-reference+33

735K·32:55·6 months ago

Unmasking the Snitch Puck: The Creepy IoT Surveillance Tech in the School Bathroom

DEFCONConference

arp-scannc+36

412K·40:04·6 months ago

HARRY Parser and the Cursed Tracker: Breaking the Spell of Online Data Collection

Beyond the Pixel: Automating Data Exfiltration Analysis with HARRY Parser

The Mechanics of Modern Tracking

Automating the Hunt with HARRY Parser

Leveraging AI for Obfuscated JavaScript

Practical Application for Pentesters

Defensive Considerations

Vulnerability Classes

Tools Used

Target Technologies

Attack Techniques

OWASP Categories