Black Hat2024

Parse Me Baby One More Time: Bypassing HTML Sanitizer via Parsing Differentials

Black Hat1,945 views29:4311 months ago

This talk demonstrates how inconsistencies in HTML parsing between sanitizers and browsers can be exploited to bypass security filters and achieve Cross-Site Scripting (XSS). The research highlights that sanitizers often fail to account for browser-specific parsing quirks, such as how different elements handle nested tags or invalid HTML structures. The speaker introduces a tool called MutaGen, which automatically generates payloads designed to trigger these parsing differentials. The findings emphasize that server-side HTML sanitization is inherently fragile and that developers should avoid allowing user-supplied HTML whenever possible.

Why Your HTML Sanitizer Is Probably Lying to You

TLDR: HTML sanitizers are fundamentally flawed because they cannot account for the vast, inconsistent, and browser-specific ways that HTML is parsed. By using MutaGen, researchers have demonstrated that parsing differentials between sanitizers and browsers allow attackers to bypass filters and achieve XSS. If you are building or testing applications, stop relying on sanitization for user-supplied HTML and move toward safer alternatives like Markdown or strict content security policies.

Security researchers have spent years chasing the dragon of perfect HTML sanitization. We treat sanitizers like DOMPurify as a silver bullet for preventing Cross-Site Scripting (XSS), assuming that if we pass a string through a filter, the output is safe for the browser to render. This assumption is dangerous. The reality is that HTML parsing is not a single, unified standard. It is a collection of browser-specific behaviors, quirks, and legacy support that no single library can fully replicate.

The Mechanics of Parsing Differentials

The core issue is that a sanitizer parses input into an abstract representation, cleans it, and then serializes it back into a string. The browser then takes that string and parses it a second time to build its own Document Object Model (DOM). If the sanitizer and the browser disagree on how to interpret a specific sequence of tags or attributes, you have a parsing differential.

Consider the iframe element. The HTML specification defines its content model as "nothing," meaning it should not contain text or other elements. However, browsers are notoriously forgiving. If you feed a browser an iframe containing a script tag, it might ignore the spec and execute the script anyway. If your sanitizer follows the spec and assumes the iframe is empty, it will pass the payload through, believing it to be harmless. The browser, however, will see the script and execute it.

This is where MutaGen changes the game. Instead of manually crafting payloads to find these edge cases, MutaGen automates the generation of complex, nested HTML structures designed to force browsers into a state of repair. When a browser encounters invalid HTML, it attempts to "fix" the structure to make it renderable. These mutations are often where the vulnerability lies. By comparing the output of a sanitizer against the actual DOM generated by different browsers, MutaGen identifies where the sanitizer’s view of the world diverges from the browser’s reality.

Why Sanitization Fails in the Wild

During recent research, we tested 11 different sanitizers across five programming languages, including Java, JavaScript, PHP, Ruby, and .NET. The results were consistent: every single one of them had functional deficiencies. On average, the parsing accuracy compared to a modern browser was below 60 percent. Even when the sanitizers were technically "secure," they often mangled input by parsing it incorrectly, which can break application functionality or create new, unintended attack vectors.

One of the most persistent issues involves the noscript tag. Because its behavior depends on whether the browser has JavaScript enabled, it is effectively impossible to sanitize correctly on the server side. The server has no way of knowing the client's state, and the browser's internal state is not exposed to the sanitizer. If you include a noscript tag in your input, you are essentially handing the browser a blank check to interpret the content however it sees fit.

Namespace confusion is another common failure point. When you mix HTML with SVG or MathML, you are switching between different parsing modes. Sanitizers often struggle to keep track of these transitions. If a sanitizer fails to recognize that it has entered an SVG context, it may fail to filter out dangerous attributes that are valid in SVG but would be blocked in standard HTML.

Practical Implications for Pentesters

If you are on a penetration test or a bug bounty engagement, stop looking for simple <script>alert(1)</script> payloads. Those are caught by even the most basic filters. Instead, focus on the structural integrity of the HTML. Look for inputs that are reflected inside innerHTML assignments or document.write calls. These are the primary sinks where parsing differentials manifest.

Use MutaGen to generate payloads that target the specific sanitization library in use. If you identify that the application is using an outdated version of a library like Google Caja, you are almost certainly looking at a trivial bypass. Even with modern libraries, look for nested structures that force the browser to perform error recovery. If you can get the browser to "repair" your payload into a valid script tag, you have successfully bypassed the sanitizer.

Moving Beyond Sanitization

Defenders need to accept that server-side HTML sanitization is a losing battle. The complexity of the HTML specification makes it impossible to guarantee that a filter will always match the browser's parsing logic. If your application requires rich text input, consider using a safer alternative like Markdown. If you absolutely must allow HTML, implement a strict Content Security Policy (CSP) that prevents the execution of inline scripts.

Sanitization should be your last line of defense, not your first. By reducing the attack surface and relying on browser-level security controls, you can mitigate the risks that parsing differentials introduce. The next time you see a "sanitized" input field, don't assume it's clean. Assume it's just waiting for the right browser quirk to turn your payload into code.

Talk Type

research presentation

Difficulty

advanced