Black Hat2023

The Yandex Leak: How a Russian Search Giant Uses Consumer Data

Black Hat20,744 views32:59about 2 years ago

This talk analyzes the 45GB Yandex source code leak to reveal how the company's internal analytics SDKs, specifically AppMetrica and Crypta, collect and process granular user data. The research demonstrates how Yandex uses 'gluing' techniques to link disparate identifiers—including device IDs, Wi-Fi SSIDs, and biometric voice data—to create comprehensive, re-identifiable user profiles. The analysis highlights the risks of third-party SDKs in mobile applications and the potential for such data to be accessed by state-controlled entities.

How Yandex SDKs Turn Anonymized Analytics Into De-Anonymized User Profiles

TLDR: The 45GB Yandex source code leak exposes how the company’s analytics SDKs, AppMetrica and Crypta, perform "gluing" to link disparate user identifiers into persistent, re-identifiable profiles. By combining device IDs, Wi-Fi SSIDs, and even biometric voice data, Yandex creates granular behavioral profiles that are easily accessible to state-controlled entities. Security researchers and developers must audit third-party SDKs for data leakage, as these tools often bypass basic anonymization protections.

Mobile application security often focuses on binary hardening, API endpoint security, or local storage encryption. We rarely audit the third-party SDKs we import, assuming that "anonymized" analytics are just that—anonymous. The Yandex source code leak from early 2023 proves this assumption is dangerous. By analyzing the internal logic of the AppMetrica and Crypta SDKs, we can see exactly how a massive search engine builds persistent, cross-platform user profiles that are trivial to de-anonymize.

The Mechanics of Data Gluing

At the core of Yandex’s data collection is a process they call "gluing." While the company publicly claims that the data collected by its SDKs is non-personalized and limited, the source code reveals a different reality. The SDKs collect a vast array of identifiers, including device network information, IP addresses, and Wi-Fi SSIDs.

The vulnerability here is not a traditional buffer overflow or injection flaw. It is a fundamental failure of Broken Access Control and Security Misconfiguration at the architectural level. The SDKs take these identifiers and pass them through hashing functions, but the implementation is flawed. Because the SDKs collect so many unique data points—like Wi-Fi signal strengths and specific device hardware metadata—the resulting "anonymized" hash is often unique enough to act as a persistent fingerprint.

Consider this snippet from the leaked codebase, which demonstrates how the SDK handles identifier matching:

select 
HexEncode(Digest::Blake2B(DeviceID, seed)) as DeviceID,
HexEncode(Digest::Blake2B(ADVID, seed)) as ADVID,
HexEncode(Digest::Blake2B(IFA, seed)) as IFA,
HexEncode(Digest::Blake2B(UUID, seed)) as UUID,
HexEncode(Digest::Blake2B(AndroidID, seed)) as AndroidID

While the use of Blake2B might look like a security measure, it is merely a way to normalize disparate IDs into a common format. Because the SDKs have access to the raw DeviceID, ADVID, and AndroidID before hashing, they can maintain a mapping table. If a user resets their advertising ID, the SDK simply "glues" the new ID to the existing profile using the other, more permanent hardware identifiers.

From Behavioral Analytics to State Surveillance

The real-world impact of this data collection is significant. Yandex uses this data to build "segments"—pre-defined user groups based on behavior, location, and demographics. The leak shows segments for everything from "smokers" to "young men of military age planning to leave Russia."

The danger is that these segments are not just for ad targeting. Because Yandex operates in a jurisdiction where the state can compel data disclosure, these granular profiles become a surveillance tool. The 2023 reporting by Meduza highlights that Yandex is required to provide the FSB with constant access to taxi ride data. When you combine this with the "gluing" logic, it becomes clear that the company can link a user’s physical movements to their search history, email, and social media accounts.

Pentesting the Third-Party SDK

For a pentester, this research changes how we approach mobile application assessments. We can no longer treat SDKs as black boxes. During an engagement, you should:

Intercept Traffic: Use a tool like Burp Suite to inspect the payloads sent by third-party SDKs. Look for unique identifiers being transmitted in cleartext or weakly hashed formats.
Static Analysis: If you have access to the application’s source code or can decompile the APK/IPA, search for initialization strings related to known analytics SDKs. Check if the application is passing sensitive data—like location, contact lists, or hardware IDs—to these SDKs.
Data Flow Mapping: Identify where the SDK sends its data. If the endpoint is a foreign server, document the potential for data residency violations or unauthorized access by third parties.

Defensive Considerations

Defenders must treat third-party SDKs as untrusted code. If your application imports an SDK, you are effectively granting that vendor access to your user’s data. Implement strict egress filtering to prevent unauthorized data exfiltration and, where possible, use proxying or data-scrubbing layers to strip sensitive identifiers before they reach the SDK’s servers.

The Yandex leak is a stark reminder that privacy is not just about encryption at rest. It is about the metadata we generate and the tools we allow to collect it. When we build applications, we are responsible for the entire data lifecycle. If we don't know what our SDKs are doing, we are not just failing our users—we are potentially handing their digital lives to the highest bidder or the most powerful state actor. Stop trusting the "anonymized" label and start auditing the data flow.

Talk Type

research presentation

Difficulty

intermediate