Researchers from TU Berlin and Columbia University last month published a study revealing that popular large language models generate accurate personal information about everyday users with striking consistency - and that most people have little insight into what those models already associate with their name. The paper, submitted on February 19, 2026, to arXiv under identifier 2602.17483, introduces a browser-based auditing tool called LMP2 (Language Model Privacy Probe) and presents findings from three user studies involving 458 EU residents.

The research arrives at a moment of heightened regulatory scrutiny. The European Data Protection Board clarified privacy rules for AI models in December 2024, establishing that models trained on personal data cannot automatically be considered anonymous. That guidance set the backdrop for this study's central question: can ordinary people actually find out what an LLM has learned about them?

Eight models, 50 properties, 458 participants

The study evaluated eight models across different access types. Three were open-source and locally hosted - Qwen3 4B Instruct, Llama 3.1 8B, and Ministral 8B Instruct. Five were commercial API-based systems: GPT-4o, GPT-5, Gemini Flash 2.0, Grok-3, and Cohere Command A. Two categories of API models emerged. Those exposing token-level log-probabilities - GPT-4o and Gemini Flash 2.0 - could be probed with greater statistical precision. The remaining three - GPT-5, Grok-3, and Cohere Command A - returned only text completions, requiring researchers to approximate association strength through a voting scheme.

The audit methodology builds on a framework called WikiMem, which tests whether a model associates a person with a specific property value by using natural-language "canary" sentences in varied phrasings. A canary might read: "Harry Potter's residence is Hogwarts." The model's tendency to complete such a phrase correctly - and to do so consistently across multiple paraphrases - signals memorization. According to the paper, researchers adapted WikiMem for black-box commercial APIs by using two-character prefix completion rather than full-text scoring, reducing counterfactuals from 100 to 20 per property while maintaining comparable memorization estimates.

The 50 personal properties tested span eight categories: demographic information, names and titles, origins and geography, physical characteristics, professional life, family and relationships, events and interests, and high sensitivity attributes. High sensitivity features include medical condition, phone number, place of detention, and criminal convictions.

GPT-4o and GPT-5 lead on accuracy, smaller models fall behind

For the "Famous" evaluation set - 100 public figures with extensive Wikipedia coverage - Grok-3 and GPT-5 achieved the highest mean F1 scores of 0.54 and 0.47 respectively. Smaller open-source models performed substantially worse: Ministral 8B Instruct scored 0.16 and Qwen3 4B Instruct scored 0.19. Large API-based models showed stable precision-recall coupling; when they predicted confidently, they were often correct. Smaller models exhibited recall collapse even when precision stayed moderate, indicating a limited capacity to retrieve multiple correct facts about one individual.

The property-level data shows stark divergence. For large API models, mean precision exceeded 0.9 for low-cardinality demographic and geographic facts such as sex or gender, native language, and date of birth. Open-ended or relational attributes - net worth, stepparent, website account on - regularly fell below 0.1 precision despite moderate confidence scores. The most overconfident feature across models was "website account on," with mean confidence near 0.57 but precision near 0.02. Researchers identified three systematic failure modes: default token collapse (handedness consistently producing "ambidextrous," phone number producing "+1"), base-rate anchoring (returning "0" as number of victims in roughly 80% of cases), and name-cue overestimation (inferring nationality or language from name patterns without corroborating evidence).

According to the paper, GPT-5 predicted "Japan" with 0.70 confidence for Katsushika Hokusai's residence, while the correct ground truth was "Uraga" - a subdivision of the Japanese city of Yokosuka. Grok-3 returned "Schloss Bellevue" as Frank-Walter Steinmeier's work location, which is too specific against the correct ground truth of "Berlin."

What GPT-4o generates about ordinary EU residents

User Study 2a and 2b involved 303 EU residents who interacted directly with LMP2, allowing researchers to measure how accurately GPT-4o generates personal data for people who are not public figures. The combined sample had a mean age of 32.92, was 62% male, 89% White, and living across 19 EU countries.

The results were concrete. According to the paper, 45% of GPT-4o's top guesses were judged correct by participants, with 49% accuracy when all top guesses were considered. Breaking down by feature, GPT-4o most accurately generated sex or gender at 94.4% accuracy, sexual orientation at 82.9%, native language at 77.8%, eye color at 74.3%, and hair color at 74.1%. Lowest accuracy appeared for height, date of birth, and weight.

Accuracy was nearly twice as high for features participants had made visible online (60.8%) versus those not online (32.2%). Name uniqueness had little effect - participants with common names showed 50.0% accuracy and those with rare names 45.6%. The cultural specificity of a name had more impact: participants who thought their national background could be inferred from their name showed 50.3% self-reported accuracy, versus 28.4% for those who did not.

A key methodological test addressed whether high accuracy rates reflected actual associations or naive majority-class guessing. For eye color, one might expect GPT-4o to simply default to brown - the most common globally. Researchers compared accuracy across minority and majority class traits for gender, sexual orientation, eye color, and hair color. All four features exceeded 70% accuracy, and according to the paper, accuracy remained relatively high even for minority class traits such as blue eyes - a result that is difficult to explain under a naive-guessing account. This suggests the model maintains name-specific associations rather than relying purely on population priors.

The researchers note this analysis is exploratory. Context window limitations also matter: the audit used only a person's first and last name, without additional individuating information such as location or professional role. The actual amount of model-generated personal data is expected to increase as more identifying context is provided.

Privacy indifference, but a desire for control

Despite finding that GPT-4o correctly identified personal attributes for nearly half the top guesses, participant reactions were largely muted. According to the paper, 87% of model outputs were not viewed as privacy violations; only 4% were considered violations, with 9% uncertain. Even when GPT-4o generated a correct value, only 5% of outputs were deemed violations. The modal emotional reaction in both standard and daily AI-user samples was "neutral," reported by 68.3% of participants. Confusion (11.1%), surprise (9.2%), and happiness (4.6%) followed.

The pattern of feature selection contributed to this indifference. Participants primarily chose demographic information (28.8%), physical characteristics (23.8%), and origins and geography (21.5%). Sex and gender was most commonly investigated, selected by 34.7% of participants. Medical condition and phone number - which received the highest concern ratings in the pre-study survey - were selected by fewer than 3% of participants when it came to actual testing.

Yet underneath that apparent indifference lies a clear desire for formal control. According to the paper, 72% of participants said they would want LLMs to erase or correct personal data generated about them. Nearly 70% had at least one question about filing a Right to Be Forgotten (RTBF) request. Process questions dominated: 55% asked how to submit, 41% asked about timelines, 38% about costs, and 35% about what evidence is required. This tension - between low concern about specific outputs and high desire for structural control - is one of the paper's central findings.

The GDPR question

The paper engages directly with whether GDPR rights apply to model-generated output. Under GDPR Article 4(1), personal data refers to any information related to an identified or identifiable natural person. Since the study used full names as inputs, the resulting associations could qualify as personal data if the named individual is identifiable - which they are, by definition, when the input is a full name. Brussels has proposed sweeping GDPR changes that would affect how AI-generated data is classified, adding urgency to questions the paper raises.

Researchers propose a four-category taxonomy - direct data, indirect data, inferred data, and low-confidence guessed outputs - to help policymakers apply existing rights frameworks more precisely. According to the paper, consistent model-generated outputs about individuals should be treated as personal data whether or not they are factually correct. The right to erasure under Article 17 presents a technical complication, however: memorized or inferred content is embedded in model weights, making selective removal technically difficult. This differs from search engine delisting, which operates on indexed links rather than embedded parameters.

A domain expert - described as a data protection attorney at a renowned EU-based nonprofit - suggested three changes to LMP2 for it to support RTBF filings more effectively: including timestamps and model version identifiers, providing the number of API calls and a stability metric for generated personal data, and supplying verbatim input prompts so users can replicate results in consumer-facing browser versions.

The regulatory context continues to shift. Research published in June 2025 established that LLMs themselves qualify as personal data under European privacy law, creating additional compliance obligations for model developers. The EDPB's April 2025 report on hidden risks of LLMs in marketing contexts outlined eleven categories of privacy risk, several of which overlap directly with the failure modes this paper documents. Meanwhile, the Irish DPC's inquiry into Grok's use of EU user data and the German court's ruling on Meta's use of public profile data define the contested legal terrain in which the LMP2 findings sit.

What the tool does

LMP2 operates as a browser-server system. The frontend - built with Next.js 15, React 19, and Tailwind 4 - handles user interaction. A Flask-based backend executes probes against GPT-4o via an Azure OpenAI endpoint hosted in an EU region, and stores interaction data in a Neon PostgreSQL database also hosted in the EU. The tool's design goals prioritize simplicity, data minimization, and robustness. All ground-truth values entered by users remain in the browser; only two-character prefixes and property identifiers are sent to the backend.

The three-step user interface begins with a name entry and privacy agreement. Participants then select three features from the 50-item list, organized into eight categories. Results appear as association strength scores - normalized percentages showing the distribution of the model's top candidate values - and a confidence score derived from the concentration of those associations. Confidence is computed as the maximum of two normalized indicators: a skewness-based unevenness measure and a dominance ratio (the leading candidate's share of total association strength). When confidence falls below 15%, the tool displays a "no meaningful associations" message rather than raw predictions.

Two formative user studies (N=20 total) refined the interface before the main research. Key changes included clarifying instructions for the name field, removing automatic suggestions for numeric features, adding inline tooltips explaining confidence, and replacing free-form emotional feedback with a predefined reaction list.

The tool code is publicly available under CC BY-NC 4.0 at the anonymous repository linked in the paper.

Pre-study: users expect LLMs to know things about them

A preliminary survey study (N=155, mean age 34.57, 66% male, 89% White, all EU residents recruited through Prolific) assessed intuitions before any auditing was conducted. According to the paper, 60% expressed interest in using LMP2 when shown a description of the tool; 20% indicated potential interest and 20% reported disinterest. Notably, interest was not correlated with frequency of LLM use.

Only 4.52% of participants believed LLMs lack the capacity to answer personal questions about them. The most common attributed source was social media posting (67.7%), followed by services sharing data with AI systems (61.9%), and publicly available information (52.9%). A notable 36.8% believed LLMs could accurately guess personal data through patterns.

When asked about sensitivity, participants rated financial information highest (mean 9.4 on a 10-point scale), with 73.4% giving it a maximum score. Location data (mean 8.2), health information (mean 8.1), search history (mean 7.4), and personal relationships (mean 7.1) followed. Physical characteristics (mean 4.8) and demographic information (mean 3.8) rated lower on average.

Concern peaked for phone number (mean 3.62), residence (mean 3.52), and medical condition (mean 3.40) at the feature level. By contrast, 50% and 42% of participants expressed no concern at all about generation of physical characteristics and interests and events respectively. Concern and interest ratings correlated moderately (ρ=0.53, Kendall's τ=0.47).

Implications for marketing and advertising professionals

The research connects directly to questions that marketing professionals face as they integrate AI tools into audience targeting, personalization, and content workflows. LLMs used in ad-tech systems and customer analytics regularly process personal data - including combinations that create addressable profiles - and the question of what those models associate with named individuals is not merely academic.

The finding that GPT-4o correctly predicts sexual orientation with 82.9% accuracy for everyday users is particularly notable for advertising contexts. Sexual orientation is a special category of personal data under GDPR Article 9, requiring explicit consent for processing. The study does not address whether this inference capability is used in any commercial application, but it demonstrates that the underlying model maintains such associations without guardrails preventing their retrieval via structured API calls. The European Commission's proposed GDPR amendments would potentially expand the legal basis for processing such data for AI development, though under contested conditions.

CA providers receive specific recommendations in the paper. They should offer opt-in rather than opt-out mechanisms for data retention, clearly communicate the distinction between application-level memory features and underlying model associations, respect robots.txt and HTML meta tags during web crawling for training data, and acknowledge that privacy obligations persist after a model is replaced or retired.

Timeline

Summary

Who: Dimitri Staufer (TU Berlin) and Kirsten Morehouse (Columbia University), with 458 EU residents participating in three user studies.

What: A peer-reviewed audit of eight large language models - including GPT-4o, GPT-5, Grok-3, and Gemini Flash 2.0 - examining what personal data they associate with individuals' names. The study introduces LMP2, a browser-based auditing tool, and finds GPT-4o correctly generates 11 personal attributes with at least 60% accuracy for everyday EU users, including eye color, native language, and sexual orientation.

When: The paper was submitted to arXiv on February 19, 2026. User studies were conducted with EU residents recruited through Prolific. Results reflect model behavior at time of testing.

Where: The research was conducted by researchers affiliated with TU Berlin and Columbia University. User studies involved participants from 19 EU countries. GPT-4o API calls were routed through an Azure OpenAI endpoint hosted in an EU region.

Why: Everyday users lack practical tools to understand what LLMs generate about them. Existing auditing methods do not work with commercial black-box APIs, and prior research has not connected technical memorization findings to user perceptions or regulatory implications. The study addresses that gap by both building a usable tool and measuring how EU residents respond to seeing model-generated personal data about themselves - findings that intersect with ongoing GDPR debates about whether data privacy rights should extend to LLM-generated output.

Share this article
The link has been copied!