Study: large language models qualify as personal data
Researchers publish framework demanding GDPR compliance throughout AI development lifecycle.

A research paper published on June 18, 2025, establishes that large language models qualify as personal data under European Union privacy regulations, requiring compliance with data protection requirements throughout the entire development lifecycle. According to researchers from the University of Tübingen, machine learning practitioners must acknowledge legal implications when developing, training, and distributing AI models on platforms including GitHub and Hugging Face.
The 15-page academic paper, authored by Henrik Nolte, Michèle Finck, and Kristof Meding, demonstrates that LLMs memorize training data to varying degrees, even when memorization represents only a small fraction of the total dataset. "Most Large Language Models (LLMs) memorize training data to some extent," the authors state in their analysis. The research indicates that LLMs can memorize between 0.1 and 10 percent of their training data verbatim.
The researchers present empirical evidence demonstrating personal data storage within LLM parameters through controlled experiments using GPT-J 6B. Their technical demonstration shows how models retain specific factual information, including personal details, which can be accessed through targeted queries and modified using editing techniques like MEMIT (Mass-Editing Memory in a Transformer).
"LLMs store training data in parameters through an interconnected and overlapping architecture," according to the paper. This complex structure makes direct access difficult but does not eliminate the possibility of data retrieval. The research challenges arguments from data protection authorities who claim AI models cannot be classified as personal data because information is transformed into abstract mathematical representations.
Several European data protection authorities have previously argued against classifying LLMs as personal data. The Hamburg Data Protection Commissioner and Danish Data Protection Authority asserted that models do not contain personal data because training information becomes abstract probability weights. However, the researchers counter that "the format in which information is encoded is irrelevant when determining whether it qualifies as personal data" under GDPR provisions.
The European Data Protection Board has issued guidance stating that AI models trained on personal data cannot automatically be considered anonymous, requiring case-by-case assessment by authorities. This position aligns with the researchers' arguments about memorization creating data protection obligations.
GDPR obligations extend beyond training phase
When LLMs qualify as personal data, researchers face comprehensive legal obligations extending beyond the initial training phase. Every action performed on a model containing personal data constitutes processing under GDPR definitions, including training, uploading, downloading, storing, deploying, fine-tuning, or sharing models.
"Accordingly, each entity that processes personal data is independently responsible for ensuring that its specific processing activity complies with the GDPR," the researchers emphasize. This means individual researchers, universities, or companies distributing models may face separate liability for their processing activities.
The research identifies three critical compliance areas often overlooked by machine learning practitioners. First, researchers require legal basis for processing, whether through consent, contract, or legitimate interest justifications. Second, data subjects possess rights to access, rectification, and erasure of their personal information within LLMs. Third, developers must implement privacy by design principles and conduct data protection impact assessments for risky processing activities.
Data subject rights present particular implementation challenges. The right of access requires controllers to provide individuals with information about personal data stored in models, despite the technical difficulty of extracting specific information from neural network parameters. "They could either use structured prompts to query the model for potentially relevant personal data, or ask data subjects to suggest their own prompts," the authors propose as potential compliance approaches.
Financial penalties reach millions for violations
Non-compliance with GDPR requirements carries substantial financial consequences for researchers and organizations. Serious violations can result in fines reaching 20 million euros or 4 percent of total worldwide annual turnover, whichever amount is higher. Less serious breaches still trigger penalties up to 10 million euros or 2 percent of global turnover.
These enforcement actions represent real risks rather than theoretical possibilities. The research notes that individual researchers have faced penalties, while at least 35 universities have received GDPR fines for data protection violations. Recent enforcement activity demonstrates increasing regulatory scrutiny of AI development practices across European jurisdictions.
Stockholm courts upheld 5.4 million euro penalties against Spotify for inadequate data access responses, highlighting transparency obligations relevant to LLM developers. Dutch authorities have established comprehensive AI guidance requiring lawful data sourcing for machine learning applications, emphasizing the growing regulatory focus on AI compliance.
Technical solutions emerge for compliance challenges
The researchers propose several technical approaches to address GDPR compliance challenges in LLM development. Privacy-preserving frameworks could encapsulate models within protective layers that filter inputs and outputs to minimize personal data exposure. Differential privacy methods may help avoid personal data access by model users while preserving functionality.
Machine unlearning techniques show promise for implementing data subjects' right to erasure, though current approaches face significant limitations. The research demonstrates editing methods using MEMIT to modify stored facts, but notes that "while unlearning tools in the context of LLMs pose a promising direction, there are still severe challenges."
German data protection authorities have published comprehensive guidelines establishing technical requirements for AI systems across their complete development lifecycle. These frameworks provide practical guidance for implementing GDPR-compliant AI development practices.
Marketing implications drive compliance priorities
The research findings carry significant implications for marketing technology providers and advertising platforms increasingly deploying AI systems. Marketing organizations face growing scrutiny over data collection practices, with authorities examining automated decision-making frameworks and targeted advertising compliance.
Recent regulatory developments show heightened enforcement activity targeting transparency failures and consent mechanisms in marketing technology. The European Data Protection Board has issued guidance clarifying how data protection rules apply to AI models, addressing key questions around anonymity and lawful processing.
According to industry research cited in regulatory documents, 36 percent of occupations now use artificial intelligence for at least 25 percent of associated tasks, with computer-related tasks showing the highest usage rates. Marketing departments implementing AI solutions must carefully document their data processing activities and evaluate potential risks to individual rights.
The regulatory environment continues evolving as marketing leaders seek adaptive AI solutions while navigating compliance requirements. Industry surveys indicate 28 percent of marketing leaders identify implementing AI and machine learning technology as their top priority for driving impactful marketing outcomes.
Research community must adapt development practices
The University of Tübingen researchers argue that practical implementation difficulties do not negate legal applicability of GDPR provisions. "Reducing the scope of the GDPR based on unforeseen or conflicting interests would contradict its comprehensive protective purpose and intended technology neutrality," they contend.
The paper calls for collaborative efforts between machine learning researchers and legal experts to develop workable compliance approaches. While acknowledging the challenges of implementing data subject rights in LLM contexts, the authors emphasize that legal obligations remain regardless of technical complexity.
European regulatory activity demonstrates increasing focus on AI governance, with multiple jurisdictions developing specific frameworks for AI system compliance. The Dutch Data Protection Authority published comprehensive consultation on GDPR preconditions for generative AI, while the European Data Protection Board coordinates enforcement across member states.
Future court decisions will likely provide additional clarity on LLM classification and compliance requirements. Pending cases, including complaints against OpenAI alleging GDPR violations due to failure to ensure personal data accuracy and provide transparency about data sources, may establish important precedents for the industry.
The researchers conclude that acknowledging legal implications enables the ML community to better engage with policymakers and influence future legislation. Joint efforts between computer science and law remain necessary to address the complex challenges posed by AI technologies for data protection compliance while supporting continued innovation in the field.
Subscribe the PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Timeline
- June 18, 2025: University of Tübingen researchers publish "Machine Learners Should Acknowledge the Legal Implications of Large Language Models as Personal Data"
- June 2025: German authorities issue comprehensive AI development guidelines establishing technical requirements for AI systems
- May 23, 2025: Dutch Data Protection Authority releases comprehensive consultation on GDPR preconditions for generative AI
- May 23, 2025: German court allows Meta's AI training with public data, ruling use of public profiles lawful under GDPR
- April 13, 2025: Irish DPC launches Grok LLM training inquiry investigating X's use of EU user data for AI training
- December 18, 2024: European Data Protection Board clarifies privacy rules for artificial intelligence models
- May 6, 2024: German Data Protection Authorities issue first guidelines on AI and Privacy
Key Terms Explained
GDPR (General Data Protection Regulation): The European Union's comprehensive data protection law that governs how personal data must be processed, stored, and protected. Under GDPR, any information relating to an identified or identifiable natural person qualifies as personal data, regardless of format or storage method. The regulation applies to organizations processing EU resident data globally, with violations carrying fines up to 20 million euros or 4 percent of worldwide annual turnover.
Large Language Models (LLMs): Artificial intelligence systems trained on vast text datasets to generate human-like responses and perform language tasks. These models learn patterns and information from training data through neural network parameters, often memorizing portions of the original content. Examples include GPT models, ChatGPT, and other transformer-based architectures that power conversational AI applications.
Personal Data: Any information relating to an identified or identifiable natural person under GDPR definitions. This includes names, addresses, online identifiers, and behavioral patterns that could be linked to individuals. The research demonstrates that when LLMs memorize personal information from training datasets, the models themselves become personal data requiring regulatory compliance.
Data Protection: Legal and technical measures designed to safeguard individual privacy rights regarding personal information processing. Data protection encompasses principles like data minimization, purpose limitation, accuracy, and security. Organizations must implement appropriate safeguards throughout data lifecycles, from collection through deletion, while ensuring individuals can exercise their rights.
Machine Learning: Computational techniques enabling systems to learn patterns from data without explicit programming for specific tasks. Machine learning algorithms power LLMs by identifying relationships in training datasets and encoding knowledge in model parameters. The memorization capabilities that enable effective learning also create data protection challenges when personal information becomes embedded in models.
Data Processing: Any operation performed on personal data under GDPR definitions, including collection, organization, storage, retrieval, use, disclosure, and erasure. When LLMs contain personal data, activities like training, uploading, downloading, fine-tuning, or sharing models constitute processing requiring legal basis and compliance measures.
Memorization: The phenomenon where LLMs retain verbatim or near-verbatim information from training datasets within their parameters. Research shows models can memorize 0.1 to 10 percent of training data, with larger models and frequently repeated information showing higher memorization rates. This capability enables factual knowledge retention but creates privacy risks when personal data is memorized.
Data Subjects: Individuals whose personal data is processed under GDPR frameworks, possessing specific rights including access, rectification, erasure, and objection to processing. When LLMs memorize personal information, affected individuals become data subjects with rights to know what information is stored and request its removal from models.
Training Data: The datasets used to teach machine learning models, often containing billions of text examples scraped from internet sources. Training data for LLMs typically includes web pages, books, articles, and other publicly available content that may contain personal information about identifiable individuals, creating the foundation for data protection obligations.
Compliance: The process of adhering to legal requirements and regulatory standards, particularly GDPR provisions for organizations processing personal data. AI compliance involves implementing technical and organizational measures to protect individual rights, conducting impact assessments, establishing legal bases for processing, and enabling data subject rights throughout model development and deployment phases.
Subscribe the PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Summary
Who: Researchers Henrik Nolte, Michèle Finck, and Kristof Meding from the University of Tübingen published academic research addressing the machine learning community, data protection authorities, and AI developers.
What: A comprehensive legal analysis establishing that large language models qualify as personal data under GDPR when they memorize training information, triggering data protection obligations throughout the AI development lifecycle including data subject rights to access, rectification, and erasure.
When: The research paper was published on June 18, 2025, amid increasing European regulatory scrutiny of AI development practices and growing enforcement activity across multiple jurisdictions.
Where: The analysis focuses on European Union data protection law, particularly GDPR compliance requirements, with implications for organizations processing EU resident data regardless of their geographical location.
Why: Machine learning practitioners often overlook legal implications of data memorization in AI models, creating compliance risks and potential financial penalties while requiring technical solutions to balance innovation with privacy protection requirements.