Data

Transparency in AI: Can Google address publisher concerns over training data?

A year after promising a public discussion on AI training data, Google faces criticism for allegedly prioritizing a deal with Reddit while neglecting smaller publishers.

Luis Rijo

Jul 10, 2024 • 2 min read

Mailing List Sign-Up is either expired or reached its maximum registration limit

One year ago, Google published a blog post titled A principled approach to evolving choice and control for web content. The post acknowledged publisher concerns about AI companies using their content to train models without permission or compensation. It highlighted the limitations of robots.txt, a common method for controlling website access by search engines, and promised a "public discussion" on developing a new system.

However, according to industry observers, Google never truly delivered on this promise. Nate Hake this week criticized Google in a series of tweets, accusing them of prioritizing a "sweetheart deal" with Reddit over an open discussion with all publishers.

Publisher Concerns: Many publishers, including Reddit, are apprehensive about large AI companies freely using their content for training purposes. They argue for control over their content and seek compensation for its use.

Limitations of Robots.txt: Relying on robots.txt, which allows website owners to block access to specific content by search engines, is ineffective for controlling AI training data access. Here's why:

Unidentified Scrapers: Many AI companies may not be readily identifiable, making it difficult for publishers to know who to block.
Inverted Responsibility: Robots.txt places the onus on publishers to prevent infringement, rather than requiring AI companies to seek permission.
Content Duplication: Blocking an AI scraper won't prevent it from accessing copies of the content hosted elsewhere online.
Ignored Restrictions: Some AI companies might disregard robots.txt directives altogether.

"Opt-In" System Proposed: Industry experts believe a workable solution lies in an "opt-in" system. Here, AI companies would openly offer publishers compensation for the right to train on their content.

Google's Alleged Backroom Deal: Critics suspect Google prioritized a private agreement with Reddit, granting them control and compensation over their content for AI training, while neglecting smaller publishers.

Lack of Transparency: Despite a year of significant investment in AI, Google has not facilitated a substantive public discussion with smaller publishers about AI training consent, control, and compensation.

This situation raises concerns about transparency and fairness in how Google deals with AI training data access. Smaller publishers feel left behind in crucial discussions that could significantly impact their revenue and online presence.