Data

The Internet Architecture Board seeks input on AI data control mechanisms

Opting out of AI training data: IAB workshop seeks solutions.

Luis Rijo

Jul 11, 2024 • 2 min read

How can content creators control whether their work is included in the training data for AI systems?

The Internet Architecture Board (IAB) yesterday issued a call for papers for a workshop focused on exploring mechanisms for users to control how their data is used by Artificial Intelligence (AI) systems.

Large Language Models (LLMs) and other machine learning techniques are revolutionizing numerous fields. However, these advancements rely heavily on vast amounts of training data, often obtained by "crawling" publicly available content on the internet. This process resembles how search engines gather information.

A key question arises: how can content creators control whether their work is included in the training data for AI systems?

Currently, the Robots Exclusion Protocol (robots.txt) is a potential solution being explored. Defined in RFC 9309, robots.txt allows website owners to specify which content search engine crawlers should not access. The IAB is investigating whether robots.txt can be adapted for a similar purpose with AI data crawlers.

However, there are significant challenges to consider. For instance, a website owner might not be able to distinguish between a search engine crawler and an AI data crawler. Additionally, robots.txt may not be well-suited for the diverse use cases and content types involved in AI training data.

The IAB workshop aims to delve into these complexities and explore practical opt-out mechanisms for AI data control. The workshop will focus on methods for users to communicate their opt-out preferences and the underlying data models that support these mechanisms. Technical enforcement of opt-out signals will not be a primary focus.

The IAB welcomes position papers from interested participants on a range of topics related to AI data control. These topics include potential use cases for opt-out mechanisms, the strengths and weaknesses of using robots.txt for AI crawling control, and the legal and ethical considerations surrounding opt-out mechanisms.

The IAB workshop is scheduled for a two-day period within the week of September 16, 2024, in the Washington, D.C. area. Specific dates and the exact location will be confirmed soon. Participation in the workshop is by invitation only.

Those interested in participating can submit position papers by August 2, 2024. Submissions can be formatted as internet drafts, text documents, or academic-style papers. While anonymized submissions are possible, submissions considered relevant will typically be published on the workshop website.

The IAB workshop represents a significant step towards establishing clear guidelines for user control over their data in the evolving world of AI.

Last month, Cloudflare introduced a feature to block AI Scrapers and Crawlers. Publishers can now block artificial intelligence (AI) bots, crawlers, and scrapers from scraping the website content and training large language models (LLM) to recreate it without permission.