News publishers target Common Crawl, the AI training data backdoor

The News/Media Alliance (NMA) yesterday submitted a formal letter to Common Crawl demanding that the web archive organization stop enabling unauthorized use of publisher content by artificial intelligence companies. The letter, dated April 29, 2026, and addressed to Rich Skrenta, Executive Director of the Common Crawl Foundation at 9663 Santa Monica Blvd, #425, Beverly Hills, CA 90210, marks one of the most detailed and direct challenges publishers have yet mounted against the infrastructure layer that has quietly powered some of the largest AI models in the world.

The action was first reported by Bloomberg and follows mounting frustration among publishers who argue that Common Crawl has effectively become a conduit for copyright infringement at scale - what some critics have described as "data laundering."

What Common Crawl is, and why publishers are angry

Common Crawl presents itself as a nonprofit archiving the open web for researchers, academics, and historians. Founded on a mission of democratizing web data access, it has crawled billions of web pages since the early 2010s and has published regular crawls since 2013. Its petabyte-scale archive is freely available for download. That openness, the NMA argues in its letter, is precisely the problem.

According to the letter, Common Crawl "enables the scraped content in its archive to be used by any number of commercial developers irrespective of copyright protections and for purposes that harm publishers' ability to license their content to said commercial developers." The organization acknowledges in its own Mission and Impact statements that its content is used by "entrepreneurs, and developers" to "create . . . applications and services" in fields such as "language processing, search engine optimization, and web analytics."

The financial ties between Common Crawl and AI developers are significant. According to the NMA's letter, in 2024, over 60% of donated funds and at least 8 out of 13 donors were directly or closely affiliated with leading generative AI companies or data brokers. More than half of those donations came from three sources: Anthropic, OpenAI, and the Schmidt Foundation - the latter closely committed to generative AI development. In 2023, venture capital firm Andreessen Horowitz, which identifies AI as one of its focus areas, made a $100,000 donation. These figures come from Common Crawl's Form 990 filings published on ProPublica.

This funding relationship sits at the center of the NMA's complaint. An organization that portrays itself as a neutral public resource is substantially funded by the same commercial entities that benefit from the training data it makes available.

The specific models trained on Common Crawl

The NMA letter does not leave the AI training connection to inference. It cites documented examples. In 2020, OpenAI used Common Crawl archives to train GPT-3. A subset of Common Crawl data known as "C4" was used by Google to develop its Bard AI assistant, as documented in Google's own research paper on LaMDA (Language Models for Dialog Applications). OpenAI built multiple iterations of its GPT technology using Common Crawl content, as documented in the 2020 paper "Language Models Are Few-Shot Learners." PPC Land has covered this history in depth, including an investigation published in November 2025 that found Common Crawl had supplied millions of paywalled news articles to AI companies despite publisher removal requests.

The November investigation, first published by The Atlantic, revealed that Common Crawl's scraper circumvents paywall mechanisms by never executing the browser code that checks subscription status. Publishers who pay for journalism and depend on subscription revenue to fund it have no effective technical defense under this model. The archive contained content files showing no modifications since 2016, despite publishers having submitted removal requests years earlier.

The opt-out registry and its limitations

Common Crawl does operate an opt-out registry - a mechanism allowing publishers to signal that they do not want their content scraped. The NMA acknowledges this in the letter, but argues the registry falls well short of what is needed.

According to the letter, the opt-out registry is "just one of 27 sub-sections identified at the bottom of Common Crawl's website home page." Nothing in that subsection alerts users - the developers downloading and using the archive - not to use registrants' content. At most, Common Crawl states that registrants "have specifically requested to be excluded from our crawls." There is no directive warning that using the content would breach Common Crawl's terms, nor any statement that the content cannot be used for AI purposes.

The robots.txt issue compounds the problem. Common Crawl has stated it will not crawl websites with properly configured robots.txt files. But the NMA letter questions what "periodically" means when Common Crawl describes how frequently it checks for updated robots.txt files. For publishers who added blocking directives after their content was already ingested, the pledge is immaterial. The archive already contains years of their journalism.

Some NMA members reported sending removal requests over two and a half years ago - with no result. Common Crawl's own terms of use state that "we and our designees shall have the right to remove any Crawled Content at any time and for any reason in our sole discretion." Yet, according to the letter, removal has not followed from this stated authority.

What the NMA is demanding

The NMA's demands are specific and technical. According to the letter, Common Crawl must do four things.

First, upon request of a publisher, Common Crawl must swiftly remove that publisher's content from the archive. Second, Common Crawl must include a clear and conspicuous public statement on its website covering five points: it will remove content on publisher request; it does not own and cannot authorize use of scraped content; it prohibits unauthorized use of the content for AI purposes; it respects the intellectual property rights of publishers; and it will include publisher licensing contact information in the opt-out registry upon request.

Third, Common Crawl must revise its terms of use to explicitly prohibit use of its repository for AI purposes, and to state that violations may result in account restriction or discontinuation.

Fourth, Common Crawl must add a clear directive to the opt-out registry itself - a warning to users that using registrant content for unauthorized purposes, including AI training, is a breach of Common Crawl's terms and that those terms will be enforced.

The letter also includes Exhibit A, a list of publishers requesting to be added to the opt-out registry. The list is substantial. It includes organizations ranging from NBCUniversal and CNN to Boston Globe Media Partners, McClatchy, Vox Media, Ziff Davis, USA Today, Newsday, Seattle Times, TelevisaUnivision, People Inc., and dozens of regional publishers operating under groups such as CherryRoad Media, Emmerich Newspapers, and Trib Total Media. The domain list spans hundreds of individual web properties.

These publishers join those already listed on Common Crawl's opt-out registry, including Advance Publications, The Atlantic, BBC, DMG Media, The Economist, The Guardian, Hearst, News Corp, the New York Times, the Philadelphia Inquirer, the Toronto Star, and The Washington Post.

The indemnity language buried in Common Crawl's terms

One of the more striking elements of the NMA letter concerns what Common Crawl's existing terms of use actually say. According to the letter, Common Crawl obtains full indemnities from users who employ its archive for AI-related purposes. The terms cover the "use of Crawled Content in connection with artificial intelligence, machine learning, or other similar technologies, including, without limitation, large language models and neural networks," as well as the use of crawled content for "developing, training, or deploying AI Systems."

The indemnity clauses also cover "infringement or misappropriation of any third party's patent, trademark, copyright, or trade secret rights, including in connection with Crawled Content, AI Systems, or Generated Content," and "any actions taken by end-users of AI Systems, including creation or use of Generated Content."

The NMA letter argues that this indemnity structure tacitly acknowledges the unlawfulness of such uses. By writing provisions to shield itself from liability arising from AI-related deployments, Common Crawl is implicitly recognizing that those deployments carry legal risk - risk it has transferred to users rather than addressed through operational policy.

The international dimension

The NMA's letter situates its demands within a broader international context. According to the press release accompanying the letter, the action follows similar requests from the Danish Rights Alliance and the Alliance de la Presse d'Information Generale, both of which had previously asked Common Crawl to remove publisher articles to prevent unauthorized AI use. In December 2024, Common Crawl's attorney told the Danish Rights Alliance that "approximately 50% of this content has been removed" - a claim that prompted skepticism given the investigation findings.

The pattern of international publisher organizing against AI data intermediaries is well established. Publishers rallied against AI scraping at an IAB Tech Lab summit in July 2025, where over 80 media executives convened in New York to develop technical standards that would force AI platforms to respect publisher consent and compensation requirements. That meeting notably did not include OpenAI, Anthropic, or Perplexity.

Danielle Coffey's statement

Danielle Coffey, President and CEO of the News/Media Alliance, issued a direct statement alongside the letter. According to the NMA press release, Coffey said: "Common Crawl is blatantly taking our content without our permission and failing to honor our opt outs to remove content already taken. We encourage them to act like the good actor they claim to be, honor these requests, and make clear to their users that the content they scrape is not authorized for commercial use unless expressly permitted."

The statement draws a line between Common Crawl's self-presentation as a neutral academic resource and its operational reality as a distribution mechanism for commercial AI training data.

Why this matters for the marketing and advertising community

The conflict between publishers and Common Crawl is not abstract to the marketing industry. Publishers fund their journalism - and their advertising inventory - through content that readers and subscribers value. When AI companies train models on that content without compensation, those models can then generate outputs that substitute for the original reporting, reducing the traffic and revenue that publishers depend on to produce new content.

AI crawling data reveals a stark imbalance: Anthropic crawled 38,000 pages for every page visit it referred back to publishers in July 2025, while OpenAI maintained a ratio of 1,091 crawls per referral. Training-related crawling accounted for nearly 80% of all AI bot activity by that point.

The NMA has previously escalated its criticism directly at Google. In May 2025, the organization called Google AI features "the definition of theft" - a statement that captures how publishers now frame AI-driven content use as a structural economic threat rather than a technical policy disagreement.

The broader legal landscape has not been uniformly favorable to publishers. Courts have dismissed several copyright and antitrust cases on standing grounds. A federal court dismissed Raw Story's lawsuit against OpenAI in November 2024, ruling that having content included in an AI training dataset - without proof of a specific harmful use - was insufficient to establish standing under the Digital Millennium Copyright Act.

This creates a structural problem for the NMA's approach. Formal demands and opt-out registries carry limited legal weight if publishers cannot demonstrate concrete harm in court. The letter itself acknowledges this implicitly: it reserves all rights and states that nothing in it constitutes a waiver of any claim. It is, at this stage, a demand letter rather than a legal filing.

Whether Common Crawl responds, and how, will shape the next phase of this dispute. The NMA has framed its demands in terms of what Common Crawl already claims to stand for - transparency, fair use, and the public good. If Common Crawl declines to act on those stated principles, the NMA will face a choice about escalation.

The pressure is coming from multiple directions. Cloudflare launched a pay-per-crawl service in July 2025, giving publishers a technical mechanism to charge AI crawlers for access rather than simply blocking them or accepting uncompensated scraping. Blocking AI crawlers has itself proven insufficient: research published in early 2026 found that publishers who blocked AI crawlers via robots.txt experienced a 23.1% decline in monthly visits without a corresponding reduction in AI citations. Common Crawl's historical archive means that even a publisher who blocks all future scraping cannot remove the years of content already ingested and redistributed.

The NMA's letter to Common Crawl is an attempt to close that historical gap - to force the organization to treat its archive not as an open resource but as a repository of intellectual property that requires active management and respect.

Timeline

Early 2010s: Common Crawl begins scraping webpages and publishing regular crawls since 2013
2020: OpenAI uses Common Crawl archives to train GPT-3; Google uses the C4 subset to develop its Bard assistant
2022: GPT-3.5, built on Common Crawl data, becomes the foundation for ChatGPT
July 2023: The New York Times requests content removal from Common Crawl
July 2024: Danish Rights Alliance initiates removal requests with Common Crawl
November 2024: Federal court dismisses Raw Story lawsuit against OpenAI, citing lack of standing under DMCA
December 2024: Common Crawl's attorney tells the Danish Rights Alliance that approximately 50% of requested content has been removed; NMA petitions DOJ and FTC over Google's site reputation abuse policy
January 2025: AI licensing contrasts emerge between major AI providers and news publishers
May 2025: NMA calls Google AI features "the definition of theft" and demands DOJ intervention - coverage on PPC Land
July 2025: Cloudflare launches pay-per-crawl service in private beta, naming CCBot (Common Crawl's crawler) among supported systems
July 30, 2025: Over 80 media executives gather in New York under IAB Tech Lab - coverage on PPC Land
August 2025: Cloudflare data shows Anthropic crawls 38,000 pages per referred visit; training-related crawling reaches 79% of AI bot traffic - coverage on PPC Land
November 4, 2025: The Atlantic publishes investigation into Common Crawl's supply of paywalled content to AI companies - coverage on PPC Land
Early 2026: Research finds publishers blocking AI crawlers experienced 23.1% monthly visit decline - coverage on PPC Land
March 20, 2026: Judge dismisses Helena World Chronicle antitrust lawsuit against Google - coverage on PPC Land
April 29, 2026: NMA submits formal letter to Common Crawl Executive Director Rich Skrenta, demanding removal of publisher content, revised terms of use, and explicit prohibition of AI training use; Exhibit A filed with opt-out registry requests from NBCUniversal, CNN, McClatchy, Vox Media, Ziff Davis, USA Today, and dozens of others
April 30, 2026: NMA publishes press release; Bloomberg first reports the action

Summary

Who: The News/Media Alliance (NMA), representing news and magazine publishers in the United States and internationally, addressed its letter to Rich Skrenta, Executive Director of the Common Crawl Foundation. The letter covers member publishers ranging from NBCUniversal and CNN to McClatchy, Vox Media, Ziff Davis, USA Today, Boston Globe Media Partners, and hundreds of regional news outlets.

What: The NMA sent a formal demand letter requiring Common Crawl to swiftly remove publisher content upon request, publish a clear statement that it does not own or authorize use of scraped content, revise its terms of use to explicitly prohibit AI training use of its repository, and add enforceable directives to its opt-out registry. The letter also includes an exhibit with hundreds of domain names requesting opt-out registry registration.

When: The letter is dated April 29, 2026. The NMA published an accompanying press release on April 30, 2026. The action was first reported by Bloomberg.

Where: Common Crawl Foundation is headquartered at 9663 Santa Monica Blvd, #425, Beverly Hills, CA 90210. The NMA is based at 4401 N. Fairfax Dr., Suite 300, Arlington, VA 22203. The archive itself is a globally accessible online resource used by AI developers worldwide.

Why: NMA argues that Common Crawl has strayed from its stated mission as an academic and historical archive to become an effective distribution mechanism for AI training data - enabling commercial AI developers to train large language models on copyrighted news content without authorization or compensation. Publishers argue this harms their ability to license content, fund journalism, and sustain business models that depend on the value of their reporting. The letter cites funding ties between Common Crawl and major AI companies including Anthropic and OpenAI as evidence of the organization's alignment with commercial rather than academic interests.