Czech publishers get new robots.txt shield against AI scrapers

Czech online publishers gained on March 19 a more detailed technical framework to protect their content from artificial intelligence systems - one that extends well beyond training data to cover the real-time AI responses increasingly siphoning traffic from news sites across Europe.

The Sdružení pro internetový rozvoj (SPIR), the Association for Internet Development in the Czech Republic, published on March 19, 2026, an updated unified standard enabling website operators to declare an opt-out from the text and data mining (TDM) exception under EU copyright law. The update, developed in collaboration with the Asociace online vydavatelů (AOV), the Česká unie vydavatelů (ČUV) and the Správce licenčních práv vydavatelů (SLPV), replaces a draft standard SPIR first issued on July 7, 2023.

The revision reflects how fundamentally the AI landscape has shifted in less than three years. Back in 2023, the concern was primarily training data - the vast corpora of text fed to large language models during their development. Today, according to SPIR, the scope of automated content extraction has grown to include data used for so-called real-time responses: AI assistants, online summarisation tools, and retrieval-augmented generation (RAG) systems that pull live content from the web to answer user queries on the fly.

Two directives, two distinct use cases

At the heart of the update lies a two-tier technical structure, each tier expressed through specific directives in the robots.txtfile - the protocol that website operators use to communicate instructions to automated crawlers. The standard draws a precise line between two scenarios that, until now, many publishers may have struggled to address separately.

The first tier targets AI training. According to the document, operators who do not want their copyright-protected content used for training general AI models - including large language models - or for building datasets toward that end can add the following to their robots.txt file:

User-agent: MachineLearning
Disallow: /

The second tier is broader. It addresses both training and real-time usage simultaneously. Publishers who also wish to prevent AI systems from using their content for live inference - such as AI assistants generating answers from crawled content in real time - can instead apply:

User-agent: AI
Disallow: /

Both directives cover all content on a given domain. Crucially, according to SPIR, neither setting affects standard web search indexing by conventional search engine crawlers - unless those crawlers are operating in an AI mode, such as Google's AI Overviews. That distinction matters considerably. Publishers blocking AI crawlers via robots.txt saw total traffic drop by 23% and human traffic fall by 14%, according to research published December 31, 2025, by Rutgers Business School and The Wharton School.

Priority hierarchy and platform-specific controls

The standard also establishes a clear priority hierarchy among the directives. According to SPIR, directives for specific user agents take precedence over those set for the MachineLearning user agent, and MachineLearning directives in turn take precedence over those set for the AI user agent. This means operators can layer their settings: a broad AI-level block can be refined or overridden by platform-specific entries.

SPIR explicitly notes that operators can subsequently permit or prohibit specific AI platforms using their individual user agent identifiers. Examples given in the standard include Apple-Extended, Google-Extended, Perplexity-User, Seznam-Extended, and Open AI Crawlers, among others. That granularity is significant. Google-Extended, for instance, allows publishers to control whether their content feeds into future Gemini model training - though questions persist about whether such controls adequately address AI Overviews participation.

To reduce ambiguity, SPIR also recommends adding a plain-text comment to the robots.txt file explaining the directives, along with a contact address for licensing negotiations. The recommended comment, translated from Czech, reads approximately: "The 'User-agent: MachineLearning' and 'User-agent: AI' settings are tools of the SPIR unified standard for automated text and data mining from this website, particularly within the meaning of Article 4 of Directive 2019/790/EU."

The legal grounding: Article 4 of the EU Copyright Directive

The standard is explicitly grounded in Article 4, paragraph 3 of Directive 2019/790 of the European Parliament and of the Council - the EU's 2019 Copyright in the Digital Single Market directive. That provision allows rights holders to reserve their content from TDM by machine-readable means, effectively making the robots.txt opt-out legally meaningful under European law.

SPIR's earlier 2023 recommendation was conceived in direct response to the emergence of AI as a significant commercial force. Its stated goal at the time was to create a more transparent and predictable commercial environment for AI developers, website operators, and authors alike. That framing, notably, places AI developers within the framework as potential counterparties - not adversaries - in future licensing negotiations. When an operator applies the standard, SPIR says they are clearly declaring an opt-out from the TDM exception and signalling willingness to negotiate compensation with AI platforms for the use of their copyright-protected content.

The robots.txt standard: older than it looks

The Robots Exclusion Protocol, which underpins the entire framework, predates the AI era by decades. According to SPIR, the protocol originated before the year 2000. It was formally presented as IETF standard RFC 9309 in 2019 and approved in 2022. Despite its age, it has become the de facto mechanism through which publishers worldwide attempt to manage crawler access - a role it was never specifically designed for.

That improvised role is a source of ongoing tension. The standard operates on voluntary compliance. Crawlers can choose to ignore robots.txt directives without legal consequence under most jurisdictions, which is precisely why enforcement tools and legal frameworks have grown alongside the technical standard. Cloudflare launched Robotcop in December 2024 specifically to convert robots.txt declarations into active Web Application Firewall rules, enforcing them at the network level rather than relying on crawler goodwill. Research from Kim et al. (2025) further showed that compliance falls with stricter robots.txt directives, and that some AI-related crawlers rarely check these files at all.

Anthropic, for its part, clarified in February 2026 how its three crawlers - ClaudeBot for model training, Claude-User for real-time queries, and Claude-SearchBot for search quality - respond to robots.txt directives, and committed not to bypass CAPTCHAs. Whether documented commitments translate reliably into crawler behaviour has remained a point of contention. Reddit's lawsuit against Anthropic, filed June 4, 2025, alleged the company continued accessing its platform more than 100,000 times after publicly claiming it had stopped.

From training data to real-time extraction: why the scope expansion matters

The extension of SPIR's standard to cover real-time AI responses is arguably its most significant technical update. Retrieval-augmented generation (RAG) systems - those that pull current information from the open web to produce answers - have grown rapidly in prominence since 2023. Unlike static training, RAG involves live crawling at the moment a user asks a question. A news article published this morning can be ingested and summarised by an AI assistant within hours, generating responses for users who never visit the original publication.

This dynamic sits at the centre of the commercial conflict between publishers and AI platforms. Over 80 media executives gathered under the IAB Tech Lab banner in late July 2025 to address systematic content extraction by AI platforms - with OpenAI, Anthropic, and Perplexity notably absent from the room. The gathering aimed to develop an LLM Content Ingest API that would formalise publisher consent, attribution, and compensation into a binding technical framework.

The scale of the economic harm being alleged is substantial. According to IAB Tech Lab analysis, AI-driven search summaries reduce publisher traffic by 20% to 60% on average, with niche websites experiencing losses as high as 90%. The organisation estimates publishers collectively face $2 billion in annual revenue losses from AI-driven search features.

Czech context within the European landscape

SPIR's update does not exist in isolation. Czech media has been navigating AI-related pressures alongside broader European regulatory developments. The same association coordinated a separate self-regulatory initiative in August 2025, when ten major Czech media organisations announced a ten-point framework aligned with the European Media Freedom Act, which entered force on August 8, 2025. SPIR has also been active on the political advertising front, seeking a developer for a centralised political advertising transparency system in February 2026.

The European Commission launched a formal antitrust investigation on December 9, 2025, into whether Google violated EU competition rules by using publisher and YouTube content for AI purposes without compensation or viable opt-out mechanisms. At the same time, the UK's Competition and Markets Authority proposed in January 2026 that Google provide publishers the ability to opt out of AI Overviews without losing search visibility - a change Google's own executives described in February 2026 as a "huge engineering project."

In that broader context, SPIR's updated standard represents one national association's concrete response to a problem that regulators in Brussels and London are still working to resolve through formal legal mechanisms. The Czech framework offers something regulators cannot yet guarantee: an immediately implementable, machine-readable signal that a publisher does not consent to AI extraction of their content.

What the standard does not cover

SPIR is explicit that the new standard applies specifically to extraction via internet crawlers, not to other forms of data mining. Publishers seeking to restrict content reuse through different technical means - such as textual notices in page footers or use of the TDM Reservation Protocol - are not precluded from doing so. However, SPIR cautions that any alternative methods should be applied consistently with the robots.txt settings to avoid ambiguity or contradictions.

The standard also notes that top websites have increasingly moved to block AI crawlers through their own configurations in recent years. By July 2024, 35.7% of the top 1,000 global websites were blocking OpenAI's GPTBot - a sevenfold increase from the 5% blocking rate when that crawler launched in August 2023. CCBot was blocked by 22.1% of top sites; Google-Extended by 13.6%.

Why this matters for marketing and advertising professionals

Publishers are not the only stakeholders watching these developments. For the marketing and advertising community, the growing fragmentation of content access policies across territories and platforms introduces meaningful complexity. A Czech news site applying SPIR's User-agent: AI / Disallow: / directive is, in effect, opting out of AI-powered ad environments that rely on real-time content context. Contextual targeting systems powered by RAG or live content APIs would find access restricted where publisher consent has not been negotiated.

OpenAI revised its ChatGPT crawler documentation in December 2025, separating the roles of its training and search crawlers. Anthropic's February 2026 clarification similarly distinguished between its model training and user-query bots. These distinctions map directly onto the two tiers SPIR has now formalised: a training-only opt-out versus a full opt-out covering live inference as well. Publishers using SPIR's framework can now align their robots.txt configuration with the specific crawlers they wish to block, using the platform-specific user agent identifiers that major AI companies have begun publishing.

Whether AI platforms honour these declarations consistently remains an open question. The SPIR standard does not create an enforcement mechanism - that remains the province of regulation and litigation. What it does create is a clear, standardised, and legally grounded signal that Czech publishers can use to assert their rights under the EU Copyright Directive, and a foundation for future licensing negotiations with AI platforms that increasingly depend on publisher content to function.

Timeline

Pre-2000: Robots Exclusion Protocol (robots.txt) first emerges as an informal standard for communicating with web crawlers.
2019: IETF standard RFC 9309 formally presented as the basis for the robots.txt protocol.
2019: EU Directive 2019/790 on copyright in the Digital Single Market adopted, including Article 4's TDM opt-out provision.
2022: RFC 9309 officially approved as the formal robots.txt standard.
August 7, 2023: OpenAI announces GPTBot; major sites including Amazon, The New York Times, and CNN begin blocking it within two weeks. Coverage: PPC Land
July 7, 2023: SPIR publishes its first standardised draft for Czech publishers to opt out of AI content extraction via robots.txt.
June 29, 2024: Cloudflare introduces a feature to block AI scrapers and crawlers. Coverage: PPC Land
August 2024: 35.7% of top 1,000 global websites block GPTBot, a sevenfold increase from August 2023.
August 2024: Publisher traffic begins declining measurably, 13.2 months after ChatGPT's launch, according to the Rutgers/Wharton research.
December 10, 2024: Cloudflare launches Robotcop to enforce robots.txt policies at the network level. Coverage: PPC Land
March 9, 2025: Google updates robots meta tag documentation to include AI Mode. Coverage: PPC Land
June 30, 2025: Independent Publishers Alliance files antitrust complaint with the European Commission targeting Google's AI Overviews. Coverage: PPC Land
July 1, 2025: Cloudflare launches pay-per-crawl service for content creators.
July 30–August 3, 2025: Over 80 media executives gather at IAB Tech Lab in New York to address AI content scraping. Coverage: PPC Land
August 8, 2025: Ten Czech media organisations announce a ten-point self-regulatory framework aligned with the European Media Freedom Act. Coverage: PPC Land
August 21–30, 2025: Amazon updates its robots.txt to block AI crawlers from Meta, Google, Huawei, and others. Coverage: PPC Land
September 30, 2025: UK CMA designates Google with Strategic Market Status after a nine-month investigation.
October 10, 2025: Google VP Robby Stein confronted publicly over publisher opt-out gaps for AI Overviews. Coverage: PPC Land
December 9, 2025: European Commission launches formal antitrust investigation into Google's AI content practices. Coverage: PPC Land
December 9, 2025: OpenAI revises ChatGPT crawler documentation, separating training and search bots. Coverage: PPC Land
December 31, 2025: Rutgers Business School and The Wharton School publish research showing publishers who blocked AI crawlers lost 23% of total traffic. Coverage: PPC Land
February 10, 2026: SPIR announces public tender for a political advertising transparency system. Coverage: PPC Land
February 11, 2026: Google executive calls letting publishers skip AI Overviews without losing search a "huge engineering project." Coverage: PPC Land
February 25, 2026: Anthropic clarifies the roles and blocking mechanisms for its three web crawlers. Coverage: PPC Land
March 19, 2026: SPIR, in collaboration with AOV, ČUV, and SLPV, publishes updated unified standard for Czech online publishers to opt out of AI text and data mining via robots.txt, extending scope to real-time AI responses.

Summary

Who: The Sdružení pro internetový rozvoj (SPIR) - Association for Internet Development in the Czech Republic - acting jointly with the Asociace online vydavatelů (AOV), the Česká unie vydavatelů (ČUV), and the Správce licenčních práv vydavatelů (SLPV), representing Czech online publishers.

What: An updated unified technical standard enabling Czech website operators to declare an opt-out from the EU's text and data mining (TDM) exception via the robots.txt file. The update introduces two specific directives - User-agent: MachineLearning for training-only opt-outs, and User-agent: AI for a broader opt-out covering both training and real-time AI inference. It replaces SPIR's first draft, published July 7, 2023, and extends the scope to cover retrieval-augmented generation and similar real-time AI response systems.

When: The updated standard was published on March 19, 2026. The original draft it replaces dated from July 7, 2023.

Where: The standard applies to website operators in the Czech Republic, expressed through the machine-readable robots.txt file - a globally understood protocol that AI crawlers access at the root of any domain. It is grounded in EU Directive 2019/790 and is designed for pan-European and international applicability where equivalent national laws exist.

Why: Since 2023, the scope of AI content extraction has expanded from model training data to real-time content usage through AI assistants and summarisation tools. Czech publishers, alongside counterparts across Europe, face significant traffic and revenue losses from AI systems using their content without compensation. The update provides a standardised, legally grounded mechanism for publishers to assert their rights under EU copyright law and signals willingness to negotiate licensing terms - at a moment when enforcement by regulation remains incomplete and litigation is ongoing.