Apple this week formalized how its web crawler feeds directly into Siri, Apple Intelligence, and foundation model training - opening a new battleground between publishers and the $3 trillion company over who controls the web's raw material.
Apple expands Applebot's mandate into AI
Apple today published significant updates to its "About Applebot" documentation, dated June 8, 2026, extending the crawler's remit well beyond its original search indexing role. The changes formalize how crawled web data flows into Apple's generative AI products - including Siri, Apple Intelligence, and what Apple calls its foundation models powering Developer Tools and Services. The update arrives days after Apple confirmed at WWDC26 that Siri AI now runs on Google Gemini models, binding two of the world's largest technology companies through an AI infrastructure arrangement with direct consequences for publishers and marketers.
For anyone operating a website, the documentation revision is not an abstract policy update. It is a technical specification that defines the rules under which Apple's systems can read, store, and use web content to train AI or generate AI-driven answers. Understanding those rules - and the opt-out mechanisms available - is now a practical SEO and content governance question.
What the updated documentation says
The core addition to the Applebot documentation states that crawled data "may also be used to help train Apple foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools." This language is new. Previous versions of the documentation focused on search indexing for Spotlight, Siri suggestions, and Safari. Now the scope explicitly covers generative AI training.
A second, equally significant addition concerns real-time AI output. According to the updated documentation, Applebot crawled data "may be used to provide additional context and up-to-date content when AI models are used to generate output for display in Apple products and services." The documentation gives an example: answering broad world knowledge questions in Siri and Search, where the output "may include links to sources and websites used to help generate the answer." That is a retrieval-augmented generation architecture, and the documentation describes it plainly.
These two use cases - AI training and real-time AI retrieval - are distinct. Publishers can address each one using different technical mechanisms. The documentation, notably, describes both the mechanisms and their limits.
The opt-out architecture: two separate levers
Apple provides publishers with two separate opt-out paths, and the documentation is precise about what each one does and does not do.
Applebot-Extended is a secondary user agent that does not crawl webpages. It is used solely to determine how data already crawled by Applebot can be used. Publishers who want to prevent their content from being used to train Apple's foundation models can add a disallow rule targeting Applebot-Extended in their robots.txt file. The directive syntax, as specified by Apple, is:
User-agent: Applebot-Extended
Disallow: /private/
Disallowing Applebot-Extended does not affect search indexing. According to the documentation, pages that disallow Applebot-Extended "can still be included in search results" through Spotlight, Siri, and Safari. The separation is deliberate: Apple's search product and its AI training pipeline are governed by distinct robots.txt signals.
The nosnippet meta tag addresses the second use case - real-time AI retrieval. According to the documentation, "Apple will not use data tagged nosnippet as additional context and up-to-date content when AI models are used to generate output for display in Apple products and services." Publishers can apply the tag at the page level in the HTML head section, or by using an Applebot-specific variant:
<meta name="applebot" content="nosnippet">
The documentation also notes that the nosnippet directive prevents Applebot from generating descriptions or web answers - search results for nosnippet-tagged pages will show only the page title.
There is a limitation worth noting. According to the updated page, the nosnippet tag's effect on AI output applies at the page level only. Section-level markup using the schema.org hasPart property is not supported. Publishers cannot selectively exclude a paragraph or a sidebar from AI retrieval while keeping the rest of the page eligible. The scope is the entire page, or nothing.
The nosnippet caveat: crawling continues regardless
A critical clarification in the June 8 documentation prevents publishers from conflating opt-out signals with access restrictions. According to Apple, "even if you disallow Applebot-Extended and tag website content with the nosnippet meta tag, your website instructions may still allow Applebot to crawl your webpages. Your content will remain discoverable through Spotlight, Siri, and Safari, as well as other system-wide features on Apple devices."
In plain terms: opting out of AI training and opting out of AI-generated answers does not remove a site from Apple's search index. The three controls - robots.txt for general crawl access, Applebot-Extended for training data, and nosnippet for retrieval context - are independent levers that operate without overriding each other. Publishers who want to block all Applebot activity entirely must disallow the primary Applebot user agent in robots.txt.
Paywalled content and the isAccessibleForFree property
The updated documentation introduces explicit handling for subscription content. Apple now supports the schema.org isAccessibleForFree property to identify paywalled pages. Publishers can declare pages as inaccessible by adding the following JSON-LD structured data:
{
"@context": "https://schema.org",
"isAccessibleForFree": false
}
According to the documentation, pages marked with isAccessibleForFree: false "are eligible to appear in search results, but Applebot will not use that content as additional context when AI models are used to generate output for display in Apple products and services." This is a meaningful distinction for publishers with metered or subscription models: the content stays indexable and findable, but Apple draws a boundary around using paywalled text to generate AI answers.
As with nosnippet, this signal operates at the page level. The hasPart section-level variant is not supported.
X-Robots-Tag and non-HTML resources
One of the technically relevant additions to the documentation is the explicit recognition of the X-Robots-Tag HTTP response header. According to Apple, Applebot "also supports indexing directives delivered via the X-Robots-Tag HTTP response header." This is particularly useful for non-HTML resources such as PDFs and images, where meta tags cannot be placed in an HTML head section.
The implementation syntax is simple:
X-Robots-Tag: applebot: nosnippet
For publishers managing large document archives - PDFs, downloadable reports, image files - the X-Robots-Tag is the only available mechanism to signal indexing preferences. The Applebot-specific token ensures the directive applies only to Apple's crawler, leaving Googlebot and Bingbot unaffected.
Google added AI Mode to its own robots meta tag documentation in March 2025, and Bing introduced a data-nosnippet attribute for granular control over content sections in October 2025. Apple's June 8 update now places all three major crawlers on broadly similar footing in terms of the controls they expose to publishers - though the implementations differ in detail.
Crawl identification and efficiency mechanics
The documentation retains and expands its technical section on identifying Applebot traffic. Traffic originating from Applebot can be identified using reverse DNS lookup against the *.applebot.apple.com domain. Alternatively, publishers can match IP addresses against the CIDR prefix list Apple maintains in a publicly available JSON file. The standard reverse DNS verification flow produces results like:
$ host 17.58.101.179
179.101.58.17.in-addr.arpa domain name pointer 17-58-101-179.applebot.apple.com.
On crawl behaviour, the documentation includes a new explicit statement: "Applebot does not follow crawl-delay." This is a notable clarification. The crawl-delay directive, a non-standard but widely used robots.txt parameter, is honoured by Bing and some other crawlers. Apple's position is that its crawler manages rate internally. According to the documentation, "Applebot is engineered for efficiency and will adjust to minimize the impact on site owners." The crawler adjusts its rate automatically when a site slows down or returns errors, and Apple caches crawled content to reduce unnecessary repeated requests.
The documentation also specifies that Applebot may render page content within a browser-like environment, meaning it processes JavaScript, CSS, and XHR requests. If those resources are blocked via robots.txt, Applebot may be unable to render content properly. Apple recommends either making all rendering-necessary resources accessible to the crawler or ensuring pages degrade gracefully when resources are unavailable - what the documentation calls "graceful degradation."
Googlebot fallback behaviour
A detail with practical consequences for site operators who have not written Applebot-specific robots.txt rules: "If robots instructions don't mention Applebot but mention Googlebot, the Apple robot will follow Googlebot instructions." This means a site that has a comprehensive Googlebot configuration but has not explicitly addressed Applebot will still have its Googlebot rules honoured by Apple's crawler. This fallback does not apply to Applebot-Extended, which requires its own explicit directive.
Search ranking signals documented
The "About Applebot" page lists the factors Apple Search may use when ranking web search results. These include aggregated user engagement with search results, relevancy and matching of search terms to webpage topics and content, number and quality of links from other pages on the web, user location-based signals using approximate data, and webpage design characteristics. The documentation notes that search results may use these factors with "no pre-determined importance of ranking" - no explicit weighting is disclosed.
The market context for publishers and marketers
The June 8 update lands in a context where the relationship between publishers and AI crawlers has become considerably more contested. AI crawlers consumed 4.2 percent of all HTML requests across Cloudflare's network in 2025, a period when global internet traffic grew 19 percent. A March 2026 study of approximately 200 retail and e-commerce websites found that for every single visit OpenAI's systems deliver to a retail website, those same systems perform 198 crawls. The disparity illustrates how AI platforms interact with the web in ways structurally different from traditional search.
PPC Land first documented the mechanics of Applebot - including the introduction of Applebot-Extended - in June 2024, before Apple Intelligence had launched as a product and before the training use case was disclosed in the crawler's own documentation. The June 8 update formalizes what had become implicit: that a website allowing Applebot to crawl is now contributing to an AI training pipeline unless the publisher takes deliberate steps to opt out.
A major wave of blocking began in 2024, when major media and news publishers started disallowing AI training crawlers. GPTBot, the OpenAI crawler, was blocked by a growing share of top sites - including The New York Times, The Guardian, CNN, and Reuters - within weeks of its introduction in August 2023. Applebot-Extended attracted similar blocking. A BuzzStream study published in April 2026 found that blocking AI crawlers rarely stops AI systems from citing publisher content - because AI systems can draw on content indexed before blocking rules were applied, or retrieve via search results rather than direct crawling.
Czech publishers in March 2026 adopted a structured two-tier robots.txt framework distinguishing between AI training bots and real-time retrieval bots, aligning with EU copyright law's text and data mining exception. Apple's documentation update separates the same two use cases - training via Applebot-Extended and retrieval via nosnippet - suggesting a convergent technical vocabulary, even if legal enforceability remains contested.
Research published in April 2026 by Cloudflare and ETH Zurich found that AI bots were disrupting the web's cache layer, with training crawls generating load patterns that differ structurally from human traffic. Apple's statement that its crawler adjusts automatically based on server response times is relevant here: it describes a reactive throttling system, not a pre-agreed crawl budget.
Only 7.4 percent of Fortune 500 companies had implemented llms.txt - a newer, unratified mechanism for communicating AI access preferences - as of a March 2026 study. The robots.txt-based controls Apple formalizes on June 8 are, by contrast, widely understood, deployable within minutes, and compatible with existing webmaster tooling. For practitioners managing large site portfolios, the Applebot-Extended disallow directive and the nosnippet tag are the immediately actionable interventions.
The timing matters for one further reason. Apple confirmed at WWDC26 on June 8, 2026, that Siri AI is built on Google Gemini models. The foundation model underlying Siri's responses is no longer exclusively Apple's own - it incorporates Gemini technology, running on device and through Private Cloud Compute. Web content crawled by Applebot may therefore contribute context to a system whose underlying model is jointly developed. Publishers opting out of Applebot-Extended are opting out of that entire pipeline.
Timeline
- August 7, 2023: OpenAI introduces GPTBot; major publishers begin blocking AI training crawlers within two weeks.
- June 2024: Apple introduces Applebot-Extended, the secondary user agent giving publishers opt-out controls over AI training data use. PPC Land covers Applebot's mechanics including Applebot-Extended.
- August 3, 2024: Analysis shows top websites increasingly blocking AI web crawlers, with GPTBot blocked by 26 percent of top sites.
- March 9, 2025: Google adds AI Mode to its robots meta tag documentation, extending nosnippet controls to AI Overviews and AI Mode.
- October 18, 2025: Bing introduces data-nosnippet attribute for element-level publisher control over snippets in search and Copilot.
- December 20, 2025: Cloudflare analysis finds AI crawlers accounted for 4.2 percent of HTML requests across its network, as global internet traffic grows 19 percent.
- March 7, 2026: Study of approximately 200 retail sites finds AI systems perform 198 crawls per visit delivered, versus one crawl per six visits for Google.
- March 22, 2026: Czech publishers adopt two-tier robots.txt framework distinguishing AI training and real-time retrieval crawlers under EU copyright law.
- April 5, 2026: Study finds only 7.4 percent of Fortune 500 companies have implemented llms.txt, while 92.8 percent use robots.txt.
- April 6, 2026: BuzzStream data from 4 million AI citations finds blocking AI crawlers rarely stops AI systems from citing content.
- April 6, 2026: Cloudflare and ETH Zurich publish research showing AI bots disrupting the web's cache layer.
- June 8, 2026: Apple confirms at WWDC26 that Siri AI is built on Google Gemini models.
- June 8, 2026: Apple publishes updated "About Applebot" documentation formalizing AI training use, real-time AI retrieval, nosnippet controls, X-Robots-Tag support, crawl-delay behaviour, and paywalled content handling.
Summary
Who: Apple, affecting all website publishers globally whose content Applebot crawls, as well as SEO practitioners, content teams, and marketing professionals managing web properties.
What: Apple updated its "About Applebot" documentation to formally disclose that Applebot crawled data may be used to train its foundation models powering Apple Intelligence, Siri, Services, and Developer Tools. The update also added new sections covering nosnippet controls for AI output, the X-Robots-Tag HTTP header, crawl-delay behaviour (Applebot does not follow crawl-delay), paywalled content handling via the schema.org isAccessibleForFree property, and site rendering requirements. Two distinct opt-out mechanisms are documented: disallowing Applebot-Extended in robots.txt to prevent AI training data use, and applying the nosnippet meta tag to prevent content from being used as context in AI-generated answers.
When: The documentation was published on June 8, 2026, the same day Apple announced at WWDC26 that Siri AI is built on Google Gemini models.
Where: The changes are published in Apple's official support documentation at apple.com, covering Applebot's behaviour globally across all markets where Apple products operate.
Why: Apple formally documented how its crawler serves its expanding generative AI infrastructure, providing publishers with technical controls to manage the use of their content across two distinct pipelines - AI model training and real-time AI retrieval - as pressure from publishers over crawler use of web content has intensified industry-wide since 2023.
Discussion