Google's secret crawl logic, finally explained in one page

Google on March 3, 2026, added a new page to its Crawling Infrastructure documentation titled "Things to know about Google's web crawling" - a nine-point overview designed to answer common questions the company says it has received over more than three decades of crawling the open web. The document, last updated March 3, 2026 UTC, was flagged by SEO professionals Glenn Gabe and Barry Schwartz on LinkedIn within days of its publication, quickly drawing attention across the technical search community.

According to the changelog entry published by Google, the rationale was straightforward: "Based on questions we've received over the years, we've put together a resource page with basic educational information about crawling to better highlight various resources about crawling that are available to site owners." The page now sits under the dedicated Crawling Infrastructure documentation site, a home Google created in November 2025 after migrating crawling content away from Google Search Central to reflect the fact that crawling infrastructure serves products well beyond Search alone - including Google Shopping, News, Gemini, and AdSense.

What the document actually says

The nine sections cover ground that is individually familiar to technical SEO practitioners but has rarely been consolidated in one authoritative place. Taken together, they offer a picture of how Google approaches the mechanics of web discovery at scale.

Crawling is defined in the document as "the process of using automated software to discover new web pages and to understand them." This is deliberately plain language - aimed, according to Google, at audiences who ask foundational questions rather than at advanced developers already working with the Search Console API. The company notes that all search engines depend on crawling to know what pages and information exist on the web.

The document gives particular weight to Google's roster of crawlers. Googlebot, described as the "most well-known crawler," handles the core task of keeping Search results fresh. But the infrastructure reaches further. Dedicated crawlers exist for Google Images and Google Shopping. According to Google, each crawler uses "easily identifiable user-agent names and known internet addresses," allowing site owners to confirm that activity they see in server logs is legitimate. PPC Land covered Google's September 2024 documentation revamp that split crawler information across separate pages and added robots.txt snippets per user agent - an earlier expansion of the same transparency effort.

Crawl frequency: what drives it and what it signals

One of the document's more direct statements concerns recrawl frequency, a topic that generates persistent anxiety among site owners. According to the document, breaking news homepages may be recrawled "every few minutes," while pages where nothing has changed for years may wait a month between crawls. The gap between those two extremes is governed by Google's systems reading demand signals automatically.

The document frames high crawl frequency as a positive indicator: "If we're crawling your site a lot, it's an indication your pages have fresh or highly relevant content that people want to find, and that our systems are recognizing that demand." E-commerce is cited as a practical example. Google crawls product pages frequently so that prices, promotions, and inventory status reflect current conditions in search results. This is not a trivial point. For retailers, crawl frequency directly affects whether the price shown in a shopping result matches what a customer will see on the product page.

Sitemap files are identified as the primary mechanism through which site owners can influence recrawl timing. By explicitly declaring new and updated pages in sitemaps, publishers give Google's scheduling systems more precise signals. The document does not promise faster crawls in response to sitemaps - only that they inform the process.

Rendering: why one crawl is sometimes not enough

Perhaps the most technically substantive section of the document concerns rendering. According to Google, crawlers use "a technique called rendering, which loads a site in full to 'see' a page just as a real person would." This matters because a raw HTML fetch alone may not capture the content a JavaScript-heavy page actually delivers to users.

The scale of the rendering challenge has grown with the web itself. The document provides two specific data points. The median mobile page has grown in size from 816 kilobytes to 2.3 megabytes. The same median page now requires more than 60 different files to load - images, scripts, stylesheets, interactive components. To capture an accurate snapshot of what a page offers, Google may need to crawl the same URL multiple times as new elements are added.

This detail connects directly to a series of documentation changes Google has made over recent months. PPC Land reported in February 2026 that Google reduced Googlebot's file size limit from 15MB to 2MB per resource, an 86.7% decrease, with the change documented under the crawling infrastructure site that March 3 page now sits within. That reduction creates a direct interaction with rendering: if a JavaScript bundle exceeds the 2MB uncompressed threshold, rendering may work with incomplete code. A testing tool released in February 2026 allows SEO professionals to simulate the limit on their own pages. The new overview page does not address the 2MB limit directly, but the broader crawling infrastructure documentation in which it sits does.

Efficiency, errors, and automatic adjustment

The document describes crawlers as "engineered for efficiency," with automatic adjustments when sites slow down or return errors. When a site's servers respond slowly or throw errors, Google's crawl rate drops automatically to avoid overloading the infrastructure. This behaviour is not new, but stating it explicitly in a public-facing overview is notable. PPC Land covered an August 2025 incident in which crawl rates dropped dramatically across sites hosted on Vercel, WP Engine, and Fastly, with Google acknowledging the disruption originated from its own systems rather than from host configurations.

Content caching is identified as another efficiency mechanism. By caching crawled content, Google reduces the number of repeat requests it makes to the same servers. The document also notes that as crawlers discover more of a website, they become capable of recognising sections that need less coverage. The example given is stark and illustrative: calendar pages extending to the year 9999 "probably don't need to be crawled in their entirety."

Site owners can reduce wasted crawling by identifying content that does not need to be included. The document frames this as a mutual benefit - it lowers infrastructure costs for the website and makes the internet more efficient overall. The Crawl Rate Limiter tool in Search Console was deprecated in January 2024, replaced by automatic systems that adjust based on server signals, though site owners can still influence crawl access through robots.txt directives.

Paywalls, permissions, and structured data

Section seven addresses a topic relevant to publishers, news organisations, and subscription businesses: what happens at the boundary between open and gated content. According to the document, Google's crawlers "never go into paywall or subscription content without permission." If a page requires a login, crawlers cannot access it by default.

Publishers who want Google to index subscription content - so that Google can refer users to that content - have a documented path. They can grant explicit permission through structured data. If they do, structured data can be used "to continue showing human visitors a login screen without triggering our rules on spam." Separately, preview controls allow subscription content to be excluded from page previews even when some indexing is permitted. This set of options matters for news publishers trying to balance discoverability against subscription conversion.

Robots.txt, Google-Extended, and AI training

Sections eight and nine deal with site owner control over crawler access. The document describes robots.txt as a "simple text file that lets site owners declare how crawlers like ours should interact with their pages." Combined with robots meta tags, it empowers site owners to block pages from appearing in Search, declare content for crawling via sitemaps, and manage crawl frequency through crawl budget settings.

The document gives specific attention to Google-Extended, a product token that can be used in robots.txt to control "whether their content helps train future versions of Gemini models." The document states explicitly that using Google-Extended to restrict AI training does not affect a site's inclusion in Search, and that Google does not use Google-Extended as a ranking signal. This distinction matters because some site owners have worried that restricting AI training access might carry penalties in organic search results. The document is unambiguous that it does not.

Google's September 2024 documentation revamp was an earlier point in the same evolution toward clearer per-product controls. The question of which crawlers serve which products has become more pressing as Google's infrastructure has expanded. PPC Land's December 2024 analysis of earlier technical crawling documentation noted that Google's Web Rendering Service implements a 30-day caching system for JavaScript and CSS resources, independent of HTTP caching directives - a detail that interacts directly with the crawl efficiency picture the March 3 document paints.

Search Console as the primary management interface

The document closes by pointing site owners to Google Search Console, described as available at no cost. Search Console's role in this context is diagnostic and informational - it shows how much Google has crawled a site and why, helps identify problems such as server downtime and speed issues, and provides information about how pages appear in Search and how users engage with them. The November 2025 documentation migration that preceded March 3's new page included expanded HTTP caching specifications and transfer protocol details - all now housed on the same Crawling Infrastructure site.

Why this matters to the marketing and advertising community

For marketing professionals, the document's practical significance extends across several dimensions. E-commerce advertisers managing product feeds and landing pages have a direct stake in how often Google refreshes its view of their inventory. A crawl delay of even a few hours can mean that a Search result shows a price that no longer matches the checkout total - a friction point that affects conversion rates and potentially ad quality scores.

PPC Land's March 2025 coverage of Google's change to daily IP range refreshes for crawler verification sits in the same security context the March 3 document addresses: site owners need to know whether traffic claiming to be Googlebot actually is. The answer matters for both security policy and for analytics data integrity. Traffic from impersonating bots that bypasses security measures can distort engagement metrics, skew attribution models, and mislead campaign optimisation decisions.

The explicit statement about Google-Extended and AI training is separately significant for content publishers who monetise through advertising. A publisher that restricts AI training access via Google-Extended now has written confirmation from Google that doing so will not affect their organic search visibility - and therefore will not affect the organic traffic that advertising revenues depend on. That assurance removes a perceived trade-off that may have deterred some publishers from using the token.

The broader context also matters. PPC Land's December 2025 analysis of Cloudflare data showed that AI bots accounted for 4.2% of all HTML requests in 2025, with GoogleBot originating 4.5% of requests separately. As AI-driven crawling activity grows, the distinction between crawlers that respect robots.txt and those that do not - a distinction the March 3 document clarifies across all three crawler categories - becomes increasingly material for publishers managing both search visibility and AI training exposure.

Glenn Gabe, President of G-Squared Interactive, noted on LinkedIn that "frequent crawling (OF THE RIGHT URLs) is a good sign," a reaction that captures the document's core message about crawl frequency as an indicator of content quality rather than a cause for concern.

Timeline

September 16, 2024 - Google revamps documentation for crawlers and user-triggered fetchers, splitting content across separate pages and adding product-impact sections and robots.txt snippets per user agent
December 3, 2024 - Google releases detailed technical documentation on the web crawling process, co-authored by Martin Splitt and Gary Illyes, covering the four-stage HTML-to-rendering pipeline and 30-day WRS caching system
March 18, 2025 - Google updates crawler verification with daily IP range refreshes for the JSON objects containing Google crawler and fetcher IP addresses, replacing a weekly schedule
August 8-28, 2025 - Google crawl rate declines affect multiple hosting platforms including Vercel, WP Engine, and Fastly, with Google acknowledging the disruption originated from its own systems
November 20, 2025 - Google migrates crawling documentation to a new dedicated Crawling Infrastructure site, covering HTTP caching, transfer protocols, and content encoding standards
December 15-18, 2025 - Google clarifies JavaScript rendering behaviour for error pages, establishing that rendering may be skipped for non-200 HTTP status codes
January 21, 2026 - Google adds Google Messages to the list of user-triggered fetchers, documenting the fetcher used to generate link previews in Google Messages chat threads
February 3, 2026 - Google documents reduction of Googlebot's file size limit from 15MB to 2MB, an 86.7% decrease affecting HTML resource crawling
February 6, 2026 - Third-party testing tool updated to simulate Google's 2MB HTML limit, enabling SEO professionals to verify whether pages exceed Googlebot's reduced file size threshold
March 3, 2026 - Google publishes "Things to know about Google's web crawling," a nine-point overview consolidating fundamental crawling information for site owners, published under the Crawling Infrastructure documentation site

Summary

Who: Google published the document as part of its Crawling Infrastructure documentation. The primary audience is site owners, web developers, and SEO professionals who manage how Google's automated systems interact with their websites. Industry commentators Barry Schwartz and Glenn Gabe brought the document to wider attention via LinkedIn posts.

What: A new nine-section overview page titled "Things to know about Google's web crawling" was added to Google's Crawling Infrastructure documentation. The page covers the definition of crawling, the purpose of different crawlers, recrawl frequency, the rendering process, automatic efficiency optimisation, paywall handling, site owner controls via robots.txt and Google-Extended, and Search Console as a management tool. The document also states explicitly that using Google-Extended to restrict AI training does not affect a site's inclusion in Search.

When: Google published the document on March 3, 2026, according to the Crawling Infrastructure changelog. The changelog entry is dated March 3, and the document itself carries a "Last updated 2026-03-03 UTC" timestamp.

Where: The document is hosted on Google's Crawling Infrastructure documentation site, which was established in November 2025 after migrating crawling content from Google Search Central. The document is accessible at no cost and is aimed at a general site owner audience rather than advanced developers.

Why: According to Google's own changelog entry, the document was created in response to questions the company has received over more than 30 years of web crawling. The stated purpose is to consolidate basic educational information about crawling and highlight the range of resources available to site owners. The timing follows a series of significant documentation updates - including the 2MB crawl limit reduction, IP range refresh frequency changes, and the JavaScript rendering clarifications - that individually addressed technical edge cases. The new overview page provides a single entry point for site owners who need foundational context before engaging with those more technical specifications.