Google's Search Relations team today published episode 105 of Search Off the Record, a YouTube and podcast series produced by Google Search Central, in which engineers Martin Splitt and Gary Illyes deliver a detailed account of how the company's crawling infrastructure actually operates. The video, uploaded on March 12, 2026 to the Google Search Central YouTube channel - which has 774,000 subscribers - had accumulated 678 views within hours of publication. The episode carries a deceptively simple title: "Google crawlers behind the scenes." What it contains is considerably more technical.

The central disclosure is that Googlebot is not a standalone program. It never was. Gary Illyes, an analyst at Google, describes the name as a misnomer that made sense in the early 2000s when Google had a single product and, therefore, a single crawler. "Back then we probably had one crawler because we had one product," he says in the episode. "But then soon after another product came out - I think that was AdWords - and then we started having more crawlers and then more products came out and then more crawlers. But the Googlebot name somehow stuck."

A crawling infrastructure with no public name

What powers web crawling at Google today is, according to Illyes, an internal software-as-a-service platform. He gives it a placeholder name - "Jack" - for the purposes of the discussion, noting that its real internal name does not matter for public understanding. Jack exposes API endpoints. Any Google team that needs to fetch content from the internet makes an API call to Jack, passing parameters such as the desired user agent string, the robots.txt product token to obey, and timeout thresholds. The team does not manage the crawling mechanics itself. Jack handles those centrally.

"It's basically you tell it, fetch something from the internet without breaking the internet," Illyes says, summarising the entire system in a single sentence. "And then it will do that if the restrictions on the site allow it."

Martin Splitt, a search advocate at Google Switzerland, draws an analogy that clarifies the architecture: Googlebot is not the infrastructure. It is one client of that infrastructure. "It's just the name that one particular team is using for their fetches that are sent to this central SaaS," Illyes confirms. There are many other clients - Google News, Google Shopping, AdSense, Gemini apps, NotebookLM - all sending their crawl requests through the same underlying system. The Google for Developers documentation page on crawling infrastructure, published at developers.google.com, lists Googlebot-News, Storebot-Google, Google-Extended, the AdSense crawler, and the Google-NotebookLM fetcher as distinct products sharing this same foundation.

This architecture has been taking shape for years of documentation updates tracked by PPC Land. In November 2025, Google migrated all crawling documentation away from Google Search Central to a new dedicated site, reflecting the fact that the infrastructure serves products well beyond search. That reorganisation preceded a series of technical specification updates on caching, transfer protocols, and content encoding that PPC Land reported at the time.

Crawlers versus fetchers: a distinction that matters

One of the more practically useful clarifications in today's episode concerns the difference between crawlers and fetchers. The two terms appear in Google's documentation and are frequently used interchangeably by practitioners, but Illyes draws a firm line between them.

Crawlers operate in batch. They run continuously, processing a constant stream of URLs for a given team. No human is waiting on the other end. A fetcher, by contrast, operates on a single URL at a time and is always triggered by a user action - someone clicking a button, sharing a link, or requesting a preview. "Basically, there's someone on the other end who's waiting for the response," Illyes explains. The IP ranges used by crawlers and fetchers differ as well, though he notes that the underlying infrastructure is largely shared.

The practical significance of this distinction became clear in January 2026, when Google added Google Messages to its user-triggered fetchers list, documenting the bot that generates link previews inside Google Messages chat threads. That fetcher, like others of its kind, generally ignores robots.txt rules because it acts in response to explicit human action rather than automated crawling.

The question of which crawlers are documented is itself governed by thresholds. Illyes says he built internal SQL queries that trigger alerts when a crawler or fetcher exceeds a certain number of daily fetches. Once an alert fires, an issue is opened internally and the team investigates whether the crawler is legitimate, has been accidentally left running after a project was sunset, or needs to be formally documented. "Quite literally because of lack of space," he says of the decision not to document minor crawlers, referring to the real estate on the developers.google.com/crawlers page.

The 15 megabyte default and the 2 megabyte override

The episode surfaces one of the most consequential technical parameters in the crawling infrastructure: the default file size limit. According to Illyes, the infrastructure imposes a 15 megabyte default on any fetch that does not specify an override. When a crawler reaches that threshold while downloading bytes from a server, it stops receiving data. It does not necessarily close the connection; it signals that it has received enough.

Individual teams can override this default. For Google Search specifically, the limit has been overridden downward - to 2 megabytes for most HTML resources. For PDFs, a higher limit applies. Illyes mentions 64 megabytes for PDFs in the episode, though he notes some uncertainty about the exact figure, adding that the HTTP standard allows PDF exports of around 96 megabytes.

The 2 MB limit for HTML is not new information in isolation. PPC Land reported in February 2026 that Google had reduced Googlebot's file size limit from 15 MB to 2 MB, an 86.7% decrease. What today's episode adds is the architectural explanation: the 15 MB figure is the platform default, while the 2 MB figure is a Googlebot-specific configuration that overrides it. Other crawlers running on the same infrastructure may use entirely different limits. A hypothetical fast-indexing crawler, Illyes speculates, might use a 1 MB limit to push content through a pipeline in seconds. These are per-crawler configurations, not universal rules. A testing tool was updated in February 2026 to let SEO professionals simulate Googlebot's 2 MB threshold on their own pages.

Throttling and the "don't break the internet" mandate

The episode devotes considerable time to how the infrastructure prevents any single team - or any individual engineer - from overwhelming a website with requests. This is handled at the infrastructure level, not at the team level. Individual products cannot configure their way around it.

Illyes describes a scenario where a new engineer arrives at Google, gains access to a data center machine with a 10 gigabit connection, and begins streaming data from an external website at full capacity. That cannot happen because engineers must channel their fetch requests through the central infrastructure endpoints. The infrastructure monitors connection times for each domain. When response times begin to lengthen on repeated fetches - a signal that a site is slowing down under load - the infrastructure automatically throttles. A 503 HTTP response triggers an additional slowdown because it signals server overload. By contrast, 403 and 404 responses are treated as routine client errors with no throttling implication.

This automatic throttling is separate from any manual crawl rate controls. Google deprecated its Crawl Rate Limiter Tool in Search Console in January 2024, replacing it with the same automatic systems. PPC Land covered an August 2025 incident in which crawl rates dropped dramatically for sites hosted on Vercel, WP Engine, and Fastly - a disruption that originated from Google's own systems and was acknowledged by the company on August 28.

There is also aggressive internal caching. Illyes explains that if Google News fetched a page 10 seconds ago, a separate crawler supplying data to web search would simply receive the cached copy rather than triggering a fresh fetch. This cross-product caching reduces redundant traffic. However, he notes that not all products can reuse each other's fetched content - certain teams have policies restricting reuse across specific product combinations, a point with regulatory implications in discussions about how Googlebot data flows into AI training. Cloudflare documented in January 2026 that Googlebot accesses 1.76 times more unique URLs than GPTBot and 3.26 times more than Bingbot, a disparity it attributed to publishers' dependency on search visibility.

Geo-crawling: technically possible, deliberately limited

One of the less frequently discussed aspects of Google's crawling infrastructure is its relationship to geographic blocking. By default, Googlebot's IP addresses originate in the United States - specifically Mountain View, California, with the 66.x.x.x range appearing in DNS lookups. A site implementing geo-fencing that blocks non-local traffic will return a 403 or simply drop the connection when Googlebot arrives from a California IP.

Google can work around this, but it chooses not to in most cases. The infrastructure has access to IP address pools assigned to other countries. When the value of content is sufficiently high - Illyes uses a deliberately silly example to illustrate the principle - the infrastructure can lease IPs from those pools to crawl geo-restricted content. However, those IP addresses were not built for high-capacity crawling. "They don't have the capacity to handle crawl for everyone in Romania or Germany," he says, noting that Switzerland might be small enough to crawl fully. Resources are allocated frugally. It is, as Illyes adds, "a very, very, very bad idea to rely on this" as a strategy for ensuring crawlability from specific regions.

The implication for marketers running region-specific landing pages or price-localised product pages is direct: geo-blocking that catches Googlebot is not automatically circumvented. Crawling from non-US IPs is exceptional, not routine.

The infrastructure's origin and scale

The central crawling platform has existed since Google's founding. Its earliest form was, according to Illyes, more or less a Wget script running on a random engineer's workstation in 1998 or 1999. As products multiplied, the infrastructure was re-architected to support a service model in which teams make API calls rather than running their own crawlers. Today, it runs as compiled C++ binaries on remote data center machines - Illyes compares the setup to Google Cloud runner instances - distributed across data centers globally.

The scale is reflected in the monitoring complexity. Illyes describes the work of tracking which crawlers are active, which have been accidentally left running after project shutdowns, and which have crossed the threshold that warrants public documentation. One instance he recalls: a crawler continued fetching from the internet for years after its associated project was cancelled because someone forgot to turn off the background job. The monitoring systems that now exist are designed to catch those cases.

Google revamped its public crawler and fetcher documentation in September 2024, splitting content across separate pages and adding product-impact sections and robots.txt snippets for each user agent. That documentation now lives on the dedicated crawling infrastructure site launched in November 2025. According to the Google for Developers page, the crawling infrastructure is shared across Google Search, Gemini, Google Shopping, AdSense, Google News, and NotebookLM - each product operating as a distinct client of the same underlying service.

Why this matters for search and advertising professionals

The episode is primarily aimed at developers and SEO practitioners, but the implications extend into advertising operations. Crawl frequency and file size limits directly affect how quickly landing page changes appear in Google's index - and therefore how fast ad campaigns can point to updated pricing, updated promotional copy, or corrected inventory information. A landing page that exceeds the 2 MB HTML threshold may be crawled with truncated content, potentially indexing an incomplete version. PPC Land's March 2026 overview of Google's crawling documentation noted that a JavaScript bundle served at 800 KB compressed might decompress to 2.5 MB, placing it above the threshold.

The SaaS model also matters for understanding how crawl decisions propagate across products. A robots.txt directive that blocks Googlebot-News does not necessarily affect Storebot-Google or Google-Extended. Each product token operates as a separate configuration passed to the central infrastructure. Marketers managing content that feeds into Google Shopping, Gemini, or Google News need to treat each user agent independently rather than assuming that a single robots.txt rule governs all Google crawling behaviour.

The distinction between crawlers and fetchers carries a further implication: user-triggered fetchers such as Google Messages and other link-preview systems generally bypass robots.txt. This means content that appears to be restricted from crawling can still be fetched when users share URLs - a nuance that matters for publishers managing access to paywalled or pre-publication content.

The full transcript of episode 105 is linked in the video description, alongside the crawling infrastructure documentation at developers.google.com/crawlers. The Google Search Central channel, which carries the full episode, also hosts 110 episodes of Search Off the Record.


Timeline


Summary

Who: Martin Splitt, Search Advocate at Google Switzerland, and Gary Illyes, Analyst at Google, both members of the Google Search Relations team. The audience is web developers, SEO professionals, and technical marketers who interact with Google's crawling infrastructure.

What: In episode 105 of Search Off the Record - published today on the Google Search Central YouTube channel and accompanying podcast - Illyes and Splitt explain that Googlebot is not a standalone program but one of many clients calling a central internal crawling platform structured as software-as-a-service. The episode covers the distinction between crawlers and fetchers, the 15 MB platform-level default file size limit and the 2 MB override applied to Google Search, automatic throttling that prevents individual engineers or teams from overwhelming external servers, the limitations of geo-crawling from non-US IP addresses, and the threshold-based system that determines which crawlers receive public documentation.

When: The episode was published on March 12, 2026. The crawling infrastructure it describes has existed since 1998 or 1999, with the SaaS architecture developing as Google's product portfolio expanded from the early 2000s onward.

Where: The episode is available on YouTube at the Google Search Central channel, which has 774,000 subscribers. A full transcript is linked in the video description. The technical documentation referenced in the episode is hosted at developers.google.com/crawlers.

Why: The episode matters for marketing and technical professionals because it provides the architectural context behind crawling decisions that affect ad campaign performance, indexing speed, robots.txt configuration, and content visibility across Google's product ecosystem - from Search and Shopping to Gemini and NotebookLM. Understanding that Googlebot is one client among many in a shared SaaS infrastructure changes how site owners should interpret crawl behaviour, manage file sizes, and configure user agent - specific robots.txt rules for different Google products.

Share this article
The link has been copied!