Google rewrites Googlebot's rulebook: 2MB limits, IP moves, and what crawlers really are

Google today published two posts on the Google Search Central Blog - both dated Tuesday, March 31, 2026 - that together offer the most detailed public account of its crawling infrastructure in years. One post explains, in technical depth, how Googlebot actually fetches and processes web content. The other announces a change in the location of the JSON files that list Google's crawler IP ranges. Taken together, the two announcements fill in gaps that have persisted in public documentation for decades and carry direct implications for anyone who manages a website or runs digital advertising.

The Googlebot myth, dismantled

The first post, titled "Inside Googlebot: demystifying crawling, fetching, and the bytes we process," opens with a correction that has been a long time coming. According to the post, the name "Googlebot" is a historical misnomer. Back in the early 2000s, Google had one product and one crawler, so the name stuck. Today, Googlebot is not a standalone program. It is, according to the post, "just a user of something that resembles a centralized crawling platform."

That platform serves dozens of clients. When a request from Google Shopping, AdSense, or any other Google product reaches a web server, it routes through the same underlying infrastructure - just under different crawler names. The post states: "When you see Googlebot in your server logs, you are just looking at Google Search." PPC Land reported on March 12, 2026 that this architecture was first described in detail by Google engineers Gary Illyes and Martin Splitt in episode 105 of the Search Off the Record podcast, recorded on the Google Search Central YouTube channel, which had 678 views within hours of publication. Today's blog post is the written distillation of that podcast episode, described by the post itself as produced "with more clarity and less bickering."

The 2MB limit: what it means, byte by byte

The technical centrepiece of today's first post concerns how many bytes Googlebot will actually fetch from a single URL. According to the post, Googlebot currently fetches up to 2MB for any individual URL, excluding PDFs. For PDF files, the limit is 64MB. For any crawler that does not specify a limit, the platform default is 15MB regardless of content type.

What happens when a page exceeds that 2MB threshold is precise and unforgiving. According to the post, there are four consequences. First, partial fetching: if an HTML file exceeds 2MB, Googlebot does not reject it but stops the download exactly at that cutoff, and the limit includes HTTP request headers. Second, the downloaded portion - the first 2MB - is passed to indexing systems and the Web Rendering Service (WRS) as if it were the complete file. Third, any bytes beyond the 2MB mark are entirely ignored: not fetched, not rendered, not indexed. Fourth, every resource referenced in the HTML - excluding media, fonts, and a handful of exotic file types - is fetched by the WRS separately, each with its own per-URL byte counter that does not count toward the parent page's total.

PPC Land covered the 2MB limit in February 2026, describing it as an 86.7% reduction from a previous 15MB figure, and noting that the uncompressed nature of the limit matters: a page delivering 500KB of compressed HTML could theoretically approach the threshold once decompressed. A testing tool was subsequently updated to simulate the cutoff, allowing SEO professionals to check whether pages exceed Googlebot's threshold before relying on server logs.

The post illustrates where the limit bites hardest. Sites that embed large base64-encoded images inline, include massive blocks of inline CSS or JavaScript, or open their HTML with megabytes of navigation menus run the risk of pushing textual content and structured data past the 2MB mark. If those bytes are not fetched, the post notes simply: "to Googlebot, they simply don't exist."

Rendering: what happens after the fetch

Once Googlebot retrieves those bytes, responsibility passes to the WRS. According to the post, the WRS processes JavaScript and executes client-side code similarly to a modern browser, pulling in JavaScript and CSS files, and processing XHR requests to understand the page's textual content and structure. It does not request images or videos. The 2MB limit applies individually to each resource the WRS fetches during rendering.

Two constraints on the WRS matter for anyone building dynamic, JavaScript-heavy applications. First, the WRS can only execute code that the fetcher actually retrieved - if the crawl stopped at 2MB, the renderer works with whatever arrived. Second, the WRS operates statelessly, clearing local storage and session data between requests. Sites that depend on persisted client-side state to serve content correctly may find Googlebot encounters an incomplete or broken view of the page.

Google's December 2024 documentation had already established that WRS implements a 30-day caching system for JavaScript and CSS resources, independent of HTTP caching directives - a detail that interacts with the 2MB limit in complex ways during deployment and cache invalidation.

Image and video crawlers: variable thresholds

The post notes that image and video crawlers operate differently. Their byte thresholds vary widely depending on the product they serve. Fetching a favicon, for example, carries a very low limit. Image Search fetches at a substantially higher threshold. These per-product configurations are set by individual teams within Google's crawling platform, not by a single global policy. The implication is that the 2MB figure cited for Googlebot applies specifically to Google Search, not to the ecosystem of crawlers as a whole.

Best practices outlined in today's post

The post outlines three technical recommendations for site operators. Keep HTML lean by moving heavy CSS and JavaScript to external files - while the initial HTML document is capped at 2MB, external scripts and stylesheets are fetched separately under their own limits. Prioritise critical elements by placing meta tags, title elements, link elements, canonicals, and essential structured data higher in the HTML document, ensuring they fall before the 2MB cutoff. Monitor server logs for response time issues: if a server struggles to serve bytes, Google's fetchers will automatically reduce their crawl frequency to avoid overloading the infrastructure.

The post also acknowledges that the 2MB limit is not permanent: "this limit is not set in stone and may change over time as the web evolves and HTML pages grow in size." PPC Land reported in early March 2026 that the median mobile page has grown from 816 kilobytes to 2.3 megabytes, meaning the current 2MB cap already sits below the typical page weight for mobile content.

IP ranges move to a new home

The second post today, shorter and more operational, addresses the location of the JSON files that list Google's crawler IP addresses. According to the post, those files currently sit under the /search/apis/ipranges/ directory on developers.google.com. Since the IP ranges apply to more than just Google Search crawlers, Google is moving them to a more general location: developers.google.com/crawling/ipranges/.

The documentation has already been updated to point to the new path. According to the post, the files will remain available at the old /search/ path "for the time being" to allow organisations time to update their systems. The old locations will be phased out and redirected to the new ones within 6 months. The post, authored by Gary Illyes, advises operators to switch to the new /crawling/ipranges/ path as soon as possible.

The move is not a change to the IP ranges themselves - it is a reorganisation of where those ranges are documented. But the practical consequences for anyone using allowlists to manage server access are real. Scripts, firewall rules, and security automation tools that fetch IP data from the old path will eventually encounter redirects, and then, after the 6-month window, failures if not updated. Google introduced daily refreshes to those IP range files in March 2025, replacing a previous weekly schedule, meaning the files are already updated more frequently than many operators may expect.

Three categories of crawlers, one verification process

The verification documentation, last updated March 20, 2026 UTC, distinguishes three categories of crawlers. Common crawlers - including Googlebot - always respect robots.txt rules for automatic crawls, and their IP addresses resolve to reverse DNS masks ending in googlebot.com. Special-case crawlers - such as AdsBot - perform specific functions for Google products where there is an existing agreement between the crawled site and Google's product team; they may or may not respect robots.txt rules. User-triggered fetchers - tools and product functions where a user initiates a fetch - ignore robots.txt because the fetch is performed at human request rather than by autonomous discovery.

The verification process for any of these three types follows the same two-method structure documented by Google. For one-off lookups, a reverse DNS lookup on the accessing IP address using the host command, followed by a forward DNS lookup to confirm the IP matches, is sufficient. For large-scale automated verification, matching the crawler's IP against the published JSON files - commoncrawlers.json, specialcrawlers.json, usertriggeredfetchers.json, usertriggeredfetchers-google.json, and usertriggeredagents.json - provides a scalable solution. All IP addresses in those files use CIDR format.

Google added the Google-Agent user-triggered fetcher on March 20, 2026, introducing a new usertriggeredagents.json file to the same IP range infrastructure being reorganised today. AI agents - systems like Project Mariner that browse the web on behalf of users - now have an official documented identity within the same verification framework.

Why this matters for marketing and advertising professionals

The crawling infrastructure underpins more than organic search rankings. AdsBot, one of the special-case crawlers documented in Google's verification framework, checks ad landing page quality and relevance. If a landing page's critical content - pricing information, product descriptions, calls to action - sits beyond the 2MB mark in the HTML, AdsBot may not retrieve it. The quality signal it forms could be based on incomplete data. This is not a hypothetical edge case for advertisers running JavaScript-heavy single-page applications or pages with large inline assets.

PPC Land's coverage of Google's crawler monopoly in January 2026 noted that Cloudflare data showed Googlebot accessing approximately 8% of observed web pages over a two-month period - 3.2 times more unique URLs than OpenAI's GPTBot and 4.8 times more than Microsoft's Bingbot. That scale means infrastructure decisions like the 2MB cap and the IP range reorganisation affect not just SEO professionals but the entire advertising technology ecosystem that depends on crawler traffic for measurement, attribution, and quality scoring.

Google's November 2025 documentation migration moved crawling documentation from Google Search Central to a new dedicated Crawling Infrastructure site, reflecting the fact that these systems serve Shopping, News, Gemini, AdSense, and other products, not just organic search. Today's IP range relocation from /search/apis/ipranges/ to /crawling/ipranges/ completes that conceptual shift at the file system level.

Timeline

March 30, 2022 - Google and Microsoft release crawler IP addresses publicly, establishing the JSON file format still in use today
December 3, 2024 - Google publishes detailed crawling process documentation explaining the four-stage WRS rendering pipeline
March 18, 2025 - Google switches IP range refresh schedule from weekly to daily following feedback from large network operators
September 16, 2024 - Google revamps documentation for crawlers and user-triggered fetchers, splitting content across separate pages per crawler type
November 20, 2025 - Google migrates crawling documentation to a dedicated Crawling Infrastructure site, adding HTTP caching and transfer protocol specifications
December 18, 2025 - Google clarifies JavaScript rendering for error pages, establishing that rendering may be skipped for non-200 HTTP status codes
January 31, 2026 - Cloudflare data shows Googlebot accesses 3.2x more unique URLs than GPTBot, prompting UK CMA consultation
February 6, 2026 - Google confirmed to have reduced Googlebot's file size limit from 15MB to 2MB, an 86.7% reduction
February 7, 2026 - Third-party testing tool updated to simulate the 2MB HTML fetch limit
March 3, 2026 - Google publishes nine-point crawling overview under its Crawling Infrastructure documentation
March 12, 2026 - Google engineers explain Googlebot's shared SaaS architecture in Search Off the Record episode 105
March 20, 2026 - Google adds Google-Agent to user-triggered fetchers, introducing usertriggeredagents.json
March 31, 2026 - Google publishes "Inside Googlebot" blog post detailing 2MB fetch limits and WRS rendering constraints, and announces relocation of IP range JSON files from /search/apis/ipranges/ to /crawling/ipranges/ with a 6-month transition window

Summary

Who: Google Search Central, specifically engineer and analyst Gary Illyes, published both announcements. The primary audiences are website administrators, SEO professionals, and digital advertising teams who manage how Google's automated systems interact with their infrastructure.

What: Two blog posts were published simultaneously. The first provides a comprehensive technical account of how Googlebot - now clarified as a name for one client of a centralised crawling platform - fetches and processes web content, with detailed specifications on the 2MB per-URL fetch limit, the 64MB PDF limit, the 15MB platform default, and WRS rendering constraints. The second announces the relocation of Google's crawler IP range JSON files from /search/apis/ipranges/ to /crawling/ipranges/ on developers.google.com, with a 6-month transition window before the old paths are phased out.

When: Both posts were published on Tuesday, March 31, 2026 - the last day of the first quarter of 2026.

Where: The announcements appeared on the Google Search Central Blog at developers.google.com. The IP range files are being moved within the same domain, from the /search/ directory tree to a new /crawling/ directory that reflects the broader scope of the infrastructure.

Why: The technical blog post distils content from Search Off the Record podcast episode 105, published March 12, 2026, with the goal of providing written clarity on crawling mechanics that have historically been poorly understood. The IP range relocation reflects an organisational decision to acknowledge that Google's crawling infrastructure serves products far beyond Search - including Shopping, AdSense, Gemini, and AI agents - and therefore belongs under a more general path than one prefixed with /search/.