Google updates crawling infrastructure documentation with new technical details

Google publishes updated crawling infrastructure documentation on November 20, 2025, adding HTTP caching support details and transfer protocol specifications.

Luis Rijo

Nov 20, 2025 • 7 min read

Google published updated documentation for its crawling infrastructure on November 20, 2025, expanding technical specifications for webmasters managing crawler interactions with their websites. The documentation updates provide detailed information about HTTP caching implementation, supported transfer protocols, and content encoding standards used by Google's automated systems.

According to the updated documentation, Google's crawling infrastructure now supports heuristic HTTP caching as defined by the HTTP caching standard. The implementation specifically utilizes ETag response and If-None-Match request headers, alongside Last-Modified response and If-Modified-Since request headers. This marks a significant technical specification update for developers managing server-side caching strategies.

The documentation recommends setting both ETag and Last-Modified values regardless of crawler preferences. "These headers are also used by other applications such as CMSes," according to the documentation. When both header fields appear in HTTP responses, Google's crawlers prioritize ETag values as required by HTTP standards. Google specifically recommends using ETag instead of Last-Modified headers to indicate caching preferences, citing ETag's immunity to date formatting issues that can complicate Last-Modified implementations.

Transfer protocol support received detailed clarification in the updated documentation. Google's crawlers and fetchers support HTTP/1.1 and HTTP/2 protocols. The crawlers determine protocol versions based on optimal crawling performance, potentially switching protocols between sessions depending on previous crawling statistics. HTTP/1.1 remains the default protocol version for Google's crawlers. Crawling over HTTP/2 may save computing resources for both websites and Googlebot, though no Google-specific product benefits exist for sites, including no ranking improvements in Search.

Subscribe PPC Land newsletter ✉️ for similar stories like this one

Website administrators seeking to opt out from HTTP/2 crawling can instruct servers to respond with 421 HTTP status codes when Google attempts HTTP/2 access. The documentation acknowledges alternative blocking methods exist when server-level configuration proves infeasible, including temporary solutions through direct communication with Google's Crawling team.

Content encoding specifications now appear explicitly in the documentation. Google's crawlers and fetchers support gzip, deflate, and Brotli compression methods. Each Google user agent advertises supported encodings in Accept-Encoding headers of requests. A typical header reads: "Accept-Encoding: gzip, deflate, br."

The documentation addresses crawl rate management with specific guidance for emergency situations. Sites experiencing server strain from Google's crawling requests should return 500, 503, or 429 HTTP response status codes instead of 200 codes. Google's crawling infrastructure automatically reduces crawl rates when encountering significant numbers of these error status codes. The reduced crawl rate affects entire hostnames, including both URLs returning errors and those returning content successfully.

Google cautions against prolonged use of error responses for crawl rate management. "We don't recommend that you do this for a long period of time (meaning, longer than 1-2 days) as it may have a negative effect on how your site appears in Google products," according to the documentation. Googlebot may drop URLs from Google's index if error status codes persist for multiple days.

For situations where serving errors proves infeasible, website administrators can file special requests to report problems with unusually high crawl rates. These requests should specify optimal crawl rates for affected sites. Google notes that crawl rate increase requests are not accepted, and fulfilling reduction requests may require several days for evaluation and implementation.

The documentation updates include robots.txt file creation and submission guidelines. Robots.txt files must be named exactly "robots.txt" and located at site roots to control crawler access. Each site can maintain only one robots.txt file. The files apply exclusively to paths within specific protocols, hosts, and ports where they are posted. For example, rules in https://example.com/robots.txt apply only to files in https://example.com/, not to subdomains like https://m.example.com/ or alternate protocols such as http://example.com/.

Robots.txt files must use UTF-8 encoding, which includes ASCII. Google may ignore characters outside UTF-8 ranges, potentially invalidating robots.txt rules. The documentation emphasizes that files can be posted on subdomains and non-standard ports, with each location requiring separate robots.txt files.

Follow on Google, Google News, X, LinkedIn, Mastodon, Bluesky, or via RSS

Rule structures in robots.txt files consist of groups targeting specific user agents. Each group begins with a User-agent line specifying the target, followed by multiple disallow or allow directives controlling access. Crawlers process groups from top to bottom, matching only the first, most specific group applicable to given user agents. The documentation notes that default assumptions permit crawling of pages and directories not blocked by disallow rules.

Wildcard characters (*) function in path prefixes, suffixes, or entire strings for all rules except sitemap directives. The documentation provides detailed examples of rule structures blocking specific crawlers while permitting others. Sitemap directives require fully-qualified URLs and serve to indicate content Google should crawl, as opposed to content access restrictions.

Crawler verification processes received updated specifications in the documentation. Google's crawlers identify themselves through HTTP user-agent request headers, source IP addresses, and reverse DNS hostnames of source IPs. The documentation categorizes crawlers into three types: common crawlers used for Google products, special-case crawlers for specific products with agreed crawling processes, and user-triggered fetchers activated by end-user requests.

Common crawlers, including Googlebot, consistently respect robots.txt rules for automatic crawls. These crawlers use IP addresses within specific ranges identifiable through reverse DNS masks "crawl----.googlebot.com" or "geo-crawl----.geo.googlebot.com." Special-case crawlers like AdsBot may ignore global robots.txt user agent rules with publisher permission. User-triggered fetchers such as Google Site Verifier bypass robots.txt rules entirely because users initiate fetches directly.

The documentation clarifies that Google's crawling infrastructure operates through thousands of machines running simultaneously to improve performance as web content scales. To optimize bandwidth usage, crawlers distribute across multiple datacenters worldwide, locating near sites they access. Websites may observe visits from several IP addresses as a result. Google egresses primarily from United States IP addresses, though may attempt crawling from other countries when detecting United States blocking.

Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.

Learn more

Technical infrastructure supporting Google's crawlers includes FTP and FTPS protocol support, though crawling through these protocols occurs rarely. The documentation notes FTP support as defined by RFC959 and updates, with FTPS support as defined by RFC4217 and updates.

Individual Google crawlers and fetchers determine whether to implement HTTP caching based on associated product needs. Googlebot supports caching when re-crawling URLs for Google Search. Storebot-Google supports caching only under specific conditions. The documentation recommends website administrators contact hosting providers or content management system vendors for HTTP caching implementation guidance.

The documentation addresses Last-Modified header formatting requirements to avoid parsing issues. Google recommends using the date format: "Weekday, DD Mon YYYY HH:MM:SS Timezone." An example format reads: "Fri, 4 Sep 1998 19:15:56 GMT." While not required, setting the max-age field of Cache-Control response headers helps crawlers determine when to recrawl specific URLs. The value should specify expected seconds content will remain unchanged, such as "Cache-Control: max-age=94043."

Crawl rate optimization remains a priority for Google's systems. The documentation states Google's goal involves crawling as many pages as possible on each visit without overwhelming servers. Sites struggling with crawling requests can reduce crawl rates, though inappropriate HTTP response codes may affect site appearances in Google products.

Google's crawler infrastructure shares resources across multiple products. This means following best practices helps web content discovery efficiency and feature inclusion across Google's ecosystem. The updated documentation serves Google Search, Discover, Images, Video, News, Shopping, and various specialized services relying on web content crawling.

The documentation updates arrive as technical SEO continues evolving with increasing complexity in content management systems and hosting architectures. Website administrators managing large-scale sites or experiencing crawling issues now have more detailed technical specifications for troubleshooting and optimization.

For websites using hosting services like Wix or Blogger, the documentation notes providers may expose search settings or mechanisms controlling page visibility without direct robots.txt file editing access. Administrators should search provider documentation for instructions about modifying page visibility in search engines.

The updated documentation remains accessible through Google's Search Central website. Google recommends reviewing updated information to ensure sites are configured optimally for various crawlers and user-triggered fetchers as search technologies continue advancing.

Subscribe PPC Land newsletter ✉️ for similar stories like this one

Timeline

November 20, 2025: Google publishes updated crawling infrastructure documentation with HTTP caching and transfer protocol specifications
August 28, 2025: Google acknowledges crawl rate problems affecting multiple hosting platforms resolved
August 29, 2025: Google addresses JavaScript-based paywall guidance for publishers implementing client-side restrictions
March 18, 2025: Google updates crawler verification processes with daily IP range refreshes for enhanced security
December 7, 2024: Google details comprehensive web crawling process in new technical document explaining four-stage rendering
September 16, 2024: Google revamps documentation for crawlers and user-triggered fetchers with product impact information
May 19, 2024: Google launches new crawlers GoogleOther-Image and GoogleOther-Video for specialized data gathering

Subscribe PPC Land newsletter ✉️ for similar stories like this one

Summary

Who: Google's crawling infrastructure team published updated technical documentation for website administrators, developers, and SEO professionals managing crawler interactions.

What: The documentation updates include detailed specifications for HTTP caching implementation using ETag and Last-Modified headers, transfer protocol support for HTTP/1.1 and HTTP/2, content encoding standards including gzip, deflate, and Brotli compression, crawl rate management procedures using HTTP status codes, robots.txt file creation guidelines, and crawler verification processes across three categories of automated systems.

When: Google published the updated crawling infrastructure documentation on November 20, 2025, expanding on previous technical specifications and providing more detailed implementation guidance for webmasters.

Where: The documentation appears on Google's Search Central website under the crawling infrastructure section, accessible to all website administrators and developers managing server-side configurations for Google's automated crawling systems.

Why: The updates provide website administrators with more precise technical specifications for managing crawler interactions, addressing server performance concerns, implementing effective caching strategies, and optimizing crawl budgets. The documentation helps sites configure systems appropriately for Google's crawlers while maintaining server stability and efficient resource utilization across the search engine's distributed crawling infrastructure.