Gary Illyes: the web's JavaScript mess is an AI agent nightmare

by Luis Rijo
Luis Rijo
Luís Rijo is a seasoned marketing professional with over 10 years of experience in Digital Marketing, Search, Social, Display, Video, and DOOH. Based in Europe. Also writing in the spend. Reach out via luis@ppc.land
- LinkedIn
•
June 1, 2026
•
10 min read

AI agent facing a tangle of JavaScript code blocking access to a library of web content

Gary Illyes, analyst at Google, today made a pointed observation on LinkedIn: AI agents and large language models would have had a far easier time processing the web if sites had used clean HTML and avoided JavaScript, or at least applied server-side rendering. The post lands with particular irony in mid-2026, a moment when the industry is actively debating whether llms.txt files, structured data layers, and agentic protocols can solve problems that, according to Illyes, stem from decades of web architecture choices.

The post received 44 reactions and generated 3 comments and 3 reposts within hours of publication on June 1, 2026.

What Illyes said

According to Illyes: "AI agents and LLMs in general would have had an easy life on Web 1.0. Reading through a bunch of LinkedIn posts about what you 'need' to do for LLMs and LLM enhanced automations that can handle ambiguity (which ai agents are and I'll die on that hill), if sites used relatively good HTML and no JavaScript (or SSR), both base model training, and web and agentic RAG would be a piece of cake from raw data processing perspective."

The irony is deliberate. LinkedIn in 2026 is saturated with posts advising marketers and developers what they "need" to do to make their sites readable by AI systems - adopt llms.txt, implement structured data, configure robots.txt for AI crawlers, optimize for RAG retrieval. Illyes is pointing out that all of this effort is remedial. It addresses a problem that would not exist, or would be substantially smaller, if the web had maintained the structural simplicity it had in its first decade.

He is not making a technical recommendation for what to do today. He is making a structural observation about how the web got to a place where agentic RAG - the process by which AI agents retrieve and use live web content to ground their responses - is a non-trivial engineering challenge in 2026.

The RAG problem with modern HTML

Retrieval-augmented generation (RAG) is the mechanism that allows an LLM or AI agent to pull current information from the web rather than rely solely on its training data. It is the backbone of AI search features, agentic task completion, and real-time knowledge retrieval. The process sounds straightforward: fetch a page, extract the text, feed it to the model.

The reality is more complicated. A large share of the modern web delivers its content via JavaScript that executes in the browser after the initial HTTP response. The raw HTML returned by the server contains little or no readable text - just script tags, framework bootstraps, and empty container elements. The actual content appears only after a JavaScript engine runs, makes API calls, and populates the DOM. For a human browser, this is invisible. For a crawler or RAG retrieval system that reads the raw HTTP response, the page is effectively blank.

Server-side rendering (SSR) solves this: the server executes the JavaScript before sending the response, so the initial HTML contains the full rendered content. Illyes' parenthetical "(or SSR)" acknowledges this as the practical path for sites that have already committed to JavaScript frameworks. But SSR is not universal. Many sites built on React, Vue, Angular, and similar frameworks rely on client-side rendering by default, and retrofitting SSR is a non-trivial engineering task.

The scale of the problem is documented in Google's own infrastructure data. A crawling overview published by Google in March 2026 noted that the median mobile page has grown from 816 kilobytes to 2.3 megabytes and now requires more than 60 separate files to load. To capture an accurate snapshot, Google may need to crawl the same URL multiple times as new elements are added. That is the web that RAG systems are trying to read.

The llms.txt context

The Illyes post sits directly inside a live industry debate about llms.txt - a protocol proposed by Jeremy Howard in September 2024 that asks sites to publish a plain-text file at /llms.txt summarising their content in a form AI systems can easily consume. The premise of llms.txt is that modern HTML is too noisy, too JavaScript-heavy, and too complex for LLMs to process reliably, so a separate simplified text layer is needed.

PPC Land reported in July 2025 that llms.txt adoption had stalled as major AI platforms ignored the proposed standard, with Ahrefs analysis confirming that no major LLM provider - not OpenAI, not Anthropic, not Google - was parsing the file. Server log evidence showed AI crawlers were not requesting llms.txt files during site visits at all.

A study published in April 2026 found that only 7.4% of Fortune 500 companies - 37 out of 500 - had implemented an llms.txt file, despite significant promotional effort from SEO tools and consultants. An OtterlyAI 90-day experiment cited in the same coverage found llms.txt provided no meaningful impact on AI crawler behavior.

The irony Illyes is gesturing at is sharp. The industry is debating whether to add a new plain-text file to solve the problem of AI systems struggling to read plain text from web pages. The underlying issue - that content is buried inside JavaScript that automated systems cannot reliably execute - is not addressed by any of these protocols. llms.txt, structured data, and similar layers are workarounds. Google itself added llms.txt as an audit item to Lighthouse just days before Illyes' post, a move that elevates agent-readiness to a standard benchmark - while the root cause goes unaddressed.

Three distinct problems Illyes identifies

The post covers three separate use cases, each with its own technical dimension.

Base model training relies on large-scale web crawls. Common Crawl and similar datasets gather raw HTML at petabyte scale. Extracting clean text from that HTML is a major processing step. Pages that deliver content in the initial HTML response - semantic elements, proper heading hierarchy, paragraph tags with readable text - produce training data that strips out naturally. Pages where content is injected by JavaScript after load require either a full rendering pipeline (expensive at training-data scale) or are simply skipped, which means their content is absent from the model.

Web RAG is the real-time equivalent. When an AI-powered application fetches a live page to ground a response in current information, it faces the same rendering problem on demand. Heavy JavaScript, client-side routing, and dynamic content injection add latency and failure modes. A RAG system that cannot reliably extract text from a JavaScript-heavy page either returns incomplete information or returns nothing. Neither outcome is acceptable when the RAG retrieval is meant to be the accuracy layer of an AI response.

Agentic RAG is the most demanding case. Illyes draws a specific distinction here, insisting that AI agents - systems that handle ambiguity and take autonomous action - are different from ordinary LLM use. An agent does not just summarize a page; it navigates, evaluates, decides, and acts. It may need to traverse multiple pages, follow links, extract specific data points, and use them to complete a task. Against a web built on server-side rendering and clean HTML, that navigation is straightforward. Against a web built on client-side JavaScript frameworks, the agent is operating against an unstable surface that may change between requests.

Google formalized agentic web browsing as a documented infrastructure category in March 2026, adding Google-Agent to its official list of user-triggered fetchers - AI-powered systems hosted on Google infrastructure that browse the web on behalf of users. That formalization came roughly a year after the industry began seriously deploying agentic systems at scale, and the challenge Illyes is describing is what those systems encounter every time they hit a JavaScript-rendered page.

Why Illyes is positioned to make this observation

Illyes has spent more than 15 years at Google working on search infrastructure. He has held the Analyst role since January 2011, based in Zurich, Switzerland. He is co-author of RFC 9309, the robots.txt standard formalized in 2022 alongside Henner Zeller and Lizzi Sassman. He contributed to the HTTPS ranking signal that Google introduced in August 2014. His listed technical skills include C++ and Python.

In March 2026 he co-authored a blog post titled "Inside Googlebot: demystifying crawling, fetching, and the bytes we process", which opened with a correction that had been pending for years: the name "Googlebot" is a historical misnomer, a label that made sense when Google had one product and one crawler in the early 2000s, but now refers to only one client of a shared infrastructure that also serves Shopping, News, Gemini, AdSense, and AI agents.

That same post documented the 2MB per-resource fetch limit that applies across Google's crawling infrastructure - a constraint that, on pages requiring 60+ files to render, has real implications for how completely a page can be processed. PPC Land reported at the time that Google had previously allowed up to 15MB per resource, an 86.7% reduction that reflects infrastructure cost pressures at crawling scale.

In August 2024, Illyes featured in Search Off the Record episode 79 alongside John Mueller and Lizzi Sassman, where he described ongoing work on URL parameter handling to reduce unnecessary crawl attempts - a problem that arises directly from the complexity of modern URL structures in JavaScript-driven applications.

Reactions from the community

Slobodan Manic, founder of No Hacks and a specialist in web strategy for the AI era, responded directly to the phrase "LLM enhanced automations that can handle ambiguity," writing: "And the best part is - this is all most people need!" He added: "I couldn't agree more with this take, the internet has been broken for a very long time, unfortunately humans are too patient for that."

IDEA lab Cerovac, responding 37 minutes after the original post, added a layer to the historical context: "Hehe, indeed. Unless we are talking about web sliced from images - where they had to figure out all the nested tables for design and content and spacers and flash." That comment points to problems that predate JavaScript frameworks entirely - the table-based layout era of the late 1990s and early 2000s, when HTML structure encoded visual design rather than content semantics, making automated text extraction nearly as difficult as it is today.

What this means for the marketing and ad tech community

The observation has practical implications. Every AI system that touches web content - whether training a model, grounding a response, or completing an agentic task - performs better against infrastructure that follows the conventions Illyes is describing. Server-side rendering, semantic HTML, clean document structure, and content delivered in the initial HTTP response are not new ideas. They have been part of technical SEO best practice for years. What is new is the scale of the automated systems now depending on them.

Google's I/O 2026 announcements in May introduced persistent background agents that monitor the web and surface synthesized alerts - agents that operate largely outside the traditional search interface and interact with web pages as part of their background operation. Google also proposed WebMCP at I/O 2026, an open web standard that allows developers to expose structured tools and JavaScript functions so browser-based AI agents can complete tasks without relying on screenshot parsing and DOM manipulation - an architecture that only makes sense as a response to how difficult DOM manipulation already is.

PPC Land has tracked since the SISTRIX March 2026 data showing click-through rates at position one falling from 27% to 11% that AI features in Search have already materially altered the relationship between a ranked position and the traffic it generates. As more of that traffic is mediated by AI systems that retrieve and process web content directly, the structural quality of that content matters more, not less.

The advice filling LinkedIn in 2026 - implement llms.txt, configure your structured data, optimize for RAG retrieval - is real work with real value at the margin. But Illyes' irony is that it is all remedial. The web created its own problem. And the industry is now building a second layer of simplified text files and agent protocols to work around it.

Timeline

August 2014 - Gary Illyes associated with Google's HTTPS ranking signal, published on the Google Search developer documentation site
March 2024 - Illyes and Lizzi Sassman discuss web crawling mechanics on Search Off the Record, covering crawl budget and how crawlers schedule, fetch, and process content
May 2024 - Illyes clarifies via LinkedIn that the sitemap lastmod element remains a crawl activity signal
August 8, 2024 - Search Off the Record episode 79 features Illyes, Mueller, and Sassman on crawling misconceptions and URL parameter handling
September 3, 2024 - llms.txt protocol proposed by Jeremy Howard, positing that modern HTML is too complex for LLMs to process reliably
September 16, 2024 - Google revamps crawler documentation, introducing Google-Extended user agent for AI and generative API improvements
December 3, 2024 - Google releases detailed crawling documentation co-authored by Illyes and Martin Splitt, covering the four-stage HTML-to-rendering pipeline and 30-day WRS caching system
July 2, 2025 - PPC Land reports llms.txt adoption has stalled, with Ahrefs confirming no major LLM provider - not OpenAI, not Anthropic, not Google - parses the file
November 20, 2025 - Google migrates crawling documentation to a new dedicated Crawling Infrastructure site, separating it from Search Central
January 31, 2026 - Cloudflare CEO data shows Google accesses 3x more web content than OpenAI through crawler infrastructure
March 3, 2026 - Google publishes nine-point web crawling overview, noting the median mobile page now requires 60-plus files and has grown to 2.3 megabytes
March 20, 2026 - Google adds Google-Agent to user-triggered fetchers, formalizing AI browsing as a documented infrastructure category
March 31, 2026 - Illyes co-authors "Inside Googlebot" blog post, detailing 2MB fetch limits and clarifying that Googlebot is a historical misnomer
April 5, 2026 - ProGEO.ai study finds only 7.4% of Fortune 500 companies have implemented llms.txt; OtterlyAI 90-day experiment finds no meaningful impact on AI crawler behavior
May 19, 2026 - Google proposes WebMCP at I/O 2026 as a standard for exposing structured web tools to browser-based agents
May 2026 - Google adds llms.txt as an audit category to Lighthouse, elevating agent-readiness to a standard benchmark
June 1, 2026 - Gary Illyes publishes LinkedIn post arguing Web 1.0-style clean HTML or SSR would have made LLM training, web RAG, and agentic RAG straightforward from a raw data processing perspective

Summary

Who: Gary Illyes, Analyst at Google since January 2011, based in Zurich, Switzerland. Co-author of RFC 9309 (the robots.txt standard) and a long-standing contributor to Google's crawling and search infrastructure documentation and publications.

What: A LinkedIn post arguing that AI agents and large language models face unnecessary raw data processing complexity because the web moved away from clean, server-rendered HTML toward JavaScript-heavy client-side rendering. Illyes identifies three affected use cases: base model training, web RAG, and agentic RAG. The post carries an implicit critique of the 2026 industry conversation around llms.txt and AI optimization protocols, which address symptoms rather than the underlying structural cause.

When: June 1, 2026. The post had 44 reactions, 3 comments, and 3 reposts at time of capture.

Where: LinkedIn, posted publicly from Illyes' professional profile where he is identified as Analyst at Google.

Why: The web's migration to client-side JavaScript rendering - driven by richer interactivity and better user experience - created a structural gap between what browsers display and what automated systems can extract from a raw HTTP response. As AI systems have become the primary new consumers of web content at scale, that gap has become a significant engineering and cost problem. The industry's current response - layering llms.txt, structured data protocols, and agentic interfaces on top of the existing web - is, in Illyes' framing, remedial work on a problem that better original architecture would have avoided.

Luis Rijo

Luís Rijo is a seasoned marketing professional with over 10 years of experience in Digital Marketing, Search, Social, Display, Video, and DOOH. Based in Europe. Also writing in the spend. Reach out via luis@ppc.land