Only 7.4% of Fortune 500 have an llms.txt file, study finds

A new research report published on March 31, 2026, by ProGEO.ai reveals a striking gap in how America's largest companies are preparing - or failing to prepare - for a search landscape increasingly shaped by generative artificial intelligence. The study, titled "Signaling the Shift to Generative Engine Optimization (GEO)," scanned all 500 companies on the Fortune 500 list and measured adoption rates across three technical protocols: robots.txt, JSON-LD, and llms.txt.

The headline finding is blunt. According to the report, only 7.4% of Fortune 500 companies - 37 in total - have implemented llms.txt, a specification introduced in 2024 to help AI platforms process website content more efficiently. By contrast, 92.8% have implemented robots.txt, the 30-year-old crawler control standard. The gap between these two figures maps almost exactly to the distance between a mature, search-engine-era standard and a nascent protocol designed for a different kind of machine reader.

A new optimization discipline takes shape

The report frames these numbers through the lens of what ProGEO.ai calls generative engine optimization (GEO) - a practice distinct from traditional search engine optimization (SEO) that focuses specifically on brand visibility within AI-generated responses. Platforms such as ChatGPT, Claude, and Gemini are changing how buyers find information, and, according to Gartner, "Chief Marketing Officers and their teams need to adjust their web content strategy to adapt to search engine's evolving algorithms and appear in GenAI-powered search results."

That shift matters commercially. Organic search traffic has been falling as AI-generated responses answer queries directly, without sending users to external websites. Zero-click behavior - where a query is resolved entirely within a platform's interface - has intensified steadily since AI-powered search features began scaling in 2024 and 2025. Against that backdrop, the question of how a brand signals its presence to AI systems is no longer purely theoretical.

Clinton Karr, CMO of ProGEO.ai, described the findings in terms of the classic diffusion framework developed by sociologist Everett Rogers. "ProGEO.ai observed that the Fortune 500 adoption rates for robots.txt, JSON-LD, and llms.txt mapped to Rogers' 'Diffusion of Innovations' curve, demonstrating a full spectrum of technical marketing maturity," said Karr. "Early adopters of llms.txt in the Fortune 500 are signaling their experimentation with generative engine optimization."

The adoption rates in the data do follow this curve closely. robots.txt, which dates to 1994 and became an official IETF standard in 2022 as RFC 9309, sits at 92.8% - well past the early majority threshold. JSON-LD, a W3C structured data standard that has existed since 2011, sits at 53.8% - squarely within the late majority. llms.txt, published as a specification in 2024, sits at 7.4% - squarely within the innovators category on Rogers' curve.

Part one: robots.txt at 30, still not built for AI

Robots.txt implements the Robots Exclusion Protocol. According to the ProGEO.ai report, it enables site operators to specify whether they allow or disallow web crawlers from accessing their site. Despite its near-universal adoption - 92.8% of Fortune 500 companies have one - only 11% of those companies, just 55 companies in total, have named a specific AI user agent anywhere in the file.

This matters because the default behavior of robots.txt is permissive. The absence of an explicit directive is treated as an implied allow. So the 89% of Fortune 500 companies that have not named an AI user agent are, according to the report, "by default more accessible to AI crawlers than most of the 11% who have."

Among the 55 companies that have named an AI user agent, ProGEO.ai identified 270 total directives across 25 distinct AI user agents. These break down into 105 allow directives, 116 disallow directives, and 49 partial access directives. A clear pattern emerges when the data is split by crawler type. Directives aimed at training crawlers - bots that collect content to build or refine AI models - skew toward disallow. Directives aimed at search crawlers - bots that retrieve content to generate responses - skew toward allow.

GPTBot, OpenAI's training crawler, is the most frequently named AI user agent with 32 total directives, and it leans toward restriction. CCBot (Common Crawl), Google-Extended (Gemini), Meta-ExternalAgent, and Bytespider (ByteDance) follow the same pattern. By contrast, ChatGPT-User (OpenAI's search agent), OAI-SearchBot, and PerplexityBot show predominantly permissive directives. The implication is that Fortune 500 companies making deliberate choices about AI access are generally keeping their doors open to AI-generated search while trying to block the use of their content for model training.

The efficacy of this approach is, however, contested. In December 2025, OpenAI announced that its ChatGPT-User agent would no longer follow robots.txt directives for user-initiated browsing. Multiple reports suggest Bytespider, operated by ByteDance, also ignores robots.txt directives. Some organizations have responded by moving enforcement to web application firewalls (WAFs), which can block requests at the network level rather than relying on voluntary compliance. Cloudflare introduced Robotcop in December 2024 to automate this process.

WAF-based enforcement introduces its own complication. According to the report, Google uses the same user agent for all of its crawlers - covering both Search and Gemini. Blocking Googlebot at the WAF layer to restrict Gemini access would simultaneously prevent the site from being indexed for traditional search. The separation of training crawlers from search crawlers that robots.txt allows does not translate cleanly to WAF enforcement.

76% of Fortune 500 companies include at least one Sitemap directive in their robots.txt - more than three in four. Sitemaps provide crawlers with a structured list of URLs, relative priority values, and metadata. Together, robots.txt and sitemaps form the foundational layer of information retrieval for both search engines and AI systems.

Part two: JSON-LD - widespread adoption, shallow implementation

JSON-LD (JavaScript Object Notation for Linked Data) is a W3C standard for encoding structured data on web pages. It provides machine-readable semantic signals - explicit declarations that tell search engines and AI systems what is on a page and what it means. Google explicitly stated in 2019 that JSON-LD is its preferred format for structured data.

According to the ProGEO.ai report, 53.8% of Fortune 500 companies have implemented JSON-LD on their homepage - 269 companies - at an average of 5.1 schema types per implementation. The three most common types are Organization (used by 182 companies), WebSite (used by 147), and SearchAction (used by 124). These three types handle traditional SEO tasks: Organization populates knowledge panels, WebSite enables sitelinks, and SearchAction powers the search box within Google results.

That baseline, however, conceals a significant maturity gap. ProGEO.ai randomly sampled interior content pages for 189 of the 269 companies with homepage JSON-LD. The remaining 77 blocked interior page scanning, and 3 returned inconclusive results. Among the 189 successfully sampled, 52.4% had JSON-LD only on the homepage or injected as a site-wide template - doing identical work on every page: declaring Organization, WebSite, and SearchAction. Only 47.6% - 90 companies out of the 189 sampled - were adding page-specific structured data to interior pages.

The most common content-specific types on interior pages were Article (found at 146 companies), Person (105), and BreadcrumbList (84). These types do the work that matters specifically for GEO. They identify the author, mark the content as a distinct publishable unit, and establish its position within a site hierarchy - precisely the signals that AI systems use to build entity relationships and identify citable sources.

In practical terms, the data suggests only about one-quarter of all Fortune 500 companies have deployed JSON-LD at a level of sophistication relevant to AI visibility. The majority have the infrastructure but are not using it strategically.

Part three: llms.txt - early adopters, contested efficacy

llms.txt is the newest of the three protocols. According to the specification published in 2024, it serves content in Markdown - the format most efficiently processed by large language models - at a file located at the root of a domain. It uses an H1 header to declare the site name, H2 sections to list URLs, and Markdown to provide contextual detail. The report describes sitemaps as maps for search engines and llms.txt as guidebooks for AI systems.

ProGEO.ai found that 37 Fortune 500 companies - 7.4% - have implemented llms.txt. Analysis of the file structure and content across those implementations reveals several patterns. Approximately two-thirds of the typical llms.txt file (66.5% by character count) is prose rather than URLs. The average file size is 6,721 characters, with a median of 31 URLs and a median of eight headers. The required structure calls for one H1 header and six H2 headers. Outlier analysis revealed significant variance: one company implemented 976 H1 headers, undermining the specification's hierarchical logic; another published an llms.txt file of 1.3 million characters - roughly 250,000 tokens, which exceeds the context window of some AI models entirely.

The report also notes that 70.3% of Fortune 500 companies that have implemented llms.txt have also implemented JSON-LD - a co-adoption rate that suggests deliberate, multi-layer thinking about AI visibility rather than isolated experimentation. Eight of the 37 llms.txt adopters have additionally named AI user agents in their robots.txt, and all but one of those take a predominantly permissive posture: across 51 AI directives in those eight companies' robots.txt files, 41 are allow and only one is disallow.

Six companies have implemented all three signals - llms.txt, JSON-LD, and explicit allow directives for AI user agents in robots.txt. According to the report, those companies are Nvidia (nvidia.com), Dell Technologies (dell.com), Builder FirstSource (bldr.com), Sonic Automotive (sonicautomotive.com), FM (fmglobal.com), and Concentrix (concentrix.com). A footnote flags that Concentrix explicitly disallows ClaudeBot from its entire site, though other AI bots inherit partial permission through wildcard rules. Across the full Fortune 500, fewer than 1% of companies have implemented all three signals.

A protocol without consensus

Whether llms.txt actually does anything useful for AI visibility is genuinely unresolved. The specification was published two years ago in 2024, and, as the report notes, "the evidence for its efficacy is early and contested." PPC Land reported in July 2025 that server log analysis found AI crawlers do not request llms.txt files during website visits, indicating zero actual utilization at that time.

Google's John Mueller reiterated throughout 2025 that no AI system was using llms.txt. As of March 2026, however, Google's own Gemini documentation has an active llms.txt file - an implicit acknowledgment of the specification's relevance even if the mechanics of how it influences AI responses remain unclear. OpenAI also serves llms.txt as of March 2026. Anthropic's docs.anthropic.com/llms-full.txt, on the other hand, returns a "page not found" result as of the same date, despite Anthropic having requested Mintlify - a documentation platform - to implement llms.txt support in 2024. In February 2026, a 90-day experiment by OtterlyAI found llms.txt provided no meaningful impact on AI crawler behavior.

The mixed signals from the platforms themselves make it difficult to draw firm conclusions. The specification is not yet a formal standard. It has no RFC equivalent. Platform endorsement has historically been the driver of adoption for comparable protocols - Google's explicit support for JSON-LD in 2019 drove its uptake among enterprises, just as Google, Microsoft, and Yahoo! drove adoption of Schema.org structured data types from 2011 onwards. For llms.txt, that decisive moment of platform endorsement has not yet arrived, even if the direction of travel among some platforms appears positive.

What the data means for the marketing community

The broader context for this research is the ongoing disruption to organic search traffic. Ahrefs research published in February 2026 found that Google's AI Overviews now correlate with a 58% reduction in click-through rates for top-ranking pages - nearly double the 34.5% decline the same organization documented in April 2025. The direction is consistent and, for publishers and brands dependent on organic search traffic, severe. Zero-click searches have become the majority outcome for many query types.

In that context, the question of how brands maintain visibility within AI-generated responses - rather than just traditional search result pages - becomes increasingly material. GEO, as ProGEO.ai frames it, is an attempt to answer that question at the technical infrastructure level. The report is careful to note, however, that technical signals are necessary but not sufficient. According to the report, AI systems cite content that is "authoritative, evidence-based, and structured for extraction." Google's E-E-A-T framework - experience, expertise, authority, and trust - describes the content qualities that matter alongside the technical plumbing.

The data from the ProGEO.ai study does not suggest that llms.txt is a silver bullet - the authors explicitly acknowledge the contested evidence around its efficacy. What the data does suggest is that the largest enterprises are beginning to treat AI visibility as a distinct discipline requiring dedicated technical attention. Whether that attention ultimately proves well-directed depends on decisions that AI platforms themselves have not yet made transparent.

Timeline

1994: robots.txt introduced as an informal standard; adopted by Lycos, AltaVista, and Google as a de facto web crawler control mechanism
2011: W3C launches JSON-LD Community Group; Google, Microsoft, and Yahoo! launch Schema.org
2014: JSON-LD 1.0 receives W3C Recommendation status
2019: Google explicitly states JSON-LD is its preferred structured data format
2020: JSON-LD 1.1 receives W3C Recommendation status
2022: IETF publishes RFC 9309, formalizing robots.txt as an official internet standard; Microsoft launches a robots.txt tester tool for Bing
June 2024: Cloudflare publishes analysis of AI bot activity, finding AI bots accessed approximately 39% of the top one million internet properties
2024: llms.txt specification published; Anthropic requests Mintlify to implement llms.txt support
September 2024: Cloudflare introduces AI Audit tools for publisher content management
December 2024: Cloudflare launches Robotcop to enforce robots.txt policies at network level
December 2025: OpenAI announces ChatGPT-User will no longer follow robots.txt for user-initiated browsing
December 2025: Google updates JavaScript SEO documentation including interaction between nosnippet directive and AI-powered search features
February 2026: OtterlyAI 90-day experiment finds llms.txt provides no meaningful impact on AI crawler behavior; Ahrefs research finds AI Overviews now correlate with 58% reduction in organic click-through rates
March 2025: Google outlines pathway for robots.txt protocol to evolve for emerging AI use cases
March 31, 2026: ProGEO.ai publishes "Signaling the Shift to Generative Engine Optimization (GEO)," measuring Fortune 500 adoption rates of robots.txt, JSON-LD, and llms.txt

Summary

Who: ProGEO.ai, a San Francisco-based data-driven generative engine optimization agency, published the research. The report was authored by Clinton Karr, CMO of ProGEO.ai, who has 20 years of background in corporate communications and content marketing.

What: A report titled "Signaling the Shift to Generative Engine Optimization (GEO)" measured adoption rates of three technical protocols - robots.txt, JSON-LD, and llms.txt - across all 500 companies on the Fortune 500 list. Key findings include: 92.8% of Fortune 500 companies have robots.txt, 53.8% have JSON-LD, and 7.4% have llms.txt. Only 11% of Fortune 500 companies name an AI user agent in robots.txt, and fewer than 1% have implemented all three signals.

When: The research was conducted in March 2026, with the report published and announced on March 31, 2026.

Where: ProGEO.ai scanned all Fortune 500 company websites using a Python-based HTTP client. The research covered homepage, robots.txt, and llms.txt files, with interior page sampling conducted for structured data analysis. The announcement was issued from San Francisco.

Why: The study addresses the growing gap between traditional SEO infrastructure and the requirements of AI-powered search platforms. As generative AI platforms increasingly answer queries directly - without sending users to external websites - brands face a question of how to maintain visibility within AI-generated responses. ProGEO.ai positioned the research as a baseline measurement of GEO maturity among the largest US companies, enabling enterprises to benchmark their own technical readiness against the Fortune 500 cohort.