Google and Schema.org finally show how the web uses structured data

Schema.org this week published its first public dataset of aggregate usage statistics for its structured data vocabulary, offering developers and publishers a view into how different types and properties are actually deployed across millions of domains - information drawn from Google's web crawling infrastructure and released jointly in CSV and JSON formats on GitHub.

A long-requested window into adoption

The dataset, announced on June 4, 2026, on the Schema.org blog, represents a collaboration between Google and the Schema.org community. According to the announcement, the initiative aims to provide greater transparency into how different Types and Properties are being utilized by developers and publishers globally. The release fills a gap that had persisted for years: no authoritative, crawl-scale data existed showing which Schema.org terms had achieved meaningful adoption and which remained marginal.

Ryan Levering, a software engineer at Google, offered context for why the release took as long as it did. "As long as I've known Dan Brickley, one of his main asks was that Google Search publish some stats on usage on the web," Levering wrote in a LinkedIn post on June 5, 2026. "I finally got some support to finish the project and hopefully a few more cool things coming soon." He added: "It's hard for most open crawls to get the same depth of index as we have at Google, so we're happy to present some usage stats on schema.org terms, even if they are somewhat abstracted."

Dan Brickley is Schema.org's primary coordinator and has been involved with the project since its founding in 2011 by Google, Microsoft, Yahoo, and Yandex. His years-long request for public usage data reflects a broader frustration within the structured data community: the vocabulary was developed openly, but the evidence of its uptake was always opaque, available only to search engines with the crawling infrastructure to measure it.

How the data is structured

The dataset follows a deliberately simplified design. According to the Schema.org documentation, each record contains three fields: a term type (either Type or Property), a URI identifying the term, and a domain count bucket representing the range of unique domains using that term.

The domain count buckets are: fewer than 1,000 domains, 1,000 to 10,000, 10,000 to 100,000, 100,000 to 1 million, 1 million to 10 million, and more than 10 million. Rather than publishing precise counts, Schema.org uses these ranges for two stated reasons. First, exact counts fluctuate daily as crawling encounters temporary server issues or minor website changes, making raw figures noisy and unreliable as signals of meaningful adoption. Second, exact figures could allow competitors or bad actors to reverse-engineer search engine crawling patterns or track subtle changes on specific websites.

According to Schema.org, the data is aggregated at the domain level, not by individual page or by the number of schema objects on a site. A publisher that deploys a given schema type on 500 pages still counts as a single domain. This design choice reflects a deliberate prioritization of wide adoption over intensity of use. As Schema.org explains in its documentation, URL counts and object counts are heavily influenced by index composition, and domain-level counting is more universally comparable.

The files are available in three formats: CSV, JSON, and a JSON summary format with aggregated bucket distributions. All three are published on the official Schema.org GitHub repository. According to the announcement, an updated file will be pushed to GitHub every month, making it a continuous resource rather than a one-time snapshot.

What the May 2026 data shows

The first release, dated May 2026, covers 958 Itemtypes and 4,587 Predicates - a total of 5,545 entries across the two classes.

Among Itemtypes, 12 terms fall in the 10 million-plus domain bucket. These are the most widely deployed schema types on the public web: BreadcrumbList, EntryPoint, ImageObject, ListItem, Organization, Person, PropertyValueSpecification, ReadAction, SearchAction, Thing, WebPage, and WebSite. The presence of Organization and Person at this scale reflects how broadly the vocabulary has been adopted for identity and publisher markup, well beyond specialist use cases.

A further 35 Itemtypes fall in the 1 million to 10 million bucket. These include a mix of commercial and content-oriented types: Product, Review, AggregateRating, Article, BlogPosting, FAQPage, LocalBusiness, VideoObject, and Question, among others. The presence of FAQPage and Question at this level is notable given that Google previously surfaced FAQ rich results from structured data in search, a feature that drove significant adoption among SEO practitioners. EventPage and Recipe sit in lower buckets - 100,000 to 1 million and 10,000 to 100,000 respectively - suggesting they remain more specialized.

At the other end of the distribution, 485 Itemtypes fall in the fewer-than-1,000-domain bucket. According to Schema.org's documentation, this group contains a mix of brand-new terms and highly specialized ones - medical or government terms, for instance - whose naturally limited publishing communities keep them in the lower ranges regardless of their importance within their sectors.

Among Predicates, the 10 million-plus bucket contains 31 terms. These are the foundational properties that appear almost universally wherever structured data is deployed: name, description, image, url, headline, datePublished, dateModified, author, publisher, breadcrumb, and logo, among others. The predicate query-input, which supports SearchActionimplementations for site search boxes, also reaches this threshold - indicating that sitelinks search box markup has achieved very broad deployment.

The 1 million to 10 million predicate bucket contains 65 terms. This tier covers properties closely associated with the high-adoption types: aggregateRating, availability, articleBody, acceptedAnswer, address, brand, price, and ratingValue all appear here, reflecting the commercial and content-oriented patterns that dominate structured data practice.

The summary data shows a sharp drop-off at smaller domain sizes. Among Itemtypes, 3,779 predicates fall below 1,000 domains - a long tail of vocabulary terms that exist in the specification but have not achieved meaningful real-world adoption. This distribution matters for toolmakers and developers making decisions about which terms to support or implement.

The methodology and its limits

Schema.org's technical documentation is candid about the constraints. The data comes from Google's public web crawling infrastructure and reflects the web as indexed by Google. Websites blocked by robots.txt are not included. The crawler's own scope and indexing methodology introduce biases, as do any web-scale datasets. The statistics do not distinguish between JSON-LD, Microdata, and RDFa implementations: a domain using JSON-LD on its homepage and Microdata on product pages counts once for each term it uses, regardless of format.

The monthly release cadence was chosen because, according to Schema.org, web adoption trends change slowly enough that monthly updates are sufficient to capture meaningful shifts. Each release requires manual validation and approval before publication to filter out anomalies. This suggests the data is not fully automated at publication, which introduces a latency that is considered acceptable given how gradually adoption patterns shift.

Google's role as the sole initial contributor is noted openly. According to the announcement, "while this initial contribution comes from Google, we recognize that a truly comprehensive view of the web requires multiple perspectives." The Schema.org community has explicitly invited other crawlers and indexers to contribute their own statistics using the same open format. The data format is documented in the Schema.org GitHub repository, making independent contributions technically feasible for other search engines or web archive projects.

This matters because Google's index is not a neutral sample of the web. The Cloudflare data cited in PPC Land's earlier coverage of crawler infrastructure showed Googlebot accessing approximately 8% of observed web pages over a two-month period in early 2026 - significantly more than other crawlers, but still a sampled view. Sites that block Googlebot or that have poor crawl coverage for other reasons will not appear in the dataset.

The February 2026 reduction in Googlebot's maximum crawl file size to 2MB - an 86.7% reduction from the previous 15MB limit - could also affect which structured data is visible. Pages with large HTML files where schema markup appears deep in the document might not be fully processed. Schema.org's documentation does not address how this infrastructure constraint interacts with data collection.

Why this matters for the marketing and SEO community

For SEO practitioners and marketing technologists, the dataset provides something that has not existed before: official, crawl-scale evidence of which schema vocabulary is worth implementing, drawn from the source most relevant to search visibility.

The structured data landscape has historically been shaped by search engine documentation, case studies from individual publishers, and community observation rather than by aggregate adoption data. Practitioners often had to infer popularity from indirect signals - whether a given type appeared in Google's rich results gallery, whether toolmakers supported it, or whether peers reported success with it. The new dataset replaces inference with measurement.

For platform developers, the signal is particularly concrete. WordPress SEO plugins, CMS structured data modules, and e-commerce schema generators can now reference official adoption figures when deciding which types and properties to implement by default, which to include as optional, and which to retire from active support. A predicate sitting in the fewer-than-1,000-domain bucket might be technically valid but not worth building tooling around, while one in the 10 million-plus bucket clearly warrants first-class support.

The dataset also has implications for the way structured data connects to Google's AI-powered search features. AI Overviews and AI Mode rely on Google's understanding of web content, and structured data has historically been one of the clearest signals about what a page contains and what entities it describes. The SpeakableSpecification type, which reached the 100,000 to 1 million domain bucket, was originally introduced to support voice search and has since been discussed in the context of AI-generated responses. Its adoption at that scale suggests a non-trivial number of publishers have already signaled which parts of their content are machine-readable in that format.

The timing is also notable in the context of the EU's Digital Markets Act proceedings against Google, where the European Commission adopted preliminary findings in April 2026 that would require Google to share search data with competing search engines on fair terms. The Schema.org usage dataset is technically distinct from search query and click data - it describes the web's self-description rather than user behavior - but it represents a step toward greater transparency about how Google's infrastructure views the web's semantic layer.

The Google crawl infrastructure updates tracked by PPC Land in November 2025 are relevant context here too. The documentation move from Google Search Central to a dedicated crawling infrastructure site signaled that crawler data serves Shopping, News, Gemini, and other Google products beyond organic search. The Schema.org usage dataset is derived from that same infrastructure, meaning its signals reflect relevance across Google's ecosystem, not just traditional web search rankings.

Practical use of the data

The three file formats serve different use cases. The CSV and JSON files contain the full term-level data, suitable for analysis, tooling, and integration into developer workflows. The JSON summary format provides aggregated bucket distributions - the data included in the documents attached to this article - which gives a quick overview of how many terms fall in each adoption tier without requiring processing of the full dataset.

According to Schema.org, the statistics will also appear directly on schema term pages within the Schema.org website, making adoption data visible inline when consulting the vocabulary reference. This means developers checking how to implement, say, the Product type will see directly how many domains use it, providing immediate context for prioritization decisions.

The GitHub repository structure makes it straightforward to track changes over time by comparing monthly releases. Terms moving between buckets - for example, a type moving from the 100,000 to 1 million range into the 1 million to 10 million range - would indicate meaningful growth in adoption. Deprecated or obsolete terms declining across buckets over successive months would offer evidence-based grounds for removing support from plugins and tooling.

Analysis: what the numbers reveal

A vocabulary built on a narrow foundation

The May 2026 dataset contains 5,545 entries across 958 Types and 4,587 Predicates. At first glance that sounds like a rich vocabulary. The distribution tells a different story.

Just 12 Types - 1.3% of all Types in the specification - have reached the 10 million-plus domain threshold. At the same time, 485 Types, or 50.6% of the entire vocabulary, fall in the fewer-than-1,000-domain bucket. Half the specification's Types are effectively marginal by the measure of actual deployment. The predicate distribution is even more concentrated: 82.4% of all Predicates - 3,779 out of 4,587 - sit below the 1,000-domain floor.

This is not a gradual curve. It is a sharp cliff. Once above the 1,000-domain threshold, the numbers thin out quickly: 236 Types at 1,000 to 10,000 domains, 151 Types at 10,000 to 100,000, then 39 Types at 100,000 to 1 million, 35 Types at 1 million to 10 million, and the 12 at 10 million-plus. The practical result is that 47 Types - just under 5% of the vocabulary - account for essentially all high-volume deployment. The remaining 911 Types describe a long tail of specialized, experimental, or underused terms.

The 12 dominant types

The 12 Types reaching 10 million-plus domains are: BreadcrumbList, EntryPoint, ImageObject, ListItem, Organization, Person, PropertyValueSpecification, ReadAction, SearchAction, Thing, WebPage, and WebSite. Several observations follow from this list.

First, most of these terms are not primarily content descriptors - they are structural. BreadcrumbList, ListItem, WebPage, and WebSite describe page architecture rather than the subject matter of content. Their near-universal deployment reflects the fact that they are generated automatically by CMS platforms and SEO plugins, not hand-coded by individual publishers. The high adoption of WPHeader, WPFooter, and WPSideBar in the 1 million to 10 million bucket reinforces this: a significant portion of structured data on the web is produced by WordPress infrastructure, not by intentional publisher decisions.

Second, the presence of SearchAction - which powers sitelinks search boxes in Google Search - in the top tier indicates that the capability has been rolled out at scale across millions of properties. The associated predicate query-input also sits in the 10 million-plus bucket, confirming the pairing is deployed consistently wherever SearchAction appears.

Third, Organization and Person at 10 million-plus domains each represent the baseline of entity disambiguation markup. Publishers declaring their identity as an Organization and listing their authors as Person instances are likely doing so partly because structured data guidelines have consistently cited these as foundational. Their ubiquity also suggests they are among the earliest types to be adopted when a site adds any structured data at all.

E-commerce and content: two pillars of mid-tier adoption

The 35 Types in the 1 million to 10 million bucket split broadly into two clusters: e-commerce and content publishing.

On the commercial side: Product, Offer, AggregateOffer, UnitPriceSpecification, AggregateRating, Review, and LocalBusiness all sit in this range. The presence of Offer and Product together, alongside pricing and rating types, maps directly to the product schema patterns recommended by Google for shopping and rich results eligibility. MerchantReturnPolicy and OfferShippingDetails - types that Google added as requirements for enhanced Shopping listings - land in the 100,000 to 1 million bucket, suggesting they are widely deployed but have not yet reached the scale of the core product types. ShippingDeliveryTime sits in the same range.

On the content side: Article, BlogPosting, VideoObject, FAQPage, Question, and Answer all reach the 1 million to 10 million tier. FAQPage's position here is worth noting. Google had previously shown FAQ-derived rich results in search and then scaled back the feature for some query types. The data shows that adoption built up to a meaningful scale regardless - millions of domains implemented it, driven by the expectation of rich results eligibility. The persistence of that markup even after Google adjusted the feature's display behaviour would be visible in future monthly releases if adoption starts declining.

Recipe - a type frequently cited in SEO guides as a high-value implementation - sits in the 10,000 to 100,000 bucket. That places it well below Article and BlogPosting, reflecting the relative size of the food publishing sector compared to general content publishing. HowTo lands in the 100,000 to 1 million bucket, somewhat higher.

The predicate-to-type ratio shifts with adoption level

Comparing Predicates and Types within each bucket reveals a notable pattern. In the 10 million-plus tier, there are 31 Predicates and 12 Types - a ratio of roughly 2.6 Predicates per Type. At the 1 million to 10 million tier, the ratio is 1.9 to one. At 100,000 to 1 million, it rises to 3.1 to one. At the lowest tier - fewer than 1,000 domains - it jumps to 7.8 Predicates per Type.

The low-end spike is significant. It means that deeply specialist or experimental Types tend to come with a large number of associated Predicates that also see very limited use. This is characteristic of vocabulary areas where the schema was designed with comprehensive coverage of a domain but has not been widely implemented - healthcare being the clearest example. The MedicalCondition, MedicalProcedure, MedicalClinic, Hospital, Physician, and related Types all sit in the 10,000 to 100,000 bucket. The associated medical predicates, by contrast, cluster heavily in the fewer-than-1,000-domain range. The Type has some adoption; the detailed properties that would make it semantically meaningful do not.

Specialist verticals: healthcare, news, and fact-checking

The medical cluster sits notably lower than commercial and content types, despite healthcare being among the most structured-data-invested industries given Google's E-E-A-T guidelines for health content. MedicalClinic, MedicalCondition, MedicalOrganization, MedicalProcedure, MedicalSpecialty, MedicalWebPage, Hospital, and Physician all fall in the 10,000 to 100,000 range. The predicate isAcceptingNewPatients also sits there. This bucket position reflects both the smaller number of health publishers compared to general commerce, and the specialist nature of medical markup.

NewsArticle sits in the 100,000 to 1 million bucket - below the generic Article and BlogPosting types that sit one tier higher. This gap suggests that publishers are more likely to use the broader Article type than to precisely declare NewsArticle, even when the content is clearly news. Whether that represents a deliberate choice or a tooling default is not discernible from the data, but the difference is measurable.

ClaimReview and Claim, both designed to support fact-checking in search results and to enable features like fact-check labels in Google News, each fall in the 1,000 to 10,000 domain bucket. The predicate verificationFactCheckingPolicy - used to declare a newsroom's fact-checking approach - sits in the same range. This is a low adoption floor for vocabulary that has been available for years and has been highlighted as a way for news publishers to build trust signals. CorrectionComment, the most granular correction term in the vocabulary, has fewer than 1,000 adopting domains.

Podcast and broadcast: modest deployment

PodcastEpisode and PodcastSeries both land in the 10,000 to 100,000 domain bucket. Given the volume of podcast content on the web, that adoption level suggests that most podcast publishers are not using structured data to describe their episodes, or are relying on platform-level markup rather than site-level implementation. PodcastSeason falls below 1,000 domains. BroadcastEvent also sits in the 10,000 to 100,000 range.

Dataset and DataCatalog - relevant to the open data and research communities - fall in the 10,000 to 100,000 and 1,000 to 10,000 ranges respectively. GovernmentOrganization, LegalService, and Legislation are all in the 10,000 to 100,000 tier. These are not surprising given the size of those sectors, but they indicate that public sector and civic information publishing has limited structured data coverage compared to commercial publishing.

What 76.9% of terms sitting below 1,000 domains means

Combining Types and Predicates, 76.9% of all 5,545 terms in the dataset fall in the fewer-than-1,000-domain bucket. Viewed one way, this looks like a vocabulary inflated with unused terms. Viewed another way, it reflects a deliberate design philosophy: Schema.org has long been developed by extending the vocabulary to cover specialized domains - even ones with limited publishing communities - because the semantic web's value depends on having precise vocabulary available when it is needed.

The data now makes this tradeoff visible at scale. A developer or toolmaker can distinguish between terms that are low-adoption because they are niche but well-established (medical schemas, government schemas) and terms that are low-adoption because they are recently introduced or have not attracted an implementing community. The monthly cadence means that distinction will become more informative over time: a term sitting at fewer than 1,000 domains that begins moving into the 1,000 to 10,000 bucket within two or three releases is gaining traction; one that remains stationary is not.

The concentration at the top is equally stark: 43 terms - 12 Types and 31 Predicates - in the 10 million-plus bucket represent fewer than 0.8% of the total vocabulary. These 43 terms are the common language of the structured web. Everything else is either a specialist dialect or an aspiration.

Timeline

June 2, 2011 - Schema.org launched by Google, Microsoft, Yahoo, and Yandex as a collaborative structured data vocabulary initiative
November 2011 - Yandex joins Schema.org, adding a fourth founding sponsor
August 2025 - Google crawl rate declines affect multiple hosting platforms including PPC Land, highlighting the operational sensitivity of crawling infrastructure
November 2025 - Google moves crawling documentation from Search Central to a dedicated crawling infrastructure site, reflecting that crawlers serve multiple Google products
February 2026 - Google reduces Googlebot's maximum crawl file size from 15MB to 2MB, an 86.7% reduction attributed partly to AI infrastructure costs
March 2026 - Google updates Googlebot documentation covering 2MB file cap and IP range reorganization
March 2026 - Schema.org releases version 30.0 (dated 2026-03-19)
April 16, 2026 - European Commission adopts preliminary DMA findings requiring Google to share search data with competing engines
June 4, 2026 - Schema.org publishes the first monthly usage statistics dataset, covering May 2026 data, in CSV and JSON formats on GitHub, in collaboration with Google
June 5, 2026 - Ryan Levering, software engineer at Google, confirms on LinkedIn that the project was driven by Dan Brickley's long-standing request for public usage data from Google Search

Summary

Who: Schema.org and Google, with Ryan Levering (Google software engineer) leading the technical implementation and Dan Brickley (Schema.org coordinator) identified as the long-standing advocate for the release.

What: The first public dataset of aggregate usage statistics for Schema.org vocabulary terms, covering 958 Types and 4,587 Properties across millions of domains, organized into six domain count buckets. The data is available in CSV, JSON, and summary JSON formats on the Schema.org GitHub repository, and will be updated monthly.

When: The dataset was announced on June 4, 2026, on the Schema.org blog, with the first release covering May 2026 data. Monthly updates are planned going forward.

Where: The data is published on the official Schema.org GitHub repository and will also appear inline on Schema.org term pages. The underlying data comes from Google's public web crawling infrastructure.

Why: Understanding real-world adoption of structured data vocabulary is essential for developers, publishers, SEO practitioners, plugin builders, and researchers. Without aggregate crawl-scale data, decisions about which terms to implement had relied on indirect signals rather than measurement. The release also responds to a transparency gap that Dan Brickley had sought to address for years, and invites other crawlers to contribute their own statistics in the same open format.