Joost de Valk's website spec: 128 rules to future-proof your site

by Luis Rijo
Luis Rijo
Luís Rijo is a seasoned marketing professional with over 10 years of experience in Digital Marketing, Search, Social, Display, Video, and DOOH. Based in Europe. Also writing in the spend. Reach out via luis@ppc.land
- LinkedIn
•
May 30, 2026
•
10 min read

A checklist on paper with a pencil — website technical audit in progress.

Joost de Valk, the Dutch entrepreneur best known as the creator of Yoast SEO, yesterday published the Website Specification - a platform-agnostic reference document that consolidates technical standards for building a modern website into a single, openly licensed resource at specification.website. The spec covers 128 topics across 10 categories, with each entry assigned one of four statuses: required, recommended, optional, or avoid.

Why this exists

The premise is simple. According to de Valk in his LinkedIn announcement, the problem he kept running into was having to point at six different sources to back a single recommendation: "WHATWG for HTML. WCAG for accessibility. IETF for headers. schema.org for structured data. MDN, web.dev, Google Search Central for everything else."

The result is a reference that pulls those layers together. According to the site's about page, the web is "a layer cake of standards" - WHATWG defines HTML, W3C ratifies WCAG, the IETF publishes RFCs behind security headers and /.well-known/ URIs, search engines publish their own rules, and browsers add their own quirks. "Almost nobody carries the whole picture," the page states. This spec is an attempt to carry it.

The target audience is broad. According to the about page, the spec is intended for engineers auditing or building a website, designers and PMs trying to scope quality, and - specifically - AI agents that need a machine-readable description of what to check. That last category is both significant and novel: the specification.website itself is structured so an AI agent can query the entire spec via an MCP server, a topic that marketers and publishers have been grappling with as AI crawlers consume a growing proportion of web traffic.

The ten categories and their scope

The specification is organized into 10 areas, and the breakdown by topic count reveals where de Valk placed the most weight.

Foundations covers 14 topics - the HTML, head, and document basics that every page needs. Required items include the HTML doctype declaration, the lang attribute on the html element, the meta charset declaration (which must appear in the first 1,024 bytes of HTML), the meta viewport tag, and the title element. According to the spec, the title element is used by "browsers, search engines, screen readers, social previews, and AI agents." Recommended items in this category include canonical URLs, Open Graph protocol tags, feed discovery, and the Popover API as a replacement for ARIA-dependent JavaScript modals.

SEO carries 13 topics. Required items are HTTP redirects (with a specific note to use 301 or 308 for permanent moves and never chain more than necessary), meta robots and the X-Robots-Tag, and heading hierarchy. According to the spec, "soft 404s" - pages that display a not-found message while returning HTTP 200 to a crawler - are listed as an "avoid" item: "search engines treat soft 404s as a quality problem and often refuse to index them." Recommended topics include robots.txt, XML sitemaps, URL structure, internal linking, and JSON-LD structured data. IndexNow is listed as optional, with a note that Google does not participate in that protocol.

Accessibility is the largest single category at 20 topics. Among the required items: colour contrast, image alt text, form labels, keyboard navigation, visible focus indicators, semantic HTML and landmark elements, descriptive link text, accessible form errors, document language, reduced motion support, captions and transcripts, accessible data tables, and touch target size. WCAG 2.2 sets the enhanced touch target at 44 by 44 CSS pixels, with a minimum of 24 by 24. Accessibility overlays - third-party JavaScript widgets that claim to make a site WCAG-compliant at runtime - are listed under "avoid." According to the spec, they "do not work, often harm screen-reader users, and attract lawsuits."

Security covers 12 topics and requires HTTPS with TLS 1.2 or 1.3, HSTS with max-age, includeSubDomains, and preload (described as "an irreversible commitment"), X-Content-Type-Options with nosniff, clickjacking protection via CSP frame-ancestors, and cookie attributes. According to the spec, every cookie should be "Secure, HttpOnly where possible, and have an explicit SameSite," with the __Host- and __Secure- prefixes recommended for session cookies. Content Security Policy and Referrer-Policy are listed as recommended, as is /.well-known/security.txt, which is described as cheap to publish and something that "dramatically lowers the bar for responsible disclosure."

Well-Known URIs covers 9 topics organized around the /.well-known/ path prefix, which was standardized in RFC 8615. Required: none specifically in this category. Recommended items include /.well-known/api-catalog, defined in RFC 9727, which publishes a machine-readable index of the APIs and resources a host exposes.

Agent Readiness at 18 topics is arguably the most forward-looking category, and the one that intersects most directly with the discussions the marketing community has been having about AI crawler behavior. It includes /llms.txt as recommended - described as "a proposed markdown file at the site root that gives LLMs a curated index of your most important content" and explicitly noted as "emerging convention, not a ratified standard." The /llms-full.txt companion file is listed as optional. Both the adoption challenges of llms.txt and the broader question of what signals AI agents actually use have been widely debated.

Stable URLs are marked required in this category. "URLs are public contracts," the spec states. "Breaking them invalidates citations, bookmarks, links, and agent caches."

Among the optional items are MCP and tool discovery - described as "an emerging way for sites to expose queryable tools to agents over JSON-RPC" - along with A2A agent cards, DNS for AI Discovery (DNS-AID) using SVCB/HTTPS records, NLWeb for conversational interface discovery, WebMCP for browser-native agent tools, and a site-specific convention called Schemamap that indexes one JSON-LD endpoint per resource via /schemamap.xml.

Web Bot Auth, which uses RFC 9421 HTTP Message Signatures to let a bot cryptographically prove its identity, is listed as optional. PPC Land has reported on the limited adoption of WebBotAuth among AI operators, with Google experimenting with the web-bot-auth protocol under the identity https://agent.bot.goog as noted in updated crawler documentation.

Performance spans 19 topics and includes Core Web Vitals as required. The spec sets the targets at LCP at or below 2.5 seconds, INP at or below 200 milliseconds, and CLS at or below 0.1, measured at the 75th percentile of real users. Image optimization (WebP and AVIF formats, correct viewport sizing, explicit dimensions) and compression (brotli preferred, gzip as fallback) are also required. Cache-Control is required, with a specific recommended pattern: immutable plus max-age=31536000 for fingerprinted assets and no-cache for HTML.

Recommended performance items include lazy loading (with a specific warning: "never on the LCP element"), Speculation Rules, View Transitions, Back/Forward Cache eligibility, HTTP/2 at minimum, HTTP/3 where possible, and the No-Vary-Search response header - which tells browsers that some URL query parameters like UTM tracking codes do not change the response content.

Privacy covers 6 topics. Required items are a privacy policy and cookie consent, with the spec noting that "in the EU and UK, non-essential cookies require freely given, informed, specific, and unambiguous opt-in consent." Recommended items include Global Privacy Control, which California and Colorado require sites to honour, and privacy-respecting analytics described as "aggregate, cookieless, EU-hosted analytics tools."

Resilience carries 5 topics. Custom 404 and 500 error pages are required and must "return the correct HTTP status code, explain what went wrong in plain language, and offer the user a way forward without leaking implementation details."

Internationalisation covers 12 topics. Automatic IP-based language redirects are listed as "avoid" - they "trap users in the wrong language, break search crawlers, and break shared links." Required items include the lang attribute on inline content, and recommended items include hreflang, a language switcher that lists each locale in its own language, and RTL support for Arabic, Hebrew, Persian, and Urdu.

The MCP server

One technically significant aspect of the launch is the publication of an MCP server at mcp.specification.website/mcp. According to the spec's MCP page, the server uses the Streamable HTTP MCP 2025-03-26 protocol revision, is stateless, and requires no authentication.

The server exposes five tools: search (returning ranked spec pages with title, status, category, URL, and body excerpts), list_topics (a filtered index by category or status), get_topic (full canonical Markdown for one page), get_checklist (a tickable Markdown checklist grouped by category), and get_categories (the ten top-level categories with topic counts). There is also a prompt - audit_url - that generates an audit plan for a target URL against required spec items, optionally narrowed to a single category such as security.

According to the spec's design notes, the MCP server and the website both build from the same Markdown source files, and the server bundles a JSON manifest at build time, meaning there is no runtime parsing and no drift between what the site shows and what the agent queries.

To connect from Claude Desktop, the configuration requires editing the claude_desktop_config.json file and adding a single JSON entry with the transport set to "http" and the URL set to https://mcp.specification.website/mcp. According to the spec, any MCP client speaking the 2025-03-26 Streamable HTTP revision connects with that URL alone, with no client SDK to install and no token to manage.

The discoverability architecture mirrors what the spec itself recommends: a server card at /.well-known/mcp/server-card.json, a Link header with rel="mcp" on every page, an entry in /.well-known/api-catalog per RFC 9727, and cross-linking from the relevant spec page. MCP as an infrastructure layer has seen rapid adoption across the marketing technology stack throughout 2025, with Google Analytics and marketing data platforms both shipping servers earlier in the year.

How it is built and licensed

The site is a static build generated with Astro and deployed to Cloudflare Pages from GitHub. Content lives in plain Markdown files under src/content/spec/. According to the about page, "a page without a source link should not be merged" - sources are part of the content schema, not an optional annotation. The spec draws on MDN Web Docs, the Yoast Developer Portal for SEO patterns, Equalize Digital for accessibility, WP Accessibility Knowledge Base, and the standards bodies themselves: WHATWG, W3C, IETF, WCAG, and IANA.

Content is licensed under CC BY 4.0 and code under MIT. According to the about page, the invitation is explicit: "use it, fork it, ship it."

The site itself implements every item in the spec. According to the about page, this includes a strict Content Security Policy served as a response header, the full security header set via Cloudflare Pages _headers, /.well-known/security.txt, llms.txt and llms-full.txt for AI agents, robots.txt, a sitemap index, an RSS feed, JSON-LD structured data on every page, Open Graph and Twitter Cards, and WCAG-aligned colour contrast, focus indicators, a skip link, and semantic landmarks. If any contradiction appears between what the spec says and what the site does, the about page is explicit: "that is a bug."

Why it matters for the marketing community

The timing connects to several pressures marketers and web professionals are navigating simultaneously. AI agents are increasingly traversing the web independently of users - retail AI crawlers now access sites 198 times per visit compared to Google's ratio of one visit to six crawls. The infrastructure for how sites signal content permissions to those agents is fragmented, with the IAB Tech Lab, IETF, and individual platforms each proposing different mechanisms.

The question of what constitutes a well-formed site for this environment has no single, authoritative, non-vendor answer. The Google Search Central documentation is platform-specific. The WCAG guidelines cover only accessibility. The IETF RFCs cover specific protocols. There has been no single document that maps all of it.

What the Website Specification offers is a consolidated audit surface. The checklist published at specification.website/checklist lists every item in a flat, tickable format, print-friendly and organized by category. A practitioner can work through Foundations (14 items), SEO (13), Accessibility (20), Security (12), Well-Known URIs (9), Agent Readiness (18), Performance (19), Privacy (6), Resilience (5), and Internationalisation (12) - arriving at a total of 128 checkpoints drawn from primary sources.

For digital agencies, the spec provides a structured way to define quality standards across client work without inventing those standards from scratch. For in-house teams, it maps the gap between what a site currently does and what current standards require - including standards that have emerged specifically around AI agent interaction, which are not yet reflected in most site auditing tools.

Reactions

The LinkedIn post announcing the launch drew significant engagement within hours. Chudi Nnorukam-Krisdiva, who described building agent tooling for AI operators, wrote that the agent readiness section was where they kept getting stuck when building a tool called AVR, noting that without a single spec to point at, they had been "stitching them manually for citability.dev audits and the seams showed every time." They described the gap as not whether sites ship llms.txt and JSON-LD - "most do not" - but whether sites can tell if AI agents are actually traversing those files. "The spec and the signal are two different problems," they wrote.

Core Web Vitals consultant Erwin Hofman raised a technical concern about the Interaction to Next Paint advice in the spec, specifically around scheduler.postTask, and noted that preloading fonts carries a cost on slow connections where it can become render-blocking despite inlining critical CSS.

Lily Ray, describing herself as founder of Algorythmic and VP of SEO and AI Search at Amsive - and visible in the comments - reacted with six likes, making her engagement among the most visible from the SEO community.

Timeline

September 3, 2024: Jeremy Howard proposes the /llms.txt protocol as an emerging convention for AI content discovery
December 10, 2024: Cloudflare launches Robotcop to enforce robots.txt policies against AI crawlers
July 1, 2025: Cloudflare launches pay-per-crawl service for AI crawler monetization
July 22, 2025: Google Analytics launches MCP server for AI-powered data conversations
August 20, 2025: IAB Tech Lab launches Content Monetization Protocols working group for AI
September 1, 2025: Analysis shows AI crawling imbalance between training and referral patterns
September 26, 2025: Cloudflare announces AI Index for website content discovery
December 20, 2025: Report shows AI crawlers consume 4.2% of web traffic as internet grows 19% in 2025
March 7, 2026: Report shows AI bots crawl retail sites 198 times more per visit than Google
March 20, 2026: Google adds Google-Agent to official crawler list and references web-bot-auth protocol
April 6, 2026: Research shows blocking AI crawlers does not stop AI citations
May 30, 2026: Joost de Valk publishes the Website Specification at specification.website, covering 128 topics across 10 categories with an open MCP server at mcp.specification.website/mcp

Summary

Who: Joost de Valk, entrepreneur and creator of Yoast SEO, along with open-source contributors.

What: The Website Specification - a platform-agnostic, MIT-licensed reference covering 128 technical topics across 10 categories, including an open MCP server that allows AI agents to query the full spec without authentication.

When: Published May 30, 2026, with the LinkedIn announcement and site launch occurring on the same day.

Where: Available at specification.website, with the MCP server at mcp.specification.website/mcp and source code on GitHub under a CC BY 4.0 content licence and MIT code licence.

Why: To consolidate web standards from WHATWG, W3C, IETF, WCAG, and search engine documentation into a single, opinionated, source-cited reference that engineers, designers, PMs, and AI agents can use to audit or build sites - addressing the absence of any single platform-agnostic document covering foundations, SEO, accessibility, security, agent readiness, performance, privacy, resilience, and internationalisation together.

Luis Rijo

Luís Rijo is a seasoned marketing professional with over 10 years of experience in Digital Marketing, Search, Social, Display, Video, and DOOH. Based in Europe. Also writing in the spend. Reach out via luis@ppc.land