Microsoft Clarity now flags robots.txt violations inside Bot Analytics

Microsoft Clarity today added robots.txt violation detection to its Bot Analytics dashboard, giving website publishers a direct view into which AI crawlers are ignoring access rules - and precisely which content those crawlers are targeting despite explicit instructions to stay out.

The release, announced on June 23, 2026, by Ihab Rizk on the Microsoft Clarity blog, extends the platform's Bot Analytics dashboard with a dedicated violations layer. Publishers can now see non-compliant bot requests as a percentage of total bot traffic, track violation trends over time, and filter results by operator, bot name, and activity type. The feature is live today for all Clarity users who have connected a supported CDN provider.

What the new feature does

The core addition is a Violations card sitting inside the existing Bot Analytics dashboard. According to Microsoft, the card shows violations as a percentage of total requests made by bots to a site - not just a raw count, but a proportion that allows quick comparison against the broader volume of automated traffic. In the screenshot shared alongside the announcement, that figure reads 4.56% of total requests. With total requests shown at 246,035 across AI crawlers in the illustrated data set, that 4.56% translates to roughly 11,227 non-compliant requests in the measured window.

The percentage framing matters. A site receiving 500 daily bot requests and one receiving 500,000 face structurally different situations, even if both show 100 violations. Clarity's ratio-based display makes the signal comparable across sites of very different sizes.

A violation trendline accompanies the card, plotting how non-compliant activity changes over time. This lets operators detect spikes - a sudden surge in violations might indicate a new crawler entering the field or an existing one changing behavior - and monitor whether patterns are stabilizing or escalating. The trendline is positioned as a persistent monitoring tool rather than a one-off audit capability.

Filtering is the feature's third component. Publishers can slice violation data by operator, by individual bot name, and by activity type. That combination allows a reasonably precise diagnosis: not just "some crawler is ignoring my robots.txt" but "Operator X's bot named Y is generating requests of type Z to these specific paths." According to Microsoft, the dashboard also shows the URLs and paths generating the violations, distinguishing whether crawlers are attempting to reach high-value editorial content, restricted resources, or other sections marked as off-limits.

The final analytical layer is a side-by-side comparison of compliant and non-compliant requests. This comparison is designed to give operators a fuller picture of how individual AI platforms and crawlers interact with their published content overall, not just the subset that violates rules.

Why robots.txt compliance is a live issue

Robots.txt, introduced as an informal web standard in 1994 and formalized as RFC9309 in 2022 after nearly three decades of unofficial use, operates as a voluntary protocol. Crawlers can read the directives and choose to ignore them without any automatic technical consequence. The burden of enforcement falls entirely on the publisher, who must implement additional mechanisms - firewall rules, CDN-level blocks, legal action - if a crawler refuses to comply.

That gap between declared policy and actual crawler behavior has widened considerably as AI platforms scaled. Research published by Rutgers Business School and The Wharton School, covering data through mid-2025, found that roughly 75% of the top 30 news publishers in the study's sample had blocked at least one major AI crawler at some point via robots.txt - yet the voluntary nature of the protocol left compliance far from guaranteed.

Cloudflare's Robotcop tool, launched in December 2024, addressed part of this by converting robots.txt rules into Web Application Firewall rules enforced at the network edge rather than relying on crawler cooperation. Clarity's new feature approaches the problem differently. It does not enforce compliance - it measures and surfaces it, giving publishers the data they need to decide whether enforcement action is warranted.

OpenAI introduced a further complication in late 2025 when it announced that its ChatGPT-User agent would no longer follow robots.txt directives for user-initiated browsing. That change was covered by PPC Land when OpenAI revised its crawler documentation in December 2025. Anthropic, by contrast, clarified in February 2026 that its three crawlers - ClaudeBot, Claude-User, and Claude-SearchBot - do respect robots.txt and that the company will not attempt to bypass CAPTCHAs. Whether those commitments hold in practice has remained a point of contention, as Reddit's lawsuit against Anthropic alleged the company accessed its platform more than 100,000 times after publicly claiming it had stopped.

The aggregate picture from Cloudflare's 2025 data showed that AI bots accounted for 4.2% of all HTML requests across its network in December 2025, with GPTBot alone ranging from 2.4% of AI crawling traffic in early April to 6.4% in late June. Against that backdrop, the 4.56% violation figure shown in Clarity's own illustrative dashboard data lands in a plausible range, though it reflects a different measurement universe: violations as a share of bot requests to a specific site, not all HTML traffic across Cloudflare's network.

The crawling volume context

The scale of AI crawling has grown substantially in a short time. HUMAN Security's State of AI Traffic report from April 2026 documented automation growing eight times faster than human traffic. Botify data cited in PPC Land's coverage of the AI bot ecosystem showed OpenAI bots crawling retail sites 198 times for every single referral visit they deliver - compared to 1 in 6 for Google. Kinsta's infrastructure analysis, published earlier this month, found AI bots trapped in query-string loops hammering WooCommerce cart and checkout pages up to 3.75 million times in a single day across a sample of 10 billion requests.

Within that context, knowing which portion of crawler traffic violates stated access preferences - and which content attracts the most non-compliant attention - provides a different category of intelligence than traffic volume alone. A publisher might accept 246,000 bot requests if those crawlers are behaving within declared preferences. The same publisher would likely respond differently to learning that 4.56% of those requests targeted pages explicitly marked as disallowed.

The content-level visibility in Clarity's new feature addresses this directly. By analyzing violation activity by path and content type, publishers can identify whether non-compliant crawlers are concentrated on specific sections - premium content, restricted archives, user-generated areas - or spread evenly across the site. That distinction shapes the appropriate response. A spike in violations concentrated on a subscription paywall calls for different action than scattered violations across general editorial content.

Technical setup requirements

The feature requires a CDN integration to function. According to Microsoft, project administrators must connect a supported CDN through the AI Visibility section in Project Settings before violation data becomes available. The supported providers are Fastly, Amazon CloudFront, and Cloudflare.

For WordPress sites running the latest version of the Microsoft Clarity plugin, AI Bot Activity - the broader category that includes violation tracking - becomes available automatically. Sites on older versions of the Clarity plugin for WordPress will need to update to access the feature. This distinction matters in practice: WordPress powers approximately 43% of all websites globally, according to figures cited in prior Clarity announcements, and the platform has been a consistent focus of Clarity's integration roadmap.

Once the CDN is connected, the access workflow is five steps. First, open the project in Clarity and navigate to Bot Analytics. Second, locate the Violations card to review the share of non-compliant requests. Third, apply filters for operator, bot name, and activity type to narrow the view. Fourth, review violating URLs, paths, and content types. Fifth, compare compliant and non-compliant requests over time to identify patterns and determine whether to adjust monitoring, enforcement, or content protection workflows.

Microsoft notes that for users already running Bot Analytics, violation insights are ready to use immediately - no additional configuration is required beyond the CDN connection.

Clarity's expanding bot intelligence layer

This release sits within a sequence of Clarity updates addressing AI's impact on web analytics. Microsoft introduced AI channel groups on August 29, 2025, enabling dedicated tracking of traffic arriving from ChatGPT, Claude, Gemini, Copilot, and Perplexity as distinct sources. That update addressed the downstream end of the AI content lifecycle - measuring referral traffic after AI systems direct users to source websites.

The Bot Activity dashboard itself launched on January 21, 2026, addressing the upstream end: which AI crawlers access content before any grounding, citation, or referral activity occurs. At launch, the dashboard showed total requests from AI crawlers, the proportion of site traffic they represented, and site pages crawled as a percentage of total page volume. The Violations layer announced today adds a third dimension: behavioral compliance relative to the publisher's stated preferences.

Taken together, the sequence maps a fairly complete picture of how AI systems interact with a given website. Crawlers arrive and make requests - some compliant, some not. Some of that crawling eventually generates citations or referral traffic, which the AI channel groups capture. The gap between the two phases, where content is consumed without any attributable downstream visit, remains invisible to most analytics systems. Clarity's expanding toolkit attempts to make different parts of that chain measurable.

A December 2025 analysis from Microsoft Clarity's own research team found that AI-referred traffic had grown 155% over an eight-month period, though it still represented less than 1% of total visitors in the analyzed dataset. The same research found that AI-sourced visitors converted to sign-ups at 1.66% compared to 0.15% from organic search - an 11-fold advantage in conversion rate, despite the smaller volume.

What the marketing industry can do with this data

For marketing and analytics teams, the practical value of violation data operates at several levels. At the most immediate level, it establishes a baseline. Publishers now know that in a given period, a specific share of bot traffic ignored their robots.txt rules. That number can be tracked over time. If it rises after a new AI model launches, or spikes when a specific operator scales its crawling, the pattern is now visible in the same dashboard already open for behavioral analytics.

The operator-level filtering adds a competitive intelligence dimension. If one AI platform consistently generates higher violation rates than others, publishers can factor that pattern into decisions about whether to pursue more aggressive enforcement through CDN-level blocking or WAF rules. Clarity does not enforce compliance itself, but it supplies the evidence that informs enforcement decisions elsewhere - through Cloudflare, Fastly, or CloudFront configurations.

The content-level data has a direct connection to content strategy. If violation activity concentrates on specific content types - long-form editorial, data visualizations, research documents - that concentration signals which parts of a site AI crawlers consider most valuable, regardless of whether they have permission to access them. For publishers thinking about how to negotiate with AI platforms or how to structure licensing discussions, knowing which content draws the most non-compliant attention provides a negotiating reference point.

There is also a measurement hygiene argument. Bot traffic that bypasses declared rules and reaches disallowed pages can, under some conditions, generate events that pollute behavioral analytics. Clarity already distinguishes between human and bot sessions in its core analytics. The violations layer adds specificity: this traffic is not just automated, it is actively non-compliant, and the content it accessed was explicitly marked as off-limits.

Kinsta's June 2026 Bot Protection launch noted that bot sessions hitting checkout URLs could trigger retargeting pixels and pollute conversion data fed into automated bidding systems. A similar logic applies to violation traffic on any restricted content: when it generates analytics events, those events do not reflect genuine user interest. They reflect crawler behavior that the site administrator had already determined should not occur.

The broader industry context, tracked extensively by PPC Land, is one of rising blocking rates alongside rising agentic traffic. HUMAN Security's May 2026 data found that the rate at which sites block agentic traffic climbed to nearly 9% - up from 8.2% the previous month - even as total agentic traffic volume dipped 4.3% month over month. Publishers are increasingly active in enforcement. Clarity's new feature gives them a more precise measurement foundation for those decisions.

Timeline

1994 - robots.txt introduced as an informal web standard for communicating crawl preferences to automated systems
2022 - robots.txt formalized as internet standard RFC9309 after nearly three decades of unofficial use
June 29, 2024 - Cloudflare introduces a feature to block AI scrapers and crawlers, giving publishers one-click controls over training data access
December 10, 2024 - Cloudflare launches Robotcop, converting robots.txt directives into Web Application Firewall rules enforced at the network level
August 29, 2025 - Microsoft Clarity introduces AI channel groups, enabling tracking of traffic from ChatGPT, Claude, Gemini, Copilot, and Perplexity as distinct sources
October 24, 2025 - Perplexity denies training AI models as Cloudflare documents stealth crawlers generating 20-25 million daily declared requests alongside 3-6 million undeclared ones
December 9, 2025 - OpenAI revises ChatGPT crawler documentation, removing robots.txt compliance language for ChatGPT-User in user-initiated browsing contexts
December 18, 2025 - Microsoft Clarity research finds AI referral traffic grew 155% in eight months, converting at 1.66% versus 0.15% from organic search
December 31, 2025 - Rutgers and Wharton researchers publish working paper finding publishers who blocked AI crawlers via robots.txt experienced a 23.1% total traffic decline
January 21, 2026 - Microsoft Clarity launches Bot Activity dashboard, giving website operators visibility into which AI systems crawl their properties and in what volumes
February 25, 2026 - Anthropic clarifies its three web crawlers - ClaudeBot, Claude-User, and Claude-SearchBot - and commits to respecting robots.txt directives
March 22, 2026 - Czech publishers gain an updated robots.txt shield covering real-time AI response crawlers, not just training data collection
April 5, 2026 - Research finds only 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD
April 9, 2026 - HUMAN Security State of AI Traffic report documents automation growing 8 times faster than human web traffic
April 26, 2026 - Updated Wharton and Rutgers research finds publishers blocking AI crawlers lost roughly 7% of weekly website traffic within six weeks
June 4, 2026 - HUMAN Security Satori team publishes May 2026 agentic traffic report finding blocking rates climbed to nearly 9% while overall agentic traffic declined 4.3%
June 9, 2026 - Kinsta launches Bot Protection for all WordPress plans at no added cost, giving owners controls over AI crawlers inside MyKinsta
June 17, 2026 - Kinsta analysis of 10 billion requests finds AI bots hammered WooCommerce cart pages up to 3.75 million times in a single day
June 23, 2026 - Microsoft Clarity announces robots.txt violation detection inside Bot Analytics, surfacing non-compliant crawler behavior alongside existing traffic and compliance data

Summary

Who: Microsoft, through its Clarity web analytics platform, announced by Ihab Rizk on the Microsoft Clarity blog. The feature is directed at website publishers and administrators using Clarity's Bot Analytics dashboard, particularly those on CDN providers including Fastly, Amazon CloudFront, and Cloudflare.

What: A new Violations layer inside the Bot Analytics dashboard that detects and surfaces instances where AI crawlers and bots request URLs that a site's robots.txt file explicitly disallows. The feature includes a Violations card showing non-compliant requests as a percentage of total bot requests - illustrated at 4.56% in the announcement - a violation trendline, filtering by operator and bot name, URL-level visibility into what content attracts non-compliant traffic, and a side-by-side comparison of compliant versus non-compliant requests.

When: The announcement was published on June 23, 2026. For existing Bot Analytics users, the violation insights are available immediately. New users must first connect a supported CDN through the AI Visibility section of their Project Settings. WordPress sites using the latest Clarity plugin receive AI Bot Activity automatically; older plugin versions require an update.

Where: Available within the Microsoft Clarity platform at the Bot Analytics dashboard level, accessible globally. CDN support covers Fastly, Amazon CloudFront, and Cloudflare. WordPress integration is handled through the Microsoft Clarity plugin.

Why: The robots.txt standard operates as a voluntary protocol with no automatic enforcement mechanism. AI crawlers can and do ignore directives without technical consequence, leaving publishers without visibility into how frequently their access rules are breached or which content attracts non-compliant attention. Clarity's new feature addresses the measurement gap - supplying the data publishers need to assess whether enforcement action through CDN or WAF rules is warranted, which operators are most frequently non-compliant, and whether violation rates are rising or stabilizing over time.