New York passes bill forcing AI crawlers to identify themselves to news sites

by Luis Rijo
Luis Rijo
Luís Rijo is a seasoned marketing professional with over 10 years of experience in Digital Marketing, Search, Social, Display, Video, and DOOH. Based in Europe. Also writing in the spend. Reach out via luis@ppc.land
- LinkedIn
•
June 14, 2026
•
11 min read

Statue of Liberty crushes AI web crawlers as New York moves to regulate stealth bots

New York's legislature has sent a bill to the governor that would require any software crawling news publishers' websites to identify itself openly - and make covert scraping a civil violation carrying up to $15,000 in daily fines.

The New York State Assembly this month passed Assembly Bill A11292, a measure known formally as theNew York Stealth Crawler Prohibition Act, establishing mandatory disclosure obligations for web crawlers that access covered news sources. The bill's journey through Albany was swift by legislative standards: introduced on May 8, 2026, it cleared the Assembly floor on June 5 - less than four weeks later - substituting its Senate companion bill S9934a in the process. It now heads to Governor Kathy Hochul, who must sign or veto it before it can become law.

What the bill actually requires

The legislation creates a new Article 48 within New York's General Business Law. Its core requirement, set out in section 1752, is that any crawler accessing a covered news source must disclose its identity and purpose before or at the time of access. That disclosure must take two specific forms.

First, the crawler must identify itself via a valid and accurate user-agent string. According to the bill, that string must state three things: the identity of the software product making the request, the version of that product, and the identity of the company behind it. This is not merely a suggestion to follow conventions already common among well-behaved bots. It is a statutory obligation enforceable through civil action.

Second, and more demanding, the crawler must disclose the "specific nature and purpose" of its activities, including - and the bill's language is explicit here - "all uses and purposes that the content of the covered news source could be used for." That disclosure must be made in a format the journalism provider can actually access. The bill does not prescribe a technical standard for how this second requirement should be satisfied, leaving that question for future guidance or litigation.

A crawler that fails to meet either requirement is defined by the bill as a stealth crawler. Deploying a stealth crawler in a way that damages, impairs, burdens, or causes economic harm to a covered news source is the conduct the bill prohibits.

Who counts as a covered news source

The bill's definition of a "covered news source" is detailed. It covers any website or other relevant platform belonging to print, television, radio, network, cable, satellite, or digital publications - but only those that meet all four of a set of conditions.

The outlet must perform a public-information function comparable to traditional journalism. It must make a substantial expenditure of labor, skill, and money to create, edit, produce, and distribute original content - including text, audio, photo, illustration, or video - concerning matters of public interest. It must publish new content or update existing content at least monthly and maintain a process for error correction. And it must have at least 1,000 monthly active viewers, listeners, users, or subscribers in New York.

That final threshold is low enough to include a very large number of digital publications. A local news blog with a modest New York readership would qualify. A major national newspaper certainly would. The bill's authors, Assemblymember Otis and co-sponsor Alex Bores, appear to have drawn the line deliberately to capture the long tail of smaller outlets that have felt the economic pressure of AI-driven content extraction without having the legal resources of a large media company.

The enforcement mechanism

The enforcement section of the bill gives the New York Attorney General broad authority to act against violators. Whenever the AG believes an operator has engaged in, or is about to engage in, prohibited conduct, the office may bring an action in the name of the people of New York to enjoin the behavior. Civil penalties can reach $15,000 per day for each violation.

Notably, the bill states that no proof of injury to any specific person is required for a court to find a violation. That is a significant departure from typical civil tort requirements, and it lowers the procedural bar for enforcement considerably.

The bill also creates a parallel mechanism for journalism providers themselves. According to section 1754, a journalism provider may request the clerk of the Supreme Court - or a judge where there is no clerk - to issue a pre-litigation subpoena to a service provider for identification of an alleged violator. The service provider receiving such a subpoena must expeditiously disclose information sufficient to identify the alleged violator, to the extent that information is available. The subpoena must also include a provision requiring preservation of any relevant evidence in the service provider's possession.

This combination of AG enforcement and private pre-litigation discovery makes the bill unusually well-equipped for practical use. Publishers do not have to wait for a government agency to act. They can go to court themselves to identify whoever is crawling them covertly.

How the bill moved through Albany

The bill's legislative history is recorded in detail in the Assembly's actions log. After being introduced on May 8, 2026 and referred to the Committee on Science and Technology, it was reported to the Codes Committee on May 13. A week later, on May 20, it was reported again and referred to the Rules Committee. The Senate companion measure, S9934, cleared the Internet and Technology Committee on May 21 by a vote of 6 to 1. The full Senate voted on June 2, with 60 senators in favor and 1 against. The Assembly then passed the substituted bill on June 5 and returned it to the Senate.

The near-unanimous margins in both chambers are striking. The lone Senate dissenter was Walczyk; Parker was recorded as absent. In the Senate Internet and Technology Committee, the same pattern held: 6 in favor, 1 against. On the Assembly floor, the Rules Committee reported the bill at calendar position 633. Two senators, Krueger and Brouk, participated via videoconferencing.

Why the user-agent requirement matters technically

The user-agent string is the mechanism by which any HTTP client identifies itself to a web server. When a browser or bot makes a request, it sends a header - "User-Agent" - that typically includes the software name, version, and sometimes the operating system or the company operating the client. Well-known crawlers have long published their user-agent strings. Googlebot, for instance, is documented and its IP ranges are publicly verifiable through Google's own systems.

The problem the bill addresses is that many AI crawlers either use vague strings that obscure their purpose, rotate through multiple identities, or abandon identifying headers altogether. Research published in January 2026 documented AI agents using adversarial tactics including user-agent spoofing, distributed IP rotation, rapid parallel requests, and JavaScript execution avoidance. A single query to Grok was found to trigger 16 requests from 12 IP addresses impersonating human browsers. That report described the collapse of what analysts called the "gentleman's agreement" governing automated web traffic - an informal understanding that well-behaved bots would identify themselves and respect robots.txt controls, while malicious bots would be blocked.

Separate data from DataDome found that 80% of AI agents do not declare themselves properly when visiting websites. That figure applies to commercial sites broadly, but the pattern is well-documented in the publishing sector too.

The bill's requirement to include the software version in the user-agent string goes further than most voluntary conventions. It would in principle allow a news publisher to know not just which company's crawler is accessing its content, but which specific build of that software - information that could be relevant to identifying whether a crawler is behaving consistently with its stated documentation.

The "all uses and purposes" clause

The second disclosure requirement - disclosing all uses that the content could be used for - is the more novel and legally uncertain element of the bill. It goes substantially beyond simply identifying the crawler. It asks the operator to forecast and enumerate every possible downstream use of the content retrieved.

In practical terms, this means a company deploying a crawler that retrieves news content for inclusion in an AI training dataset would need to disclose that use explicitly. A company retrieving content to provide real-time summaries in a chatbot interface would need to disclose that. A company whose crawler feeds both a search index and a training pipeline would need to disclose both.

The bill does not define what format this disclosure should take or where it should appear. The text says only that it must be "in a format that the journalism provider can access." This ambiguity will likely generate substantial debate during any rulemaking process if the bill is signed into law. The 90-day implementation period the bill prescribes after becoming law - stated in section 2 as "the ninetieth day after it shall have become a law" - may prove short for companies running large-scale crawling operations to redesign their disclosure infrastructure.

Context: an industry fighting over who pays for content

The bill arrives at a moment when the relationship between AI companies and news publishers has become one of the most contested areas in technology law. PPC Land has tracked this conflict since at least 2024, when data showed 35.7% of the world's top 1,000 websites were blocking OpenAI's GPTBot - a seven-fold increase from the 5% blocking rate when the crawler first launched in August 2023.

Publishers initially responded to AI crawling through technical means: blocking crawlers via robots.txt, implementing Cloudflare's pay-per-crawl service launched in July 2025, or pursuing litigation. Those approaches have produced complicated results. Research published in early 2026 found that news publishers who blocked AI crawlers via robots.txt lost 23.1% of monthly visits without achieving proportional content protection, because robots.txt is a voluntary protocol that crawlers can ignore without technical consequence.

A subsequent updated study published by Rutgers Business School and The Wharton School in April 2026 refined the estimate to a 7% decline in weekly traffic within six weeks of implementing blocks - appearing in human browsing data, not just automated bot metrics. The implication is that AI systems may be driving meaningful referral traffic back to publishers even as they extract content, and that blunt blocking carries real costs.

Meta's leaked scraping operations, covered by PPC Land in August 2025, illustrated the scale of the problem: 6 million unique websites harvested, with major media companies among the targets. Over 80 media executives gathered in New York in July 2025 to confront the issue collectively.

The A11292 bill takes a different approach from both blocking and payment frameworks. It does not restrict crawling outright - the prohibition only applies to stealth crawlers that cause economic harm. And it does not mandate compensation. What it requires is transparency: operators must say who they are, what software they are running, and what they intend to do with what they retrieve.

That transparency requirement has a practical logic. A publisher cannot negotiate licensing terms, request removal of content, or assess whether a crawler's behavior matches its stated purpose without first knowing who is accessing its site. The bill creates the disclosure foundation that other legal and commercial mechanisms could build on.

Relation to federal and other state activity

New York's bill sits within a crowded regulatory landscape. At the federal level, the White House released a national AI policy framework in March 2026 calling on Congress to preempt state-level AI regulations - a move that, if followed, could limit the reach of measures like A11292. That framework did not address web crawling specifically, focusing instead on copyright, child safety, and free speech.

The TRAIN Act, introduced in the US Senate in July 2025 and followed by a House version in January 2026, would give copyright holders the ability to issue administrative subpoenas to identify which of their works appear in AI training datasets - a tool that addresses a related but distinct problem to what A11292 tackles.

California passed legislation in late 2025 requiring AI systems to disclose their artificial nature to users. New York's own chatbot liability bill targeting chatbot operators has separately advanced in the legislature. A11292 addresses a different layer of the stack: not the end-user interface, but the automated infrastructure that collects content before any human interaction occurs.

What happens next

The bill now awaits the governor's signature. If signed, it takes effect on the ninetieth day. That timeline would place implementation in early September 2026 at the earliest. Companies running crawlers that access New York-based news sites would need to audit their user-agent configurations, prepare purpose-disclosure mechanisms, and assess whether their current crawling practices could be characterized as causing economic harm to any covered news source.

The severability clause in section 1755 protects the rest of the bill if any individual provision is struck down by a court. That is standard legislative drafting, but it is particularly relevant here given the potential for litigation over the "all uses and purposes" disclosure requirement, which could be challenged on grounds of vagueness or practical impossibility for crawlers that retrieve content at scale.

Whether the governor signs the bill or not, its detailed disclosure framework - the specific user-agent requirements, the purpose disclosure obligation, the pre-litigation subpoena mechanism for publishers - provides a model that other states and possibly federal legislators may draw on. New York has established disclosure requirements for AI systems in other contexts; this bill extends that logic to the underlying data-collection layer.

Timeline

August 2023 - OpenAI's GPTBot crawler launches; 5% of top 1,000 websites block it at launch
August 2024 - Blocking of GPTBot reaches 35.7% of top 1,000 websites, a seven-fold increase, as tracked by PPC Land
July 1-2, 2025 - Cloudflare launches pay-per-crawl service for AI content access, introducing a payment layer between publishers and crawlers
July 30, 2025 - Over 80 media executives gather in New York to address AI scraping at IAB Tech Lab summit; PPC Land covered the rally
August 2025 - Meta's leaked scraping list reveals the company harvested 6 million unique websites
August 29, 2025 - Cloudflare data shows Anthropic crawling 38,000 pages per referral back to publishers
December 9, 2025 - OpenAI revises ChatGPT crawler documentation, removing robots.txt compliance language for ChatGPT-User
January 1, 2026 - Research finds publishers who blocked AI crawlers lost 23.1% of monthly visits
January 6, 2026 - AI agents documented masquerading as human browsers, using spoofed user agents and distributed IP rotation
January 24, 2026 - TRAIN Act introduced in House, extending the Senate bill creating subpoena rights for AI-trained content
February 25, 2026 - Anthropic clarifies its three crawlers and how publishers can block them, following documentation update
March 22, 2026 - White House releases national AI policy framework recommending Congress preempt state AI laws
March 3, 2026 - New York's chatbot liability bill reaches Senate floor, threatening AI providers with product liability exposure
April 26, 2026 - Updated Wharton-Rutgers study refines crawler blocking cost to 7% weekly traffic decline within six weeks
May 1, 2026 - News publishers formally target Common Crawl, demanding it stop unauthorized scraping
May 8, 2026 - Assembly Bill A11292 introduced by Committee on Rules at request of Assemblymember Otis; referred to Science and Technology Committee
May 13, 2026 - A11292 reported from Science and Technology and referred to Codes Committee
May 20, 2026 - A11292 reported from Codes and referred to Rules Committee
May 21, 2026 - Senate companion bill S9934 passes the Internet and Technology Committee, 6 to 1
June 2, 2026 - S9934 passes the full Senate floor vote, 60 to 1
June 5, 2026 - Assembly passes A11292 as substituted by S9934a; bill returned to Senate and delivered to governor

Summary

Who: The New York State Assembly and Senate, acting on legislation introduced by Assemblymember Otis and co-sponsored by Alex Bores. The bill directly affects operators of AI web crawlers - including technology companies, AI developers, and any entity running automated tools that access news publisher websites.

What: Assembly Bill A11292, the New York Stealth Crawler Prohibition Act, creates a legal requirement for crawlers accessing covered news sources to identify themselves via valid user-agent strings and to disclose all purposes for which retrieved content may be used. Crawlers that fail to comply are defined as stealth crawlers. Deploying a stealth crawler in a manner that damages or economically harms a news publisher is a civil violation carrying penalties of up to $15,000 per day, enforceable by the Attorney General and through publisher-initiated pre-litigation subpoenas.

When: The bill was introduced on May 8, 2026, passed the Senate on June 2, and cleared the Assembly on June 5, 2026. If signed by the governor, it takes effect 90 days later.

Where: New York State. The law would apply to any operator - regardless of where the company is headquartered - whose crawler accesses a covered news source that serves at least 1,000 monthly active users in New York.

Why: The legislation responds to documented industry-wide tension between AI companies extracting news content at scale and publishers who have no legal mechanism to compel identification of crawlers operating without disclosure. The bill fills a gap that technical measures - robots.txt blocking, Cloudflare tools, informal voluntary conventions - have failed to close, giving publishers a statutory right to know who is accessing their content and for what purpose before any economic harm occurs.

Luis Rijo

Luís Rijo is a seasoned marketing professional with over 10 years of experience in Digital Marketing, Search, Social, Display, Video, and DOOH. Based in Europe. Also writing in the spend. Reach out via luis@ppc.land