Google sues scraper while being the internet's biggest scraper itself

Google filed a December 19 lawsuit against SerpApi for scraping search results, yet the company scrapes billions of web pages for AI training without compensation.

Google logo pulling web pages through colorful beams depicting massive scraping operations.
Google logo pulling web pages through colorful beams depicting massive scraping operations.

Google filed a federal lawsuit against SerpApi LLC on December 19, 2025, alleging the Texas company violated the Digital Millennium Copyright Act by circumventing SearchGuard protections to scrape copyrighted content from search results. The irony runs deep. Google, which scrapes billions of web pages across the internet for artificial intelligence training and search indexing, now demands legal protection against companies that scrape its own search results pages.

The 13-page complaint, filed in the United States District Court for the Northern District of California under case number 25-10826, seeks statutory damages between $200 and $2,500 for each act of circumvention. SerpApi processes hundreds of millions of automated queries daily, creating potentially astronomical liability figures. Yet Google's own crawling operations dwarf this scale by orders of magnitude, accessing virtually every publicly available web page to power search services and train AI models that increasingly keep users within Google's ecosystem rather than directing them to publisher websites.

The lawsuit creates striking parallels to events from February 2014, when Matt Cutts, then Google's head of web spam, publicly requested examples of scraper sites outranking original content. Dan Barker, a search marketing professional, responded by highlighting Google's own Knowledge Panels that displayed content scraped from Wikipedia with nearly identical text and formatting. "I think I have spotted one, Matt. Note the similarities in the content text," Barker wrote on February 27, 2014, accompanied by screenshots showing Google's search results functioning as a massive scraper site.

Fast forward to December 2025, and Google positions itself as victim rather than practitioner of large-scale scraping operations. The complaint emphasizes Google's investment in SearchGuard, a technological protection measure deployed in January 2025 after "tens of thousands of person hours and millions of dollars of investment." SearchGuard sends JavaScript challenges to search queries from unrecognized sources, requiring browsers to transmit specific information proving requests originate from human users rather than automated systems.

Yet while Google builds technological walls around its own search results, the company continues extracting massive quantities of content from publishers worldwide for artificial intelligence purposes. Penske Media Corporation filed a comprehensive federal antitrust lawsuit against Google on September 12, 2025, alleging the search giant "systematically coerces online publishers into providing content for artificial intelligence systems without compensation while simultaneously reducing website traffic that publishers depend on for revenue generation."

The 101-page Penske Media complaint describes an impossible choice facing publishers. They must either allow their content to be used for training and grounding AI models without payment, or face exclusion from search results that drive substantial portions of their revenue. "This action challenges Google's abuse of its adjudicated monopoly in General Search Services to coerce online publishers like PMC to supply content that Google republishes without permission in AI-generated answers that unfairly compete for the attention of users on the Internet," according to the complaint's opening statement.

Google's scraping operations for AI training extend far beyond what SerpApi allegedly extracts from search results. The European Commission launched a formal antitrust investigation on December 9, 2025, examining whether Google violated EU competition rules by using content from web publishers and YouTube creators for artificial intelligence purposes without appropriate compensation or viable opt-out mechanisms. Brussels regulators will assess whether Google imposed unfair terms on publishers and content creators while granting itself privileged access to training data that competitors cannot obtain.

According to the Commission's announcement, Google may have used publisher content to power AI Overviews and AI Mode features on search results pages without compensation. Google's "Google-Extended" controls, introduced in September 2023, purportedly allow publishers to prevent content usage for AI training. However, these controls provide insufficient granularity and exclude key products, according to the Penske Media lawsuit. Publishers who attempt to block AI training through technical means like robots.txt files face traffic penalties that make such blocking economically untenable.

The economic imbalance Google creates through its scraping practices is stark. Cloudflare CEO Matthew Prince revealed during a CNBC interview on May 21, 2025, that Google previously sent one visitor for every two pages crawled from websites ten years ago. Six months ago, this ratio had deteriorated to one visitor for every six pages scraped. Currently, the ratio reaches 15 pages scraped per visitor sent to original sources. Publishers bear infrastructure costs for content creation, hosting, and bandwidth while Google extracts value through AI training and zero-click searches that keep users within Google's ecosystem.

Research analyzing 300,000 keywords found that AI Overviews reduce organic clicks by 34.5 percent when present in search results, according to analysis published by Ahrefs. Yet when asked during a December 15, 2025 podcast interview whether publishers should view their content differently for AI search, Nick Fox, Google's SVP of Knowledge and Information, responded with an unequivocal "no" and rejected proposals for standardized licensing deals that would allow smaller publishers to negotiate fair compensation.

Advertise on ppc land

Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.

Learn more

Meanwhile, SerpApi founder Julien Khaleghy established the Austin-based company in 2017 after concluding that "scraping images from Google was an intensive process," according to Google's complaint. The business model centers on appropriating output from services that invested substantially to generate it, then delivering that content to third parties through paid subscription tiers. SerpApi advertises its "Google Search API" as a way to "Scrape Google," with specialized services targeting Knowledge Graph blocks, Google Shopping listings, and Google Maps data.

Google estimates SerpApi sends hundreds of millions of artificial search requests each day, with volume increasing as much as 25,000 percent over the past two years. The automated queries consume substantial computing resources without generating revenue offset. Google's Terms of Service Agreement expressly forbids automated access to search content "in violation of the machine-readable instructions on our web pages (for example, robots.txt files that disallow crawling, training, or other activities)."

The complaint describes how SerpApi developed circumvention methods immediately after SearchGuard's January 2025 deployment effectively blocked the company's access. Khaleghy recently characterized the process as "creating fake browsers using a multitude of IP addresses that Google sees as normal users." These techniques involve misrepresenting device information, software details, or location data when responding to SearchGuard challenges, or syndicating legitimate authorizations across unauthorized machines worldwide.

SerpApi boldly markets these circumvention capabilities, promising customers they "don't need to care about … captcha, IP address, bots detection, maintaining user-agent, HTML headers, [or] being blocked by Google." The company claims to use "advanced algorithms to bypass CAPTCHAs and other anti-bot mechanisms, ensuring uninterrupted and efficient data extraction." A recent SerpApi blog post explained that SearchGuard had made web scraping "more difficult," but claimed the company was "fortunate to be minimally impacted" because its services had "already pre-solved Google's JavaScript challenge."

The lawsuit represents the second major legal action against SerpApi in 2025. Reddit sued SerpApi on October 22, 2025, along with Oxylabs, AWMProxy, and Perplexity AI, for circumventing both Reddit's anti-scraping measures and Google's SearchGuard system to scrape Reddit content from Google search results pages. That 41-page complaint described defendants as "similar to would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead."

The scraping controversy unfolds amid broader industry tensions over content access and AI training data. Over 80 media executives gathered in New York during the week of July 30, 2025, under the IAB Tech Lab banner to address what many consider an existential threat to digital publishing. Mediavine Chief Revenue Officer Amanda Martin joined representatives from Google, Meta, and numerous other industry leaders in confronting AI companies that scrape publisher content without consent or compensation. Notably absent from the gathering were the AI companies at the center of the controversy: OpenAI, Anthropic, and Perplexity.

Publishers increasingly view unauthorized AI training data collection as an existential threat. Research data presented showed over 35 percent of top websites now block OpenAI's GPTBot, while HUMAN Security documented 107 percent year-over-year increases in scraping attacks. TollBit research demonstrated AI dependency on fresh, human-generated content to maintain accuracy and avoid hallucinations, with 2 million scrapes per site and 117 percent AI bot traffic surge reflecting heavy reliance on publisher content for training and real-time information retrieval.

One publishing company earned just $174 from AI crawlers over an extended period, according to data revealed on November 20, 2025. The meager revenue highlights a fundamental imbalance: while AI companies including Google scrape millions of pages to train models and power answer engines, publishers receive minimal compensation despite bearing the costs of content creation, hosting, and bandwidth. Some large outlets including Vox Media, Axel Springer, and People Inc. have signed licensing deals for use of their data, but the vast majority of websites have received no payment in exchange for AI firms using their content.

Google's complaint against SerpApi alleges violations of Section 1201(a)(1)(A) of the Copyright Act, which prohibits circumventing technological measures controlling access to copyrighted material. Each circumvention act carries statutory damages between $200 and $2,500. Additionally, Section 1201(a)(2) violations stem from manufacturing, offering, providing, or trafficking in services designed to circumvent technological measures. The lawsuit acknowledges that SerpApi reportedly earns "a few million dollars in annual revenue, but already faces liability that is orders of magnitude higher and growing" with hundreds of millions of additional violations every day.

Central to Google's legal argument is that SearchGuard qualifies as a technological measure that "effectively controls access" to copyrighted works. Google licenses copyrighted content specifically to enhance search results through Knowledge Panels displaying high-resolution images, Google Shopping featuring merchant-supplied product images and descriptions, and Google Maps incorporating various terrestrial pictures and business-supplied imagery. The complaint includes an example of a Willie Mays photograph that Google obtained under copyright license.

Yet publishers argue Google applies fundamentally different standards to its own scraping operations. Recipe creators confronted Google in December 2025 over AI features displaying complete recipes with errors, plagiarized content, and stolen photos without proper attribution. Adam Gallagher, co-founder of Inspired Taste, detailed specific problems affecting recipe publishers in a LinkedIn exchange with Nick Fox on December 2. "We would like to point out that we are still seeing branded searches for us and multiple recipe sites with full plagiarized recipes riddled with errors, using our photos," Gallagher wrote.

The lawsuit seeks injunctive relief compelling SerpApi to cease circumventing technological measures and destroying any technology involved in Section 1201 violations. Google requests statutory damages for each of SerpApi's violations, or alternatively Google's actual damages and SerpApi's profits. The complaint emphasizes that "SerpApi's statutory violations are ongoing and are causing Google irreparable harm because SerpApi will not be able to pay the damages it will owe for its misconduct."

Google's position as both aggressive scraper of web content and litigious defender of its own search results creates tensions throughout the digital publishing ecosystem. The company's SearchGuard investment protects copyrighted content Google licenses from third parties, yet Google's own AI training operations extract far more value from publishers than SerpApi could ever scrape from Google's search results pages. Publishers who depend on search visibility for revenue cannot realistically block Google's AI crawlers without facing existential traffic losses, while Google deploys sophisticated technological and legal mechanisms to prevent similar scraping of its own content.

The lawsuit matters for the marketing community because it exposes asymmetrical power dynamics in content distribution. Google controls 89.2 percent market share in general search services, rising to 94.9 percent on mobile devices, according to federal court findings referenced in the Penske Media complaint. This dominance creates a "monopsony" position where Google controls publisher access to search referral traffic while simultaneously extracting their content for AI training without fair compensation.

Technical details about how Google circumvents publisher preferences appear throughout the Penske Media complaint. The lawsuit examines Bard, Gemini, Search Generative Experience, AI Overviews, and AI Mode. These products rely on large language models trained on massive text corpuses scraped from publisher websites. "The training process for Google's LLMs involves storing encoded copies of the training works in computer memory, repeatedly passing them through the model with words masked out, and adjusting the parameters to minimize the difference between the masked-out words and the words that the model predicts to fill them in," the complaint explains.

Publishers face what the Penske Media complaint characterizes as a "Hobson's choice" between allowing Google to use their content for AI systems or losing search visibility entirely. Google's robots.txt instructions prohibit automated scraping of search results, yet Google's own crawlers extract content from publisher websites at unprecedented scale. The double standard becomes apparent when examining enforcement: Google files lawsuits against companies scraping its search results while simultaneously facing EU antitrust complaints over using publisher content without appropriate compensation or viable opt-out mechanisms.

The SerpApi lawsuit follows Google's pattern of protecting its own interests through technological and legal means while maintaining aggressive content extraction from publishers who lack similar recourse. Google eliminated the num=100 parameter on September 14, 2025, fundamentally transforming how tools access search result data by forcing 10 separate requests instead of one to retrieve 100 results. When SerpApi developed a workaround retrieving 100 organic results through its Light Fast API, Google blocked that method as well, restricting the service to just three results.

Google's legal strategy differs from previous scraping disputes by emphasizing copyright protection through DMCA provisions rather than contract violations or terms of service breaches. The framework provides statutory damages that could theoretically exceed SerpApi's ability to pay given the massive number of alleged violations. This approach may establish precedent for platform protection measures while Google continues extracting publisher content for AI training under different legal theories that publishers challenge through antitrust litigation and regulatory complaints.

The timing of Google's lawsuit against SerpApi coincides with intensifying regulatory scrutiny of data scraping practices and mounting legal pressure from publishers. Amazon blocked AI bots from major tech companies in August 2025, updating its robots.txt file to prohibit crawlers from Meta, Google, Huawei, Mistral, and other technology firms. The e-commerce giant maintains a $56 billion advertising business built around shoppers browsing its marketplace, with third-party AI tools that bypass Amazon's storefront potentially undermining both website traffic and advertising revenue streams.

Timeline

Summary

Who: Google LLC filed the lawsuit against SerpApi LLC, a Texas-based company founded by Julien Khaleghy in 2017 that provides automated web scraping services through API subscriptions. Google itself operates as the internet's largest scraper, extracting billions of web pages for search indexing and AI training while facing separate antitrust lawsuits from Penske Media Corporation and regulatory investigations from the European Commission over unauthorized publisher content usage.

What: The lawsuit alleges SerpApi violated the Digital Millennium Copyright Act by circumventing Google's SearchGuard technological protection measures to scrape copyrighted content from search results pages, processing hundreds of millions of automated queries daily through techniques including creating fake browsers, misrepresenting device information, and syndicating authorizations to bypass security challenges. Meanwhile, Google scrapes publisher content at far greater scale for AI training, with Cloudflare data showing Google now scrapes 15 pages per visitor sent to original sources compared to one visitor for every two pages crawled a decade ago.

When: The complaint was filed December 19, 2025, following SearchGuard's January 2025 deployment and SerpApi's immediate development of circumvention techniques, while Google's own AI scraping operations have accelerated dramatically throughout 2025 with AI Overviews reducing publisher traffic by 34.5 percent according to Ahrefs analysis.

Where: The case was filed in the United States District Court for the Northern District of California as case number 25-10826, targeting SerpApi's operations from Austin, Texas that affect Google's Mountain View, California operations, while Google's AI scraping affects publishers globally including those who filed the September 12, 2025 Penske Media antitrust lawsuit and triggered the December 9, 2025 European Commission investigation.

Why: Google seeks to protect copyrighted content licensed from third parties that appears in Knowledge Panels, Google Shopping, and Google Maps features while preventing unauthorized appropriation that undermines content partnerships and imposes infrastructure costs, yet the lawsuit exposes fundamental contradictions in Google's position as both aggressive scraper of publisher content for AI training without adequate compensation and litigious defender of its own search results through DMCA provisions that publishers cannot similarly invoke against Google's scraping operations due to the company's 89.2 percent search monopoly power.