TikTok leads data collection surge as AI training reshapes scraping landscape
TikTok jumped to number one among most scraped websites in 2025 as companies collect diverse content for AI training, with video platforms now representing 38% of all scraping activity.

The data collection landscape underwent dramatic transformation in 2025, driven primarily by an explosion in artificial intelligence training requirements and the emergence of multimodal AI systems. According to new research released on September 9, 2025, by web scraping company Decodo, TikTok jumped from outside the top 10 to claim the number one position among most scraped websites, representing a 321% traffic growth from 2024.
The annual study examined scraping patterns across thousands of websites to identify which platforms companies target most frequently for data collection. Video-first platforms dominated the rankings this year, with YouTube climbing to fourth position and TikTok securing the top spot. Combined with other video and social media platforms, this category now accounts for 38% of all scraping activity.
"In 2025, outdated data is useless. LLMs and AI agents live on real-time, relevant information collected from various sources, including product reviews, the latest research papers, and trending content on community platforms," said Vytautas Savickas, CEO at Decodo. "Companies are betting their future on having access to this kind of current, reliable data."
Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Multimodal content drives scraping priorities
The shift toward video platforms reflects fundamental changes in AI development. Companies no longer collect only text data for their artificial intelligence systems. Modern AI training requires videos, images, audio, and text combined to create multimodal models that understand context across different media types.
TikTok's rise demonstrates this trend clearly. With over 1.5 billion active users and a unique algorithm-driven discovery system, the platform provides diverse content essential for training next-generation AI models. Companies extract video content and metadata, hashtag trends, user engagement metrics, audio and music usage data, creator analytics, comment sentiment, and geographic trending data.
YouTube's emergence at position four shows similar patterns. The platform receives 500 hours of content uploads every minute and serves 2.7 billion monthly users, creating an extensive dataset for companies building AI systems. Organizations explore YouTube videos to train models capable of understanding speech, recognizing objects, analyzing facial expressions, and detecting cultural nuances from visual storytelling.
The platform's mix of languages, accents, and content types provides exactly what AI developers need to build systems that understand human communication through sight and sound, not just text. Companies collect video metadata, engagement metrics, channel analytics, trending video identification, comment sentiment analysis, video transcript extraction, and audio-visual correlation data.
Search engines maintain critical importance
Google retained its position as a critical data source, though it dropped from first to second place. The platform processes over 13.7 billion searches daily, providing insights into global search trends, consumer behavior patterns, and real-time market demand across every industry and geography. Despite TikTok's surge, Google showed 84% traffic growth from 2024.
Companies continue collecting search result rankings and featured snippets, local business listings and reviews, Google Shopping product listings and prices, image search results, news aggregation data, and auto-suggest keyword data. The slight decline in relative position reflects diversification rather than reduced importance.
The data collection industry faces mounting resistance from major platforms, with Amazon blocking AI crawlers from Meta, Google, and Huawei on August 21. According to independent analyst Juozas Kaziukėnas, Amazon's updated robots.txt file now explicitly prohibits these companies from scraping data from the world's largest online marketplace.
eCommerce intelligence becomes more sophisticated
Amazon dropped from second to third position but demonstrated 151% traffic growth from 2024. The platform remains essential for eCommerce intelligence, from dynamic pricing and product assortment monitoring to customer reviews and market trend analysis. Companies collect product listings and specifications, pricing data, customer reviews, seller information, inventory availability, best-seller rankings, and sponsored product advertising data.
The typical eCommerce scraping use case evolved from simple price monitoring to advanced competitive intelligence systems that track product availability, customer reviews, shipping times, and competitor marketing strategies in real time. Walmart maintained fifth position with 67% growth, while eBay held seventh with 107% growth, demonstrating continued demand for marketplace data.
"We've seen an increasing demand for data from eCommerce platforms like Coupang, Amazon, and Walmart," said Gabrielė Verbickaitė, Senior Product Marketing Manager at Decodo. "Businesses are increasingly collecting more data from each platform, meaning these sites now play a bigger role in pricing strategies, product assortment decisions, and shaping customer experiences."
Coupang emerged as a significant new entrant at sixth position with 259% growth, highlighting globalization of eCommerce data collection. As South Korea's leading online retailer, Coupang offers valuable insights into consumer behavior, pricing strategies, and cross-border commerce in one of Asia's most dynamic markets.
Academic and business intelligence sources gain prominence
ScienceDirect entered the rankings at eighth position with 148% growth, reflecting growing demand for high-quality, factually accurate data sources. Beyond academic research, businesses increasingly turn to peer-reviewed content to support market analysis, product development, and strategic decision-making.
The platform provides research paper abstracts and metadata, citation networks, author collaboration patterns, emerging research trends, technical terminology, geographic research distribution, and publication timeline analysis. For technology developers, authoritative sources help ensure information reliability, while enterprises rely on these platforms for trusted insights into emerging technologies, scientific discoveries, and industry trends.
Crunchbase secured ninth position with 132% growth, highlighting demand for reliable business intelligence data. Companies, investors, and analysts rely on the platform to track startups, funding activity, and industry shifts. The data supports market research, competitive benchmarking, investment due diligence, and corporate strategy development.
Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.
Travel and hospitality data collection expands
Airbnb claimed the tenth position with 18% growth, representing the travel industry's growing reliance on data. As one of the largest peer-to-peer accommodation platforms, Airbnb provides valuable information for understanding pricing, availability, and traveler preferences across global markets.
Travel companies, hospitality groups, and analysts use Airbnb data to refine pricing strategies, optimize inventory, benchmark against competitors, and track trending holiday destinations. Companies collect property listings and availability, pricing trends across locations, host performance metrics, guest review sentiment, seasonal demand patterns, and alternative accommodation growth.
Platforms lose ground in rankings
Several previously prominent websites dropped from the top 10, indicating shifting priorities. TripAdvisor fell from third position, suggesting users replace review platforms with more comprehensive data sources offering richer content variety and real-time insights. Craigslist dropped from fifth position, becoming less relevant for modern AI training needs compared to other community forums with higher user activity.
Bing fell from sixth position as businesses reduced data collection from this search engine and prioritized the dominant search platform for real-time data. Microsoft's decision to end Bing Search APIs on August 11 further complicated access to this data source, with recommended replacements costing 40-483% more.
Shopify dropped from eighth position as individual store scraping declined while businesses focus on major marketplace data. Lazada and Zillow also fell from the rankings, replaced by bigger eCommerce marketplaces and broader business intelligence platforms respectively.
Data collection trends indicate future directions
The marketing community faces significant implications from these shifting patterns. Publishers already struggle with identity challenges, with 84% unable to identify more than 25% of their website visitors according to recent Wunderkind research. AI-powered data collection compounds these challenges as automated systems harvest content at unprecedented scales.
Meta's leaked scraping operations revealed systematic data collection from approximately 6 million unique websites, including news organizations, educational platforms, and various content sites. This comprehensive operation encompasses roughly 100,000 of the internet's most-trafficked domains, demonstrating the scope of modern data collection efforts.
The rise of AI training requirements creates new challenges for content creators and publishers. Google's search expert John Mueller warned against building site liability with automated content, stating that using large language models to create topic clusters provides "reasons not to visit any part of your site."
Marketing professionals must adapt to an environment where AI platforms increasingly influence brand discovery. Similarweb's GenAI Intelligence Toolkit revealed that AI platforms generated 1.1 billion referral visits in June 2025, representing 357% year-over-year growth.
Looking ahead to late 2025 and beyond
Industry experts predict continued growth for video platforms as businesses recognize the value of analyzing multimedia content. The need to train AI agents and language models will push more companies toward platforms with conversation data, user posts, and real-world interaction patterns.
"We're seeing a clear move toward websites that have lots of different types of content instead of just basic info," said Vaidotas Juknys, Head of Commerce at Decodo. "The biggest reason for this shift is that everyone needs tons of varied, good-quality data to train AI chatbots, language models, and other smart tools."
As AI models become less transparent about their sources, businesses will increasingly rely on collecting raw data themselves rather than trusting AI-generated summaries. Companies want to control their analysis process and understand exactly where their insights originate, driving more direct data collection from original sources.
AI-powered scraping and parsing tools will become more common, making data extraction and analysis faster. Specialized scraping tools built specifically for training AI systems and improving machine learning models will focus on gathering diverse, contextual data that modern AI agents require for effective operation.
Up-and-coming platforms in professional networking, fintech, and specialized industry forums might join the most-scraped list once they achieve sufficient scale and produce valuable business insights. These platforms will prove especially important for companies building AI agents that need to understand complicated human behaviors, work relationships, and niche market dynamics.
The 2025 data reveals how drastically the landscape has changed as businesses seek richer, more diverse content. Six completely new websites in the top 10 demonstrate how quickly priorities shift when businesses realize they need better quality, comprehensive data beyond traditional SEO and pricing information.
"Data might have been the new oil in 2006, but in 2025, it's the fuel that powers artificial intelligence," said Gabrielė Verbickaitė, Senior Product Marketing Manager at Decodo. "And AI systems have an appetite for fresh, diverse, and high-quality training data at unprecedented scale."
Companies that thrive in this AI-driven environment will be those capable of effectively collecting, analyzing, and transforming diverse data sources into training datasets for their AI systems. Whether training customer service chatbots, building AI-powered pricing algorithms, or developing autonomous research agents, the platforms on the 2025 list represent essential data sources powering modern artificial intelligence development.
Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Timeline
- August 2023: OpenAI announces GPTBot, triggering widespread blocking by major websites
- July 31, 2024: HUMAN Security reports 80% of companies block AI language models
- June 13, 2025: ChatGPT adds UTM parameters to "More" section links for improved analytics tracking
- July 14, 2025: Meta announces "hundreds of billions" AI infrastructure investment
- July 28, 2025: Similarweb launches dual tracking platform for AI search optimization
- August 6, 2025: Meta leaked scraping list reveals massive content harvesting operation
- August 11, 2025: Microsoft ends Bing Search APIs, replacement costs 40-483% more
- August 12, 2025: German court confirms Meta AI training includes children's data despite protections
- August 21, 2025: Amazon blocks AI bots from major tech companies amid commerce battle
- September 3, 2025: Publishers reveal identity crisis as data challenges mount
- September 9, 2025: Decodo releases "Most Scraped Websites of 2025" report
Subscribe PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Summary
Who: Web scraping company Decodo analyzed scraping patterns from its anonymized user base to identify the most targeted websites. Companies across industries collect data for AI training, competitive intelligence, and market research purposes.
What: TikTok emerged as the most scraped website in 2025, jumping from outside the top 10 with 321% traffic growth. Video and social media platforms now represent 38% of all scraping activity, reflecting demand for multimodal AI training data that combines text, video, images, and audio.
When: The report, published September 9, 2025, analyzed 2025 data compared to 2024 baselines, revealing dramatic shifts in scraping priorities as AI training requirements reshape data collection strategies throughout the year.
Where: The analysis examined global websites but particularly highlights platforms serving international audiences, including TikTok (global), YouTube (global), Amazon (primarily US), Coupang (South Korea), ScienceDirect (academic/global), and Crunchbase (business intelligence/global).
Why: Companies dramatically increased data collection from diverse sources to train large language models and multimodal AI systems. The shift from traditional text-only scraping to comprehensive multimedia data collection reflects the need for AI models that understand context across different media types, driving demand for platforms offering rich, varied content.