Data

Baidu blocks Google and Bing from indexing Baike content amid AI data demands

Chinese tech giant Baidu restricts access to its online encyclopedia, highlighting growing competition for AI training data.

Luis Rijo

Aug 26, 2024 • 3 min read

AI Crawlers

Chinese internet search giant Baidu this month implemented measures to block Google and Bing search engine crawlers from indexing content on Baidu Baike, its online encyclopedia service. This move, occurring just over two weeks ago, marks a significant shift in Baidu's approach to protecting its digital assets amid increasing demand for large-scale data used in artificial intelligence (AI) training and development.

According to records from the Internet Archive's Wayback Machine, Baidu Baike updated its robots.txt file on August 8. This file, which instructs search engine crawlers on which parts of a website they can access, now explicitly prohibits the Googlebot and Bingbot crawlers from indexing any content from the platform. Prior to this change, Google and Bing had been allowed to browse and index Baidu Baike's repository of nearly 30 million entries, with only certain sections of the website off-limits.

Baidu Baike, launched in April 2006, has grown to become the largest Chinese-language online encyclopedia. As of February 2022, it boasted over 25.54 million entries and 7.5 million editors, significantly surpassing the Chinese version of Wikipedia, which currently has 1.43 million entries.

The decision to restrict access comes at a time when major tech companies are increasingly focused on acquiring vast amounts of data to train and improve their AI models and applications. This trend has been particularly pronounced since the release of OpenAI's ChatGPT on November 30, 2022, which sparked a global race in generative AI development.

Baidu's move follows similar actions taken by other online platforms. In July 2024, Reddit, the U.S.-based social news aggregation and discussion website, blocked various search engines from indexing its content, with the notable exception of Google. This exclusion is due to a multimillion-dollar agreement between Reddit and Google, granting the latter the right to scrape Reddit's platform for AI training data.

Even tech giant Microsoft has taken steps to protect its data assets. In 2023, the company reportedly threatened to revoke access to its internet search data, which it licenses to competing search engine operators, if these companies continued to use the data for their chatbots and other generative AI services.

The growing trend of data protectionism extends beyond search engines and social media platforms. Content publishers are increasingly striking deals with AI developers for access to their archives. For instance, in June 2024, OpenAI secured an agreement with Time magazine, gaining access to over a century's worth of the publication's archived content.

Baidu Baike's decision to block Google and Bing crawlers highlights the strategic importance of high-quality, curated content in the AI era. With its vast repository of Chinese-language information, Baidu Baike represents a valuable resource for training AI models, particularly those focused on Chinese language processing and cultural understanding.

It's worth noting that while Baidu has restricted access to its encyclopedia content, the company itself is heavily invested in AI development. Baidu has been working on its own large language models and AI applications, competing with both domestic and international tech giants in the rapidly evolving AI landscape.

The implications of Baidu's decision extend beyond the immediate impact on search results. It raises questions about the future of open access to information on the internet and the potential fragmentation of the global knowledge base along corporate or national lines. As AI continues to drive technological advancement and economic competition, control over large-scale, high-quality data sets is likely to become an increasingly contentious issue.

Despite the recent changes to Baidu Baike's robots.txt file, a survey conducted on August 25, 2024, revealed that many entries from the service still appear in Google and Bing search results. This suggests that the full impact of Baidu's restrictions may take some time to manifest, as search engines typically retain cached content for a period after access is revoked.

Key facts

Baidu updated its robots.txt file on August 8, 2024, to block Google and Bing crawlers from indexing Baidu Baike content.
Baidu Baike contains nearly 30 million entries as of August 2024.
As of February 2022, Baidu Baike had over 25.54 million entries and 7.5 million editors.
The Chinese version of Wikipedia currently has 1.43 million entries.
Reddit blocked various search engines, except Google, from indexing its content in July 2024.
OpenAI secured access to Time magazine's archived content in June 2024.
ChatGPT was released on November 30, 2022, intensifying the race for AI development and data acquisition.