Cloudflare adds a new feature to help block AI data scrapers

The new tool is easy to use and available to all Cloudflare users, helping protect their content from being used without their consent.
Cloudflare adds a new feature to help block AI data scrapers

In an effort to safeguard internet content creators, Cloudflare has introduced a new “easy button” designed to block all AI bots, scrapers, and crawlers with a single click. This feature is accessible to all Cloudflare customers, including those on the free tier.

The rise of generative AI has escalated the demand for web content to train models and run inferences. Not all AI companies, however, are transparent about their web scraping activities. Reports indicate that Google spends around $60 million annually to license Reddit’s user-generated content, and Scarlett Johansson has accused OpenAI of using her voice without permission for their new personal assistant.

Perplexity AI has also been alleged to impersonate legitimate visitors to scrape content—they do this to plagiarize other people’s content, including text and images.

Cloudflare initially allowed customers to block AI bots that adhered to ethical guidelines, such as following the robots.txt file and not using unlicensed content. Despite these bots playing by the rules, many Cloudflare customers opted to block them. Now, Cloudflare has introduced a broader blocking option.

toggle labeled AI Scrapers and Crawlers

“We hear clearly that customers don’t want AI bots visiting their websites, especially those that do so dishonestly,” Cloudflare stated. The new feature can be enabled through the Cloudflare dashboard under the Security > Bots section, where customers can toggle the AI Scrapers and Crawlers option.

Cloudflare says this tool will evolve, updating automatically as new AI bot fingerprints are identified. Cloudflare’s extensive network allows them to comprehensively survey and understand AI crawler activity.

AI bot activity insights

Recent data from Cloudflare highlights the most active AI bots in terms of request volume. According to their analysis, Bytespider, Amazonbot, ClaudeBot, and GPTBot top the list. Bytespider, operated by ByteDance (the company behind TikTok), is used to gather training data for its language models, such as Doubao. Amazonbot and ClaudeBot follow closely, indexing content for Alexa and training the Claude chatbot, respectively.

Bytespider leads in both request volume and the extent of its internet property crawling, closely followed by GPTBot, which OpenAI manages. GPTBot collects training data for AI products like ChatGPT. Cloudflare’s data reveals that Bytespider accessed 40.40% of websites protected by Cloudflare, while GPTBot accessed 35.46%.

Despite the prevalence of AI bot activity, only a tiny fraction of websites take measures to block these bots. In June, AI bots accessed around 39% of the top one million internet properties using Cloudflare, but only 2.98% of these properties implemented blocking measures.

ai bot activity cloudflare

Interestingly, the more popular an internet property is, the more likely it is to be targeted by AI bots and to block such requests. For instance, 80% of the top 10 internet properties are accessed by AI bots, with 40% of them blocking these requests.

One challenge is that some bot operators use spoofed user agents to disguise themselves as legitimate browsers. Cloudflare has been monitoring this and claims its machine learning models have effectively detected these evasive bots.

“We’ve observed bot operators attempt to appear as though they are real browsers by using a spoofed user agent,” Cloudflare noted. Their global machine learning model has been able to recognize such activity consistently. They recommend customers set WAF rules to challenge visitors with a bot score below 30 to block this bot traffic automatically.

Ongoing efforts and reporting

Cloudflare leverages its global network to detect and block new scraping tools without needing manual intervention. They encourage customers to report misbehaving AI crawlers, providing tools for enterprise and general users to submit reports of unauthorized scraping activities.

In their announcement, they noted that some AI companies may continue to find ways to evade detection, but they remain committed to evolving their models to protect content creators. “We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models,” Cloudflare affirmed.

Posted by Alex Ivanovs

Alex is the lead editor at Stack Diary and covers stories on tech, artificial intelligence, security, privacy and web development. He previously worked as a lead contributor for Huffington Post for their Code column.