AI Crawlers: The Nasty Bugs Causing Trouble on the Internet

2 weeks ago 9

AI crawlers are eating up web resources, and site administrators are looking for defence mechanisms to protect against big bills.

ai web crawlers

Illustration by Diksha Mishra

AI tools with web search capabilities, such as Anthropic’s Claude, browse the internet to deliver users the needed information. Perplexity, OpenAI, and Google offer similar features through ‘Deep Research’. 

In a blog post, Cloudflare explained that these web crawlers, often referred to as AI crawlers, deploy the same techniques as search engine crawlers to gather available information.

While the aim of AI crawlers is to assist users, they may be causing more damage on the internet than one realises. They are believed to increase server resource usage for website administrators, leading to unwanted bills and causing disruptions.

AI Crawlers on The Rise of Being a Hassle 

Gergely Orosz, creator of The Pragmatic Engineer newsletter, shared on LinkedIn, “AI crawlers are wrecking the open internet, and I’m now being hit for the bill for their training.”

He explained that his website, a side project, initially had a few thousand visitors a month and used around 100 GB of server bandwidth. But, after Meta’s AI crawler and other bots like Imagesiftbot started crawling the website, more than 700 GB of bandwidth was consumed, leading to an extra $90 in bills.

Orosz expressed frustration over having to pay all this extra money to help train LLMs. Furthermore, he added that crawlers ignore robots.txt file. “The irony is how the bots—including Meta! — blatantly ignore the robots.txt on the site that tells them ‘please stay away’…I’m upset – and have had enough.”

Vercel, a cloud platform company, shared some interesting statistics from their network in a blog post that said: “AI crawlers have become a significant presence on the web. OpenAI’s GPTBot generated 569 million requests across Vercel’s network in the past month, while Anthropic’s Claude followed with 370 million.”

Source: Vercel

“For perspective, this combined volume represents about 20% of Googlebot’s 4.5 billion requests during the same period,” it added.

Xe Iaso, a software developer, expressed frustration upon noticing that AmazonBot was consuming their Git server resources. Attempts to block it resulted in failure. Iaso stated in the blog post, “It’s futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more. I just want the requests to stop.”

The developer created an open source solution, Anubis, to present a challenge to AI crawlers and block the requests.

The developer’s quick solution turned out to be helpful to others as well. Bart Piotrowski, a system administrator at GNOME, used it to fend off AI crawlers from GNOME’s GitLab instance, which were reportedly taking 90% of their resources.

Drew Devault, founder of SourceHut, wrote a blog post voicing something similar: “Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.”

Ars Technica reached a similar conclusion for AI crawlers, focusing on its impact on open source projects. Many other reports indicate that people are attempting to fend off AI crawlers consuming their web resources.

What Can Be Done?

Solutions such as Iaso’s Anubis, though not suitable for everyone, are a good option and are increasingly being embraced by individuals.

Cloudflare has joined the fight against AI bots that do not honour the robots.txt rule with AI Labyrinth, which uses AI-generated content to keep the crawler occupied and waste its resources.

Source: Cloudflare

“Crawlers generate more than 50 billion requests to the Cloudflare network every day, or just under 1% of all web requests we see. While Cloudflare has several tools for identifying and blocking unauthorised AI crawling, we have found that blocking malicious bots can alert the attacker that you are on to them, leading to a shift in approach, and a never-ending arms race,” the Cloudflare blog read. 

It added, “So, we wanted to create a new way to thwart these unwanted bots, without letting them know they’ve been thwarted.”

In addition to the solutions mentioned above, AI companies can do their bit by improving their crawlers to respect the web resources and be a little less aggressive in their information-hunt process.

While the web search functionality in AI tools provides great value, it should not come at the cost of disrupting the web server resources of small or independent web admins.

📣 Want to advertise in AIM? Book here

Picture of Ankush Das

Ankush Das

I am a tech aficionado and a computer science graduate with a keen interest in AI, Open Source, and Cybersecurity.

Related Posts

Develop AI driven team

Association of Data Scientists

GenAI Corporate Training Programs

Our Upcoming Conference

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

DE&I in India’s Tech 2025

Abhijeet Adhikari

DE&I is redefining the future of India’s tech industry fueling innovation, productivity, and a more inclusive culture. As 2025 approaches, the focus shifts from intent to impact. This report explores

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Happy Llama 2025

AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India

Data Engineering Summit 2025

May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru

MachineCon GCC Summit 2025

June 20 to 22, 2025 | 📍 ITC Grand, Goa

Cypher India 2025

Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India

MLDS 2026

India's Biggest Developers Summit | 📍Nimhans Convention Center, Bengaluru

Rising 2026

India's Biggest Summit on Women in Tech & AI 📍 Bengaluru

Read Entire Article