Wikimedia Battles AI Crawlers Amid Traffic Surge

Wikimedia Commons, the massive media repository behind Wikipedia and other Wikimedia projects, has seen a dramatic surge in bandwidth usage—up by 50% since January 2024. But this spike isn’t coming from curious human users. Instead, it’s driven by automated AI scrapers aggressively pulling media files to train large language and image models.

According to a new blog post from the Wikimedia Foundation, these AI crawlers are putting intense strain on the platform’s infrastructure. The nonprofit organization emphasized that while their systems are built to handle traffic surges—especially during major global events—this new wave of automated scraping presents an entirely different challenge.

Bots Are Draining Resources, Not Just Viewing Pages

Wikimedia Commons hosts millions of freely licensed images, videos, and audio clips, making it a goldmine for AI companies hungry for training data. But the traffic from bots isn’t just about quantity—it’s about cost.

While bots account for only 35% of total pageviews, they’re behind a whopping 65% of the most bandwidth-heavy traffic. This disparity comes down to how content is served. Popular files accessed frequently are cached closer to users, making them cheaper to deliver. But crawler bots typically target obscure or less-frequented pages, forcing Wikimedia to serve them from its core data centers—which is significantly more expensive.

“Humans tend to browse similar, popular topics,” the Foundation explained. “Bots, on the other hand, bulk-download large sets of less-visited pages. That kind of behavior quickly drives up bandwidth and backend server load.”

Wikimedia Now Playing Defense to Protect Users

To maintain site stability for regular users, Wikimedia’s site reliability team has been forced to dedicate growing resources toward identifying and blocking scraper bots. This includes managing traffic spikes, optimizing server capacity, and reducing the risk of outages. On top of that, the organization is also facing higher cloud hosting costs due to these automated demands.

This surge in AI crawler activity reflects a broader, troubling trend across the open web. Developers and infrastructure maintainers are seeing their public resources consumed by bots that often ignore long-standing rules, like the “robots.txt” protocol that instructs crawlers not to index certain parts of a website.

The Open Internet Is Under Threat from AI Scrapers

This growing misuse of open-access platforms has sparked backlash in the tech community. Last month, software engineer Drew DeVault criticized AI companies for disrespecting web standards. Developer and writer Gergely Orosz also shared how scrapers from firms like Meta were draining his own project resources, making operations more expensive and unsustainable.

While some developers are fighting back with creative solutions, it’s becoming a high-stakes battle. Cloudflare, for example, recently introduced a new tool called AI Labyrinth, which uses fake, AI-generated content to mislead and slow down crawlers.

Still, experts warn this cat-and-mouse game could push more content publishers to lock their websites behind paywalls or require user logins—ultimately reducing access to the free and open internet.

Share with others