HUMAN BLOG

How AI Scraping Is Evolving and How Publishers Can Stay Ahead

Read time: 4 minutes

Jennifer Ukaegbu

September 8, 2025

Uncategorized

How AI Scraping Is Evolving and How Publishers Can Stay Ahead

As AI-driven technology continues to advance, publishers are witnessing a rise in the sophisticated scraping of their digital content. While this trend isn’t new, it’s accelerating at a rate that demands attention. According to recent research from TollBit, AI bot scraping is becoming more aggressive, and traditional defenses like robots.txt are no longer enough to keep unauthorized bots at bay. In 2024 alone, HUMAN blocked 215 billion scraping attack attempts, showcasing the scale at which bots are attempting to access and reuse content.

The Growing Problem: Why AI Scraping Is a Bigger Threat Than Ever

AI scraping bots are comprised of two categories: LLM crawlers, which gather large amounts of training data, and retrieval augmented generation (RAG) scrapers, which carry out targeted crawling and scraping to provide users with real-time data or fill gaps in the LLM’s knowledge base. The latter are often launched within the user’s session, and may mimic human browsing behavior, making it increasingly difficult to detect and manage (or block) them using standard security measures. While many of these bots self-declare or follow robots.txt directives, not all do, which makes them difficult to monitor and control without a dedicated solution.

Here are some recent findings from TollBit that illustrate the scale of the problem:

  • RAG bot scrapes now exceed training bot scrapes: Requests from RAG bots increased 49% from Q4 to Q1. Unlike Training bots, which collect content occasionally for model building, RAG bots pull targeted information to serve users directly, often resulting in zero-click searches and potentially replacing human visits, which can impact traffic, ad revenue, and subscriptions.
  • AI bot traffic nearly doubled in Q1 2025: Websites with analytics set up before January 2025 saw AI bot traffic rise by 87% during the quarter, underscoring the rapid expansion of automated scraping activity.
  • Robots.txt is increasingly being ignored: Despite more publishers attempting to block bots, an increasing number of AI scrapers are disregarding robots.txt. The share of bots ignoring these files jumped from 3.3% to 12.9% in just one quarter, highlighting that traditional defenses alone are no longer sufficient.
A line chart titled "tollbit-rag-surpasses-crawlers.jpeg" displays projected data from July 2024 to April 2025 for three categories: "AI search indexer," "RAG agent," and "Training data crawler."

The chart shows the following trends:

The trend line for the "AI search indexer" (blue) is low and remains relatively flat throughout the period.

The trend line for the "Training data crawler" (yellow) shows a steady, moderate increase over time.

The trend line for the "RAG agent" (dark red) starts at zero and shows rapid, steep growth, crossing and significantly surpassing the trend line for the "Training data crawler" around the beginning of 2025.
Traffic from RAG agents surpasses training data crawlers in early 2025. Source: Tollbit.

Why this matters for publishers: RAG bots actively consume fresh content in real time, meaning they can divert value away from your site and replace human visits. Publishers now face the dual challenge of protecting their content while finding ways to control or monetize access to AI agents that legitimately need it.

How Publishers Can Take Control: The Solution

The good news is that publishers don’t have to sit back and watch this trend unfold. With the right solutions, publishers can block malicious scrapers while also managing and monetizing AI-driven traffic. That’s where our partnership with TollBit comes in.

HUMAN and TollBit have teamed up to provide a comprehensive solution that helps publishers protect their content while also generating revenue from AI traffic. Here’s how:

  • Visibility and Control over AI Scrapers: Understand which bots are accessing your content and what they’re scraping, so you can make informed decisions to protect your assets and maximize your content’s value.
  • Advanced Robots.txt Enforcement and Bot Mitigation: Enforce robots.txt directives for both known and unknown bots using powerful, granular policies to prevent unauthorized content scraping or summarization.
  • Generate Recurring Revenue from AI Usage: Establish a consistent revenue stream by requiring AI scrapers to pay for access to your content and enabling scalable licensing deals with AI developers.

Why Now Is the Perfect Time for Publishers to Act

As AI scraping activity continues to rise, the pressure on publishers to protect their content will only increase. But with the right tools, publishers can turn this challenge into an opportunity. By adopting proactive measures to control and monetize AI traffic, you can safeguard your digital assets and generate new revenue streams in the process.

The time is now for publishers to take control, and with the HUMAN and TollBit partnership, it’s easier than ever.

Ready to Learn More?

If you want to explore how our solution can help you protect your content and monetize AI traffic, we’re here to help. Reach out to us today for more information on how to stay ahead of AI scraping.

Spread the Word