HUMAN BLOG

The Ultimate 2025 List of Web Crawlers and Good Bots: Identification, Examples, and Best Practices

Read time: 15 minutes

Jeff Edwards

June 2, 2025

Bot Mitigation, content scraping, Web Scraping

The Ultimate 2025 List of Web Crawlers and Good Bots: Identification, Examples, and Best Practices

The internet is full of bots. They generate almost as much traffic as people do.

But while many of them are malicious scrapers or spammy impostors, let’s not forget about the many “good” bots that serve legitimate functions, like indexing your content for Google search or generating preview cards when your link gets shared on X.

These are the helpful worker bees of the internet. If you accidentally shut them out in your haste to seal off malicious bots, you’ll create some serious headaches for yourself: buried search results, broken preview links, disrupted integrations, and so on.

Of course, even these legitimate bots (also called known bots) can hog your bandwidth and skew your analytics. So the key is to strike the correct balance, being intentional about which bots you welcome and which you exclude. To this end, there are specific ways to modify the way some of these crawlers interact with your site, which we’ve described below where applicable.

In this guide, we’ll use our analytics, gained by verifying more than 20 trillion digital interactions each week, to take a deep dive into the world of legitimate bots, going well beyond the usual suspects like Googlebot and Bingbot. You’ll find a detailed list of known crawlers, complete with their user agent strings, and learn how to spot some of the most common types, figure out who sent them, and determine what, exactly, they want from you and your domain.

Basic Bot Identification: User Agents and Patterns

When a legitimate bot visits your site, it identifies itself with a user agent string: A snippet of text that usually contains the name of the bot or of the company that owns it (e.g., Googlebot or Bingbot).

Because known bots follow these predictable naming conventions, they can be detected using regex patterns. For example, Googlebot’s UA string includes the word “Googlebot,” and Facebook’s crawler uses “facebookexternalhit” or “Facebot.”

A quick note on security: Because malicious bots can fake the UA strings of legitimate ones, you need a decent bot detection system to sort the good players out from the bad ones. One recent example is AkiraBot, a spambot that uses LLMs to generate messages and hides behind generic user-agent strings, making it harder to detect with standard UA filtering.

Which Bots Visit Your Site the Most?

Knowing how often certain bots are crawling your site is one of the major factors in deciding which ones to allow and which ones to throttle.

As you can tell from the chart below (drawn from our observations over a 30-day period), legitimate bot traffic is dominated by Google’s main web crawler, followed by Bing’s. Together, they account for more than 60% of all requests from known good bots. Some other notably active crawlers are the ones operated by Facebook and Pinterest.

Incidentally, OpenAI’s crawler, GPTBot, was number 16 on our list, generating almost a full percent of all legit-bot traffic. That’s a pretty significant amount of activity for this relative newcomer.

Now that we understand which “good” bots are visiting your network and how often, it’s time to learn why they’re on there in the first place.

Let’s go over the major categories of legitimate bots and learn how to identify some of the major players in each field:

Search Engine Crawlers

Think of a search engine crawler (like the aforementioned Googlebot) as a digital explorer that roams the uncharted web, hunting for content to bring home and display on its engine’s results page.

While these crawlers are critical for your domain’s SEO visibility, they can get a little greedy with server resources, so use robots.txt to guide them to only the most relevant areas of your site.

Googlebot

Googlebot is Google’s main web crawler that indexes pages to display in Google search results. You should never block it, unless you’re trying to turn your page into an un-googleable ghost site. If you do want to manage Googlebot, you can throttle its crawl rate on Google Search Console.

Google also runs some specialized crawlers for news, images, and so on, but they always have “google” or “googlebot” in their UA strings.

User agent: Googlebot

Sample UA string: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Bingbot

Bingbot is Microsoft’s search crawler that indexes pages for Bing. As the chart above shows, it’s second only to Googlebot in traffic volume. It’s also essential for Bing SEO, so we recommend not blocking this one either.

Bing runs some specialized bots like MSNBot and BingPreview, but they always contain “bing” in the UA string. Crawl management is available via Bing Webmaster Tools.

User agent: bingbot

Sample UA string: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Yahoo! Slurp

Slurp is Yahoo’s legacy crawler that indexes content for Yahoo! Search. This living relic of the early 2000s web isn’t spotted a lot these days, but it’s still out there; if you’re lucky enough, it may pay you a visit (especially if you’re outside the US).

User agent: Slurp

Sample UA string: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Baidu Spider

Baidu Spider is the bot behind China’s biggest search engine, Baidu. It’s essential for visibility in Chinese-language markets, but you might encounter it globally as well.

This crawler doesn’t have an opt-out mechanism, and it may inconsistently obey directives from robots.txt.

User agent: Baiduspider

Sample UA string: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Yandex Bot

Yandex Bot is the web crawler for Yandex, Russia’s most popular search engine. It’s especially important for websites targeting Russian or Eastern European visitors. Its crawl settings can be modified using the Yandex Webmaster toolkit.

User agent: YandexBot

Sample UA string: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

SEO and Marketing Analytics Crawlers

Another common type of crawler is used to gather data for marketing analytics: performing SEO audits, checking backlinks, researching competitors, and so on.

These bots can offer useful marketing insights, but they can also be a little inconsiderate with their volume of requests. Sometimes they need to be kept on a short leash or blocked entirely. (Like a lot of the bots on this list, they frequently have built-in opt-out mechanisms.)

AhrefsBot

AhrefsBot is the crawler behind the popular Ahrefs SEO tool. It scans sites for backlinks and other metrics. Our data shows that this bot is in the top ten most active “good” bots (see the above chart), so you’re pretty likely to see it in your logs.

To stop AhrefsBot from crawling your site, you can opt out by emailing support@ahrefs.com.

User agent: AhrefsBot

Sample UA string: Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)

SemrushBot

SemrushBot is the crawler behind Semrush’s SEO toolkit. It serves a similar purpose to AhrefsBot, scanning sites for keyword rankings, backlinks, and competitive data. It’s a useful tool for digital marketers, but might need to be throttled if it visits too frequently.

User agent: SemrushBot

Sample UA string: Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)

Majestic MJ12Bot

MJ12Bot is used by Majestic SEO to map the web’s link graph for Majestic’s backlink index. It’s useful for SEO research, but its aggressive crawling might require tuning if your site gets a ton of traffic from it.

Check the MJ12Bot website for information on how to block or throttle it.

User agent: MJ12Bot

Sample UA string: MJ12bot/v1.4.0 (http://www.majestic12.co.uk/bot.php?+)

Social Media and Content Preview Bots

This type of crawler is mainly used by social media platforms to create content previews for links that are shared. Blocking them can mess with the appearance of links to your content, so in general it’s best to allow them.

If your site has pages you want to keep confidential, make sure to use the correct meta tags or authentication, because these preview bots will fetch anything that’s public.

Facebook Crawler

Facebook’s crawler scans shared links to generate previews for posts and messages. It’s a major bot (ranking in the top 5 of our traffic volume data), and allowing it is crucial if you want your content to show up correctly when it gets linked from Facebook or Instagram. Make sure it can access your site’s OG tags.

User agent: facebookexternalhit (most commonly)

Sample UA string: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

Twitterbot

Twitterbot is X’s crawler that scans shared links to generate preview cards. If you block it, posts linking to your site won’t show a preview, just a bare URL.

User agent: Twitterbot

Sample UA string: Twitterbot/1.0 Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.3 Chrome/69.0.3497.128 Safari/537.36

LinkedInBot

Another preview crawler, this time for LinkedIn’s platform. As with Facebook and X’s crawlers, blocking it will result in broken or missing previews on posts that link to your content.

User agent: LinkedInBot

Sample UA string: LinkedInBot/1.0 (compatible; Mozilla/5.0; +http://www.linkedin.com)

Slackbot

Slack’s crawler that creates preview cards for links that are shared in Slack channels. This one could also be considered an integration bot, but we’re listing it here because creating link previews is its main job.

User agent: Slackbot

Sample UA string: Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)

Pinterestbot

Pinterest’s web crawler is extremely active; in fact, it ranked #6 in our traffic stats, just behind Facebook. It saves page content and images when someone pins something.

User agent: Pinterestbot

Sample UA string: Mozilla/5.0 (compatible; Pinterestbot/1.0; +https://www.pinterest.com/bot.html)

Monitoring and Uptime Bots

These “guardian angel” bots regularly check a site’s availability and response time to alert its owner of outages.

They’re useful for making sure your site stays online, but they can cause noise in visitor logs if their activity isn’t filtered out. If you use this type of service on your site, make sure to whitelist the bots in your detection system, but exclude them from your visitor stats.

Pingdom

Pingdom is a monitoring service that uses bots to ping a site at regular intervals to check its uptime and performance. These frequent “health check” visits should be filtered from analytics to avoid skewed data.

User agent: Pingdom.com_bot

Sample UA string: Pingdom.com_bot_version_1.1 (+http://www.pingdom.com/)

UptimeRobot

Another common uptime checker. It checks a site every few minutes and will alert the owner if it detects downtime.

User agent: UptimeRobot

Sampe UA string: UptimeRobot/2.0 (+http://www.uptimerobot.com/)

BetterStack Bot

Another uptime checker we spotted in our data. (Formerly called Better Uptime.)

User agent: BetterStackBot

Sample UA string: BetterStackBot/1.0 (+https://betterstack.com/docs/monitoring/uptime-robot/bot/)

cron-job.org

A free service that triggers URL pings and other actions at user-scheduled intervals. This one can also show up in visitor logs.

User agent: cron-job.org

Sample UA string: cron-job.org/1.2 (+https://cron-job.org/en/faq/)

AI and LLM Data Crawlers

These crawlers, mainly operated by AI companies, are designed to scrape and index huge amounts of data that is then used for training AI systems or powering real-time AI services.

This is a relatively new class of web crawlers that’s emerged alongside the rapid growth of AI and large language models (LLMs). It’s also a category that is growing extremely fast. As noted earlier, OpenAI’s GPTBot placed #16 in our traffic volume data, despite having existed for only a handful of years.

Of course, these bots raise some thorny questions around data ownership and are therefore somewhat controversial. For a variety of reasons, these bots can’t always be trusted to respect robots.txt, so additional steps may be necessary to control their behavior; most of them have some kind of opt-out mechanism built in, which we’ve described below where applicable.

For a deeper look at practical controls—blocking uninvited bots, allowing trusted crawlers, and even monetizing AI requests—see our companion posts on AI-driven content scraping and on the HUMAN + TollBit integration, which allows monetization of LLM Crawlers on a pay-per-crawl basis.

OpenAI’s GPTBot

GPTBot is the main web crawler used by OpenAI to gather training data for ChatGPT and its other AI models. It can be blocked using robots.txt, or by denying access to its IP range.

See OpenAI’s crawler documentation for more detailed information about how to rein in GPTBot, as well as the company’s other bots.

User agent: GPTBot

Sample UA string: GPTBot/1.0 (+https://openai.com/gptbot)

OpenAI’s ChatGPT-User

ChatGPT-User is an agent that simulates end-user browsing on behalf of ChatGPT conversations, fetching live webpages so that ChatGPT can cite fresh information. ChatGPT-User is not used for crawling the web in an automatic fashion, nor to crawl content for generative AI training. Because it impersonates a browser shell, the traffic can be bursty and may inflate analytics. It does not consistently respect robots.txt, so filter on the UA substring “ChatGPT-User” or block at the firewall if needed. You can also access OpenAI’s crawler documentation (linked above) for instructions on managing this bot.

User agent: ChatGPT-User
Sample UA string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot

OpenAI’s OAI-SearchBot

OAI-SearchBot is optimized for real-time retrieval and indexing rather than bulk training. Expect short, frequent bursts as it follows links from ChatGPT answers and other OpenAI products. Robots.txt directives are usually ignored, so block via the UA substring “OAI-SearchBot” or by IP.

User agent: OAI-SearchBot
Sample UA string: Mozilla/5.0 (compatible; OAI-SearchBot +https://www.openarchives.org/Register/BrowseSites)

Anthropic’s ClaudeBot

ClaudeBot is Anthropic’s main web crawler. It gathers text used for training the Claude AI assistant. The company’s other crawlers include Claude-User and Claude-SearchBot.

Anthropic’s website contains information about how to limit the crawling activity of their bots. For further support, you can also contact claudebot@anthropic.com using an email address from the domain in question.

User agent: ClaudeBot

Sample UA string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)

Anthropic’s anthropic-ai

Anthropic formerly operated a second crawler, distinct from ClaudeBot, that identifies itself simply as anthropic-ai. It appeared to gather publicly available text for training and evaluating Anthropic’s Claude models. The bot did not honor robots.txt.

However, Anthropic told reporters in July 2024 that the anthropic-ai and claude-web agents were deprecated in favor of ClaudeBot. If you still see traffic with the string above, it is almost certainly legacy or spoofed traffic.

User agent: anthropic-ai
Sample UA String: Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html)

PerplexityBot

Run by Perplexity AI, this crawler gathers page content used to generate cited answers in Perplexity’s “answer engine.” Like most LLM scrapers it disregards robots.txt. To exclude it, deny requests bearing the UA “PerplexityBot” or use Perplexity’s published opt-out header.

User agent: PerplexityBot
Sample UA string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://docs.perplexity.ai/docs/perplexity-bot)

Security and Vulnerability Scanners

Security-focused web crawlers scan domains for vulnerabilities like exposed databases or vulnerable plugins (often for threat research purposes).

Reputable scanners like the ones listed here will announce themselves with clear UA strings, and they often provide opt-out options. Malicious scanners, on the other hand, typically disguise their activity and ignore opt-out protocols.

A visit from a legitimate security scanner can be a pretty useful cue to check your network for issues (as in, Why is Shodan scanning me? Is everything secure?). But they’re also capable of generating a fair amount of traffic, so some admins choose to block them via firewall.

Censys.io

Censys.io’s web crawler is a security research bot that scans the internet for exposed devices and other network vulnerabilities. It’s used widely in the cybersecurity industry for threat intelligence and improving network resilience.

See Censys’s documentation for more detailed information about how to opt out of data collection.

User agent: censys.io

Sample UA string: Mozilla/5.0 (compatible; CensysInspect/1.1; +https://about.censys.io/)

Shodan

Shodan’s crawler also indexes connected devices, collecting data on open ports and other vulnerabilities.

User agent: shodan

Sample UA string: Mozilla/5.0 (compatible; Shodan/1.0; +https://www.shodan.io/bot)

BitSight

Companies like BitSight generate security ratings for websites and networks by using bots to scan them for vulnerabilities. Their scans are non-invasive, but their probing can still trigger security alerts.

User agent: BitSightBot

Sample UA string: Mozilla/5.0 (compatible; BitSightBot/1.0)

Best Practices for Managing “Good” Bots

Even bots with legitimate purposes—such as indexing content, generating link previews, or monitoring uptime—can cause issues if not properly managed. They might strain infrastructure, distort analytics, or access content you’d rather keep restricted. Smart bot management involves applying appropriate controls based on the bot’s identity, behavior, and impact.

Here are five best practices for keeping “good” bot traffic truly beneficial.

1. Use robots.txt as a Starting Point

Most legitimate bots follow robots.txt, making it a reliable first step:

Restrict low-priority areas: Block access to non-public sections like staging environments, admin panels, or login endpoints.
Customize per bot: Many bots accept agent-specific rules or crawl-delay directives, giving you control over how often and where they crawl.
Understand the limits: robots.txt is advisory, not enforceable. It won’t stop bots that choose to ignore it, including some AI agents and impersonators.

2. Confirm a Bot’s Identity Before Granting Trust

Not every bot is what it claims to be. Malicious actors often spoof user-agent strings to appear legitimate.

Verify major crawlers by IP address, not just their user-agent. Services like Googlebot publish DNS records that let you confirm authenticity.
Log behavior from unfamiliar bots and confirm that it matches their stated purpose.
Watch for red flags: If a “known” bot is accessing sensitive paths or making high-frequency requests, it could be a fake.

3. Monitor Bot Traffic as Carefully as Human Traffic

Many analytics tools filter out bot activity by default, but bots still interact with your infrastructure and data.

Analyze server logs and firewall data to see which bots are requesting what content.
Flag new or high-volume bots for review. Just because a bot is active doesn’t mean it’s valuable.
Understand intent: Knowing whether a bot is indexing pages, scraping prices, or fetching preview metadata helps determine the right response.

Bot Management solutions like HUMAN surface bot activity in dashboards and detailed bot profiles, making it easier to understand which bots are interacting with your site and how.

4. Apply Tiered Controls Based on Purpose

Not all bots should be treated the same. Match your response to their intent and impact.

Search engine bots should generally be allowed and controlled with robots.txt and sitemap files.
SEO and analytics bots can provide value, but often hit sites with high frequency. These may need to be rate-limited or blocked, depending on their usefulness to your business.
Social media crawlers fetch page data for link previews. Their requests are typically lightweight and only occur when users share your content.
AI and LLM-related crawlers are growing in volume and may access content for training or response generation. Consider restricting access unless the traffic is explicitly allowed or monetized. For more on how some organizations are managing this class of bots, see how HUMAN and TollBit enable enforcement and monetization for AI agents.

5. Revisit and Adjust Bot Policies Regularly

Bot ecosystems evolve rapidly, especially with the rise of AI agents and data harvesters.

Review access rules every few months, especially for newly active bots or changes in crawler behavior.
Update filters and detection patterns as new user-agent strings and AI tools emerge.
Audit allowlists or IP rules to ensure continued relevance and avoid outdated exceptions that may introduce risk.

Take Control of Your Bot Traffic with HUMAN

The above best practices provide a foundation, but managing bots effectively at scale requires visibility, precision, and adaptability.

HUMAN helps organizations enforce bot access policies with greater accuracy and less manual overhead. From verifying the identity of search engine crawlers, to detecting obfuscated AI agents, to blocking unwanted scraping traffic before it reaches your application, HUMAN gives security and engineering teams the tools they need to stay in control.

With HUMAN, you can:

Identify and manage known bots with a curated, toggleable list—no more maintaining manual rulesets or chasing new user-agent patterns.
Gain full visibility into bot traffic, including AI crawlers, preview bots, and unknown agents. Easily see who’s accessing what, and how often.
Enforce nuanced policies with flexible response options: allow, rate-limit, redirect, serve alternate content, or monetize traffic through integrations like TollBit.
Stay ahead of change, thanks to advanced detection capabilities that adapt to evolving bot behaviors, spoofing attempts, and threat patterns.

Managing bots shouldn’t require guesswork or compromise. HUMAN makes it easy to welcome the bots you want and block or monetize the ones you don’t.

If you’re ready to go beyond basic detection and take control of your automated traffic, get in touch with our team for a demo or learn more about how HUMAN defends the entire user-lifecycle.

Spread the Word

PREVIOUS POST Next Post