The internet is full of bots. They generate almost as much traffic as people do.
But while many of them are malicious scrapers or spammy impostors, let’s not forget about the many “good” bots that serve legitimate functions, like indexing your content for Google search or generating preview cards when your link gets shared on X.
These are the helpful worker bees of the internet. If you accidentally shut them out in your haste to seal off malicious bots, you’ll create some serious headaches for yourself: buried search results, broken preview links, disrupted integrations, and so on.
Of course, even these legitimate bots (also called known bots) can hog your bandwidth and skew your analytics. So the key is to strike the correct balance, being intentional about which bots you welcome and which you exclude. To this end, there are specific ways to modify the way some of these crawlers interact with your site, which we’ve described below where applicable.
In this guide, we’ll use our analytics, gained by verifying more than 20 trillion digital interactions each week, to take a deep dive into the world of legitimate bots, going well beyond the usual suspects like Googlebot and Bingbot. You’ll find a detailed list of known crawlers, complete with their user agent strings, and learn how to spot some of the most common types, figure out who sent them, and determine what, exactly, they want from you and your domain.
Basic Bot Identification: User Agents and Patterns
When a legitimate bot visits your site, it identifies itself with a user agent string: A snippet of text that usually contains the name of the bot or of the company that owns it (e.g., Googlebot or Bingbot).
Because known bots follow these predictable naming conventions, they can be detected using regex patterns. For example, Googlebot’s UA string includes the word “Googlebot,” and Facebook’s crawler uses “facebookexternalhit” or “Facebot.”
A quick note on security: Because malicious bots can fake the UA strings of legitimate ones, you need a decent bot detection system to sort the good players out from the bad ones. One recent example is AkiraBot, a spambot that uses LLMs to generate messages and hides behind generic user-agent strings, making it harder to detect with standard UA filtering.
Which Bots Visit Your Site the Most?
Knowing how often certain bots are crawling your site is one of the major factors in deciding which ones to allow and which ones to throttle.
As you can tell from the chart below (drawn from our observations over a 30-day period), legitimate bot traffic is dominated by Google’s main web crawler, followed by Bing’s. Together, they account for more than 60% of all requests from known good bots. Some other notably active crawlers are the ones operated by Facebook and Pinterest.
Incidentally, OpenAI’s crawler, GPTBot, was number 16 on our list, generating almost a full percent of all legit-bot traffic. That’s a pretty significant amount of activity for this relative newcomer.
Now that we understand which “good” bots are visiting your network and how often, it’s time to learn why they’re on there in the first place.
Let’s go over the major categories of legitimate bots and learn how to identify some of the major players in each field:
Search Engine Crawlers
Think of a search engine crawler (like the aforementioned Googlebot) as a digital explorer that roams the uncharted web, hunting for content to bring home and display on its engine’s results page.
While these crawlers are critical for your domain’s SEO visibility, they can get a little greedy with server resources, so use robots.txt to guide them to only the most relevant areas of your site.
Googlebot
Googlebot is Google’s main web crawler that indexes pages to display in Google search results. You should never block it, unless you’re trying to turn your page into an un-googleable ghost site. If you do want to manage Googlebot, you can throttle its crawl rate on Google Search Console.
Google also runs some specialized crawlers for news, images, and so on, but they always have “google” or “googlebot” in their UA strings.
User agent: Googlebot
Sample UA string: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Bingbot
Bingbot is Microsoft’s search crawler that indexes pages for Bing. As the chart above shows, it’s second only to Googlebot in traffic volume. It’s also essential for Bing SEO, so we recommend not blocking this one either.
Bing runs some specialized bots like MSNBot and BingPreview, but they always contain “bing” in the UA string. Crawl management is available via Bing Webmaster Tools.
User agent: bingbot
Sample UA string: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Yahoo! Slurp
Slurp is Yahoo’s legacy crawler that indexes content for Yahoo! Search. This living relic of the early 2000s web isn’t spotted a lot these days, but it’s still out there; if you’re lucky enough, it may pay you a visit (especially if you’re outside the US).
User agent: Slurp
Sample UA string: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Baidu Spider
Baidu Spider is the bot behind China’s biggest search engine, Baidu. It’s essential for visibility in Chinese-language markets, but you might encounter it globally as well.
This crawler doesn’t have an opt-out mechanism, and it may inconsistently obey directives from robots.txt.
User agent: Baiduspider
Sample UA string: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Yandex Bot
Yandex Bot is the web crawler for Yandex, Russia’s most popular search engine. It’s especially important for websites targeting Russian or Eastern European visitors. Its crawl settings can be modified using the Yandex Webmaster toolkit.
User agent: YandexBot
Sample UA string: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
SEO and Marketing Analytics Crawlers
Another common type of crawler is used to gather data for marketing analytics: performing SEO audits, checking backlinks, researching competitors, and so on.
These bots can offer useful marketing insights, but they can also be a little inconsiderate with their volume of requests. Sometimes they need to be kept on a short leash or blocked entirely. (Like a lot of the bots on this list, they frequently have built-in opt-out mechanisms.)
AhrefsBot
AhrefsBot is the crawler behind the popular Ahrefs SEO tool. It scans sites for backlinks and other metrics. Our data shows that this bot is in the top ten most active “good” bots (see the above chart), so you’re pretty likely to see it in your logs.
To stop AhrefsBot from crawling your site, you can opt out by emailing support@ahrefs.com.
User agent: AhrefsBot
Sample UA string: Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
SemrushBot
SemrushBot is the crawler behind Semrush’s SEO toolkit. It serves a similar purpose to AhrefsBot, scanning sites for keyword rankings, backlinks, and competitive data. It’s a useful tool for digital marketers, but might need to be throttled if it visits too frequently.
User agent: SemrushBot
Sample UA string: Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
Majestic MJ12Bot
MJ12Bot is used by Majestic SEO to map the web’s link graph for Majestic’s backlink index. It’s useful for SEO research, but its aggressive crawling might require tuning if your site gets a ton of traffic from it.
Check the MJ12Bot website for information on how to block or throttle it.
User agent: MJ12Bot
Sample UA string: MJ12bot/v1.4.0 (http://www.majestic12.co.uk/bot.php?+)
Social Media and Content Preview Bots
This type of crawler is mainly used by social media platforms to create content previews for links that are shared. Blocking them can mess with the appearance of links to your content, so in general it’s best to allow them.
If your site has pages you want to keep confidential, make sure to use the correct meta tags or authentication, because these preview bots will fetch anything that’s public.
Facebook Crawler
Facebook’s crawler scans shared links to generate previews for posts and messages. It’s a major bot (ranking in the top 5 of our traffic volume data), and allowing it is crucial if you want your content to show up correctly when it gets linked from Facebook or Instagram. Make sure it can access your site’s OG tags.
User agent: facebookexternalhit (most commonly)
Sample UA string: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
Twitterbot
Twitterbot is X’s crawler that scans shared links to generate preview cards. If you block it, posts linking to your site won’t show a preview, just a bare URL.
User agent: Twitterbot
Sample UA string: Twitterbot/1.0 Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.3 Chrome/69.0.3497.128 Safari/537.36
LinkedInBot
Another preview crawler, this time for LinkedIn’s platform. As with Facebook and X’s crawlers, blocking it will result in broken or missing previews on posts that link to your content.
User agent: LinkedInBot
Sample UA string: LinkedInBot/1.0 (compatible; Mozilla/5.0; +http://www.linkedin.com)
Slackbot
Slack’s crawler that creates preview cards for links that are shared in Slack channels. This one could also be considered an integration bot, but we’re listing it here because creating link previews is its main job.
User agent: Slackbot
Sample UA string: Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)
Pinterestbot
Pinterest’s web crawler is extremely active; in fact, it ranked #6 in our traffic stats, just behind Facebook. It saves page content and images when someone pins something.
User agent: Pinterestbot
Sample UA string: Mozilla/5.0 (compatible; Pinterestbot/1.0; +https://www.pinterest.com/bot.html)
Monitoring and Uptime Bots
These “guardian angel” bots regularly check a site’s availability and response time to alert its owner of outages.
They’re useful for making sure your site stays online, but they can cause noise in visitor logs if their activity isn’t filtered out. If you use this type of service on your site, make sure to whitelist the bots in your detection system, but exclude them from your visitor stats.
Pingdom
Pingdom is a monitoring service that uses bots to ping a site at regular intervals to check its uptime and performance. These frequent “health check” visits should be filtered from analytics to avoid skewed data.
User agent: Pingdom.com_bot
Sample UA string: Pingdom.com_bot_version_1.1 (+http://www.pingdom.com/)
UptimeRobot
Another common uptime checker. It checks a site every few minutes and will alert the owner if it detects downtime.
User agent: UptimeRobot
Sampe UA string: UptimeRobot/2.0 (+http://www.uptimerobot.com/)
BetterStack Bot
Another uptime checker we spotted in our data. (Formerly called Better Uptime.)
User agent: BetterStackBot
Sample UA string: BetterStackBot/1.0 (+https://betterstack.com/docs/monitoring/uptime-robot/bot/)
cron-job.org
A free service that triggers URL pings and other actions at user-scheduled intervals. This one can also show up in visitor logs.
User agent: cron-job.org
Sample UA string: cron-job.org/1.2 (+https://cron-job.org/en/faq/)
AI and LLM Data Crawlers
These crawlers, mainly operated by AI companies, are designed to scrape and index huge amounts of data that is then used for training AI systems or powering real-time AI services.
This is a relatively new class of web crawlers that’s emerged alongside the rapid growth of AI and large language models (LLMs). It’s also a category that is growing extremely fast. As noted earlier, OpenAI’s GPTBot placed #16 in our traffic volume data, despite having existed for only a handful of years.
Of course, these bots raise some thorny questions around data ownership and are therefore somewhat controversial. For a variety of reasons, these bots can’t always be trusted to respect robots.txt, so additional steps may be necessary to control their behavior; most of them have some kind of opt-out mechanism built in, which we’ve described below where applicable.
OpenAI’s GPTBot
GPTBot is the main web crawler used by OpenAI to gather training data for ChatGPT and its other AI models. It can be blocked using robots.txt, or by denying access to its IP range.
See OpenAI’s crawler documentation for more detailed information about how to rein in GPTBot, as well as the company’s other bots.
User agent: GPTBot
Sample UA string: GPTBot/1.0 (+https://openai.com/gptbot)
Anthropic’s ClaudeBot
ClaudeBot is Anthropic’s main web crawler. It gathers text used for training the Claude AI assistant. The company’s other crawlers include Claude-User and Claude-SearchBot.
Anthropic’s website contains information about how to limit the crawling activity of their bots. For further support, you can also contact claudebot@anthropic.com using an email address from the domain in question.
User agent: ClaudeBot
Sample UA string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Security and Vulnerability Scanners
Security-focused web crawlers scan domains for vulnerabilities like exposed databases or vulnerable plugins (often for threat research purposes).
Reputable scanners like the ones listed here will announce themselves with clear UA strings, and they often provide opt-out options. Malicious scanners, on the other hand, typically disguise their activity and ignore opt-out protocols.
A visit from a legitimate security scanner can be a pretty useful cue to check your network for issues (as in, Why is Shodan scanning me? Is everything secure?). But they’re also capable of generating a fair amount of traffic, so some admins choose to block them via firewall.
Censys.io
Censys.io’s web crawler is a security research bot that scans the internet for exposed devices and other network vulnerabilities. It’s used widely in the cybersecurity industry for threat intelligence and improving network resilience.
See Censys’s documentation for more detailed information about how to opt out of data collection.
User agent: censys.io
Sample UA string: Mozilla/5.0 (compatible; CensysInspect/1.1; +https://about.censys.io/)
Shodan
Shodan’s crawler also indexes connected devices, collecting data on open ports and other vulnerabilities.
User agent: shodan
Sample UA string: Mozilla/5.0 (compatible; Shodan/1.0; +https://www.shodan.io/bot)
BitSight
Companies like BitSight generate security ratings for websites and networks by using bots to scan them for vulnerabilities. Their scans are non-invasive, but their probing can still trigger security alerts.
User agent: BitSightBot
Sample UA string: Mozilla/5.0 (compatible; BitSightBot/1.0)
Best Practices for Managing “Good” Bots
Even bots with legitimate purposes—such as indexing content, generating link previews, or monitoring uptime—can cause issues if not properly managed. They might strain infrastructure, distort analytics, or access content you’d rather keep restricted. Smart bot management involves applying appropriate controls based on the bot’s identity, behavior, and impact.
Here are five best practices for keeping “good” bot traffic truly beneficial.
1. Use robots.txt as a Starting Point
Most legitimate bots follow robots.txt, making it a reliable first step:
- Restrict low-priority areas: Block access to non-public sections like staging environments, admin panels, or login endpoints.
- Customize per bot: Many bots accept agent-specific rules or crawl-delay directives, giving you control over how often and where they crawl.
- Understand the limits: robots.txt is advisory, not enforceable. It won’t stop bots that choose to ignore it, including some AI agents and impersonators.
2. Confirm a Bot’s Identity Before Granting Trust
Not every bot is what it claims to be. Malicious actors often spoof user-agent strings to appear legitimate.
- Verify major crawlers by IP address, not just their user-agent. Services like Googlebot publish DNS records that let you confirm authenticity.
- Log behavior from unfamiliar bots and confirm that it matches their stated purpose.
- Watch for red flags: If a “known” bot is accessing sensitive paths or making high-frequency requests, it could be a fake.
3. Monitor Bot Traffic as Carefully as Human Traffic
Many analytics tools filter out bot activity by default, but bots still interact with your infrastructure and data.
- Analyze server logs and firewall data to see which bots are requesting what content.
- Flag new or high-volume bots for review. Just because a bot is active doesn’t mean it’s valuable.
- Understand intent: Knowing whether a bot is indexing pages, scraping prices, or fetching preview metadata helps determine the right response.
Bot Management solutions like HUMAN surface bot activity in dashboards and detailed bot profiles, making it easier to understand which bots are interacting with your site and how.
4. Apply Tiered Controls Based on Purpose
Not all bots should be treated the same. Match your response to their intent and impact.
- Search engine bots should generally be allowed and controlled with robots.txt and sitemap files.
- SEO and analytics bots can provide value, but often hit sites with high frequency. These may need to be rate-limited or blocked, depending on their usefulness to your business.
- Social media crawlers fetch page data for link previews. Their requests are typically lightweight and only occur when users share your content.
- AI and LLM-related crawlers are growing in volume and may access content for training or response generation. Consider restricting access unless the traffic is explicitly allowed or monetized. For more on how some organizations are managing this class of bots, see how HUMAN and TollBit enable enforcement and monetization for AI agents.
5. Revisit and Adjust Bot Policies Regularly
Bot ecosystems evolve rapidly, especially with the rise of AI agents and data harvesters.
- Review access rules every few months, especially for newly active bots or changes in crawler behavior.
- Update filters and detection patterns as new user-agent strings and AI tools emerge.
- Audit allowlists or IP rules to ensure continued relevance and avoid outdated exceptions that may introduce risk.
Take Control of Your Bot Traffic with HUMAN
The above best practices provide a foundation, but managing bots effectively at scale requires visibility, precision, and adaptability.
HUMAN helps organizations enforce bot access policies with greater accuracy and less manual overhead. From verifying the identity of search engine crawlers, to detecting obfuscated AI agents, to blocking unwanted scraping traffic before it reaches your application, HUMAN gives security and engineering teams the tools they need to stay in control.
With HUMAN, you can:
- Identify and manage known bots with a curated, toggleable list—no more maintaining manual rulesets or chasing new user-agent patterns.
- Gain full visibility into bot traffic, including AI crawlers, preview bots, and unknown agents. Easily see who’s accessing what, and how often.
- Enforce nuanced policies with flexible response options: allow, rate-limit, redirect, serve alternate content, or monetize traffic through integrations like TollBit.
- Stay ahead of change, thanks to advanced detection capabilities that adapt to evolving bot behaviors, spoofing attempts, and threat patterns.
Managing bots shouldn’t require guesswork or compromise. HUMAN makes it easy to welcome the bots you want and block or monetize the ones you don’t.
If you’re ready to go beyond basic detection and take control of your automated traffic, get in touch with our team for a demo or learn more about how HUMAN defends the entire user-lifecycle.