Sep 15, 2024

Blocking Abusive AI Website Crawlers

Explore the rise of AI web crawlers scraping online content, the implications for creators, and how to block unwanted bots using robots.txt effectively.

The Rise of AI Bots Scraping Web Content

AI models require extensive data for training, often gathered by automated bots crawling websites to collect publicly accessible information. While this accelerates AI innovation, many content creators feel exploited as their work is used without consent or compensation. — Neil Clarke, Block the Bots That Feed AI Models (August 2023)

This issue is particularly sensitive for creators who depend on their content for income or wish to retain control over its distribution. As AI capabilities grow, so does the need for ethical data usage and control.

Should You Block AI Crawlers on Your Website?

Some site owners welcome AI research and open web usage. However, many are uncomfortable with unregulated scraping, especially when their content is used commercially without permission.

Updating your robots.txt file allows you to instruct bots on which parts of your site they can access. Although not all bots comply, most reputable crawlers, including major search engines and some AI bots, respect these directives. — Cory Dransfeldt, Go Ahead and Block AI Web Crawlers (March 2024)

By proactively managing your robots.txt, you maintain greater control over your content’s accessibility to automated crawlers.

The Ethical Debate on AI Scraping and Web Content

The practice of scraping web content for AI training raises broader ethical questions. Should content creators have a say in how their publicly available work is collected and reused? Many argue that content scraped for commercial AI models deserves explicit consent and fair use considerations.

Community initiatives like the ai.robots.txt project advocate for transparent, standardized ways to identify and block AI bots on websites, empowering site owners to protect their digital assets.

Example ‘robots.txt’ to Block Known AI Crawlers

Below is a sample robots.txt file that allows all bots general access except for a comprehensive blocklist of known AI-related crawlers:

# Sitemap location for all bots
Sitemap: https://example.com/sitemap-index.xml

# Allow all other bots full access
User-agent: *
Disallow:

# Block specific bots from accessing any part of the site
User-agent: AI2Bot
User-agent: Ai2Bot-Dolma
User-agent: aiHitBot
User-agent: AliyunSecBot
User-agent: AliyunSecBot/Aliyun
User-agent: AliyunSecBot/Nutch-1.21-SNAPSHOT
User-agent: Amazonbot
User-agent: AndiBot
User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: Barkrowler
User-agent: bedrockbot
User-agent: BLEXBot
User-agent: Brightbot
User-agent: Brightbot 1.0
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: Claude-Web
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: Cotoyogi
User-agent: Crawlspace
User-agent: Diffbot
User-agent: DuckAssistBot
User-agent: EchoboxBot
User-agent: ExaBot
User-agent: FacebookBot
User-agent: Factset_spyderbot
User-agent: FirecrawlAgent
User-agent: GPTBot
User-agent: iaskspider
User-agent: iaskspider/2.0
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: ISSCyberRiskCrawler
User-agent: Kangaroo Bot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: MistralAI-User
User-agent: MistralAI-User/1.0
User-agent: MyCentralAIScraperBot
User-agent: NovaAct
User-agent: OAI-SearchBot
User-agent: omgili
User-agent: omgilibot
User-agent: Operator
User-agent: PanguBot
User-agent: Panscient
User-agent: panscient.com
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: PetalBot
User-agent: PhindBot
User-agent: QualifiedBot
User-agent: QuillBot
User-agent: quillbot.com
User-agent: Quora-Bot
User-agent: SBIntuitionsBot
User-agent: Scrapy
User-agent: SemrushBot
User-agent: SemrushBot-BA
User-agent: SemrushBot-CT
User-agent: SemrushBot-OCOB
User-agent: SemrushBot-SI
User-agent: SemrushBot-SWA
User-agent: Sidetrade indexer bot
User-agent: TikTokSpider
User-agent: Timpibot
User-agent: VelenPublicWebCrawler
User-agent: wpbot
User-agent: YandexAdditional
User-agent: YandexAdditionalBot

Disallow: /

The above blocklist does not include the following Google-related user agents:
Google-CloudVertexBot
Google-Extended
GoogleOther
GoogleOther-Image
GoogleOther-Video
These bots have been deliberately omitted to ensure your website’s SEO is not negatively affected. Blocking official Google crawlers can reduce your site’s visibility and indexing performance. It is recommended to allow these Google bots full access to your site.