The rise of AI content scraping has made one file more powerful than ever: your robots.txt. This guide will walk you through how to use it to block large language model crawlers and protect your work. Whether you're running a blog, a news site, or a publishing platform, these simple steps help reclaim your digital boundaries—because your content should serve readers, not train machines.
European Association of Travel Writers (EATW)
Protecting your content in the age of machine learning
Why This Matters
AI bots don’t ask permission. If your site is public, it’s likely being crawled. Blocking known AI and LLM training bots is your legal and technical way of saying: “No thanks.”
Step 1: Locate Your robots.txt File
This file lives in the root directory of your website (e.g., https://www.yoursite.com/robots.txt). You can access it via FTP, cPanel, or your CMS (like Joomla or WordPress).
Step 2: Add These Disallow Rules
Paste this at the bottom of your file to block the most common AI crawlers:
Tip: These blocks don’t work retroactively. They prevent future crawling, not previous training.
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: TurnitinBot
Disallow: /
Step 3: Add a Legal Notice
Insert a clear disclaimer in your website’s footer or Terms of Service:
“This website does not permit its content to be used in the training of AI or machine learning systems without prior written consent. Automated data collection is strictly prohibited.”
Step 4: Test It
Use these tools to make sure your robots.txt file is active:
-
Google’s robots.txt tester
-
Curl command line to simulate a bot visit
-
Chrome extensions like “Robots Exclusion Checker”
Step 5: Monitor Your Logs
Check your server access logs or use Cloudflare/WAF tools to identify any suspicious bot activity. Watch for large-volume hits from unknown user-agents or IPs.
Optional: Use a Firewall
You can block aggressive bots at the server level using:
-
Cloudflare Bot Management (paid)
-
ModSecurity rules
-
Joomla plugins that manage bot access
You don’t have to block everything. Good bots like Googlebot, Bingbot, and Twitterbot help drive traffic and visibility. The goal is to draw a boundary around your intellectual property, not hide your work.
Your voice matters. Your content is yours. EATW stands with you.