The rise of AI content scraping has made one file more powerful than ever: your robots.txt. This guide will walk you through how to use it to block large language model crawlers and protect your work. Whether you're running a blog, a news site, or a publishing platform, these simple steps help reclaim your digital boundaries—because your content should serve readers, not train machines.

European Association of Travel Writers (EATW)
Protecting your content in the age of machine learning

Why This Matters

AI bots don’t ask permission. If your site is public, it’s likely being crawled. Blocking known AI and LLM training bots is your legal and technical way of saying: “No thanks.”

Step 1: Locate Your robots.txt File

This file lives in the root directory of your website (e.g., https://www.yoursite.com/robots.txt). You can access it via FTP, cPanel, or your CMS (like Joomla or WordPress).

Step 2: Add These Disallow Rules

Paste this at the bottom of your file to block the most common AI crawlers:
Tip: These blocks don’t work retroactively. They prevent future crawling, not previous training.

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: TurnitinBot
Disallow: /

Step 3: Add a Legal Notice

Insert a clear disclaimer in your website’s footer or Terms of Service:

“This website does not permit its content to be used in the training of AI or machine learning systems without prior written consent. Automated data collection is strictly prohibited.”

Step 4: Test It

Use these tools to make sure your robots.txt file is active:

Step 5: Monitor Your Logs

Check your server access logs or use Cloudflare/WAF tools to identify any suspicious bot activity. Watch for large-volume hits from unknown user-agents or IPs.

Optional: Use a Firewall

You can block aggressive bots at the server level using:

  • Cloudflare Bot Management (paid)

  • ModSecurity rules

  • Joomla plugins that manage bot access

You don’t have to block everything. Good bots like Googlebot, Bingbot, and Twitterbot help drive traffic and visibility. The goal is to draw a boundary around your intellectual property, not hide your work.

Your voice matters. Your content is yours. EATW stands with you.

Volunteer Your Legal Expertise

Are you a legal thinker, policy strategist, or digital rights advocate who believes authorship still matters? Whether you're fluent in IP law, AI ethics, EU copyright directives, or licensing frameworks—we’d love your help.

Contribute your skills to help EATW develop real-world tools, shape forward-thinking policy, and defend creative freedom across Europe.

Become a Legal Advisor