Frequently asked questions

How do we know AI companies/bots respect `robots.txt`?

The short answer is that we don't. robots.txt is a well-established standard, but compliance is voluntary. There is no enforcement mechanism.

Why might AI web crawlers respect `robots.txt`?

Larger and/or reputable companies developing AI models probably wouldn't want to damage their reputation by ignoring robots.txt.

Also, given the contentious nature of AI and the possibility of legislation limiting its development, companies developing AI models will probably want to be seen to be behaving ethically, and so should (eventually) respect robots.txt.

Can we block crawlers based on user agent strings?

Yes, provided the crawlers identify themselves and your application/hosting supports doing so.

Some crawlers — such as Perplexity — do not identify themselves via their user agent strings and, as such, are difficult to block.

What can we do if a bot doesn't respect `robots.txt`?

That depends on your stack.

Nginx
- Blocking Bots with Nginx by Robb Knight
- Blocking AI web crawlers by Glyn Normington
Apache httpd
- Blockin' bots. by Ethan Marcotte
- Blocking Bots With 11ty And Apache by fLaMEd fury

Tip

The snippets in these articles all use mod_rewrite, which should be considered a last resort. A good alternative that's less resource-intensive is mod_setenvif; see httpd docs for an example. You should also consider setting this up in httpd.conf instead of .htaccess if it's available to you.

Netlify
- Blockin' bots on Netlify by Jeremia Kimelman
Cloudflare
- Block AI bots, scrapers and crawlers with a single click by Cloudflare
- I’m blocking AI crawlers by Roelant
Vercel
- Block AI Bots Firewall Rule by Vercel

Why should we block these crawlers?

They're extractive, confer no benefit to the creators of data they're ingesting and also have wide-ranging negative externalities.

How Tech Giants Cut Corners to Harvest Data for A.I.

OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems.

How AI copyright lawsuits could make the whole industry go extinct

The New York Times' lawsuit against OpenAI is part of a broader, industry-shaking copyright challenge that could define the future of AI.

How can I contribute?

Open a pull request. It will be reviewed and acted upon appropriately. We really appreciate contributions — this is a community effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ.md

FAQ.md

Frequently asked questions

How do we know AI companies/bots respect `robots.txt`?

Why might AI web crawlers respect `robots.txt`?

Can we block crawlers based on user agent strings?

What can we do if a bot doesn't respect `robots.txt`?

Why should we block these crawlers?

How can I contribute?

Files

FAQ.md

Latest commit

History

FAQ.md

File metadata and controls

Frequently asked questions

How do we know AI companies/bots respect robots.txt?

Why might AI web crawlers respect robots.txt?

Can we block crawlers based on user agent strings?

What can we do if a bot doesn't respect robots.txt?

Why should we block these crawlers?

How can I contribute?

How do we know AI companies/bots respect `robots.txt`?

Why might AI web crawlers respect `robots.txt`?

What can we do if a bot doesn't respect `robots.txt`?