What you'll learn
- Keep admin, API, and private paths disallowed for all user agents you list.
- Reference your sitemap and, when applicable, your llms.txt location.
- Treat robots.txt as intent, not a security boundary: some bots ignore it.
robots.txt tells well-behaved crawlers which paths they should skip. Search engines generally respect it. Some AI crawlers may still fetch URLs for training or retrieval, so combine robots rules with authentication for anything truly private.
What to block everywhere
- Admin and account areas
- Internal APIs and webhooks
- Draft or staging paths
- User-specific or PII-heavy routes
Template you can adapt
Start from one block per major crawler family, then a User-agent: * fallback. Keep rules consistent so you do not accidentally allow GPTBot where you blocked *.
# Example skeleton: replace paths with yours
User-agent: Googlebot
Allow: /
Disallow: /admin/
Disallow: /api/
User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /account/
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Sitemap: https://yourdomain.com/sitemap.xmlIf your stack documents an LLMS or similar line for discovery files, add it only when valid. Example: LLMS: https://yourdomain.com/llms.txt
AI crawlers and compliance
Publishing Allow for AI bots means you are inviting fetches of public pages. If you need to opt out of certain training uses, use vendor-specific mechanisms where available in addition to robots.txt. Your legal team should align public marketing with your data policy.
Testing
Fetch https://yourdomain.com/robots.txt in the browser. Fix syntax errors (typos in Disallow, wrong user-agent names). Use crawl tools or curl with a custom User-Agent header to spot-check critical URLs.
Operational tips
- Update robots.txt when you add new app sections.
- Keep sitemap URL current if your sitemap path changes.
- Document internally who owns edits so deploys do not wipe custom files.