The Complete Guide to llms.txt: Control How AI Bots Access Your Content
As AI-powered search engines and assistants become the primary way users discover information, website owners need new tools to manage how AI systems access and use their content. Enter
llms.txtrobots.txtIn this comprehensive guide, we'll cover what
llms.txtWhat is llms.txt?
The
llms.txtrobots.txtllms.txt- •What content AI systems can use for training or real-time retrieval
- •How your content should be cited when AI references it
- •Licensing and attribution requirements
- •Contact information for AI-related inquiries
- •Preferred summarization guidelines
Think of it as a conversation with AI systems about how you'd like them to interact with your content.
Why Does llms.txt Matter?
The Growing AI Bot Traffic Problem
According to data from multiple analytics platforms tracking AI bot behavior:
- •AI bot visits have increased 527% year-over-year
- •GPTBot alone accounts for ~60% of AI crawler traffic
- •Peak crawling hours are 2-4 AM UTC (likely training runs)
- •Technical documentation receives 5x more AI bot traffic than average content
Without clear instructions, AI systems make their own decisions about how to use your content. This can lead to:
- •Uncontrolled data usage — Your content may be used for training without permission
- •Disorganized crawling — Bots may crawl inefficiently, wasting server resources
- •Poor attribution — AI may reference your content without proper citation
- •Missed optimization opportunities — You can't guide AI toward your best content
The 40% Improvement Effect
Sites that implement
llms.txt- •More efficient use of your server resources
- •Better understanding by AI systems of your content structure
- •Improved chances of being cited in AI-generated responses
- •Clearer communication of your content licensing preferences
The llms.txt Specification
File Location
The
llms.txthttps://yourdomain.com/llms.txt
Basic Structure
Here's a minimal example:
# llms.txt for example.com
# Version: 1.0
# Allow AI systems to access content
User-agent: *
Allow: /
# Contact for AI-related inquiries
Contact: ai@example.com
# License information
License: CC-BY-4.0
Complete Example with All Directives
# llms.txt for example.com
# Last updated: 2026-01-24
# ===========================================
# GENERAL PERMISSIONS
# ===========================================
# Default: Allow all AI crawlers
User-agent: *
Allow: /
# ===========================================
# CRAWLER-SPECIFIC RULES
# ===========================================
# OpenAI (GPTBot, ChatGPT-User)
User-agent: GPTBot
User-agent: ChatGPT-User
Allow: /blog/
Allow: /docs/
Allow: /api-reference/
Disallow: /private/
Disallow: /internal/
# Anthropic (ClaudeBot, Claude-Web)
User-agent: ClaudeBot
User-agent: Claude-Web
Allow: /
# Perplexity
User-agent: PerplexityBot
Allow: /
# Google AI (Gemini, AI Overviews)
User-agent: Google-Extended
User-agent: Gemini-Deep-Research
Allow: /
# ===========================================
# TRAINING VS RETRIEVAL PERMISSIONS
# ===========================================
# Allow real-time retrieval (search)
Retrieval: allowed
# Training permissions (for model training)
Training: conditional
Training-terms: https://example.com/ai-training-license
# ===========================================
# CONTENT GUIDELINES
# ===========================================
# Preferred content for AI to prioritize
Priority-content: /docs/
Priority-content: /blog/
Priority-content: /api-reference/
# Content that should be summarized, not quoted
Summarize-only: /case-studies/
Summarize-only: /research/
# Content that requires full attribution
Require-attribution: /blog/
Require-attribution: /research/
# ===========================================
# CITATION PREFERENCES
# ===========================================
# How to cite this source
Citation-format: "[Title] by Example Company - [URL]"
Citation-requirement: mandatory
# Preferred citation URL (canonical)
Canonical-base: https://example.com
# ===========================================
# RATE LIMITING
# ===========================================
# Suggested crawl rate (requests per minute)
Crawl-delay: 1
Max-requests-per-minute: 60
# ===========================================
# CONTACT & LEGAL
# ===========================================
# Contact for AI-related inquiries
Contact: ai-partnerships@example.com
# Legal terms for AI usage
Terms: https://example.com/terms/ai-usage
Privacy: https://example.com/privacy
# License for content
License: CC-BY-4.0
License-URL: https://creativecommons.org/licenses/by/4.0/
# ===========================================
# METADATA
# ===========================================
# Organization information
Organization: Example Company
Organization-URL: https://example.com/about
# Sitemap for AI crawlers
Sitemap: https://example.com/sitemap.xml
# Structured data availability
Schema-types: Article, HowTo, FAQ, Product
Key Directives Explained
Permission Directives
| Directive | Purpose | Example |
|---|---|---|
| Specifies which AI crawler the rules apply to | |
| Permits access to specific paths | |
| Blocks access to specific paths | |
| Controls real-time content retrieval | |
| Controls use for model training | |
Content Guidance Directives
| Directive | Purpose | Example |
|---|---|---|
| Highlights important content areas | |
| Content should be summarized, not quoted verbatim | |
| Content requires citation when referenced | |
Citation Directives
| Directive | Purpose | Example |
|---|---|---|
| How AI should cite your content | |
| Whether citation is required | |
| Base URL for canonical links | |
Rate Limiting Directives
| Directive | Purpose | Example |
|---|---|---|
| Seconds between requests | |
| Maximum request rate | |
Implementation Best Practices
1. Start Permissive, Then Restrict
Begin with open permissions and monitor bot behavior before adding restrictions:
# Initial permissive setup
User-agent: *
Allow: /
Retrieval: allowed
Training: conditional
2. Prioritize Your Best Content
Guide AI systems to your most valuable, well-maintained content:
# Direct bots to high-quality content
Priority-content: /docs/getting-started/
Priority-content: /blog/tutorials/
Priority-content: /api-reference/
# Deprioritize outdated or thin content
Disallow: /archive/
Disallow: /drafts/
3. Optimize for Different AI Use Cases
Different AI systems have different needs:
# Training crawlers - more restrictive
User-agent: GPTBot
User-agent: ClaudeBot
Disallow: /user-data/
Disallow: /private/
Training: conditional
# Search/retrieval crawlers - more permissive
User-agent: ChatGPT-User
User-agent: PerplexityBot
Allow: /
Retrieval: allowed
4. Protect Sensitive Content
Be explicit about content that shouldn't be accessed:
# Protect sensitive areas
Disallow: /admin/
Disallow: /user-dashboard/
Disallow: /api/internal/
Disallow: /staging/
Disallow: /*.pdf$ # Protect downloadable documents
5. Set Clear Citation Requirements
Help AI systems cite your content properly:
# Citation preferences
Citation-format: "[Article Title] by [Author Name], Example Company - [URL]"
Citation-requirement: mandatory
Require-attribution: /blog/
Require-attribution: /research/
Known AI Bot User Agents
Here are the major AI crawlers you should know about:
Training Crawlers (Model Training)
| Bot | Organization | User-Agent |
|---|---|---|
| GPTBot | OpenAI | |
| Google-Extended | | |
| ClaudeBot | Anthropic | |
| Meta-ExternalAgent | Meta | |
| Bytespider | ByteDance | |
| CCBot | Common Crawl | |
Search/Retrieval Crawlers (Real-time Queries)
| Bot | Organization | User-Agent |
|---|---|---|
| ChatGPT-User | OpenAI | |
| OAI-SearchBot | OpenAI | |
| PerplexityBot | Perplexity AI | |
| Claude-Web | Anthropic | |
| Gemini-Deep-Research | | |
| YouBot | You.com | |
Combining llms.txt with robots.txt
The
llms.txtrobots.txtrobots.txt (for all crawlers)
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /private/
User-agent: ClaudeBot
Allow: /
Sitemap: https://example.com/sitemap.xml
llms.txt (AI-specific instructions)
User-agent: *
Allow: /
Retrieval: allowed
Training: conditional
Priority-content: /docs/
Citation-requirement: mandatory
Contact: ai@example.com
Monitoring AI Bot Compliance
Once you've implemented
llms.txtServer Log Analysis
Look for these user agents in your server logs:
# Check for AI bot traffic
grep -E "(GPTBot|ClaudeBot|PerplexityBot|ChatGPT)" access.log | wc -l
# See which pages AI bots access most
grep "GPTBot" access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
Key Metrics to Track
- •Bot traffic volume — Are AI crawlers visiting your site?
- •Crawl patterns — Are they following your rules?
llms.txt - •Page coverage — Are priority pages being crawled?
- •Respect for disallow — Are restricted pages being accessed?
Use AI Search Index for Monitoring
Tools like AI Search Index can automatically track:
- •Which AI bots visit your site
- •Peak crawling hours and patterns
- •Which pages receive the most AI attention
- •Compliance with your specified permissions
Common Implementation Mistakes
1. Not Creating the File
Many sites still don't have an
llms.txt# Minimum viable llms.txt
User-agent: *
Allow: /
Contact: webmaster@example.com
2. Overly Restrictive Rules
Blocking all AI crawlers hurts your visibility in AI search:
# Too restrictive - avoid this
User-agent: *
Disallow: /
Instead, be selective:
# Better approach
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/
3. Ignoring Training vs. Retrieval Distinction
Many sites want to allow real-time retrieval but restrict training:
# Allow search, restrict training
Retrieval: allowed
Training: denied
Training-terms: https://example.com/training-license
4. No Contact Information
Always include a way for AI companies to reach you:
Contact: ai-partnerships@example.com
Terms: https://example.com/ai-usage-terms
Future of llms.txt
The
llms.txt- •Standardization — Working toward RFC status like robots.txt
- •More granular controls — Content-level permissions vs. page-level
- •Payment integration — Licensing terms for paid content access
- •Verification mechanisms — Cryptographic signing of permissions
- •Industry adoption — More AI companies committing to respect the standard
llms.txt Template Generator
Here's a quick template you can customize for your site:
# llms.txt for [YOUR-DOMAIN]
# Generated: [DATE]
# ===========================================
# PERMISSIONS
# ===========================================
User-agent: *
Allow: /
# Restricted areas
Disallow: /admin/
Disallow: /private/
Disallow: /api/internal/
# ===========================================
# AI USAGE TERMS
# ===========================================
Retrieval: allowed
Training: [allowed|conditional|denied]
# Priority content for AI
Priority-content: /docs/
Priority-content: /blog/
# ===========================================
# CITATION
# ===========================================
Citation-format: "[Title] - [URL]"
Citation-requirement: preferred
# ===========================================
# CONTACT
# ===========================================
Contact: [YOUR-EMAIL]
Organization: [YOUR-COMPANY]
Terms: [TERMS-URL]
# ===========================================
# TECHNICAL
# ===========================================
Crawl-delay: 1
Sitemap: https://[YOUR-DOMAIN]/sitemap.xml
Key Takeaways
- •Implement llms.txt today — Sites with the file see 40% more organized AI crawling
- •Balance access and control — Be permissive enough to maintain AI visibility
- •Distinguish training from retrieval — Different permissions for different use cases
- •Monitor compliance — Track whether AI bots respect your instructions
- •Keep it updated — Review and update as AI ecosystem evolves
Frequently Asked Questions
Q: Do AI companies actually respect llms.txt?
A: Adoption is growing. OpenAI, Anthropic, and Perplexity have indicated support for respecting content permissions. However, compliance varies. The standard is still evolving and becoming more widely adopted.
Q: Should I block AI crawlers completely?
A: Generally not recommended. Blocking all AI crawlers means your content won't appear in AI search results like ChatGPT, Perplexity, or Google AI Overviews. Consider selective permissions instead.
Q: Does llms.txt replace robots.txt?
A: No. Use both files.
robots.txtllms.txtQ: How often should I update llms.txt?
A: Review quarterly or when you add new content areas. Update immediately if you notice bot behavior that doesn't align with your preferences.
Q: Can I charge for AI training access?
A: Some sites are experimenting with paid licensing. Include your licensing terms URL in the file:
Training: conditional
Training-terms: https://yoursite.com/ai-licensing
Conclusion
The
llms.txt- •Better control over AI access to your content
- •Improved organization of AI crawler behavior
- •Clear communication of your citation preferences
- •A foundation for future AI-related permissions
Start with a basic implementation and expand as needed. Your future AI visibility depends on the relationships you build with AI systems today.
Additional Resources
- •AI Search Index — Monitor AI bot traffic and visibility
- •Superlines AI Analytics — Track brand mentions in AI responses
- •robots.txt Specification — The original crawler control standard
- •Schema.org — Structured data for better AI understanding