llms.txt Explained: The robots.txt for AI Crawlers
What is llms.txt? Why AI engines scan it. How to set it up correctly. Everything you need to know about this new file type.
In this cluster
Answer Engine Optimization (AEO): Make your content extractable by AI search engines with crawl access and structure.
Related in this cluster
llms.txt is a site-level policy file that tells AI engines how they can use and cite your content. It complements robots.txt by focusing on usage and attribution rather than crawl access. If you want AI systems to cite you correctly, this is the simplest control point.
What is llms.txt?
llms.txt is a new standard file that tells AI engines (Perplexity, Claude, ChatGPT) how to handle your content.
- robots.txt = “Can you crawl my site?” (access control)
- llms.txt = “How should you use my content?” (usage policy)
Both should exist on your site.
Why llms.txt Matters
The Problem: Content Attribution
When OpenAI’s ChatGPT answers a user’s question, it synthesizes an answer from multiple sources. But where does it cite those sources?
Without llms.txt: ChatGPT has to guess your preferred attribution format.
- Maybe it cites the article title
- Maybe it cites your domain
- Maybe it doesn’t cite you at all
With llms.txt: You explicitly say “Cite me like this: [Title] by [Author] (yoursite.com)”
AI engines follow your preference.
The Bigger Picture
llms.txt emerged in 2024 as a response to AI scraping concerns. Instead of fighting crawlers, creators use llms.txt to:
- Invite crawlers — “Please index my content”
- Set terms — “But cite me this way”
- Exclude content — “Don’t train on my drafts”
- Provide discovery — “Here’s my sitemap and RSS”
It’s like putting a “Welcome” sign on your site with conditions attached.
How AI Engines Use llms.txt
When a crawler visits your site:
- Fetch
/robots.txt→ Check if allowed to crawl - Fetch
/llms.txt→ Check usage policy - Fetch
/sitemap.xml→ Discover all pages - Extract content → Index and train
If /llms.txt doesn’t exist, the crawler might:
- Crawl your site anyway (risky for them)
- Skip your site entirely (loss for you)
- Use conservative assumptions (minimal indexing)
Having /llms.txt shows you’ve explicitly consented to AI indexing.
How to Create llms.txt
Step 1: Location
Create a file at: https://yoursite.com/llms.txt
It must be at the root, not in /content/ or /blog/. Just like robots.txt is at the root.
Step 2: Content
Here’s a basic template:
# LLM Content Policy for [Your Site]
All articles on this site are available for training and search indexing by large language models.
## How to attribute content
When citing articles from this site, please use the format:
[Article Title] — [Author Name] on [yoursite.com]
Example: "How to Optimize for Perplexity" — Chudi on chudi.dev
## Content discovery endpoints
- Sitemap: https://yoursite.com/sitemap.xml
- RSS feed: https://yoursite.com/rss.xml
- Blog archive: https://yoursite.com/blog
## Content not available for indexing
- Pages marked as draft or private
- Internal documentation
- User-generated content (comments)
- Archived content older than [5] years
## Preferred citation style
Inline: [Article](https://yoursite.com/article-url) by Author Name
Bibliography: Author Name. "Article Title." Your Site, YYYY.
## Questions or concerns?
Email: [your-email@yoursite.com]
Last updated: January 2025 Step 3: Customize for Your Site
Replace:
[Your Site]→ your actual site name[Author Name]→ your name- Email → your contact email
- Dates → today’s date
Step 4: Include Metadata
Optionally, you can include a JSON section:
## Machine-readable metadata
```json
{
"version": "1.0",
"license": "CC BY-SA 4.0",
"attribution_required": true,
"commercial_use": "allowed",
"modification": "allowed",
"sitemap": "https://yoursite.com/sitemap.xml",
"rss": "https://yoursite.com/rss.xml"
}
``` This helps AI engines parse your policy programmatically.
Where llms.txt vs robots.txt
| Aspect | robots.txt | llms.txt |
|---|---|---|
| Purpose | Access control | Usage policy |
| Audience | Search crawlers | AI engines |
| Required | Yes (best practice) | No (but recommended) |
| Format | Plain text directives | Markdown + optional JSON |
| Location | /robots.txt | /llms.txt |
| Blocks access | Yes | No |
| Legally binding | No | No (advisory) |
robots.txt is like a gate at your property. llms.txt is like a sign on the gate saying “Welcome, but please do X.”
Common llms.txt Policies
Policy 1: Fully Open (Creator-Friendly)
# LLM Content Policy
All content on this site is available for:
- Training large language models
- Extracting for answer engines
- Commercial and non-commercial use
Just cite us: [Title] — [Author] ([yoursite.com]) Best for: Indie creators who want maximum visibility
Policy 2: Attribution Required (Balanced)
# LLM Content Policy
Content available for training and use, with required attribution.
Required format: [Article Title] by [Author Name] (yoursite.com)
Prohibited use: Removing or hiding attribution Best for: Most creators who want credit
Policy 3: Non-Commercial Only (Restrictive)
# LLM Content Policy
Content available for non-commercial use and training.
Prohibited use:
- Commercial products without permission
- Training proprietary LLMs
- Republishing without modification Best for: Creators concerned about exploitation
Policy 4: Permission Required (Most Restrictive)
# LLM Content Policy
All uses require explicit permission. Email [your-email] to request. Best for: Creators who want full control
Real-World Examples
Example 1: Tech Blog
# LLM Content Policy
Technical articles on this site are available for:
- AI training (open-source and proprietary)
- Answer generation (Perplexity, ChatGPT, Claude)
- Academic and educational use
Citation format: [Title] by [Author] on [yoursite.com]
Prohibited:
- Removing examples or code without attribution
- Training models specifically to replicate this blog
Updated: January 2025 Example 2: Content Creator
# LLM Content Policy
All essays are available for training and synthesis.
Citation: [Essay Title] — [Your Name]
Prefer long-form citations, not snippets.
Excluded:
- Guest posts (ask the author)
- Archived essays older than 3 years
Contact: [email] Example 3: SaaS Documentation
# LLM Content Policy
Documentation is available for indexing and use in AI tools.
Required attribution: Link to the original docs page + software name.
Prohibited:
- Repackaging docs as your own product
- Training models on raw HTML without attribution
Questions? hello@[company].com How to Test if llms.txt Works
Method 1: Manual Check
# Verify it exists and is accessible
curl https://yoursite.com/llms.txt
# Should return 200 status code
curl -I https://yoursite.com/llms.txt Method 2: Check in Perplexity
Search your site name in Perplexity. Are you being cited?
Before llms.txt: Sporadic or no citations After llms.txt: More consistent citations with proper attribution
Method 3: Monitor Traffic
Track referral traffic from:
perplexity.comopenai.comanthropic.com
A week after publishing llms.txt, you should see an uptick.
Does llms.txt Actually Matter?
Short answer: Yes, but not as much as robots.txt.
Longer answer:
- Required by law: No, it’s advisory
- Followed by all AI engines: Not yet (but major ones do)
- Necessary for indexing: No, but it helps
- Better than nothing: Absolutely
Think of it like the difference between:
- A locked door (robots.txt: blocks crawling)
- A welcome mat with terms (llms.txt: invites crawling with rules)
You still need the robots.txt. But llms.txt gets you better attribution and signaling.
The Future of llms.txt
Standards bodies like IETF and W3C are discussing llms.txt as a formal standard. As it becomes more official:
- AI engines will prioritize crawling sites with llms.txt
- LLMs will automatically cite in your preferred format
- Licensing and commercial rights will be more enforceable
For now, it’s early adoption. But early adopters get:
- Better attribution from AI engines
- Clearer signal to crawlers
- Documented content policy (good for SEO too)
Checklist: Set Up llms.txt
- Create file at
/llms.txt - Include attribution format
- Link to sitemap.xml
- Link to RSS feed
- Specify excluded content
- Add contact email for questions
- Test with
curl https://yoursite.com/llms.txt - Announce on Twitter/LinkedIn
- Monitor Perplexity citations week 1-4
What’s Next?
Once you have robots.txt and llms.txt set up, focus on:
- Structured data (schema.org) — Helps AI parse your content
- Content structure (headers, lists) — Makes extraction easier
- Freshness (update articles) — Recent content ranks higher
- Specificity (answer common questions directly) — Better for AI synthesis
The combination of these creates what we call AEO (Answer Engine Optimization).
Start here: Add llms.txt to your site today. It takes 5 minutes and can improve your visibility in AI search engines.
Then, check your AEO readiness score with SEOAuditLite to see what else needs attention.
FAQ
Is llms.txt required for AI crawling?
No, but it clarifies usage and attribution preferences, which helps AI systems cite you correctly.
Where should llms.txt live?
At the site root: https://yoursite.com/llms.txt, similar to robots.txt.
Do I still need robots.txt?
Yes. Robots.txt controls crawl access, while llms.txt explains usage and attribution rules.
Sources & Further Reading
Sources
- RFC 9309: Robots Exclusion Protocol Defines the robots.txt standard used for crawler access control.
- Sitemaps XML format Canonical sitemap specification referenced by AI crawlers.
- Google Search Central: robots.txt overview Practical guidance on crawler access behavior.
Further Reading
- How to Optimize Your Site for Perplexity, ChatGPT, and Claude Search Step-by-step guide to make your content visible in AI search engines. Includes robots.txt, structured data, and content format optimization.
- What is AEO? Answer Engine Optimization Explained (2026) AEO is the SEO of AI search engines. Learn to optimize for Perplexity, Claude, ChatGPT, and other answer engines without traditional SEO.
- Why AI-First Product Development is the Future My thesis on why the future of software development starts with AI agents, not IDE plugins. MicroSaaSBot is proof of concept.