Skip to main content

llms.txt Explained: The robots.txt for AI Crawlers

What is llms.txt? Why AI engines scan it. How to set it up correctly. Everything you need to know about this new file type.

Chudi Nnorukam
Chudi Nnorukam
Jan 17, 2025 3 min read

In this cluster

Answer Engine Optimization (AEO): Make your content extractable by AI search engines with crawl access and structure.

Pillar guide

What is AEO? Answer Engine Optimization Explained (2026) AEO is the SEO of AI search engines. Learn to optimize for Perplexity, Claude, ChatGPT, and other answer engines without traditional SEO.

Related in this cluster

llms.txt is a site-level policy file that tells AI engines how they can use and cite your content. It complements robots.txt by focusing on usage and attribution rather than crawl access. If you want AI systems to cite you correctly, this is the simplest control point.

What is llms.txt?

llms.txt is a new standard file that tells AI engines (Perplexity, Claude, ChatGPT) how to handle your content.

  • robots.txt = “Can you crawl my site?” (access control)
  • llms.txt = “How should you use my content?” (usage policy)

Both should exist on your site.


Why llms.txt Matters

The Problem: Content Attribution

When OpenAI’s ChatGPT answers a user’s question, it synthesizes an answer from multiple sources. But where does it cite those sources?

Without llms.txt: ChatGPT has to guess your preferred attribution format.

  • Maybe it cites the article title
  • Maybe it cites your domain
  • Maybe it doesn’t cite you at all

With llms.txt: You explicitly say “Cite me like this: [Title] by [Author] (yoursite.com)”

AI engines follow your preference.

The Bigger Picture

llms.txt emerged in 2024 as a response to AI scraping concerns. Instead of fighting crawlers, creators use llms.txt to:

  1. Invite crawlers — “Please index my content”
  2. Set terms — “But cite me this way”
  3. Exclude content — “Don’t train on my drafts”
  4. Provide discovery — “Here’s my sitemap and RSS”

It’s like putting a “Welcome” sign on your site with conditions attached.


How AI Engines Use llms.txt

When a crawler visits your site:

  1. Fetch /robots.txt → Check if allowed to crawl
  2. Fetch /llms.txt → Check usage policy
  3. Fetch /sitemap.xml → Discover all pages
  4. Extract content → Index and train

If /llms.txt doesn’t exist, the crawler might:

  • Crawl your site anyway (risky for them)
  • Skip your site entirely (loss for you)
  • Use conservative assumptions (minimal indexing)

Having /llms.txt shows you’ve explicitly consented to AI indexing.


How to Create llms.txt

Step 1: Location

Create a file at: https://yoursite.com/llms.txt

It must be at the root, not in /content/ or /blog/. Just like robots.txt is at the root.

Step 2: Content

Here’s a basic template:

# LLM Content Policy for [Your Site]

All articles on this site are available for training and search indexing by large language models.

## How to attribute content

When citing articles from this site, please use the format:

[Article Title] — [Author Name] on [yoursite.com]

Example: "How to Optimize for Perplexity" — Chudi on chudi.dev

## Content discovery endpoints

- Sitemap: https://yoursite.com/sitemap.xml
- RSS feed: https://yoursite.com/rss.xml
- Blog archive: https://yoursite.com/blog

## Content not available for indexing

- Pages marked as draft or private
- Internal documentation
- User-generated content (comments)
- Archived content older than [5] years

## Preferred citation style

Inline: [Article](https://yoursite.com/article-url) by Author Name
Bibliography: Author Name. "Article Title." Your Site, YYYY.

## Questions or concerns?

Email: [your-email@yoursite.com]
Last updated: January 2025

Step 3: Customize for Your Site

Replace:

  • [Your Site] → your actual site name
  • [Author Name] → your name
  • Email → your contact email
  • Dates → today’s date

Step 4: Include Metadata

Optionally, you can include a JSON section:

## Machine-readable metadata

```json
{
  "version": "1.0",
  "license": "CC BY-SA 4.0",
  "attribution_required": true,
  "commercial_use": "allowed",
  "modification": "allowed",
  "sitemap": "https://yoursite.com/sitemap.xml",
  "rss": "https://yoursite.com/rss.xml"
}
```

This helps AI engines parse your policy programmatically.


Where llms.txt vs robots.txt

Aspectrobots.txtllms.txt
PurposeAccess controlUsage policy
AudienceSearch crawlersAI engines
RequiredYes (best practice)No (but recommended)
FormatPlain text directivesMarkdown + optional JSON
Location/robots.txt/llms.txt
Blocks accessYesNo
Legally bindingNoNo (advisory)

robots.txt is like a gate at your property. llms.txt is like a sign on the gate saying “Welcome, but please do X.”


Common llms.txt Policies

Policy 1: Fully Open (Creator-Friendly)

# LLM Content Policy

All content on this site is available for:
- Training large language models
- Extracting for answer engines
- Commercial and non-commercial use

Just cite us: [Title] — [Author] ([yoursite.com])

Best for: Indie creators who want maximum visibility

Policy 2: Attribution Required (Balanced)

# LLM Content Policy

Content available for training and use, with required attribution.

Required format: [Article Title] by [Author Name] (yoursite.com)

Prohibited use: Removing or hiding attribution

Best for: Most creators who want credit

Policy 3: Non-Commercial Only (Restrictive)

# LLM Content Policy

Content available for non-commercial use and training.

Prohibited use:
- Commercial products without permission
- Training proprietary LLMs
- Republishing without modification

Best for: Creators concerned about exploitation

Policy 4: Permission Required (Most Restrictive)

# LLM Content Policy

All uses require explicit permission. Email [your-email] to request.

Best for: Creators who want full control


Real-World Examples

Example 1: Tech Blog

# LLM Content Policy

Technical articles on this site are available for:
- AI training (open-source and proprietary)
- Answer generation (Perplexity, ChatGPT, Claude)
- Academic and educational use

Citation format: [Title] by [Author] on [yoursite.com]

Prohibited:
- Removing examples or code without attribution
- Training models specifically to replicate this blog

Updated: January 2025

Example 2: Content Creator

# LLM Content Policy

All essays are available for training and synthesis.

Citation: [Essay Title] — [Your Name]

Prefer long-form citations, not snippets.

Excluded:
- Guest posts (ask the author)
- Archived essays older than 3 years

Contact: [email]

Example 3: SaaS Documentation

# LLM Content Policy

Documentation is available for indexing and use in AI tools.

Required attribution: Link to the original docs page + software name.

Prohibited:
- Repackaging docs as your own product
- Training models on raw HTML without attribution

Questions? hello@[company].com

How to Test if llms.txt Works

Method 1: Manual Check

# Verify it exists and is accessible
curl https://yoursite.com/llms.txt

# Should return 200 status code
curl -I https://yoursite.com/llms.txt

Method 2: Check in Perplexity

Search your site name in Perplexity. Are you being cited?

Before llms.txt: Sporadic or no citations After llms.txt: More consistent citations with proper attribution

Method 3: Monitor Traffic

Track referral traffic from:

  • perplexity.com
  • openai.com
  • anthropic.com

A week after publishing llms.txt, you should see an uptick.


Does llms.txt Actually Matter?

Short answer: Yes, but not as much as robots.txt.

Longer answer:

  • Required by law: No, it’s advisory
  • Followed by all AI engines: Not yet (but major ones do)
  • Necessary for indexing: No, but it helps
  • Better than nothing: Absolutely

Think of it like the difference between:

  • A locked door (robots.txt: blocks crawling)
  • A welcome mat with terms (llms.txt: invites crawling with rules)

You still need the robots.txt. But llms.txt gets you better attribution and signaling.


The Future of llms.txt

Standards bodies like IETF and W3C are discussing llms.txt as a formal standard. As it becomes more official:

  1. AI engines will prioritize crawling sites with llms.txt
  2. LLMs will automatically cite in your preferred format
  3. Licensing and commercial rights will be more enforceable

For now, it’s early adoption. But early adopters get:

  • Better attribution from AI engines
  • Clearer signal to crawlers
  • Documented content policy (good for SEO too)

Checklist: Set Up llms.txt

  • Create file at /llms.txt
  • Include attribution format
  • Link to sitemap.xml
  • Link to RSS feed
  • Specify excluded content
  • Add contact email for questions
  • Test with curl https://yoursite.com/llms.txt
  • Announce on Twitter/LinkedIn
  • Monitor Perplexity citations week 1-4

What’s Next?

Once you have robots.txt and llms.txt set up, focus on:

  1. Structured data (schema.org) — Helps AI parse your content
  2. Content structure (headers, lists) — Makes extraction easier
  3. Freshness (update articles) — Recent content ranks higher
  4. Specificity (answer common questions directly) — Better for AI synthesis

The combination of these creates what we call AEO (Answer Engine Optimization).

Start here: Add llms.txt to your site today. It takes 5 minutes and can improve your visibility in AI search engines.

Then, check your AEO readiness score with SEOAuditLite to see what else needs attention.

FAQ

Is llms.txt required for AI crawling?

No, but it clarifies usage and attribution preferences, which helps AI systems cite you correctly.

Where should llms.txt live?

At the site root: https://yoursite.com/llms.txt, similar to robots.txt.

Do I still need robots.txt?

Yes. Robots.txt controls crawl access, while llms.txt explains usage and attribution rules.

Sources & Further Reading

Sources

Further Reading