Your robots.txt Is Not Enough for AI Crawlers — You Need llms.txt

What is llms.txt and why do AI engines scan it? Learn how to set up this robots.txt companion file to control how AI crawlers use and cite your content.

Chudi Nnorukam

Jan 17, 2025 Updated Feb 16, 2026 3 min read

In this cluster

Answer Engine Optimization (AEO): Make your content extractable by AI search engines with crawl access and structure.

Pillar guide

6 AEO Factors That Decide Whether AI Search Engines Cite Your Content AEO is the SEO of AI search engines. Learn to optimize for Perplexity, Claude, ChatGPT, and other answer engines without traditional SEO.

Related in this cluster

Cross-Posting to Dev.to Immediately After Publishing Is Wrong Why waiting 72 hours before cross-posting to Dev.to protects your canonical SEO. RSS + Zapier + Slack creates a safe, automated cross-posting workflow.
8 Steps to Get Your Content Cited by Perplexity, ChatGPT, and AI Search Step-by-step guide to make your content visible in AI search engines. Includes robots.txt, structured data, and content format optimization.

llms.txt is a site-level policy file that tells AI engines how they can use and cite your content. It complements robots.txt by focusing on usage and attribution rather than crawl access. If you want AI systems to cite you correctly, this is the simplest control point.

What is llms.txt?

llms.txt is a plain-text policy file you publish at your site root that tells AI engines how they may use and cite your content. Where robots.txt controls whether a crawler can access your pages, llms.txt governs what the crawler can do with the content it finds: training, answer generation, attribution format, and which sections to exclude from indexing entirely.

llms.txt is a root-level policy file that tells AI engines like Perplexity, Claude, and ChatGPT how they may use and attribute your content. Unlike robots.txt, which controls crawl access, llms.txt governs usage policy and preferred citation format. Publishing it signals explicit consent to AI indexing and helps ensure correct attribution.

llms.txt is a new standard file that tells AI engines (Perplexity, Claude, ChatGPT) how to handle your content.

robots.txt = “Can you crawl my site?” (access control)
llms.txt = “How should you use my content?” (usage policy)

Both should exist on your site. This is part of the broader AI search optimization strategy that helps your content get discovered and cited.

Why llms.txt Matters

The Problem: Content Attribution

When OpenAI’s ChatGPT answers a user’s question, it synthesizes an answer from multiple sources. But where does it cite those sources?

Without llms.txt: ChatGPT has to guess your preferred attribution format.

Maybe it cites the article title
Maybe it cites your domain
Maybe it doesn’t cite you at all

With llms.txt: You explicitly say “Cite me like this: [Title] by [Author] (yoursite.com)”

AI engines follow your preference.

The Bigger Picture

llms.txt emerged in 2024 as a response to AI scraping concerns. Instead of fighting crawlers, creators use llms.txt to:

Invite crawlers — “Please index my content”
Set terms — “But cite me this way”
Exclude content — “Don’t train on my drafts”
Provide discovery — “Here’s my sitemap and RSS”

It’s like putting a “Welcome” sign on your site with conditions attached. This is a foundational piece of what I call Answer Engine Optimization (AEO), the practice of making your content discoverable and citable by AI systems.

How AI Engines Use llms.txt

When a crawler visits your site:

Fetch /robots.txt → Check if allowed to crawl
Fetch /llms.txt → Check usage policy
Fetch /sitemap.xml → Discover all pages
Extract content → Index and train

If /llms.txt doesn’t exist, the crawler might:

Crawl your site anyway (risky for them)
Skip your site entirely (loss for you)
Use conservative assumptions (minimal indexing)

Having /llms.txt shows you’ve explicitly consented to AI indexing.

How to Create llms.txt

Step 1: Location

Create a file at: https://yoursite.com/llms.txt

It must be at the root, not in /content/ or /blog/. Just like robots.txt is at the root.

Step 2: Content

Here’s a basic template:

# LLM Content Policy for [Your Site]

All articles on this site are available for training and search indexing by large language models.

## How to attribute content

When citing articles from this site, please use the format:

[Article Title] — [Author Name] on [yoursite.com]

Example: "How to Optimize for Perplexity" — Chudi on chudi.dev

## Content discovery endpoints

- Sitemap: https://yoursite.com/sitemap.xml
- RSS feed: https://yoursite.com/rss.xml
- Blog archive: https://yoursite.com/blog

## Content not available for indexing

- Pages marked as draft or private
- Internal documentation
- User-generated content (comments)
- Archived content older than [5] years

## Preferred citation style

Inline: [Article](https://yoursite.com/article-url) by Author Name
Bibliography: Author Name. "Article Title." Your Site, YYYY.

## Questions or Concerns

Email: [your-email@yoursite.com]
Last updated: January 2025

Step 3: Customize for Your Site

Replace:

[Your Site] → your actual site name
[Author Name] → your name
Email → your contact email
Dates → today’s date

Step 4: Include Metadata

Optionally, you can include a JSON section:

## Machine-readable metadata

```json
{
  "version": "1.0",
  "license": "CC BY-SA 4.0",
  "attribution_required": true,
  "commercial_use": "allowed",
  "modification": "allowed",
  "sitemap": "https://yoursite.com/sitemap.xml",
  "rss": "https://yoursite.com/rss.xml"
}
```

This helps AI engines parse your policy programmatically.

Where llms.txt vs robots.txt

Aspect	robots.txt	llms.txt
Purpose	Access control	Usage policy
Audience	Search crawlers	AI engines
Required	Yes (best practice)	No (but recommended)
Format	Plain text directives	Markdown + optional JSON
Location	`/robots.txt`	`/llms.txt`
Blocks access	Yes	No
Legally binding	No	No (advisory)

robots.txt is like a gate at your property. llms.txt is like a sign on the gate saying “Welcome, but please do X.”

Common llms.txt Policies

Policy 1: Fully Open (Creator-Friendly)

# LLM Content Policy

All content on this site is available for:
- Training large language models
- Extracting for answer engines
- Commercial and non-commercial use

Just cite us: [Title] — [Author] ([yoursite.com])

Best for: Indie creators who want maximum visibility

Policy 2: Attribution Required (Balanced)

# LLM Content Policy

Content available for training and use, with required attribution.

Required format: [Article Title] by [Author Name] (yoursite.com)

Prohibited use: Removing or hiding attribution

Best for: Most creators who want credit

Policy 3: Non-Commercial Only (Restrictive)

# LLM Content Policy

Content available for non-commercial use and training.

Prohibited use:
- Commercial products without permission
- Training proprietary LLMs
- Republishing without modification

Best for: Creators concerned about exploitation

Policy 4: Permission Required (Most Restrictive)

# LLM Content Policy

All uses require explicit permission. Email [your-email] to request.

Best for: Creators who want full control

Real-World Examples

Example 1: Tech Blog

# LLM Content Policy

Technical articles on this site are available for:
- AI training (open-source and proprietary)
- Answer generation (Perplexity, ChatGPT, Claude)
- Academic and educational use

Citation format: [Title] by [Author] on [yoursite.com]

Prohibited:
- Removing examples or code without attribution
- Training models specifically to replicate this blog

Updated: January 2025

Example 2: Content Creator

# LLM Content Policy

All essays are available for training and synthesis.

Citation: [Essay Title] — [Your Name]

Prefer long-form citations, not snippets.

Excluded:
- Guest posts (ask the author)
- Archived essays older than 3 years

Contact: [email]

Example 3: SaaS Documentation

# LLM Content Policy

Documentation is available for indexing and use in AI tools.

Required attribution: Link to the original docs page + software name.

Prohibited:
- Repackaging docs as your own product
- Training models on raw HTML without attribution

Questions? hello@[company].com

How to Test if llms.txt Works

Method 1: Manual Check

# Verify it exists and is accessible
curl https://yoursite.com/llms.txt

# Should return 200 status code
curl -I https://yoursite.com/llms.txt

Method 2: Check in Perplexity

Search your site name in Perplexity. Are you being cited?

Before llms.txt: Sporadic or no citations After llms.txt: More consistent citations with proper attribution

Method 3: Monitor Traffic

Track referral traffic from:

perplexity.com
openai.com
anthropic.com

A week after publishing llms.txt, you should see an uptick.

Does llms.txt Actually Matter?

Yes, llms.txt matters in practice even though it is advisory and not legally required. Major AI engines including Perplexity, Claude, and ChatGPT recognize the file and use it to determine attribution preferences. Sites with llms.txt configured tend to receive more consistent citations and better crawl coverage from AI systems within four to eight weeks.

Short answer: Yes, but not as much as robots.txt.

Longer answer:

Required by law: No, it’s advisory
Followed by all AI engines: Not yet (but major ones do)
Necessary for indexing: No, but it helps
Better than nothing: Absolutely

Think of it like the difference between:

A locked door (robots.txt: blocks crawling)
A welcome mat with terms (llms.txt: invites crawling with rules)

You still need the robots.txt. But llms.txt gets you better attribution and signaling.

The Future of llms.txt

Standards bodies like IETF and W3C are discussing llms.txt as a formal standard. As it becomes more official:

AI engines will prioritize crawling sites with llms.txt
LLMs will automatically cite in your preferred format
Licensing and commercial rights will be more enforceable

For now, it’s early adoption. But early adopters get:

Better attribution from AI engines
Clearer signal to crawlers
Documented content policy (good for SEO too)

Checklist: Set Up llms.txt

What’s Next?

After setting up robots.txt and llms.txt, focus on structured data using schema.org markup, content structure with semantic headers and lists, and content freshness by updating articles regularly. These three areas improve how AI engines parse and extract your content, which directly affects how often you are cited in AI-generated answers.

Once you have robots.txt and llms.txt set up, focus on:

Structured data (schema.org) — Helps AI parse your content
Content structure (headers, lists) — Makes extraction easier
Freshness (update articles) — Recent content ranks higher
Specificity (answer common questions directly) — Better for AI synthesis

The combination of these creates what we call AEO (Answer Engine Optimization). For a complete walkthrough of these techniques, see my full AEO optimization guide.

Start here: Add llms.txt to your site today. It takes 5 minutes and can improve your visibility in AI search engines.

Then, check your AEO readiness score with SEOAuditLite to see what else needs attention.

Sites with both robots.txt and llms.txt properly configured consistently see better AI crawl coverage and citation rates within 4-8 weeks of implementation—the infrastructure compounds over time as AI systems return to re-index updated content.

FAQ

Is llms.txt required for AI crawling?

No, but it clarifies usage and attribution preferences, which helps AI systems cite you correctly.

Where should llms.txt live?

At the site root: https://yoursite.com/llms.txt, similar to robots.txt.

Do I still need robots.txt?

Yes. Robots.txt controls crawl access, while llms.txt explains usage and attribution rules.

Sources & Further Reading

Sources

RFC 9309: Robots Exclusion Protocol IETF doc Defines the robots.txt standard used for crawler access control.
Sitemaps XML format Sitemaps.org doc Canonical sitemap specification referenced by AI crawlers.
Google Search Central: robots.txt overview Google Search Central doc Practical guidance on crawler access behavior.