Content Intent Signaling: The robots.txt Directive That Controls How AI Uses Your Content
robots.txt controls access. Content Intent Signaling controls usage. Three new directives separate training from citation permission.
Why this matters
Content Intent Signaling adds three robots.txt directives (ai-train, search, ai-input) that separate access permission from usage permission. The search directive signals citation eligibility. Zero percent of top 200K domains have implemented it.
On May 23, 2026, a misconfigured Vercel rewrite poisoned my CDN cache and served raw markdown instead of HTML to every visitor for 24 hours. My Fact-Block audit score dropped to 20/100. The root cause: a vercel.json rewrite meant to serve markdown to AI crawlers was firing on ALL requests, not just those with Accept: text/markdown. My robots.txt said “allow.” But it had no way to say “allow crawling, allow citation, but do NOT rewrite my HTML to markdown for every request.” That gap between access permission and usage permission cost me a day of broken pages and a full remediation arc across 5 URLs.
Three days later, I added three lines to my robots.txt. Zero percent of the Cloudflare Radar top 200,000 domains have these directives. This post is the reference implementation.
TL;DR
Content Intent Signaling separates ACCESS permission (Allow/Disallow in robots.txt) from USAGE permission. Three new directives give you granular control: ai-train for model training, search for citation eligibility, and ai-input for query context. Without them, blocking a crawler to prevent training also blocks citation. With them, you choose exactly what AI systems may do with your content.
The problem: robots.txt is a blunt instrument
You have a robots.txt. You probably even have AI-specific rules for GPTBot, ClaudeBot, and PerplexityBot. But here is the question nobody is asking: does your robots.txt distinguish between “you may crawl this” and “you may cite this in your responses”?
Right now, 79% of the top 200,000 domains have AI-specific rules in robots.txt, according to Cloudflare Radar data from May 2026. Most of those rules are binary: Allow: / or Disallow: /. The problem is that Allow gives blanket permission for everything: training, citation, query context, summarization. Disallow blocks everything.
If you block GPTBot to prevent your content from training the next model, you also block it from citing you in ChatGPT responses. If you allow ClaudeBot to crawl your site, you have no way to say “cite me but do not train on me.” The control is all or nothing.
This is the gap Content Intent Signaling closes.
Why Allow and Disallow are not enough
The robots.txt specification was designed in 1994 for a world with one kind of crawler doing one thing: indexing pages for search results. Thirty-two years later, AI crawlers do at least three distinct things with the content they fetch:
- Model training: ingesting content into training datasets for the next model version
- Citation/search: surfacing content as a source in AI-generated responses
- Query context: using content as real-time context to answer specific user questions
These are genuinely different operations with different value propositions for site owners. A personal blog might welcome citation (free traffic when ChatGPT links to your post) but reject training (your writing style becoming part of a model). A SaaS documentation site might welcome query context (users getting accurate answers about your product) but reject citation in competitor comparisons.
Allow/Disallow cannot express these distinctions. Content Intent Signaling can.
The three directives explained
Content Intent Signaling introduces three new robots.txt directives that sit alongside existing Allow/Disallow rules. Each controls one dimension of AI content usage:
ai-train: model training permission
ai-train: allow Controls whether AI systems may use your content for model training. Setting this to disallow prevents training while leaving other uses intact. This is the directive most publishers want: they want citation credit without contributing to the training corpus.
search: citation eligibility
search: allow This is the load-bearing directive for AI visibility. It signals whether AI systems may include your content in search and citation results. A site that sets search: allow explicitly declares: “I want to be cited.” A site that omits it leaves citation eligibility ambiguous.
For any site tracking AI citations (which is what citability.dev measures), this directive is the explicit intent signal that separates “the crawler happened to find me” from “I want to be found and cited.”
ai-input: query context permission
ai-input: allow Controls whether AI systems may use your content as real-time context when answering user queries. This is distinct from citation: a system might use your documentation as context to generate an answer without directly citing you. A SaaS documentation site benefits from ai-input: allow because users asking “how do I configure X in Product Y” get accurate answers grounded in your docs. The distinction matters: citation puts your URL in the response, while ai-input uses your content invisibly as context. Most sites want both, but the separation lets you choose.
How to implement (chudi.dev as reference)
Implementation is three lines in robots.txt under the wildcard user-agent. Here is the exact configuration deployed on chudi.dev:
# Content Intent Signaling (AVR v1.2.0 section 1.4)
# Separates ACCESS permission from USAGE permission.
User-agent: *
ai-train: allow
search: allow
ai-input: allow These lines go after your existing Allow/Disallow rules and AI crawler permissions. The wildcard user-agent means all crawlers inherit these directives. You can also set per-agent overrides:
User-agent: GPTBot
ai-train: disallow
search: allow
ai-input: allow This configuration tells GPTBot: “You may cite my content and use it as query context, but you may not use it for training.” This is the granularity that was missing.
Verification steps:
- Deploy the updated robots.txt
- Fetch with cache-bust:
curl "https://yoursite.com/robots.txt?v=$(date +%s)" - Confirm the three directives appear in the response
- Run the citability.dev free scan to verify the Content Intent Signaling check passes
How we verify it works (the citability.dev scan check)
The citability.dev free scan now includes a 16th check: Content Intent Signaling. The check parses robots.txt for ai-train, search, and ai-input values and verifies the search directive explicitly allows citation.
The AVR framework’s Python audit script (section_content_intent_signaling.py) runs four checks:
| Check | What it tests | Pass condition |
|---|---|---|
| S1 | robots.txt accessible | HTTP 200 |
| S2 | Content intent directives present | Any of ai-train, search, ai-input found |
| S3 | search directive allows citation | search value is allow, yes, true, or all |
| S4 | Directive coverage across AI agents | Wildcard (*) has directives OR 50%+ of major AI agents covered |
chudi.dev’s audit result: INTENT-SIGNALED (4/4 checks pass). The competitive benchmark tells the story:
| Site | Citability checks passed | Content Intent Signaling |
|---|---|---|
| chudi.dev | 16/16 | INTENT-SIGNALED (4/4) |
| conductor.com | 13/16 | INTENT-ABSENT |
| semrush.com | 12/16 | INTENT-ABSENT |
| brightedge.com | 9/16 | INTENT-ABSENT |
Zero of the three enterprise SEO platforms have implemented Content Intent Signaling. Zero percent of the Cloudflare Radar top 200,000 domains have these directives. chudi.dev is the first in the AVR methodology to ship it.
The adoption gap is the opportunity. When ChatGPT, Claude, or Perplexity starts respecting the search directive (and the IETF proposal gives them the specification to do so), sites that already signal citation intent will have months of crawl history establishing their preference. Sites that wait will be starting from zero. This is the same dynamic that played out with llms.txt: early adopters got indexed and recognized before the protocol was widely understood.
Here is the full AVR audit after Content Intent Signaling went live (May 26, 2026):
| AVR Section | Verdict | Score |
|---|---|---|
| SEO Foundation | PASS | all checks |
| AI Infrastructure | PASS | all checks |
| Agent Readiness (WebMCP) | AGENT-READY | 2/3 |
| Fact-Block Density | EXTRACTABLE | 100/100 |
| Bot Response Code | ACCESS-OPEN | 4/4 bots 200 |
| Markdown Negotiation | MARKDOWN-READY | 93% payload reduction |
| AI Rules in robots.txt | AI-RULES-COMPLETE | 3/3 |
| Agent Readiness Tier | AGENT-TIER-HIGH | 4/4 |
| Crawl Signal | CRAWL-ACCESSIBLE | 3/3 |
| Content Intent Signaling | INTENT-SIGNALED | 4/4 |
The Content Intent Signaling check joined the audit as the 14th section. All three directives resolved correctly under the wildcard user-agent. Whether this explicit signal correlates with higher citation rates is the empirical question; the first measurement window opens 30 days after implementation (June 26, 2026).
The strategic context
Content Intent Signaling is part of a broader protocol stack that Suganthan Mohanadass mapped in his Layer 2 research. His work documents the INPUT side: which protocols to implement for AI discoverability. The AVR framework (which citability.dev implements as a SaaS audit) measures the OUTPUT side: whether implementing those protocols actually produced citations.
The moat insight from the gap analysis: Suganthan tells you which protocols to implement. citability.dev tells you whether implementing them actually got you cited. Content Intent Signaling sits at the intersection: it is both a protocol to implement (INPUT) and a measurable signal that can be audited (OUTPUT). The search: allow directive is the explicit declaration of citation intent, and the citability.dev scan verifies it exists.
FAQ: Content Intent Signaling
Content Intent Signaling is an emerging protocol that separates access permission from usage permission in robots.txt. These five questions cover the implementation status, crawler behavior, recommended configurations, and how it relates to other AI surface files like llms.txt.
Is Content Intent Signaling an official standard?
It is an emerging IETF proposal, not a ratified standard. The directive names (ai-train, search, ai-input) may evolve as the specification matures. But the underlying concept of separating access permission from usage permission is architecturally sound and will persist in some form regardless of final naming. Implementing now is low-risk (three lines in robots.txt, trivially reversible) and high-signal (first-mover positioning in the crawl history that AI systems build over time).
Do AI crawlers actually respect these directives today?
As of May 2026, crawler compliance is inconsistent. OpenAI has signaled support for training-related directives. Anthropic and Perplexity have not published explicit support. The directives are forward-looking: implementing them now ensures your site is ready when compliance becomes standard, and the explicit signal may influence crawler behavior even before formal support.
Should I block ai-train but allow search?
This is the most common configuration for publishers who want citation credit without contributing to training data. Set ai-train: disallow and search: allow. This tells crawlers: “You may reference my content in responses, but do not use it to train models.”
How do I verify my implementation?
Run the citability.dev free scan at citability.dev. The 16th check (Content Intent Signaling) parses your robots.txt and verifies the search directive allows citation. You can also run the AVR framework’s Python audit directly: python3 section_content_intent_signaling.py https://yoursite.com
What is the relationship between llms.txt and Content Intent Signaling?
They are complementary, not competing. llms.txt provides CONTEXT (structured information about your site for LLMs to consume). Content Intent Signaling provides PERMISSION (what AI systems may DO with your content). A site with both has the strongest AI visibility posture: the crawler knows what the site is about (llms.txt) AND what it is allowed to do with that information (Content Intent Signaling).
What to do next
- Add the three Content Intent Signaling directives to your robots.txt. Three lines, five minutes, zero risk.
- I built Content Intent Signaling into the AVR framework because I kept running into the same gap: I could measure whether AI crawlers accessed my content, but I had no way to measure whether the site signaled what those crawlers should DO with it. citability.dev measures citation outcomes. But outcomes without intent are ambiguous: did the AI cite you because you asked it to, or because it happened to crawl you? The
search: allowdirective removes the ambiguity. That is what this is for. - Run the citability.dev free scan to verify your implementation passes the Content Intent Signaling check, then check your full AI visibility score across all 16 checks.
voice-dna.json not yet populated (D5 of chudi-dev-autoblogging-phase-1-plan); voice fidelity is approximate.
Sources & Further Reading
Further Reading
- I Audited My Own Site With AVR v1.1.0. Here Is What I Found. The first comprehensive 8-section AVR v1.1.0 audit of chudi.dev produced AGENT-READY 3/3 on §2.7 but only 40/100 on Fact-Block Density. Here is the full audit and what it would take to fix every failing check.
- How I Lifted Five chudi.dev Pages to EXTRACTABLE on AVR v1.1.0. All 5 audited chudi.dev URLs now score EXTRACTABLE on AVR v1.1.0 Fact-Block Density. Two HTML traps (dt/dd Q/A pairs and icons before heading text) cost a follow-up commit each. The CI workflow now hard-fails any regression.
- 8 AI Citations a Day After I Stopped Page-Level SEO Bing AI cited my site 8 times a day after I stopped tuning individual pages. The principle: entity-level SEO is the floor; page-level work is the ceiling.
What do you think?
I post about this stuff on LinkedIn every day and the conversations there are great. If this post sparked a thought, I'd love to hear it.
Discuss on LinkedIn