Content Intent Signaling: The robots.txt Directive That Controls How AI Uses Your Content

On May 23, 2026, a misconfigured Vercel rewrite poisoned my CDN cache and served raw markdown instead of HTML to every visitor for 24 hours. My Fact-Block audit score dropped to 20/100. The root cause: a vercel.json rewrite meant to serve markdown to AI crawlers was firing on ALL requests, not just those with Accept: text/markdown. My robots.txt said “allow.” But it had no way to say “allow crawling, allow citation, but do NOT rewrite my HTML to markdown for every request.” That gap between access permission and usage permission cost me a day of broken pages and a full remediation arc across 5 URLs.

Three days later, I added three lines to my robots.txt. Zero percent of the Cloudflare Radar top 200,000 domains have these directives. This post is the reference implementation.

TL;DR

Content Intent Signaling separates ACCESS permission (Allow/Disallow in robots.txt) from USAGE permission. Three new directives give you granular control: ai-train for model training, search for citation eligibility, and ai-input for query context. Without them, blocking a crawler to prevent training also blocks citation. With them, you choose exactly what AI systems may do with your content.

The problem: robots.txt is a blunt instrument

You have a robots.txt. You probably even have AI-specific rules for GPTBot, ClaudeBot, and PerplexityBot. But here is the question nobody is asking: does your robots.txt distinguish between “you may crawl this” and “you may cite this in your responses”?

Right now, 79% of the top 200,000 domains have AI-specific rules in robots.txt, according to Cloudflare Radar data from May 2026. Most of those rules are binary: Allow: / or Disallow: /. The problem is that Allow gives blanket permission for everything: training, citation, query context, summarization. Disallow blocks everything.

If you block GPTBot to prevent your content from training the next model, you also block it from citing you in ChatGPT responses. If you allow ClaudeBot to crawl your site, you have no way to say “cite me but do not train on me.” The control is all or nothing.

This is the gap Content Intent Signaling closes.

Why are Allow and Disallow not enough?

The robots.txt specification was designed in 1994 for a world where one kind of crawler did exactly one thing: indexing web pages for search engine results. Thirty-two years later, AI crawlers do at least three distinct things with the content they fetch:

Model training: ingesting content into training datasets for the next model version
Citation/search: surfacing content as a source in AI-generated responses
Query context: using content as real-time context to answer specific user questions

These are genuinely different operations with different value propositions for site owners. A personal blog might welcome citation (free traffic when ChatGPT links to your post) but reject training (your writing style becoming part of a model). A SaaS documentation site might welcome query context (users getting accurate answers about your product) but reject citation in competitor comparisons.

Allow/Disallow cannot express these distinctions. Content Intent Signaling can.

What are the three Content Intent Signaling directives?

Content Intent Signaling introduces three new robots.txt directives that sit alongside your existing Allow and Disallow rules. Each directive controls one dimension of how AI systems may use your content: training, citation, or query context. Together they give site owners the granular permission model that robots.txt has lacked since 1994.

ai-train: model training permission

ai-train: allow

Controls whether AI systems may use your content for model training. Setting this to disallow prevents training while leaving other uses intact. This is the directive most publishers want: they want citation credit without contributing to the training corpus.

search: citation eligibility

search: allow

This is the load-bearing directive for AI visibility. It signals whether AI systems may include your content in search and citation results. A site that sets search: allow explicitly declares: “I want to be cited.” A site that omits it leaves citation eligibility ambiguous.

For any site tracking AI citations (which is what citability.dev measures), this directive is the explicit intent signal that separates “the crawler happened to find me” from “I want to be found and cited.”

ai-input: query context permission

ai-input: allow

Controls whether AI systems may use your content as real-time context when answering user queries. This is distinct from citation: a system might use your documentation as context to generate an answer without directly citing you. A SaaS documentation site benefits from ai-input: allow because users asking “how do I configure X in Product Y” get accurate answers grounded in your docs. The distinction matters: citation puts your URL in the response, while ai-input uses your content invisibly as context. Most sites want both, but the separation lets you choose.

How do you implement Content Intent Signaling?

Implementation requires three lines added to your robots.txt under the wildcard user-agent. The directives go after your existing Allow and Disallow rules, inheriting the same user-agent scope. chudi.dev deployed this on May 26, 2026 as the first site in the AVR methodology to ship it. Here is the exact configuration:

# Content Intent Signaling (AVR v1.2.0 section 1.4)
# Separates ACCESS permission from USAGE permission.
User-agent: *
ai-train: allow
search: allow
ai-input: allow

These lines go after your existing Allow/Disallow rules and AI crawler permissions. The wildcard user-agent means all crawlers inherit these directives. You can also set per-agent overrides:

User-agent: GPTBot
ai-train: disallow
search: allow
ai-input: allow

This configuration tells GPTBot: “You may cite my content and use it as query context, but you may not use it for training.” This is the granularity that was missing.

Verification steps:

Deploy the updated robots.txt
Fetch with cache-bust: curl "https://yoursite.com/robots.txt?v=$(date +%s)"
Confirm the three directives appear in the response
Run the citability.dev free scan to verify the Content Intent Signaling check passes

How do you verify Content Intent Signaling works?

The citability.dev free scan now includes Content Intent Signaling as its 16th check, making it the only scan tool that detects these emerging directives. The check fetches your robots.txt, parses it for ai-train, search, and ai-input values, and specifically verifies that the search directive is set to allow, which is the citation eligibility signal.

The AVR framework’s Python audit script (section_content_intent_signaling.py) runs four checks:

Check	What it tests	Pass condition
S1	robots.txt accessible	HTTP 200
S2	Content intent directives present	Any of ai-train, search, ai-input found
S3	search directive allows citation	search value is allow, yes, true, or all
S4	Directive coverage across AI agents	Wildcard (*) has directives OR 50%+ of major AI agents covered

chudi.dev’s audit result: INTENT-SIGNALED (4/4 checks pass). The competitive benchmark tells the story:

Site	Citability checks passed	Content Intent Signaling
chudi.dev	16/16	INTENT-SIGNALED (4/4)
conductor.com	13/16	INTENT-ABSENT
semrush.com	12/16	INTENT-ABSENT
brightedge.com	9/16	INTENT-ABSENT

Zero of the three enterprise SEO platforms have implemented Content Intent Signaling. Zero percent of the Cloudflare Radar top 200,000 domains have these directives. chudi.dev is the first in the AVR methodology to ship it.

The adoption gap is the opportunity. When ChatGPT, Claude, or Perplexity starts respecting the search directive (and the IETF proposal gives them the specification to do so), sites that already signal citation intent will have months of crawl history establishing their preference. Sites that wait will be starting from zero. This is the same dynamic that played out with llms.txt: early adopters got indexed and recognized before the protocol was widely understood.

Here is the full AVR audit after Content Intent Signaling went live (May 26, 2026):

AVR Section	Verdict	Score
SEO Foundation	PASS	all checks
AI Infrastructure	PASS	all checks
Agent Readiness (WebMCP)	AGENT-READY	2/3
Fact-Block Density	EXTRACTABLE	100/100
Bot Response Code	ACCESS-OPEN	4/4 bots 200
Markdown Negotiation	MARKDOWN-READY	93% payload reduction
AI Rules in robots.txt	AI-RULES-COMPLETE	3/3
Agent Readiness Tier	AGENT-TIER-HIGH	4/4
Crawl Signal	CRAWL-ACCESSIBLE	3/3
Content Intent Signaling	INTENT-SIGNALED	4/4

The Content Intent Signaling check joined the audit as the 14th section. All three directives resolved correctly under the wildcard user-agent. Whether this explicit signal correlates with higher citation rates is the empirical question; the first measurement window opens 30 days after implementation (June 26, 2026).

The strategic context

Content Intent Signaling is part of a broader protocol stack that Suganthan Mohanadass mapped in his Layer 2 research. His work documents the INPUT side: which protocols to implement for AI discoverability. The AVR framework (which citability.dev implements as a SaaS audit) measures the OUTPUT side: whether implementing those protocols actually produced citations.

The moat insight from the gap analysis: Suganthan tells you which protocols to implement. citability.dev tells you whether implementing them actually got you cited. Content Intent Signaling sits at the intersection: it is both a protocol to implement (INPUT) and a measurable signal that can be audited (OUTPUT). The search: allow directive is the explicit declaration of citation intent, and the citability.dev scan verifies it exists.

FAQ: Content Intent Signaling

Content Intent Signaling is an emerging IETF protocol that separates access permission from usage permission in robots.txt. The five questions below cover implementation status, current crawler compliance, the recommended configuration for publishers who want citation without training, verification steps, and the relationship between Content Intent Signaling and llms.txt.

Is Content Intent Signaling an official standard?

It is an emerging IETF proposal, not a ratified standard. The directive names (ai-train, search, ai-input) may evolve as the specification matures. But the underlying concept of separating access permission from usage permission is architecturally sound and will persist in some form regardless of final naming. Implementing now is low-risk (three lines in robots.txt, trivially reversible) and high-signal (first-mover positioning in the crawl history that AI systems build over time).

Do AI crawlers actually respect these directives today?

As of May 2026, crawler compliance is inconsistent. OpenAI has signaled support for training-related directives. Anthropic and Perplexity have not published explicit support. The directives are forward-looking: implementing them now ensures your site is ready when compliance becomes standard, and the explicit signal may influence crawler behavior even before formal support.

Should I block ai-train but allow search?

This is the most common configuration for publishers who want citation credit without contributing to training data. Set ai-train: disallow and search: allow. This tells crawlers: “You may reference my content in responses, but do not use it to train models.”

How do I verify my implementation?

Run the citability.dev free scan at citability.dev. The 16th check (Content Intent Signaling) parses your robots.txt and verifies the search directive allows citation. You can also run the AVR framework’s Python audit directly: python3 section_content_intent_signaling.py https://yoursite.com

What is the relationship between llms.txt and Content Intent Signaling?

They are complementary, not competing. llms.txt provides CONTEXT (structured information about your site for LLMs to consume). Content Intent Signaling provides PERMISSION (what AI systems may DO with your content). A site with both has the strongest AI visibility posture: the crawler knows what the site is about (llms.txt) AND what it is allowed to do with that information (Content Intent Signaling).

What should you do next?

Adding Content Intent Signaling to your site takes three lines in robots.txt and five minutes of work. The directives are forward-compatible, trivially reversible, and position your site ahead of the entire Cloudflare Radar top 200,000.

Add the three Content Intent Signaling directives to your robots.txt. Three lines, five minutes, zero risk.
I built Content Intent Signaling into the AVR framework because I kept running into the same gap: I could measure whether AI crawlers accessed my content, but I had no way to measure whether the site signaled what those crawlers should DO with it. citability.dev measures citation outcomes. But outcomes without intent are ambiguous: did the AI cite you because you asked it to, or because it happened to crawl you? The search: allow directive removes the ambiguity. That is what this is for.
Run the citability.dev free scan to verify your implementation passes the Content Intent Signaling check, then check your full AI visibility score across all 16 checks.

voice-dna.json not yet populated (D5 of chudi-dev-autoblogging-phase-1-plan); voice fidelity is approximate.

Content Intent Signaling: The robots.txt Directive That Controls How AI Uses Your Content

Why this matters

TL;DR

The problem: robots.txt is a blunt instrument

Why are Allow and Disallow not enough?

What are the three Content Intent Signaling directives?

ai-train: model training permission

search: citation eligibility

ai-input: query context permission

How do you implement Content Intent Signaling?

How do you verify Content Intent Signaling works?

The strategic context

FAQ: Content Intent Signaling

What should you do next?

Sources & Further Reading

Further reading

What do you think?

Cloudflare Will Block AI Crawlers by Default on September 15: What Site Owners Need to Do Now

I Audited My Own Site With AVR v1.1.0. Here Is What I Found.

How I Lifted Five chudi.dev Pages to EXTRACTABLE on AVR v1.1.0.

8 AI Citations a Day After I Stopped Page-Level SEO

I Spent $10K on AEO and Got Zero AI Citations. Here Is the Audit Section That Would Have Caught Why.