Plan-Quota Is the Binding Constraint, Not API Tokens

Chudi Nnorukam • May 15, 2026 • 13 min read

Under subscription plans, the cost of an LLM call is plan-window burn, not API tokens. That changes which decisions you tune and which work belongs off-LLM entirely. A manifesto for the 5-hour rate-limit era.

Why this matters

Two days ago I closed 8 hours of agentic work at 32% plan-window burn. The Reddit thread I was reading at 1 PM had six people hitting Claude Code’s 5-hour limit on a third of my workload.

The difference is not volume of work. It is what gets tuned. Most builders treat AI coding tools the way they treated the API-billed era: minimize tokens, pick the cheapest model that gets the job done, prompt-cache aggressively. Under subscription pricing (Pro, Max, Max 20x, Team, Enterprise) that target function is wrong. The constraint changed. The meta has not caught up.

This post is the manifesto for the change. The thesis, named bluntly: plan-quota is the binding constraint, not API spend. If you tune as if tokens still cost money per call, you will hit the wall earlier than someone running ten times your token volume on the same plan.

TL;DR

Subscription plans rate-limit on a 5-hour rolling window, not on dollars per call. Within plan, models are fungible per-token: Haiku, Sonnet, and Opus all count against the same window cap. The right objective function is judgment-quality decisions per 5-hour window, not cost per token.

That inverts five common tunings: stop using Haiku for everything, push routine work off-LLM, batch substrate edits to session end, delegate >20-file reads to subagents, and use Opus on the few decisions where being wrong costs an hour of rework.

The post defines the constraint, names the wrong meta, names the right one, and shows the per-pattern math on a real 8-hour session.

What plan-quota actually is

Claude Code’s subscription plans impose a 5-hour rolling rate limit. Inside the window you have a token budget across all conversations and tool calls. When the window resets you start fresh; while you are inside it, every turn counts against the cap. Anthropic adjusts the cap by tier (most recently 2026-05-06, when they doubled the Code 5-hour limits and raised Opus API limits).

What plan-quota is NOT: a per-call dollar amount. Under the older API-billed model, your bill grew per token, and the tuning was fewer dollars per turn. Under subscription, you paid the flat fee. The marginal dollar cost of one more turn is zero. The marginal plan-quota cost is what binds: every turn depletes the 5-hour window, and when it runs out, you are blocked until reset.

This is the rule in my operations playbook, numbered G18: plan-quota is the binding constraint, not API spend. First thing I check when picking a tool, a model, or a workflow. The lens that explains why most r/claudecode tuning advice is correct under the old pricing and wrong under the new.

Why the Reddit consensus tunes the wrong thing

Three patterns dominate the r/claudecode efficiency discourse. Each is correct under per-call billing and wrong under subscription:

Pattern one: “Use Haiku unless you really need Sonnet.” Correct under API pricing because Haiku is roughly 1/5 the per-token cost of Sonnet. Under subscription pricing, Haiku and Sonnet both count against the same 5-hour window at roughly the same per-token rate. The cost of using Sonnet for a judgment task that Haiku would get wrong is one Sonnet turn. The cost of using Haiku for the same task and having to re-do it on Sonnet later is two turns plus the rework. Subscription pricing inverts the math: use the model that gets the decision right the first time, not the cheapest one.

Pattern two: “Prompt-cache aggressively to save tokens.” Caching is still good, but the reason changes. Under API pricing, cache hits cut your bill 90 percent on the cached prefix. Under subscription, cache hits cut your plan-window burn 90 percent on the cached prefix. Still valuable, but the discipline is different: you cache to preserve TURNS in your 5-hour window, not to save dollars. A cache miss costs you the same dollars (zero) and roughly 10 times the plan-quota.

Pattern three: “Move the conversation into the LLM so you do not have to context-switch.” Under API pricing, every turn cost money, so consolidating work into fewer longer conversations made sense. Under subscription, consolidation IS the problem: a 200-file repo dumped into context costs zero dollars and 150,000 tokens of plan-quota per turn it persists. The correct move under subscription is the opposite: query the disk, do not load it. The conversation is a steering wheel, not a workspace.

The compound effect: an operator running the tuned-for-API patterns on a subscription plan will hit limits at roughly a fifth the throughput of an operator running the tuned-for-subscription patterns. Same plan tier. Same models. Different objective function.

The objective function change

Under subscription pricing, the question is not “how do I make each call cheaper?” The question is “how many judgment-quality decisions can I fit in a 5-hour window?” That change cascades through every workflow choice:

You stop asking “Haiku or Sonnet?” and start asking “is this a judgment call or a routine call?” Judgment calls (architecture picks, money-moving deploys, irreversible config changes, debugging unknown causes) go to Sonnet or Opus regardless of token count. Routine calls (status checks, simple grep, file lookups, browser automation) go to Haiku regardless of importance. The split is on the cost of being wrong, not the cost of the call.

You stop asking “how do I fit more context?” and start asking “what goes on disk that I query, and what goes in the prompt that I send?” Stable prefix (CLAUDE.md, project conventions, the codex nodes I always reference) stays in the prompt to maximize cache hit. Volatile material gets queried from disk per turn so cache does not invalidate. The conversation never holds the full codebase. It holds a steering directive plus query results.

You stop asking “can I do this in one long turn?” and start asking “what shell script, subagent, or hook can do this without my main session paying?” A pre-commit hook running as a Python script costs ZERO plan-quota; it does not go through the LLM. A subagent dispatch costs one orchestration turn plus the subagent burns its own 200K window separately. Compared to the main session reading 30 files itself, both are 5-10x cheaper in plan-quota.

You stop asking “should I retry?” and start asking “what’s the diagnosed cause before I retry?” Retries under per-call pricing cost dollars. Retries under subscription cost turns. Three failed retries on the same approach equal three turns I cannot use elsewhere in the window. A 10-minute diagnostic shell script that runs once and tells me the real cause costs zero plan-quota and saves four future turns.

The objective function is the same whether you are a solo operator (me) or a team running parallel agents. Tune for the maximum number of decisions where being wrong is expensive, performed inside the 5-hour cap.

The proof: an 8-hour session at 32 percent burn

The empirical pattern I keep returning to: an 8-hour Claude Code session on 2026-05-15 that closed at 32 percent plan-window burn. Most of the day was substantive architectural work, not idle querying.

What shipped in the window:

Time	Work shipped	Approximate plan-quota cost
08:30 - 10:00	Audit + plan for the 4-pillar codex-driven post series	~4% (heavy Sonnet, light Opus on the audit decision)
10:00 - 12:30	Draft + ship 4 codex-driven posts to chudi.dev	~9% (Haiku + Sonnet mix, structured per the body-framework skill)
12:30 - 13:30	Methodology fix: replace HTTP-200 deploy verification with gh-run-list poll	~3% (Sonnet for the skill edit, Haiku for the gate scan)
13:30 - 16:00	Step 4h vocab translation + Step 4i framing rewrite on the kill-vs-keep-a-side-project post	~7% (Sonnet draft + Haiku validation loops)
16:00 - 17:30	Ship skill v2.5 with Step 4j craft pass + retroactive retrofit on 3 posts	~6% (Sonnet for the skill design, Haiku for the audit scan)
17:30 - 18:30	Deploy verification + indexing + close-out	~3% (Haiku for status checks, Sonnet only for the close-out summary)

Total: roughly 32 percent of the 5-hour-window cap consumed across 8 hours of wall-clock. Reset windows fire at hours 5 and 10, so I traded into two adjacent windows. Only one of the six work blocks used Opus, and only for one decision (the audit verdict on whether to proceed with the 4-post series at all). Everything else was Sonnet or Haiku. The savings were not from cheaper models. They came from work that never touched the LLM.

The work that never touched the LLM:

A pre-commit hook validated the role classification on every turn (Python script, zero plan-quota).
The codex (~/.claude/codex/INDEX.json, a hand-authored knowledge graph stored on disk) was queried 14 times across the session for context that would have otherwise been loaded into the prompt; each query was a grep, not a turn.
The decisions ledger (~/.claude/decisions.jsonl, an append-only log of every architectural pick I make with the AI) recorded 11 events as shell appends, zero plan-quota.
Three subagent dispatches (Explore, brand-voice:document-analysis, general-purpose) handled file-heavy work that would have cost the main session 30-50 file reads each; each subagent burned its own 200K window, not mine.
The Vercel deploy gate watcher ran as a background bash script polling gh-run-list; my main session received one notification when the build completed, not 30 polling turns.

That is where the gap is. Not in cheaper models. In work that never went through Claude at all.

What this means for you

If you are hitting Claude Code’s 5-hour limit on workloads where you are doing substantive technical work, the issue is almost never that you need more plan-quota. It is that the work you are doing inside the LLM contains a large fraction of work that does not belong there. Three rules surface from the per-pattern math:

One: do not put in your prompt what you can put on disk and query. A 200-file codebase loaded into context every turn is 150,000 tokens of plan-quota per turn it persists. A grep against the same codebase returns the 5 files you actually need and costs essentially zero plan-quota. The disk is the working set; the prompt is the steering wheel.

Two: do not ask the LLM to do what a 5-line shell script can do deterministically. Validation, formatting, secret-scanning, type-checking, link-checking, and file-existence checks are all jobs for hooks and scripts. Each one you offload pays back compound: zero plan-quota per invocation, deterministic results, no retry cost.

Three: tune for plan-quota burn per useful decision, not API tokens per call. When you are deciding whether to spawn a subagent, the question is not “is this cheaper per API token?” The question is “does the main session need to hold this context, or can a subagent burn its own window and return a 200-word summary?” When you are picking a model, the question is not “is Haiku enough?” The question is “is the cost of a wrong Haiku verdict (retry on Sonnet plus rework) worse than the cost of starting on Sonnet?”

If those three rules feel obvious, you are probably already operating this way. If they sound contrarian, the inversion is the post.

FAQ

Five questions I have been asked repeatedly since I started writing about this, each self-contained. AI engines extract these as standalone answers; human readers skim them as recap; both audiences are served by the same shape.

Does plan-quota economics apply to API users?

Partially. If you are pay-as-you-go on the API, dollars per call still matter and the API-era tunings remain correct. The G18 framing applies specifically to subscription tiers (Pro, Max, Max 20x, Team, Enterprise) where the marginal dollar cost of one more turn is zero and the marginal plan-quota cost is what binds. Most production teams I talk to are on subscription plans because the predictability matters more than per-call tuning; for them, G18 applies.

What is the cost of using Opus more if Opus uses more plan-quota per turn?

Opus burns roughly 5x Haiku’s plan-quota for the same tokens. Sounds like Opus should be a last resort. The math flips when you account for being wrong. A judgment call Haiku gets wrong costs the Haiku turn plus the Sonnet retry plus the rework. A judgment call Opus gets right costs one Opus turn. Reversible decisions (one git revert away) belong to Sonnet. Irreversible decisions (architecture, deploy, money-moving) belong to Opus.

Is the 5-hour window the only constraint?

For solo operators on Max 20x, effectively yes for most workloads. For teams or enterprise plans, there are additional limits (concurrent sessions, per-day caps, per-team caps). The 5-hour rolling window is the most binding constraint I have hit; everything else is a downstream effect of that constraint not being managed.

How do I measure my own plan-quota burn?

Claude Code’s built-in /cost command (in newer versions) gives you a running tally per session. If you do not have that, the proxy is “how often do I hit the 5-hour limit in a typical work day?” If the answer is more than once a week, you are spending too much per turn. If the answer is “rarely,” you have headroom to use the more expensive model where it pays off.

What is the highest-impact change I can make today to reduce burn?

Move one workflow off the LLM and onto a shell script or hook. Pick the smallest one: a validation check, a status query, a file-existence test. Implement as a pre-session hook or a script the session calls. Measure plan-quota burn before and after on a representative session. If burn drops measurably, you have the pattern. Scale to the next workflow. The discipline is “what else does not belong in the conversation?” repeated for every workflow.

What to do next

Three actions in increasing order of effort. Each is independent; pick the one that matches your current state.

Open /cost (or its equivalent) in your next Claude Code session and read your real plan-window burn. If you have never measured this, the number alone will tell you whether G18 applies to your situation. Measurement before tuning, always.
Audit one of your recurring conversations for off-LLM work. Pick a workflow you run weekly. List every step. Mark which steps the LLM is doing and which steps could be a shell script, a pre-commit hook, or a scheduled task. Migrate one step. Re-measure plan-quota.
Decide which of your judgment calls justify Opus and which do not. Make a list of the architectural picks you make weekly (deploy yes/no, schema change yes/no, kill-vs-keep on a feature). For each, name the cost-of-being-wrong. If the cost is more than an hour of rework, Opus pays for itself. If not, Sonnet is fine. The list is the practice of the framework.

The harder version of this work is rebuilding your workflow so the conversation is genuinely the steering wheel and the work happens elsewhere. Two months ago I did not have this framing. Six months from now most production teams I know will. The window to author the framing in public is now.

Drafted from my own May 2026 session-burn data and the operating principles I have been compounding over the prior 8 months. The 32-percent number is verified from session logs on 2026-05-15. If you want the AI-visibility framework I built alongside this discipline, see chudi.dev/framework.

Sources & Further Reading

What do you think?

I post about this stuff on LinkedIn every day and the conversations there are great. If this post sparked a thought, I'd love to hear it.

Discuss on LinkedIn