10 Patterns Behind a 32% Claude Code Plan-Quota Burn
10 specific patterns that explain why my Claude Code plan-quota burn runs at a fraction of what r/claudecode operators report on similar workloads. Each pattern lists the multiplier and the named alternative you can run today.
Why this matters
10 specific patterns that explain why my Claude Code plan-quota burn runs at a fraction of what r/claudecode operators report on similar workloads. Each pattern lists the multiplier and the named alternative you can run today.
Three weeks ago I traced why my Claude Code plan-quota burns at roughly a fifth of what r/claudecode operators report on similar workloads. The answer is 10 patterns.
Most r/claudecode users treat Claude Code as a runtime where the conversation IS the work. I treat it as an orchestration layer where the conversation steers and the actual work happens in subagents, hooks, scripts, scheduled tasks, and queryable disk-resident substrate. That single architectural difference is roughly 5-15x in plan-quota burn.
The 10 patterns below name where their burn goes that mine does not. Each has a named alternative (the pattern I run instead) and a rough multiplier on plan-quota cost. You can verify any of them on your own next session.
TL;DR
Under subscription pricing, plan-quota is the binding constraint (not API tokens). Most builders are still tuning as if dollars per call mattered, when the real cost is turns per 5-hour window. The 10 patterns below are where the gap lives. Stack five of them and you get the 5-10x compound effect that explains “32% burn at 8 hours vs limit-hits at 2 hours.”
The 10 patterns
1. Context-as-RAM versus codex-as-query-surface
The reddit pattern: open Claude Code, paste a 200-file repo into context (or use Read on every file to “give it context”), then start asking questions. Every turn re-passes 150-300K tokens. Even with prompt caching, the first turn alone is huge, and cache invalidates the moment they edit one file.
The alternative I run: ~30-50K loaded by default and the codex is queried on demand. The codex is a hand-authored knowledge graph stored on disk at ~/.claude/codex/, with an INDEX.json that maps terms and aliases to node files. When I need context on “plan quota” I grep the index, load 3-5 relevant nodes, paste them, and proceed. The disk is the working set; the prompt is the steering wheel.
Per-turn delta: roughly 5-10x. The runtime pattern loads RAM-like context every turn; the orchestration pattern queries it like a database.
2. Opus-everywhere versus a routed model mix
The Reddit pattern: “Opus is the smartest, so I use it for all my coding.” Even file lookups. Even status checks. Even “what does this function do.”
The alternative I run: 50% Haiku / 30% Sonnet / 20% Opus per the routing rule in my operations playbook. Haiku finds (exploration, triage, status, browser automation, simple refactor). Sonnet builds (multi-file implementation, async work, security-adjacent). Opus decides (architecture, money-moving, final review, deep debug). Opus reserves itself for judgment work where being wrong costs more than an hour of rework.
Per-turn delta: Opus burns roughly 5x the plan-quota of Haiku for the same token count. Running Opus-for-everything is a 5x multiplier compared to a routed mix, and most of those Opus turns are doing routine work that does not need its judgment.
3. Conversational chat versus done-when delegation
The Reddit pattern: “Let me ask Claude to do X. Now Y. Now Z. Wait it broke, let’s debug.” Each step is a full turn with full context, and the operator manually steers across 50 turns.
The alternative I run: single-shot delegated workflows through a goal-driven session mode I call /done-when. You specify the success condition + a max-turn budget + an evaluator model (Haiku, usually). The main agent works toward the goal; the Haiku evaluator scores each loop iteration against the success condition; the loop exits when the condition is satisfied or the turn budget runs out. The main agent does not manually steer 50 turns; it runs to the goal.
Per-task delta: 10-50x. Manual conversational steering across a multi-step task is the most wasteful pattern in the runtime model.
4. Heavy work in main session versus subagent delegation
The Reddit pattern: AI itself does the broad-codebase exploration, the document analysis, the multi-file grep. All of it burns the main session context.
The alternative I run: spawn an Explore or general-purpose agent (or a domain specialist like a brand-voice document-analyzer) for anything reading more than ~20 files. The subagent has its own 200K context window. The main session receives a summary, not the raw output. Spawning a subagent costs roughly one turn of orchestration overhead; the subagent then burns its own window separately and returns a 200-word summary.
Per-heavy-task delta: the subagent burns its own 200K window, the main session pays summary cost only. On a typical >20-file read, that is a 30-50x reduction in main-session burn.
5. Always-on extended thinking versus decision-only thinking
The Reddit pattern: many users keep thinking mode on for everything because “it gives better answers.” Thinking mode burns 5-20x the plan-quota of non-thinking responses, depending on depth.
The alternative I run: thinking only on actual decisions. My decision-mode skill (a knowledge-graph-walk skill I call /librarian) triggers thinking when the readback has multiple plausible paths and a principle has to pick between them. My adversarial-critique skill (which I call /red-team) triggers thinking when a decision is irreversible. Routine work skips thinking entirely.
Per-thinking-turn delta: 5-20x. Decisions get the thinking budget; status checks and file lookups do not.
6. Cache-invalidating mid-session versus cache discipline
The Reddit pattern: edit CLAUDE.md mid-session (“oh let me add this rule”), edit a config, push a small change. Each one invalidates the prompt cache. Next turn pays full freight.
The alternative I run: an explicit rule (sitting at ~/.claude/rules/session-management.md under a section called “Cache discipline”) that batches all substrate edits to session end. My cache stays warm across the 5-minute TTL window, dropping repeated-prefix input cost to roughly 10 percent of full. Mid-session edits go into a scratch file and apply at session close.
Per-turn delta: cache hit drops input cost roughly 10x compared to a cold prefix. Across an 8-hour session, that is the difference between hitting the wall mid-day and finishing the work.
7. Retry-without-diagnose versus preflight + red-team
The Reddit pattern: code fails. “Let me try again.” Fails. “Let me think harder.” Fails. “Switch to Opus.” Fails. Each retry is a full Opus turn.
The alternative I run: hard rules in my operating playbook (“do not retry the same approach without diagnosis”), a /preflight skill that runs config-reconciliation and break-even checks before risky actions, a /red-team skill that attacks proposed picks adversarially before they ship, and circuit breakers that stop after the same error appears 3 consecutive times. The principle is “diagnose root cause first, then retry intelligently.”
Per-bug delta: 3-5x fewer turns. Most retry storms come from re-trying the same approach with different model temperature instead of changing the approach itself.
8. Re-deriving context every conversation versus codex plus decisions-ledger
The Reddit pattern: open a new session. “OK so the project is X, I’m trying to do Y, I’ve tried Z…” 1000 tokens of context re-explaining before the actual question.
The alternative I run: a codex (the same knowledge graph from pattern 1) plus an append-only decisions ledger at ~/.claude/decisions.jsonl that records every architectural pick I make with the AI, including the red-team attack on each pick and the operator ratification. The agent already knows the project. The session starts with stable substrate loaded (CLAUDE.md + the rules files in ~/.claude/rules/, about 30K tokens total), then queries the codex for the specific question and pulls in 3-5 relevant nodes. No re-explaining.
Per-session delta: 5-10K tokens of unnecessary context never get sent. Across 5 sessions per week, that compounds.
9. AI doing what hooks should do versus deterministic pre-commit hooks
The Reddit pattern: “Claude, check that I didn’t leak any secrets in this commit.” Opus reads 30 files and reports. Repeat every commit.
The alternative I run: a pre-commit hook I just installed today that runs as a Python shell script. It blocks the commit before it even hits Claude. Same shape for the role-classifier (a hook that reads my prompt before each turn and classifies whether the work is ARCHITECT or TACTICAL, locking the appropriate discipline), the em-dash check, the output validator. These are FREE in plan-quota terms because they do not go through Claude at all.
Per-task delta: infinite. The work that runs in shell scripts costs zero plan-quota; the work that runs in Claude costs the full turn.
10. Sequential tool calls versus batched parallel
The Reddit pattern: “Let me check git status.” One tool call. “Now let me check git diff.” One tool call. “Now let me check git log.” One tool call. Three full agent turns.
The alternative I run: batched parallel tool calls in a single message, per the routing rule that says “make all independent calls in parallel.” Three tool calls in one message equals one turn. The agent emits all three tool_use blocks at once; the runtime executes them in parallel; the agent receives all three results in the same response.
Per-multi-step delta: 3x fewer turns. The compound effect across a session with many parallel-eligible workflows is significant.
The compound math
Stack 5x context savings + 3x model-mix savings + 5x cache hit + 3x batched workflows + 2x parallel calls. The arithmetic projects roughly 150x theoretical efficiency. Real-world is more like 10-30x because the multipliers do not compound cleanly on every turn. But you only need a 5x effective multiplier to explain why you are at 32 percent plan-quota burn while r/claudecode hits limits at a fifth of the same workload.
Anthropic doubled the Claude Code 5-hour limit on 2026-05-06. So the new “limit” is double what it used to be. People still hitting limits are not using 2x more compute. They are using roughly 20x what an architecturally disciplined operator uses.
What this means for you
If you are running Claude Code as a chat app and hitting the 5-hour wall on substantive work, the issue is almost never plan-tier. It is that the work happening inside the LLM contains a large fraction of work that does not belong there. Three rules:
One: do not put in your prompt what you can put on disk and query. Pattern 1 + Pattern 8 combined are the single largest multiplier.
Two: do not ask the LLM to do what a 5-line shell script can do deterministically. Pattern 9 is the highest-impact one-time fix; pre-commit hooks pay for themselves the first day.
Three: tune for plan-quota burn per useful decision, not API tokens per call. This inverts Patterns 2, 5, and 7. Use Opus on judgment calls. Use Haiku on triage. Skip thinking unless the decision deserves it.
If those rules sound contrarian, the inversion is the point.
FAQ
Five questions I have been asked since I started writing about this, each self-contained.
Do these patterns require a custom harness, or can I run them on stock Claude Code?
Most are stock features. Subagents (pattern 4) ship with Claude Code. Batched parallel tool calls (pattern 10) are documented in the prompting guide. Hooks (pattern 9) are configurable through ~/.claude/settings.json. The codex (patterns 1 and 8) is operator-built but the shape ports easily: an INDEX.json plus a folder of markdown files. The done-when delegation skill (pattern 3) and the librarian and red-team skills (pattern 7) are operator-authored skills; the patterns work without them but the discipline is harder to maintain.
What is the highest-impact pattern to start with?
Pattern 9 (deterministic hooks). The setup cost is one afternoon. The plan-quota savings compound on every future session. Pattern 1 (codex as query surface) is the second-highest-impact but requires building substrate, which takes weeks. Hooks pay back in days.
Why does the routed model mix (pattern 2) not produce more savings?
Because the savings from model-routing alone are modest under subscription pricing (roughly 3x). The runtime-versus-orchestration architecture is where the big multipliers live. Routing matters but it is not the main lever; pattern 1, 3, 4, 8, and 9 are.
Is the 32% burn number reproducible?
Yes, on the same workload mix. The variability is in workload. A session that is 80 percent codex maintenance and 20 percent shipping content burns differently than a session that is 80 percent debugging a single hard bug. The 32% number references a specific 8-hour session on 2026-05-15 that shipped 4 codex-driven blog posts, 1 skill upgrade, 1 schema fix on chudi.dev, and 1 deploy methodology fix.
How fast does the architectural shift pay off?
Pattern 9 (hooks) pays off in days. Pattern 1 (codex) pays off in weeks. Pattern 3 (done-when delegation) pays off in months as you build up the skill library. The full 5-10x compound usually surfaces around the 3-month mark; before that you are still trading off setup cost against savings.
What to do next
Three actions ordered by setup cost. Each is independent.
- Implement one deterministic pre-commit hook today. Pick the smallest validation you currently do through Claude (secret-scanning, em-dash check, type-check). Make it a shell script that runs on
git commit. Measure: how often the LLM asked about this check before vs after. The difference is pure plan-quota savings. - Audit your next session for context-as-RAM patterns. Note every time you Read a file just to “give Claude context” instead of querying for a specific answer. Replace those reads with grep + targeted load. Measure plan-quota burn before and after on a comparable workload.
- List your 10 most common workflows and tag each: belongs in the LLM, belongs in a subagent, belongs in a shell script, belongs in a hook. The tag tells you where the work should live. Most builders are surprised by how many workflows belong in the last two categories.
The harder version of this work is rebuilding workflow so the conversation is the steering wheel and 95 percent of the work happens elsewhere. Two months ago I did not have this framing. By the end of 2026 most production teams will. The window to author the framing is now.
Drafted from the May 2026 plan-quota session-burn data + the 10-pattern enumeration that surfaced during a discussion of Claude Code tuning across operator stacks. The 32% number is verified from session logs on 2026-05-15. The pre-commit-hook reference (pattern 9) was implemented the same morning. If you want the AI-visibility framework I built alongside this discipline, see chudi.dev/framework.
Sources & Further Reading
Further Reading
- Plan-Quota Is the Binding Constraint, Not API Tokens Under subscription plans, the cost of an LLM call is plan-window burn, not API tokens. That changes which decisions you tune and which work belongs off-LLM entirely. A manifesto for the 5-hour rate-limit era.
- Claude Code Best Practices the Official Docs Don't Cover (2026) What I learned building 36K lines of production code with Claude Code: quality gates, multi-agent orchestration, and the workflow patterns that ship.
- Claude Code Hooks Caught a Secret Leak Before I Shipped It Stop accidental secret leaks in Claude Code hooks. Learn 4 production patterns to validate, format, and gate commands before execution.
What do you think?
I post about this stuff on LinkedIn every day and the conversations there are great. If this post sparked a thought, I'd love to hear it.
Discuss on LinkedIn