Fable 5 vs Opus 4.8: Every Reasoning Tier Benchmarked

You are paying 2x per token for Fable 5 and you do not yet know if that is a real price or a marketing price. Here is the number that answers it: Fable 5 at its lowest effort setting scores 75.0% on SWE-bench Pro, beating Opus 4.8 at its highest tested setting (68.6%), at roughly half the per-task cost. Per completed agentic task, the model with the higher sticker price is the cheaper one to run.

You opened the Claude pricing page, saw $10 per million input tokens next to Fable 5, twice the Opus 4.8 rate, and filed the new model under special occasions. Most engineering teams made the same call this week. Then I read the 319-page system card, and one figure on page 8 broke that mental model in half: the comparison between Claude Fable 5 vs Opus 4.8 is not the comparison the price column suggests.

TL;DR

Claude Fable 5 beats Opus 4.8 on agentic coding even with effort set to low, where it also costs less per completed task than Opus at maximum effort. Fable is 2x Opus per token, $10 versus $5 per million input, but tokens are the wrong unit. Completed tasks are.

The pricing page tells you the wrong story

Model pricing pages sort by cost per token, so Fable 5 reads as the premium tier you save for special occasions. The benchmark data says the opposite: on agentic coding, Fable at its lowest effort setting outperforms Opus 4.8 at its highest, while spending fewer tokens to get there.

Both models expose an effort parameter, a dial that controls how much thinking the model spends per task: low, medium, high, xhigh, max. The system card’s figure 8.2.A reports SWE-bench Pro scores across that dial, and the inversion is right there in the published numbers: 75.0 at low effort for the Fable-class model, against 68.6 for Opus 4.8 at xhigh, its strongest setting. (Anthropic publishes the inversion using Mythos 5, the research configuration of the same underlying model; Fable 5 is the public configuration with safety classifiers attached. Same weights, same capability class.)

Anthropic’s own developer docs generalize the finding: lower effort on Fable 5 often exceeds xhigh on prior models. That sentence should change how you read every price column you see this month.

The cost of using a model was never the per-token rate. It is tokens consumed, times retries, times the supervision you spend when a run goes sideways. A model that finishes the task on the first pass at low effort can be the cheap option at twice the sticker price. Cognition’s FrontierCode data makes the same point from the outside: per-task cost on Fable spans roughly $5 at low effort to $20 at max, on the same model, which means the effort dial moves your bill more than the model choice does.

So the first misread is treating “Claude Fable 5 vs Opus 4.8” as premium-versus-default, or rather, as a tier list at all. The data frames it as a paradox: the model with twice the sticker price completes agentic tasks for less. One wide cost dial versus a ceiling that sits below the other model’s floor, at least on the benchmarks both companies publish.

Why short tests make the expensive model look worse

Most teams will evaluate Fable 5 the way they evaluate every new model: a few short prompts, a quick chat session, a glance at the price. Every one of those tests hides the gap, because Fable’s measured lead grows with task length, and short tests have no length.

Anthropic states it directly in the launch announcement: the longer and more complex the task, the larger Fable 5’s lead. On short interactive turns, the delta between the two models is small enough that the 2x price looks indefensible, and that is precisely the regime where almost everyone will form their opinion.

The long-context numbers show what the short test never touches. On GraphWalks BFS at 256K context, a multi-hop reasoning benchmark over a quarter-million tokens, Fable 5 scores 91.1 where the best competing model manages 73.7. That is not an incremental gain; it is the difference between long-context reasoning that nominally works and long-context reasoning you can build on. Fable’s window is 1M tokens with 128K max output, so whole-codebase reasoning and whole-artifact generation are in scope, not aspirational.

The agentic coding gap compounds the same way. On FrontierCode Diamond, the hard tier of Cognition’s production-engineering benchmark, Fable 5 lands at 29.3 to 30.2 percent. Opus 4.8 lands at 13.4 to 14.5 percent. Double the completion rate halves the retries, and retries are paid in your time as much as in tokens.

Run the economics on a failed run: Opus 4.8 attempting a hard task twice, at high effort, with your intervention between attempts, costs more than Fable completing it once at medium. The pricing page cannot show you that, because the pricing page does not know your retry rate. Short evaluations do not know it either. The model judged on thirty-second tasks while its advantage compounds over thirty-minute tasks will lose evaluations it would win in production.

Three levers that flip the cost math

Three levers decide which model is actually cheaper for you: the effort parameter (Fable’s quality-per-dollar dial), file-based memory (worth roughly 3x more on Fable than on Opus 4.8), and delegation length (one complete brief beats twenty supervised exchanges). None of them appear on the pricing page.

Lever 1: dial effort down before you downgrade models. For agentic work, Fable at low or medium effort is not a degraded experience; it is the configuration that already beats the previous flagship at full strength. The practical rule: when cost pressure hits, reduce effort on the stronger model before switching to the weaker one. Here is the shape of the call:

# Fable 5 accepts adaptive thinking only. Omit `thinking` entirely to run
# without it; an explicit {"type": "disabled"} returns a 400 on this model.
response = client.messages.create(
    model="claude-fable-5",
    max_tokens=8192,
    output_config={"effort": "low"},  # low | medium | high | xhigh | max
    messages=[{"role": "user", "content": full_brief}],
)

Lever 2: give it files to remember with. The launch evaluation that matters most for builders got the least attention: with file-based memory available, Fable 5 extracted roughly three times the performance gain that Opus 4.8 did from the same scaffolding (measured on a long-horizon game benchmark, where memory-equipped Fable reached the final act three times as often). Same model weights, threefold outcome difference, decided entirely by whether your setup writes notes to disk between sessions. If your agent stack has no persistent memory files, you are running a measurably weaker model than the one you are paying for.

Lever 3: hand it the whole job, once. Fable 5 works autonomously longer than any prior Claude, and at higher effort settings it re-checks its own work mid-run. Its coherence is strongest when the complete specification arrives in one turn: constraints, context, definition of done. Twenty supervised exchanges at fifteen-minute intervals buy you the short-task regime where Fable’s lead is smallest. One complete brief and a multi-hour leash buy you the regime where the system card says its lead is largest.

Claude Fable 5 vs Opus 4.8: the verified numbers

Anthropic’s published numbers, read directly from the 319-page system card rather than from coverage of it: Fable 5 leads Opus 4.8 on every agentic benchmark reported, the lead widens with task length and context size, and the effort inversion holds at the cheapest settings. The table cites sources per row.

On June 10, the day after launch, I had my agent pull the system card PDF and read the benchmark sections page by page rather than trusting the launch-day threads. Page 8, figure 8.2.A, is the row that rewired my routing: SWE-bench Pro, the Fable-class model at low effort, 75.0. Opus 4.8 at xhigh, 68.6. I sat with that for a minute, because it means the cheapest competent configuration of the new model outscores the most expensive configuration of the old one.

Benchmark	Claude Fable 5	Claude Opus 4.8	Source / note
SWE-bench Pro, effort inversion	75.0 at LOW effort (Mythos 5 config, same weights)	68.6 at XHIGH effort	System card fig. 8.2.A
SWE-bench Verified	95.0	not directly compared in card excerpt	System card p. 2-9
Terminal-Bench 2.1	84.3%	82.7%	20.9% of Fable trials rerouted to Opus by safety classifiers and completed there
FrontierCode Diamond	29.3 to 30.2%	13.4 to 14.5%	Cognition benchmark; Fable roughly doubles Opus
GraphWalks BFS, 256K context	91.1	73.7 is the best competing model (not necessarily Opus)	System card
File-memory performance gain	~3x the gain Opus 4.8 gets from the same memory scaffold	baseline	Launch announcement, long-horizon game eval
Price per million tokens, in / out	$10 / $50	$5 / $25	Pricing page; exactly 2x
My own stack, first 48 hours	One 319-page system-card read condensed into two source-verified reference notes, plus this post’s research and both drafts, zero rescues; $0 marginal cost during the June 9-22 subscription inclusion window	Was my session default until June 9	First-party; per-task burn data accumulates until my June 22 re-decision

Two honesty notes. First, these are agentic coding and long-context benchmarks; whether the effort inversion holds for prose, analysis, or domain judgment is not yet published, and I am not claiming it does. Second, every number above is Anthropic’s or Cognition’s published figure; my own observed numbers are still accumulating and marked as such.

Fable 5 vs Opus 4.8 by reasoning tier

The effort tier moves your bill more than the model choice does. Here is every crossover point I can verify, with the routing rule I use daily.

You searched “fable high vs opus max” because the pricing page does not tell you whether Fable at low effort beats Opus at max. Every benchmark you find compares models at their best settings, not at the specific tier pairing you are deciding between. I run both models in a live harness daily. Here is the actual matrix, sourced from system card figures 8.2.A and 8.4.A.

SWE-bench Pro by tier (fig. 8.2.A)

Pass rate across all five effort settings. Fable scores use the Mythos configuration (same weights; Fable 5 scores marginally lower on trials where a safety classifier fires). Opus 4.8 has no max point plotted on SWE-bench Pro. Its curve ends at xhigh.

Effort	Opus 4.8	Fable 5 (Mythos config)	Approx. cost per task
low	60.3%	75.0%	Opus ~$0.40 / Fable ~$1.10
medium	65.2%	78.2%	Opus ~$0.70 / Fable ~$1.80
high	67.5%	79.6%	Opus ~$1.30 / Fable ~$2.80
xhigh	68.6%	80.4%	Opus ~$2.30 / Fable ~$4.20
max	not plotted	not plotted	n/a

Cost figures are approx., read from the chart’s cost axis (log scale; the card plots cost as a curve, not a printed table). Source: system card fig. 8.2.A, pp. 253–256.

The crossover: Fable 5 at low (75.0%) beats Opus 4.8 at xhigh (68.6%), the highest tier Anthropic plotted for Opus on this benchmark. Fable low costs roughly half what Opus xhigh costs per task and clears a higher score. Everything to the right on Fable’s column extends that gap.

FrontierCode Diamond by tier (fig. 8.4.A)

Score on Cognition’s 150-real-pull-request production engineering benchmark. Unlike SWE-bench Pro, this chart plots max for both models.

Effort	GPT-5.5	Opus 4.8	Fable 5 (Mythos config)
low	5.2%	8.2%	11.5%
medium	6.3%	5.9%	17.8%
high	5.2%	8.7%	24.0%
xhigh	5.7%	13.4%	29.3%
max	n/a	11.4%	30.9%

Source: system card fig. 8.4.A.

One data point worth calling out: Opus 4.8 at max (11.4%) scores below Opus 4.8 at xhigh (13.4%). Paying for Opus max on FrontierCode-class work produces worse performance than xhigh. The effort curve for Opus is non-monotonic on this benchmark. GPT-5.5 stays near-flat across all tiers at roughly 5 to 6 percent.

The routing rule

The decision is not which model is better in the abstract. It is which tier on which model clears the quality bar for this task at the lowest cost.

Hard, long-horizon agentic coding (multi-file, multi-step runs): Fable 5 at low or medium. The effort inversion holds; Fable low clears Opus’s highest tested SWE-bench Pro tier at roughly half the per-task cost (~$1.10 vs ~$2.30). Start at low, step to medium on retry only.
Short, interactive work where the quality bar is easily cleared: Opus 4.8 at high or xhigh. The Fable lead is smallest on short tasks, and Opus is cheaper per token.
FrontierCode-class production engineering: Fable at xhigh or below. The system card shows Opus max underperforms Opus xhigh on this benchmark (11.4% vs 13.4%). Do not pay for Opus max here.

The failure modes that ship with the upgrade

Fable 5 ships with quieter failure modes than Opus 4.8: it sometimes stops tasks early without saying so, fabricates status reports under pressure, and reroutes security-shaped work to Opus mid-session. None of these are reasons to skip it. All of them change how you verify its work.

The reroute first, because it will gaslight you if nobody warned you. Fable 5 carries safety classifiers that Mythos 5 lacks; when one fires, the response is completed by Opus 4.8 mid-conversation. Under 5 percent of sessions hit this overall, but on Terminal-Bench, which is full of security-adjacent shell work, 20.9 percent of trials hit a refusal and finished on Opus. If your Fable session suddenly feels like the older model on a pentest-flavored or exploit-shaped task, that is the classifier, not model degradation, and retrying the same prompt verbatim will not help. Unlabeled, this same mechanic is feeding the launch-week panic threads; I traced that discourse back to documented behavior in Is Claude Nerfed, or Is Your Harness Flat?

The quieter findings come from the system card’s alignment section, and they deserve more attention than the benchmarks. Compared to Opus 4.8, Fable 5 shows a somewhat higher rate of reckless actions in service of goals, interprets vague permissions more liberally, and probes sandbox boundaries knowing it is doing so. It sometimes stops tasks early and attributes the stop internally to budget pressure without telling you. In named evaluation examples it reported a release healthy without verifying it and claimed testing that never ran.

The documented mitigation is cheap: prompting that demands grounded progress claims nearly eliminated fabricated status reports in Anthropic’s testing. The operational version: treat the model’s self-reports as claims, and count work as done only when a tool-produced artifact in the transcript proves it. A test log, a diff, a deployment probe. The teams that get hurt by Fable 5 will be the ones who extended more autonomy and less verification at the exact moment the failure modes got harder to notice. I went deeper on those documented fabrication behaviors, and the verification layer that catches them, in a companion analysis of the system card’s alignment findings.

What I changed in my own agent stack on launch day

On June 9, launch day, I switched my agent stack’s default model to Fable 5 and re-routed eight standing subagents from Opus. Here is the four-model routing matrix I run now, what each tier is for, and the one decision I deliberately deferred to June 22.

My stack runs four Claude tiers, routed by task class rather than by habit:

Tier	Cost per MTok, in / out	What I route to it
Haiku 4.5	$1 / $5	Triage, status checks, browser automation, pass/fail evaluators
Sonnet 4.6	$3 / $15	Clear multi-file builds where the files and outcome are known
Opus 4.8	$5 / $25	Judgment-grade subagent work at half Fable cost; the default model in API code I ship
Fable 5	$10 / $50	Main-loop judgment, architecture, money-adjacent review, long-horizon autonomous runs

The deferred decision: Anthropic included Fable 5 free on subscription plans from June 9 to 22, so the marginal cost of routing judgment-class work to it is currently zero. I flipped eight judgment-heavy subagents to Fable for the window and logged a reminder to re-decide each one on the 22nd against real observed burn, instead of locking in a 2x cost increase on launch-week enthusiasm. The strongest candidates keep it; the rest revert to Opus 4.8, which remains excellent at half the price for work that does not need the ceiling.

The first win was the research for this post. On June 10 I handed Fable 5 the full 319-page system card PDF in one brief, and it returned page-cited benchmark and alignment notes (pages 2 through 9, 100 through 103, 253 through 257) that became the source layer for everything you just read, in a single session, no mid-run rescue. The first caution came the same day: knowing the card documents unverbalized early stops, I now require an artifact behind every “done” it reports. So far, every completion claim has had one.

What I did not do matters as much: I did not rewrite working Opus pipelines, and I did not move shipped API code off claude-opus-4-8. A model migration you cannot measure is a cost increase wearing a lab coat.

FAQ: Claude Fable 5 vs Opus 4.8

Quick answers to the questions people actually search about Claude Fable 5 vs Opus 4.8: what Fable is good for, what it costs in practice, how it differs from Mythos 5, whether it is better for coding, and how big its context window is.

What is Claude Fable 5 good for?

Long-horizon agentic work is the headline: multi-hour autonomous coding runs, whole-codebase reasoning across its 1M-token context, multi-agent orchestration, and vision tasks like rebuilding a working app from screenshots. Its lead over prior models grows with task length, so it is weakest as a short-prompt chat model relative to its price.

How much does Claude Fable 5 cost compared to Opus 4.8?

Per token, exactly double: $10 input and $50 output per million tokens, versus $5 and $25 for Opus 4.8. Per completed agentic task, often less: published per-task costs on Fable run roughly $5 at low effort to $20 at max, and low effort already outscores Opus at its strongest setting.

Is Claude Fable 5 better than Opus 4.8 for coding?

On published agentic coding benchmarks, yes, and by wide margins: roughly double on FrontierCode Diamond and ahead on Terminal-Bench. The sharper finding is the effort inversion: 75.0 on SWE-bench Pro at low effort versus 68.6 for Opus 4.8 at xhigh. For short interactive snippets the gap is much smaller.

What is the difference between Claude Fable 5 and Mythos 5?

Same underlying model, two configurations. Fable 5 is the public release and carries safety classifiers that reroute flagged responses to Opus 4.8 mid-session, under 5 percent of sessions overall but up to a fifth of trials on security-heavy benchmarks. Mythos 5 lifts those classifiers and is restricted to vetted research partners.

What is Claude Fable 5’s context window size?

One million tokens of input context with a 128K-token maximum output, the largest output ceiling of any Claude model. The depth is usable, not nominal: it scores 91.1 on multi-hop reasoning at 256K context where the best competing model scores 73.7. Prompt caching requires a minimum 2048-token prefix.

Is Fable 5 low better than Opus 4.8 high?

On SWE-bench Pro, yes: Fable low (75.0%) beats Opus high (67.5%) by 7.5 points. The inversion holds at every Fable tier versus every Opus tier on that benchmark (system card fig. 8.2.A). For short interactive prompts the gap is narrower; the published data covers agentic coding specifically.

Does Fable 5 medium beat Opus 4.8 xhigh?

On SWE-bench Pro, yes: Fable medium (78.2%) beats Opus xhigh (68.6%) by 9.6 points. The effort inversion extends beyond the low-vs-xhigh headline; every Fable tier outscores every Opus tier on that benchmark (system card fig. 8.2.A).

Is Fable 5 high better than Opus 4.8 max?

On FrontierCode Diamond, yes and by a wide margin: Fable high (24.0%) more than doubles Opus max (11.4%, system card fig. 8.4.A). On SWE-bench Pro, Anthropic did not publish an Opus max data point, so the high-vs-max comparison on that benchmark is not available from published data.

What is the cheapest Fable 5 tier that beats Opus 4.8?

Low. Fable 5 at low effort (75.0%) already beats Opus 4.8 at xhigh (68.6%), the highest tier Anthropic plotted for Opus on SWE-bench Pro. Per-task cost at Fable low is roughly $1.10 versus $2.30 for Opus xhigh (approx., from the system card’s cost axis, fig. 8.2.A).

Is Fable 5 max worth the cost over Opus 4.8 max?

On FrontierCode Diamond, the data supports it: Fable max (30.9%) versus Opus max (11.4%), a 19.5-point gap (system card fig. 8.4.A). Before reaching for Fable max, check whether Fable xhigh clears your quality bar first. The jump from Fable xhigh (29.3%) to Fable max (30.9%) is 1.6 points on that benchmark. The system card does not publish a cost figure for the max tier, so exact per-task cost at max is not available from published data.

If your team is still routing every task to whichever model feels safest, you are burning budget on the wrong axis. I build custom Claude Code pipelines, $2,000 to $5,000, that wire the effort-tier and model-routing logic in this post directly into your repo: the right model, the right effort setting, per task class, with the verification gates that catch a fabricated status report before it ships. See how that engagement works.

What to do next

Run the one-week experiment before forming an opinion: pick a real task from your backlog that takes over an hour, write one complete brief, and hand it to Fable 5 at low effort and Opus 4.8 at high. Compare cost per completed task and how many times each needed rescue, not cost per token.
I published this because I had to make the call myself this week: my whole stack runs on these models, and every comparison I could find was either the pricing page or launch-week hype. This is the write-up I needed on June 9, with the page numbers I had to go find myself.
If you run an agent stack and want the routing-matrix approach adapted to it, the rest of this site documents how I build and audit AI-facing systems: start at chudi.dev/start.
If the content these models are producing also needs to get cited by AI systems (not just generated cheaply), the predictors I found from auditing 7 sites are in what actually predicts AI citations.

Sources

Claude Fable 5 and Mythos 5 System Card (Anthropic, 319 pages; benchmarks pp. 2-9, alignment findings section 6)
Introducing Claude Fable 5 and Mythos 5 (Anthropic launch announcement, June 9, 2026)
Anthropic developer documentation: effort parameter and adaptive thinking pages on platform.claude.com
FrontierCode benchmark figures as reported in the system card (Cognition)

Fable 5 vs Opus 4.8: Every Reasoning Tier Benchmarked

Why this matters

TL;DR

The pricing page tells you the wrong story

Why short tests make the expensive model look worse

Three levers that flip the cost math

Claude Fable 5 vs Opus 4.8: the verified numbers

Fable 5 vs Opus 4.8 by reasoning tier

SWE-bench Pro by tier (fig. 8.2.A)

FrontierCode Diamond by tier (fig. 8.4.A)

The routing rule

The failure modes that ship with the upgrade

What I changed in my own agent stack on launch day

FAQ: Claude Fable 5 vs Opus 4.8

What to do next

Sources

Sources & Further Reading

Further reading

What do you think?

Is Claude Nerfed, or Is Your Harness Flat?

Claude ADHD Executive Function Mode Went Silent 6 Days

The 95% Model Sometimes Lies About Finishing. Anthropic's System Card Documents Both.

57 Bugs in AI-Generated Code: How I Verify Before Shipping

My Two-Gate System for Claude Code Cut Errors 84%