How I Built a Claude Code Trading Bot: 36,000 Lines

Published Mar 7, 2026 Updated Jul 3, 2026 Chudi Nnorukam 12 min read

I built Polyphemus with Claude Code, 4,000 lines in 6 weeks, 36,000 in 4 months. 5 principles that cut errors 84% and kept 8 bugs from losing money.

TL;DR

I built Polyphemus with Claude Code, 4,000 lines in 6 weeks, now 36,000 lines 4 months later. The same 5 principles that cut my error rate 84% also scaled the codebase 8.5x without entropy. Plus the shadow-first deployment methodology that caught 8 silent production bugs before they cost real money.

Key takeaways

01 Context loading in 3 tiers reduces token costs 58%, give Claude only what each task needs
02 Auto memory handles preference learning; CLAUDE.md is for architecture and hard rules only
03 Plan mode before any multi-file change catches conflicts before a single line is written
04 Two-gate verification (automated Gate 1 + 6-question Gate 2) reduces production errors 84%
05 Pre-compaction handoff prompts let Claude write its own session notes for future Claude instances

Why this matters

In this cluster

Cluster context

This article sits inside Python Agent Infrastructure.

Open topic hub

Deploy, manage, and scale Python agents on VPS infrastructure you control. From $6 Droplets to specialized trading VPS.

Most Python agent tutorials stop at "works on my laptop." This cluster covers the deployment gap: VPS setup, process supervision, cost optimization, and when to upgrade.

I Run Python Agents on a $6/Month DigitalOcean Droplet

Deploy Python agents on DigitalOcean for $6/month. Skip serverless complexity, avoid setup gotchas, run async bots reliably.

Build a Polymarket Trading Bot in Python: 69.6% Win Rate

Step-by-step Python guide to building a Polymarket trading bot: CLOB API orders, Binance WebSocket feeds, and the 30-90s lag behind a 69.6% win rate.

Five principles grew Polyphemus, a production Polymarket trading bot, from 4,247 lines to 36,000 in four months, cut the error rate from 1-in-6 outputs to 1-in-40, and kept 99.2% uptime across live pair arbitrage on four assets. Tiered context loading, auto memory, plan mode, two-gate verification, and pre-compaction handoffs are not workflow tips. They are the specific architecture that makes a growing Claude Code codebase maintainable at scale.

I finished the first version of Polyphemus in 6 weeks. Fully autonomous Polymarket trading bot. 4,000+ lines, real money on the line. Couldn’t have shipped it without Claude Code. I also wasted $340 in one month using it wrong. (The deployment infrastructure this bot runs on is documented in Deploy Python Agents to DigitalOcean.)

Four months later, the same codebase is 36,000+ lines. Two live instances running pair arbitrage on BTC/ETH/SOL/XRP. Strategy evolved from directional to market-neutral. Eight silent production bugs found and killed before they touched live capital. All caught by a shadow-first deployment gate I built after month two.

The five principles that got it to 4,000 lines are the same ones that got it to 36,000 without entropy. Here’s the case study, updated April 2026.

Yes, you can build a production trading bot with Claude Code. Polyphemus is 36,000+ lines, has run live pair arbitrage on four assets for four months, maintains 99.2% uptime, and caught 8 silent bugs before they touched real capital. The constraint is not the tool; it is the system you build around it. These five principles are that system.

Why Does Claude Get Dumber as Your Project Gets Bigger?

The bigger your codebase grows, the faster Claude’s context window fills up. Each session becomes less useful because the window is full of stale context instead of relevant code. By week three of Polyphemus, I was spending 20 minutes re-explaining context Claude had already “seen.” By month one, my token bill hit $340. The problem isn’t Claude. It’s the missing system around it.

Every Claude Code guide starts with CLAUDE.md tips. Not wrong, just backwards.

The first thing I had to understand was not “how do I write better prompts.” It was: why does Claude get dumber as my project gets bigger?

The answer is the context window. Every session loads your project into Claude’s working memory. As your codebase grows, that memory fills up faster. By week three, I was re-explaining decisions Claude had made with me two days earlier. Architecture choices, naming conventions, API patterns: all gone. I was paying for Claude to relearn the project I’d already taught it.

That’s not a Claude problem. That’s a system problem. And the symptoms compound fast: re-explaining context is demoralizing, it produces worse outputs because you’re summarizing instead of being precise, and eventually you stop correcting Claude because it feels pointless. You start accepting mediocre outputs. You start doing the “hard parts” manually. You’ve turned an AI assistant into an expensive autocomplete. By month one, I was close to giving up on the whole approach.

Here are the five things that fixed it.

Principle 1: Context is a Resource. Manage it Like One.

Most developers treat Claude’s context window like unlimited RAM: load everything, let it sort out what matters. That approach blew my token budget by 58% and produced hallucinations on files Claude “remembered” but didn’t actually have in scope. The fix is three tiers: always-loaded project identity (under 500 tokens), per-session task file (under 1,000 tokens), and explicit on-demand file loading. Nothing else.

Tiered context loading in practice:

Tier 1. Always loaded (under 500 tokens). CLAUDE.md at project root. What the project is, file structure, conventions. Nothing else. The map, not the territory.

Tier 2. Per-session (under 1,000 tokens). A CURRENT_TASK.md file. What you’re building today, what files are involved, what “done” looks like.

Tier 3. On demand. Specific files, loaded explicitly. “Read src/core/kelly.py before we start.”

Result: average session token usage dropped from ~10,000 to ~4,200 tokens. 58% reduction from one workflow change.

The rule that makes Tier 3 work: never reference a file by name without loading it first. “Update the execution module” produces hallucination. “Read src/execution/orders.py, then update the retry logic” produces accurate output.

Principle 2: Claude’s Built-in Memory is Better Than Manual Note-Taking

Claude Code has two memory systems: the CLAUDE.md file you write by hand, and Auto Memory, which Claude writes itself based on corrections you make. Most developers only use the first. Using both cuts the manual overhead of session management by more than half and produces more accurate recall than notes you wrote yourself.

I wasted two weeks maintaining a sprawling set of markdown notes before I discovered this. I was carefully updating files that Claude was already tracking more accurately through auto memory.

What auto memory doesn’t do: strategic decisions. If you’ve chosen PostgreSQL over SQLite for a reason, write that in CLAUDE.md. Auto memory captures patterns. CLAUDE.md captures architecture.

The CLAUDE.md that actually worked for Polyphemus:

# Polyphemus — Claude Context

## What this is
Autonomous Polymarket trading bot. Real money. Kelly Criterion sizing (the full Python implementation for signal generation, order placement, and position management is in [I Built a Live Trading Bot in Python](/blog/algorithmic-trading-python-ai-complete-guide)). 
Never lets an exception stop the main loop.

## Hard rules
- Never hardcode API keys. Doppler only.
- All amounts in USDC, not cents. One violation cost a real trade.
- Log every trade decision with rationale BEFORE executing.
- MAX_POSITION_SIZE is a ceiling, not a suggestion.

## What we are NOT doing
- No async. Sync is predictable.
- No ML models. Signal threshold is a float.
- No framework for the trading loop. Too much magic.

300 tokens. That’s it. Short CLAUDE.md, accurate auto memory, clean context.

Principle 3: Plan Mode Before You Write a Single Line

Plan mode (/plan) lets Claude research your codebase and propose an approach without making any changes. You review the plan, redirect if needed, then approve. On any task touching more than two files, this single step eliminates the most expensive class of Claude mistake: confident, multi-file output that conflicts with existing architecture.

Without plan mode, Claude wrote 200 lines of code conflicting with an architectural decision buried in a different file. Confident. Wrong. Two hours lost.

With plan mode on anything touching more than two files:

Me: /plan Add circuit breaker to execution module.
    Pause trading after 3 consecutive losses.

Claude: [researches without touching anything]
        Proposed approach: [plan]
        Files: src/execution/orders.py, src/core/state.py
        I noticed MAX_LOSS_DAILY in src/core/config.py —
        should the circuit breaker integrate with that?

Me: Yes, but use config module, not state.py.

Claude: Understood. Implementing now...

Claude caught an integration point I hadn’t mentioned. I redirected before any code was written. Plan mode costs nothing and consistently saves hours.

Principle 4: Two Gates, Not One

Two-gate verification means every Claude output clears an automated check (type checks, linting, tests in under 30 seconds) before it gets a human review pass using a fixed 6-question checklist. Before this system, 1 in 6 Claude outputs reached production with an error. After: 1 in 40. On a trading bot, that gap is the difference between an incident log and a boring afternoon.

Gate 1 is automated. A bash script: type checks, linting, tests. 30 seconds. Catches ~60% of errors.

#!/bin/bash
python -m mypy . --ignore-missing-imports && echo "✓ Types" || exit 1
python -m ruff check . && echo "✓ Lint" || exit 1
python -m pytest tests/ -q && echo "✓ Tests" || exit 1

If Gate 1 fails, paste the error back: “Fix only what’s causing this error. Nothing else.” That last sentence matters. Without it, Claude fixes the error and refactors three other things.

Gate 2 is a 6-question checklist. Five minutes, manual, non-negotiable:

1. Does this do exactly what I asked — not more?
2. Are external API calls using the correct endpoints?
3. Is error handling present on every async/IO operation?
4. Are there hardcoded values that should be env vars?
5. Does this break anything that was already working?
6. Can I explain every line if someone asks me tomorrow?

Question 6 catches the most issues. If I can’t explain a line, I don’t ship it.

Principle 5: Treat Compaction Like a Power Outage. Plan for It.

When Claude’s context window fills, it compacts: nuance gets discarded, recent decisions disappear, and the next response starts from a degraded state. The fix is a pre-compaction ritual at the end of every meaningful session. One prompt to update CURRENT_TASK.md, record new decisions, and write a 2-sentence handoff note for the next Claude instance. Recovery time: 90 seconds.

At the end of every meaningful session, one prompt:

We're wrapping up. Please:
1. Update CURRENT_TASK.md with current state
2. Add new decisions to CLAUDE.md's decisions section
3. Write a 2-sentence next-session starter — what the next 
   instance of you needs to know to resume immediately

That third item is the key. Claude writing handoff notes for Claude produces better handoffs than I can write myself. When a new session starts:

Read CLAUDE.md and the "Next session" section of CURRENT_TASK.md.
Confirm your understanding before we continue.

90 seconds. Full speed.

The Numbers, Updated April 2026

These aren’t projections. This is the actual state of a production system that started at 4,247 lines in December 2025 and hit 36,096 lines four months later, running pair arbitrage on BTC/ETH/SOL/XRP. Every number below is from a live system, not a benchmark.

Metric	Before System	After System
Lines of code	4,247 (Dec 2025)	36,096 (Apr 2026)
Avg session tokens	~10,000	~4,200
API cost/month	$136 (month 1, unoptimized: ~$340)	$136 ongoing
Error rate to production	1 per 6 outputs	1 per 40 outputs
Silent bugs caught by shadow gate	0	8 (none hit live capital)
Test coverage	41%	73%
Uptime since December	N/A	99.2%
Claude Code sessions	~180 (Dec)	400+ (Apr)

The system rather than the prompts made the difference. Claude Code is a force multiplier. Without a system, it’s an expensive way to ship buggy code faster. For a deeper look at verification workflows, see my post on evidence-based AI code verification. And if executive function is your bottleneck, here’s the Claude Code workflow I built for ADHD.

The 6th principle I’d add today: shadow-first before every live deploy. Run a dry-run instance in parallel. Collect evidence. Gate on n_completed >= 50 before promoting. In April alone, that gate caught a bug where set("BTC") was returning {'B','T','C'} instead of {'BTC'}. The bot would have traded the wrong assets live. Eight bugs like that. Zero P&L damage.

Every principle here was learned the hard way: real bugs, real money at risk, real debugging sessions at 2am on a QuantVPS SSH terminal. I also built a self-improving RAG system that captures these learnings automatically so future Claude sessions don’t repeat past mistakes. And because static bet sizing leaves performance on the table, the bot uses a self-tuning position sizer that adjusts the base bet dynamically based on recent win rate.

How Polyphemus Is Actually Built

The five principles describe the workflow. Here is the architecture they produced, and the build sequence that created it, because “how” is harder to extract from case studies than “what.”

The core loop. Polyphemus runs a single synchronous loop: fetch market data from the Polymarket API, compute a signal for each pair from a float threshold, no ML, no black box, just a number you can audit, calculate position size using Kelly Criterion capped by MAX_POSITION_SIZE, execute orders via the CLOB API, and log every decision with rationale before execution, never after. The loop is intentionally boring. Sync, not async. No framework. No magic. That constraint is in the CLAUDE.md so Claude never proposes an async refactor.

The directory structure Claude helped design. The project root holds CLAUDE.md (under 300 tokens, always loaded) and CURRENT_TASK.md (current session context, under 1,000 tokens). Source lives in src/: core/ holds signal logic, Kelly math, and config; execution/ holds order placement and position tracking; data/ holds the API clients and market feed; monitoring/ holds the shadow-first gate and health checks. Each module is a standalone unit. You can hand Claude exactly one module file without loading the others, which is what makes tiered context loading tractable at 36,000 lines.

The build sequence. Weeks 1-2: signal threshold logic and basic order placement. Weeks 3-4: the CLAUDE.md discipline and tiered context loading, after the $340 token bill landed. Weeks 5-6: Gate 1 automation: the bash script that runs type checks, linting, and tests after every output. Month 2: the shadow-first deployment gate, after the first real production error (an off-by-one in the position sizing that Gate 1 missed because it passed linting). Month 3: the strategy shift from directional to market-neutral pair arbitrage, which required rewriting the signal module while leaving execution and monitoring untouched. Month 4: scale and hardening, circuit breakers, a pre-compaction hook, and the sub-agent pattern for parallel signal research. Each phase compounded on the last without a rewrite because the module boundaries held.

What building your own looks like. Start with the loop, not the strategy. A working synchronous data-fetch, signal-compute, log, execute pipeline is around 200 lines. That scaffold is your base. The CLAUDE.md for it is 300 tokens: what the project is, what the hard rules are, and what you are explicitly not doing. From that base, every feature is bounded: you know exactly which module it touches, and Claude does too because the structure is documented rather than assumed. The signal threshold, the Kelly sizing, the order placement: each is a separate file. Claude can improve any one of them without touching the others.

The real constraint at scale. At 36,000 lines, the limiting factor is not Claude’s capability. It is keeping the context accurate. Every module boundary in Polyphemus was drawn to minimize the cross-references Claude needs in a single session. Tiered context loading makes this possible: Claude never sees the full codebase; it sees the map (CLAUDE.md), the current task (CURRENT_TASK.md), and the specific files explicitly loaded. The signal module does not need the execution module in context to be correct. That discipline, not any individual prompt technique, is why the codebase scaled without entropy.

What I Didn’t Cover Here

The five principles in this post are the foundation. There’s a full advanced layer above them: hooks that run Gate 1 automatically after every file write, subagents routing cheap tasks to smaller models, agent teams for parallel feature development, and MCP servers giving Claude direct live database access. Each of these requires the foundation to be working first.

The full advanced system is in the Claude Code Guide: Advanced Edition. It includes hooks that run Gate 1 automatically after every file write (no manual step), subagents routing cheap tasks to smaller models (44% cost reduction with progressive context loading), agent teams for parallel feature development, checkpointing for safe architectural experiments, and MCP servers giving Claude direct access to the live database for debugging. You need the foundation before the advanced layer is useful.

The guide includes the complete Polyphemus architecture walkthrough. If you’re deploying your own bot, here’s my step-by-step VPS setup on DigitalOcean.

If this case study was useful, the best thing you can do is send it to one developer still burning money using Claude Code without a system.

Chudi

hello@chudi.dev | chudi.dev

· Frequently asked

FAQ

How long does it take to build a production bot with Claude Code?

Polyphemus took 6 weeks at roughly 3-4 hours per day. The first two weeks were unstructured and inefficient. The last four weeks, after implementing proper context management and two-gate verification, were significantly more productive. The system rather than the tool was the bottleneck.

What is tiered context loading in Claude Code?

Tiered context loading means giving Claude only the information it needs for the current task. Tier 1 is the project identity (CLAUDE.md, under 500 tokens). Tier 2 is the session task (CURRENT_TASK.md, under 1,000 tokens). Tier 3 is specific files, loaded explicitly on demand. This approach reduced my average session token usage from ~10,000 to ~4,200 tokens.

What is Claude Code auto memory?

Auto memory is Claude Code's built-in system where Claude writes its own notes based on your corrections and preferences. When you tell Claude it made a mistake, it records that learning and applies it in future sessions. It handles preference learning automatically, so your CLAUDE.md only needs to contain architecture decisions and hard rules.

How do you prevent Claude Code from making expensive mistakes in production code?

Two-gate verification. Gate 1 is automated, a bash script running type checks, linting, and tests that fires after every Claude output. Gate 2 is a 6-question manual checklist covering correctness, API endpoints, error handling, hardcoded values, regression risk, and explainability. Before implementing this, 1 in 6 Claude outputs reached production with an error. After: 1 in 40.

What happens when Claude Code compacts context?

Compaction discards nuance to free up context window space. The mitigation is a pre-compaction ritual: a prompt that tells Claude to update CURRENT_TASK.md with current state and write a 2-sentence next-session starter for the next Claude instance. This starter, written by Claude, for Claude, recovers context in about 90 seconds at the start of the next session.

Is Claude Code worth using for financial/trading applications?

Yes, with strict guardrails. The two-gate system is non-negotiable for financial code. I also added a security hook that blocks Claude from running bash commands containing database modification patterns, and I use plan mode before touching any execution logic. Polyphemus has grown from 4,247 lines to 36,000+ lines since December, with 99.2% uptime and 8 silent bugs caught by the shadow-first gate before they reached live capital.

Can you build a production trading bot with Claude Code?

Yes. Polyphemus, a fully autonomous Polymarket trading bot, grew from 4,000 lines in 6 weeks to 36,000 lines over four months using Claude Code exclusively. It runs live pair arbitrage on BTC, ETH, SOL, and XRP with 99.2% uptime. The five principles in this post, tiered context, auto memory, plan mode, two-gate verification, and pre-compaction handoffs, are what made that scale possible without entropy.

How many lines of code can you build with Claude Code on a production project?

Polyphemus crossed 36,000 lines in four months starting from 4,247 lines in December 2025. The codebase grew 8.5x without degrading output quality because each principle compounds: tiered context keeps Claude accurate, two-gate verification catches errors early, and pre-compaction handoffs preserve decisions across sessions. There is no hard ceiling. The constraint is the quality of the system around the tool, not the tool itself.

What breaks when building a trading bot with Claude Code?

The most common failure is context collapse: as the codebase grows, Claude's session window fills with stale context instead of relevant code, producing hallucinations on files it has "seen" but not loaded. The second failure is skipping gate verification under time pressure: that is where production bugs originate. A third failure class is compaction loss: critical decisions disappear mid-session without a pre-compaction handoff. The shadow-first deployment gate exists to catch errors that pass both quality gates but still produce wrong behavior at runtime, like the set("BTC") bug that would have traded wrong assets live.

· Sources & further reading

Sources & Further Reading

Sources

Continue the Python Agent Infrastructure track

Go to hub

Start here

I Run Python Agents on a $6/Month DigitalOcean Droplet

Deploy Python agents on DigitalOcean for $6/month. Skip serverless complexity, avoid setup gotchas, run async bots reliably.

Build a Polymarket Trading Bot in Python: 69.6% Win Rate

Step-by-step Python guide to building a Polymarket trading bot: CLOB API orders, Binance WebSocket feeds, and the 30-90s lag behind a 69.6% win rate.

Current

How I Built a Claude Code Trading Bot: 36,000 Lines

I built Polyphemus with Claude Code, 4,000 lines in 6 weeks, 36,000 in 4 months. 5 principles that cut errors 84% and kept 8 bugs from losing money.

None

Contextual next reads

I Run Python Agents on a $6/Month DigitalOcean Droplet

Deploy Python agents on DigitalOcean for $6/month. Skip serverless complexity, avoid setup gotchas, run async bots reliably.

Build a Polymarket Trading Bot in Python: 69.6% Win Rate

Step-by-step Python guide to building a Polymarket trading bot: CLOB API orders, Binance WebSocket feeds, and the 30-90s lag behind a 69.6% win rate.

Python Agent Infrastructure updates

Continue the Python Agent Infrastructure track

This signup keeps the reader in the same context as the article they just finished. It is intended as a track-specific continuation, not a generic site-wide interrupt.

Next posts in this reading path
New supporting notes tied to the same cluster
Distribution-ready summaries instead of generic blog digests

#claude-code #ai #trading #production #workflow #tutorial

What do you think?

I post about this stuff on LinkedIn every day and the conversations there are great. If this post sparked a thought, I'd love to hear it.

Discuss on LinkedIn

How I Built a Claude Code Trading Bot: 36,000 Lines

Why this matters

Cluster context

Why Does Claude Get Dumber as Your Project Gets Bigger?

Principle 1: Context is a Resource. Manage it Like One.

Principle 2: Claude’s Built-in Memory is Better Than Manual Note-Taking

Principle 3: Plan Mode Before You Write a Single Line

Principle 4: Two Gates, Not One

Principle 5: Treat Compaction Like a Power Outage. Plan for It.

The Numbers, Updated April 2026

How Polyphemus Is Actually Built

What I Didn’t Cover Here

FAQ

How long does it take to build a production bot with Claude Code?

What is tiered context loading in Claude Code?

What is Claude Code auto memory?

How do you prevent Claude Code from making expensive mistakes in production code?

What happens when Claude Code compacts context?

Is Claude Code worth using for financial/trading applications?

Can you build a production trading bot with Claude Code?

How many lines of code can you build with Claude Code on a production project?

What breaks when building a trading bot with Claude Code?

Sources & Further Reading

Sources

Further reading

Continue the Python Agent Infrastructure track

Continue the Python Agent Infrastructure track

What do you think?

Claude Code Best Practices 2026: A Field Guide

I Run Python Agents on a $6/Month DigitalOcean Droplet

Claude Context Management: 3-File System to Beat Compaction

My Two-Gate System for Claude Code Cut Errors 84%

Reduce Claude Token Usage 60%: Progressive Disclosure