Skip to main content
How I Built a 36,000-Line Production Trading Bot With Claude Code

How I Built a 36,000-Line Production Trading Bot With Claude Code

Chudi Nnorukam Mar 7, 2026 Updated Apr 14, 2026 8 min read

I built Polyphemus with Claude Code — 4,000 lines in 6 weeks, 36,000 in 4 months. 5 principles that cut errors 84% and kept 8 bugs from losing money.

Why this matters

I built Polyphemus with Claude Code — 4,000 lines in 6 weeks, now 36,000 lines 4 months later. The same 5 principles that cut my error rate 84% also scaled the codebase 8.5x without entropy. Plus the shadow-first deployment methodology that caught 8 silent production bugs before they cost real money.

In this cluster

Cluster context

This article sits inside Python Agent Infrastructure.

Open topic hub

Deploy, manage, and scale Python agents on VPS infrastructure you control. From $6 Droplets to specialized trading VPS.

Most Python agent tutorials stop at "works on my laptop." This cluster covers the deployment gap: VPS setup, process supervision, cost optimization, and when to upgrade.

I finished the first version of Polyphemus in 6 weeks. Fully autonomous Polymarket trading bot. 4,000+ lines, real money on the line. Couldn’t have shipped it without Claude Code. I also wasted $340 in one month using it wrong.

Four months later, the same codebase is 36,000+ lines. Two live instances running pair arbitrage on BTC/ETH/SOL/XRP. Strategy evolved from directional to market-neutral. Eight silent production bugs found and killed before they touched live capital. All caught by a shadow-first deployment gate I built after month two.

The five principles that got it to 4,000 lines are the same ones that got it to 36,000 without entropy. Here’s the case study, updated April 2026.

Why Does Claude Get Dumber as Your Project Gets Bigger?

The bigger your codebase grows, the faster Claude’s context window fills up. Each session becomes less useful because the window is full of stale context instead of relevant code. By week three of Polyphemus, I was spending 20 minutes re-explaining context Claude had already “seen.” By month one, my token bill hit $340. The problem isn’t Claude. It’s the missing system around it.

Every Claude Code guide starts with CLAUDE.md tips. Not wrong, just backwards.

The first thing I had to understand was not “how do I write better prompts.” It was: why does Claude get dumber as my project gets bigger?

The answer is the context window. Every session loads your project into Claude’s working memory. As your codebase grows, that memory fills up faster. By week three, I was re-explaining decisions Claude had made with me two days earlier. Architecture choices, naming conventions, API patterns: all gone. I was paying for Claude to relearn the project I’d already taught it.

That’s not a Claude problem. That’s a system problem. And the symptoms compound fast: re-explaining context is demoralizing, it produces worse outputs because you’re summarizing instead of being precise, and eventually you stop correcting Claude because it feels pointless. You start accepting mediocre outputs. You start doing the “hard parts” manually. You’ve turned an AI assistant into an expensive autocomplete. By month one, I was close to giving up on the whole approach.

Here are the five things that fixed it.

Principle 1: Context is a Resource. Manage it Like One.

Most developers treat Claude’s context window like unlimited RAM: load everything, let it sort out what matters. That approach blew my token budget by 58% and produced hallucinations on files Claude “remembered” but didn’t actually have in scope. The fix is three tiers: always-loaded project identity (under 500 tokens), per-session task file (under 1,000 tokens), and explicit on-demand file loading. Nothing else.

Tiered context loading in practice:

Tier 1. Always loaded (under 500 tokens). CLAUDE.md at project root. What the project is, file structure, conventions. Nothing else. The map, not the territory.

Tier 2. Per-session (under 1,000 tokens). A CURRENT_TASK.md file. What you’re building today, what files are involved, what “done” looks like.

Tier 3. On demand. Specific files, loaded explicitly. “Read src/core/kelly.py before we start.”

Result: average session token usage dropped from ~10,000 to ~4,200 tokens. 58% reduction from one workflow change.

The rule that makes Tier 3 work: never reference a file by name without loading it first. “Update the execution module” produces hallucination. “Read src/execution/orders.py, then update the retry logic” produces accurate output.

Principle 2: Claude’s Built-in Memory is Better Than Manual Note-Taking

Claude Code has two memory systems: the CLAUDE.md file you write by hand, and Auto Memory, which Claude writes itself based on corrections you make. Most developers only use the first. Using both cuts the manual overhead of session management by more than half and produces more accurate recall than notes you wrote yourself.

I wasted two weeks maintaining a sprawling set of markdown notes before I discovered this. I was carefully updating files that Claude was already tracking more accurately through auto memory.

What auto memory doesn’t do: strategic decisions. If you’ve chosen PostgreSQL over SQLite for a reason, write that in CLAUDE.md. Auto memory captures patterns. CLAUDE.md captures architecture.

The CLAUDE.md that actually worked for Polyphemus:

# Polyphemus — Claude Context

## What this is
Autonomous Polymarket trading bot. Real money. Kelly Criterion sizing. 
Never lets an exception stop the main loop.

## Hard rules
- Never hardcode API keys. Doppler only.
- All amounts in USDC, not cents. One violation cost a real trade.
- Log every trade decision with rationale BEFORE executing.
- MAX_POSITION_SIZE is a ceiling, not a suggestion.

## What we are NOT doing
- No async. Sync is predictable.
- No ML models. Signal threshold is a float.
- No framework for the trading loop. Too much magic.

300 tokens. That’s it. Short CLAUDE.md, accurate auto memory, clean context.

Principle 3: Plan Mode Before You Write a Single Line

Plan mode (/plan) lets Claude research your codebase and propose an approach without making any changes. You review the plan, redirect if needed, then approve. On any task touching more than two files, this single step eliminates the most expensive class of Claude mistake: confident, multi-file output that conflicts with existing architecture.

Without plan mode, Claude wrote 200 lines of code conflicting with an architectural decision buried in a different file. Confident. Wrong. Two hours lost.

With plan mode on anything touching more than two files:

Me: /plan Add circuit breaker to execution module.
    Pause trading after 3 consecutive losses.

Claude: [researches without touching anything]
        Proposed approach: [plan]
        Files: src/execution/orders.py, src/core/state.py
        I noticed MAX_LOSS_DAILY in src/core/config.py —
        should the circuit breaker integrate with that?

Me: Yes, but use config module, not state.py.

Claude: Understood. Implementing now...

Claude caught an integration point I hadn’t mentioned. I redirected before any code was written. Plan mode costs nothing and consistently saves hours.

Principle 4: Two Gates, Not One

Two-gate verification means every Claude output clears an automated check (type checks, linting, tests in under 30 seconds) before it gets a human review pass using a fixed 6-question checklist. Before this system, 1 in 6 Claude outputs reached production with an error. After: 1 in 40. On a trading bot, that gap is the difference between an incident log and a boring afternoon.

Gate 1 is automated. A bash script: type checks, linting, tests. 30 seconds. Catches ~60% of errors.

#!/bin/bash
python -m mypy . --ignore-missing-imports && echo "✓ Types" || exit 1
python -m ruff check . && echo "✓ Lint" || exit 1
python -m pytest tests/ -q && echo "✓ Tests" || exit 1

If Gate 1 fails, paste the error back: “Fix only what’s causing this error. Nothing else.” That last sentence matters. Without it, Claude fixes the error and refactors three other things.

Gate 2 is a 6-question checklist. Five minutes, manual, non-negotiable:

1. Does this do exactly what I asked — not more?
2. Are external API calls using the correct endpoints?
3. Is error handling present on every async/IO operation?
4. Are there hardcoded values that should be env vars?
5. Does this break anything that was already working?
6. Can I explain every line if someone asks me tomorrow?

Question 6 catches the most issues. If I can’t explain a line, I don’t ship it.

Principle 5: Treat Compaction Like a Power Outage. Plan for It.

When Claude’s context window fills, it compacts: nuance gets discarded, recent decisions disappear, and the next response starts from a degraded state. The fix is a pre-compaction ritual at the end of every meaningful session. One prompt to update CURRENT_TASK.md, record new decisions, and write a 2-sentence handoff note for the next Claude instance. Recovery time: 90 seconds.

At the end of every meaningful session, one prompt:

We're wrapping up. Please:
1. Update CURRENT_TASK.md with current state
2. Add new decisions to CLAUDE.md's decisions section
3. Write a 2-sentence next-session starter — what the next 
   instance of you needs to know to resume immediately

That third item is the key. Claude writing handoff notes for Claude produces better handoffs than I can write myself. When a new session starts:

Read CLAUDE.md and the "Next session" section of CURRENT_TASK.md.
Confirm your understanding before we continue.

90 seconds. Full speed.

The Numbers, Updated April 2026

These aren’t projections. This is the actual state of a production system that started at 4,247 lines in December 2025 and hit 36,096 lines four months later, running pair arbitrage on BTC/ETH/SOL/XRP. Every number below is from a live system, not a benchmark.

MetricBefore SystemAfter System
Lines of code4,247 (Dec 2025)36,096 (Apr 2026)
Avg session tokens~10,000~4,200
API cost/month$136 (month 1, unoptimized: ~$340)$136 ongoing
Error rate to production1 per 6 outputs1 per 40 outputs
Silent bugs caught by shadow gate08 (none hit live capital)
Test coverage41%73%
Uptime since DecemberN/A99.2%
Claude Code sessions~180 (Dec)400+ (Apr)

The system rather than the prompts made the difference. Claude Code is a force multiplier. Without a system, it’s an expensive way to ship buggy code faster. For a deeper look at verification workflows, see my post on evidence-based AI code verification.

The 6th principle I’d add today: shadow-first before every live deploy. Run a dry-run instance in parallel. Collect evidence. Gate on n_completed >= 50 before promoting. In April alone, that gate caught a bug where set("BTC") was returning {'B','T','C'} instead of {'BTC'}. The bot would have traded the wrong assets live. Eight bugs like that. Zero P&L damage.

Every principle here was learned the hard way: real bugs, real money at risk, real debugging sessions at 2am on a QuantVPS SSH terminal. I also built a self-improving RAG system that captures these learnings automatically so future Claude sessions don’t repeat past mistakes.

What I Didn’t Cover Here

The five principles in this post are the foundation. There’s a full advanced layer above them: hooks that run Gate 1 automatically after every file write, subagents routing cheap tasks to smaller models, agent teams for parallel feature development, and MCP servers giving Claude direct live database access. Each of these requires the foundation to be working first.

The full advanced system is in the Claude Code Guide: Advanced Edition. It includes hooks that run Gate 1 automatically after every file write (no manual step), subagents routing cheap tasks to smaller models (44% cost reduction with progressive context loading), agent teams for parallel feature development, checkpointing for safe architectural experiments, and MCP servers giving Claude direct access to the live database for debugging. You need the foundation before the advanced layer is useful.

The guide includes the complete Polyphemus architecture walkthrough. If you’re deploying your own bot, here’s my step-by-step VPS setup on DigitalOcean.

If this case study was useful, the best thing you can do is send it to one developer still burning money using Claude Code without a system.

— Chudi

hello@chudi.dev | chudi.dev

FAQ

How long does it take to build a production bot with Claude Code?

Polyphemus took 6 weeks at roughly 3-4 hours per day. The first two weeks were unstructured and inefficient. The last four weeks, after implementing proper context management and two-gate verification, were significantly more productive. The system rather than the tool was the bottleneck.

What is tiered context loading in Claude Code?

Tiered context loading means giving Claude only the information it needs for the current task. Tier 1 is the project identity (CLAUDE.md, under 500 tokens). Tier 2 is the session task (CURRENT_TASK.md, under 1,000 tokens). Tier 3 is specific files, loaded explicitly on demand. This approach reduced my average session token usage from ~10,000 to ~4,200 tokens.

What is Claude Code auto memory?

Auto memory is Claude Code's built-in system where Claude writes its own notes based on your corrections and preferences. When you tell Claude it made a mistake, it records that learning and applies it in future sessions. It handles preference learning automatically, so your CLAUDE.md only needs to contain architecture decisions and hard rules.

How do you prevent Claude Code from making expensive mistakes in production code?

Two-gate verification. Gate 1 is automated — a bash script running type checks, linting, and tests that fires after every Claude output. Gate 2 is a 6-question manual checklist covering correctness, API endpoints, error handling, hardcoded values, regression risk, and explainability. Before implementing this, 1 in 6 Claude outputs reached production with an error. After: 1 in 40.

What happens when Claude Code compacts context?

Compaction discards nuance to free up context window space. The mitigation is a pre-compaction ritual: a prompt that tells Claude to update CURRENT_TASK.md with current state and write a 2-sentence next-session starter for the next Claude instance. This starter — written by Claude, for Claude — recovers context in about 90 seconds at the start of the next session.

Is Claude Code worth using for financial/trading applications?

Yes, with strict guardrails. The two-gate system is non-negotiable for financial code. I also added a security hook that blocks Claude from running bash commands containing database modification patterns, and I use plan mode before touching any execution logic. Polyphemus has grown from 4,247 lines to 36,000+ lines since December, with 99.2% uptime and 8 silent bugs caught by the shadow-first gate before they reached live capital.

Sources & Further Reading

Further Reading

Python Agent Infrastructure updates

Continue the Python Agent Infrastructure track

This signup keeps the reader in the same context as the article they just finished. It is intended as a track-specific continuation, not a generic site-wide interrupt.

  • Next posts in this reading path
  • New supporting notes tied to the same cluster
  • Distribution-ready summaries instead of generic blog digests

Segment: infrastructure

What do you think?

I post about this stuff on LinkedIn every day and the conversations there are great. If this post sparked a thought, I'd love to hear it.

Discuss on LinkedIn