Skip to main content
How I Built a 4,000-Line Production Trading Bot With Claude Code

How I Built a 4,000-Line Production Trading Bot With Claude Code

I built Polyphemus — an autonomous Polymarket trading bot — in 6 weeks with Claude Code. Here are the 5 principles that made it work, and the $340/month mistake that taught me them.

Chudi Nnorukam
Chudi Nnorukam
Mar 7, 2026 5 min read

I finished Polyphemus in 6 weeks. It’s a fully autonomous Polymarket trading bot. 4,000+ lines, Kelly Criterion position sizing, real money on the line. I couldn’t have shipped it without Claude Code. I also wasted $340 in one month using it wrong before I figured out what actually works.

This is not a “Claude Code tips” post. This is the case study of building a production system — what went wrong first, what I built to fix it, and the five principles that came out the other side.

The Context Problem Nobody Talks About

Every Claude Code guide starts with CLAUDE.md tips. Not wrong, just backwards.

The first thing I had to understand was not “how do I write better prompts.” It was: why does Claude get dumber as my project gets bigger?

The answer is the context window. Every session loads your project into Claude’s working memory. As your codebase grows, that memory fills up faster. By week three of Polyphemus, I was spending 20 minutes at the start of each session re-explaining context Claude had lost. By month one, my token bill was $340.

Here are the five things that fixed it.

Principle 1: Context is a resource. Manage it like one.

Most developers treat Claude’s context like RAM — give it everything and let it sort out what’s relevant. This is the most expensive mistake you can make.

Tiered context loading:

Tier 1 — Always loaded (under 500 tokens): CLAUDE.md at project root. What the project is, file structure, conventions. Nothing else. The map, not the territory.

Tier 2 — Per-session (under 1,000 tokens): A CURRENT_TASK.md file. What you’re building today, what files are involved, what “done” looks like.

Tier 3 — On demand: Specific files, loaded explicitly. “Read src/core/kelly.py before we start.”

Result: average session token usage dropped from ~10,000 to ~4,200 tokens. 58% reduction from one workflow change.

The rule that makes Tier 3 work: never reference a file by name without loading it first. “Update the execution module” produces hallucination. “Read src/execution/orders.py, then update the retry logic” produces accurate output.

Principle 2: Claude’s built-in memory is better than manual note-taking

Here’s something most guides miss: Claude Code has two memory systems, not one.

CLAUDE.md — what you write, manually, for persistent rules and project context.

Auto Memory — what Claude writes itself. When you correct Claude, it records that correction. Next session, it already knows.

I wasted two weeks maintaining a sprawling set of markdown notes before I discovered this. I was carefully updating files that Claude was already tracking more accurately through auto memory.

What auto memory doesn’t do: strategic decisions. If you’ve chosen PostgreSQL over SQLite for a reason, write that in CLAUDE.md. Auto memory captures patterns. CLAUDE.md captures architecture.

The CLAUDE.md that actually worked for Polyphemus:

# Polyphemus — Claude Context

## What this is
Autonomous Polymarket trading bot. Real money. Kelly Criterion sizing. 
Never lets an exception stop the main loop.

## Hard rules
- Never hardcode API keys. Doppler only.
- All amounts in USDC, not cents. One violation cost a real trade.
- Log every trade decision with rationale BEFORE executing.
- MAX_POSITION_SIZE is a ceiling, not a suggestion.

## What we are NOT doing
- No async. Sync is predictable.
- No ML models. Signal threshold is a float.
- No framework for the trading loop. Too much magic.

300 tokens. That’s it. Short CLAUDE.md, accurate auto memory, clean context.

Principle 3: Plan Mode before you write a single line

This is the feature I wish I’d known about in week one.

Claude Code’s Plan Mode (/plan) lets Claude research your codebase without making changes. It reads, thinks, and proposes. You approve. Then it executes.

Without plan mode, Claude wrote 200 lines of code conflicting with an architectural decision buried in a different file. Confident. Wrong. Two hours lost.

With plan mode on anything touching more than two files:

Me: /plan Add circuit breaker to execution module.
    Pause trading after 3 consecutive losses.

Claude: [researches without touching anything]
        Proposed approach: [plan]
        Files: src/execution/orders.py, src/core/state.py
        I noticed MAX_LOSS_DAILY in src/core/config.py — 
        should the circuit breaker integrate with that?

Me: Yes, but use config module, not state.py.

Claude: Understood. Implementing now...

Claude caught an integration point I hadn’t mentioned. I redirected before any code was written. Plan mode costs nothing and consistently saves hours.

Principle 4: Two gates, not one

Every piece of code Claude writes passes two gates before it runs.

Gate 1 is automated. A bash script: type checks, linting, tests. 30 seconds. Catches ~60% of errors.

#!/bin/bash
python -m mypy . --ignore-missing-imports && echo "✓ Types" || exit 1
python -m ruff check . && echo "✓ Lint" || exit 1
python -m pytest tests/ -q && echo "✓ Tests" || exit 1

If Gate 1 fails, paste the error back: “Fix only what’s causing this error — nothing else.” That last sentence matters. Without it, Claude fixes the error and refactors three other things.

Gate 2 is a 6-question checklist. Five minutes, manual, non-negotiable:

1. Does this do exactly what I asked — not more?
2. Are external API calls using the correct endpoints?
3. Is error handling present on every async/IO operation?
4. Are there hardcoded values that should be env vars?
5. Does this break anything that was already working?
6. Can I explain every line if someone asks me tomorrow?

Question 6 catches the most issues. If I can’t explain a line, I don’t ship it.

Before two gates: 1 error per 6 Claude outputs reached production. After two gates: 1 error per 40 Claude outputs.

On a trading bot, that’s the difference between an incident and a boring afternoon.

Principle 5: Treat compaction like a power outage — plan for it

Compaction happens when Claude’s window fills up. You lose nuance. Sometimes you lose decisions.

At the end of every meaningful session, one prompt:

We're wrapping up. Please:
1. Update CURRENT_TASK.md with current state
2. Add new decisions to CLAUDE.md's decisions section
3. Write a 2-sentence next-session starter — what the next 
   instance of you needs to know to resume immediately

That third item is the key. Claude writing handoff notes for Claude produces better handoffs than I can write myself. When a new session starts:

Read CLAUDE.md and the "Next session" section of CURRENT_TASK.md.
Confirm your understanding before we continue.

90 seconds. Full speed.

The Numbers, 6 Weeks Later

  • Lines of code: 4,247
  • Claude Code sessions: ~180
  • API cost, months 1-3: $408 (unoptimized first half)
  • API cost, optimized: $136/month ongoing
  • Errors reaching production: 4 (all caught by reconciliation, none lost money)
  • Test coverage: 73%
  • Uptime since December: 99.2%

The system rather than the prompts made the difference. Claude Code is a force multiplier — but only if you have a system. Without one, it’s an expensive way to ship buggy code faster.

These five principles are the foundation. They’re enough to go from burning $340/month to shipping production systems.

What I Didn’t Cover Here

There’s more in the full system I use daily: hooks that run Gate 1 automatically after every file write (no manual step), subagents routing cheap tasks to smaller models (44% cost reduction), agent teams for parallel feature development, checkpointing for safe architectural experiments, and MCP servers giving Claude direct access to the live database for debugging.

That’s in the Claude Code Guide: Advanced Edition — the full playbook, including the complete Polyphemus architecture walkthrough.

If this case study was useful, the best thing you can do is send it to one developer still burning money using Claude Code without a system.

— Chudi

hello@chudi.dev | chudi.dev

FAQ

How long does it take to build a production bot with Claude Code?

Polyphemus took 6 weeks at roughly 3-4 hours per day. The first two weeks were unstructured and inefficient. The last four weeks, after implementing proper context management and two-gate verification, were significantly more productive. The system rather than the tool was the bottleneck.

What is tiered context loading in Claude Code?

Tiered context loading means giving Claude only the information it needs for the current task. Tier 1 is the project identity (CLAUDE.md, under 500 tokens). Tier 2 is the session task (CURRENT_TASK.md, under 1,000 tokens). Tier 3 is specific files, loaded explicitly on demand. This approach reduced my average session token usage from ~10,000 to ~4,200 tokens.

What is Claude Code auto memory?

Auto memory is Claude Code's built-in system where Claude writes its own notes based on your corrections and preferences. When you tell Claude it made a mistake, it records that learning and applies it in future sessions. It handles preference learning automatically, so your CLAUDE.md only needs to contain architecture decisions and hard rules.

How do you prevent Claude Code from making expensive mistakes in production code?

Two-gate verification. Gate 1 is automated — a bash script running type checks, linting, and tests that fires after every Claude output. Gate 2 is a 6-question manual checklist covering correctness, API endpoints, error handling, hardcoded values, regression risk, and explainability. Before implementing this, 1 in 6 Claude outputs reached production with an error. After: 1 in 40.

What happens when Claude Code compacts context?

Compaction discards nuance to free up context window space. The mitigation is a pre-compaction ritual: a prompt that tells Claude to update CURRENT_TASK.md with current state and write a 2-sentence next-session starter for the next Claude instance. This starter — written by Claude, for Claude — recovers context in about 90 seconds at the start of the next session.

Is Claude Code worth using for financial/trading applications?

Yes, with strict guardrails. The two-gate system is non-negotiable for financial code. I also added a security hook that blocks Claude from running bash commands containing database modification patterns, and I use plan mode before touching any execution logic. Polyphemus has been running in production since December with 99.2% uptime and 4 errors caught by reconciliation, none of which lost money.

Sources & Further Reading

Further Reading

Discussion

Comments powered by GitHub Discussions coming soon.