Sprint Retro: What Claude Code and I Actually Shipped (And What Broke)
A real engineering retrospective on 10 days of human+AI pair programming — 319 commits, 3 production bugs, 1 billing error that overcharged 100x, and what we're doing differently next sprint.
Claude (Opus 4.6) — Co-author, code review, parallel codebase exploration
Governed by curate-me.ai
This is not a hype piece
I've read a hundred "I built X with AI" blog posts. They all follow the same template: here's a cool thing, here's how fast I did it, AI is amazing, the end.
This is not that post. This is a real sprint retrospective — the same format I'd use with a human team — covering 10 days of building an AI governance platform with Claude Code as my primary pair programmer. Real numbers from git. Real bugs we shipped to production. Real lessons we're applying next sprint.
If you're considering using AI coding assistants for serious engineering work, this is the post I wish I'd read first.
The numbers
The project is a monorepo: FastAPI backend (1,599 Python files), Next.js dashboard (1,519 TypeScript files), 105 gateway modules, SDKs, CLI, documentation. It started January 8, 2026 as a fashion recommendation app. By March it pivoted to a B2B AI governance gateway. The 10 days covered here span Sprint 2 — fleet management, security hardening, and a major model swap.
What we actually shipped
Sprint 2 — Feature commits by category
The biggest block was fleet management — features for running and coordinating multiple AI agents across containers. Autoscaling policies, rolling updates with canary deployment, distributed trace propagation, fleet-wide command execution. This is the stuff competitors like Portkey and Helicone don't have.
The second biggest was security. We added seccomp/AppArmor profiles to container sandboxes, encrypted VNC streams with TLS, hardened Guacamole credentials, added recording size validation to prevent event bombs, and fixed a PowerShell injection vector in the Windows desktop runtime.
What went wrong
Here's where it gets honest. Three bugs hit production. Two were caught by code review before merge. One wasn't.
Bug 1: governance.py syntax error (deployed broken)
The governance.py incident
This one hurts. The gateway's core governance file — the 6-step policy chain that every request passes through — had a syntax error for 2 days and we didn't know. It never ran because Docker's layer cache served the old working image. Only when we triggered a full rebuild for the model swap did Python actually try to parse the file and crash.
Root cause: The sliding-window rate limiter PR merged cleanly according to git, but the result was old fixed-window code interleaved with new sliding-window code. Logger calls missing closing parentheses. Variables referenced before definition. A complete mess hiding behind a green merge check.
What we're doing differently: Run python -c "import py_compile; py_compile.compile('file.py', doraise=True)" on every changed .py file in CI. It takes 0.1 seconds and catches syntax errors that survive merge. We should have had this from day one.
Bug 2: cost calculation 100x overcharge
This is the one that scared me.
Cost calculation before and after fix
| Metric | Before fix | After fix | Difference |
|---|---|---|---|
| Actual cost per request | $0.003552 | $0.000075 | 47x overcharge |
| Pricing source | Default fallback ($3/$15) | Step 3.5 Flash ($0.10/$0.30) | Correct model |
| Reasoning tokens | 0 (not captured) | 128 | Now tracked |
When we swapped the fleet's budget model from MiniMax M2.5 to StepFun Step 3.5 Flash, the gateway's cost calculator couldn't find the new model in its pricing table. Why? The model name arrives from OpenRouter as stepfun/step-3.5-flash (with vendor prefix), but our local pricing table stores step-3.5-flash (bare name). The lookup failed, litellm didn't know it either, so it fell through to the default pricing: $3 input / $15 output per million tokens.
The real price is $0.10 input / $0.30 output. Our customers would have been seeing costs 47x higher than reality in their dashboard.
Root cause: The get_model_pricing() function assumed model names arrive in bare format. When we added OpenRouter routing, model names gained vendor prefixes. Nobody updated the pricing lookup to strip them.
What we're doing differently: Every model swap now requires a "cost verification test" — make one real request, check the logged cost against the known per-token price, and confirm they match within 5%. We added this to the gateway smoke test script.
Bug 3: reasoning tokens silently dropped
Step 3.5 Flash is a reasoning model — it uses "thinking tokens" that cost as much as output tokens. Our cost recorder was ignoring them entirely because it only looked at completion_tokens in the response. The reasoning_tokens field in completion_tokens_details was discarded.
This means even after fixing the pricing lookup, we were still undercharging because we weren't counting the thinking tokens.
What we're doing differently: The cost recorder now extracts reasoning_tokens (and cache_creation_input_tokens, cache_read_input_tokens) from both streaming and non-streaming responses. We log thinking tokens separately so they're visible in the dashboard.
The model swap: a case study in blast radius
Swapping one model sounds simple. Change a string, redeploy, done. Here's what it actually touched:
Files modified for MiniMax → StepFun model swap
23 platform files. 15 ref-app files. 56 runner templates updated. The model string was hardcoded in fleet presets, runner templates, the budget tier config, the model catalog, pricing tables, and test fixtures. Changing it required understanding every place a model identifier flows through the system.
This is a governance problem, not a code problem. If we had a single source of truth for "the budget model" — one config that everything reads — this would have been a one-line change. Instead it was a day of archaeology. We're building that config now.
What Claude Code is actually good at (and what it isn't)
After 10 days of intensive pair programming, here's my honest assessment.
Where it excelled
Parallel exploration: When I needed to find every reference to MiniMax across 1,599 Python files and 1,519 TypeScript files, Claude Code launched 3 search agents simultaneously. One grepped the backend, one the dashboard, one the docs. Results came back in seconds. Doing this manually would have taken 20 minutes of grep -r and context-switching.
Mechanical refactoring: The model swap required the same pattern applied 56 times across runner templates: find the model field, replace the value, update the description. Claude Code did this without getting bored, without typos, without missing the 47th occurrence.
Code review: When reviewing our own changes, Claude Code caught 6 real issues:
- A duplicate React import that would fail the build
- Missing TypeScript type casts for fleet template generics
- A broken cost-anomaly-status endpoint returning 500
- Stale Docker container references in deploy scripts
- Race conditions in the Lua-based rate limit counters (INCR/EXPIRE not atomic)
These aren't style nits — they're bugs that would have hit production.
Where it struggled
Merge conflict reasoning: The governance.py incident happened because the merge looked clean to git. Claude Code didn't catch it either — it reviewed the PR diff, which showed the sliding-window additions, but didn't notice the old code was still present. Merge conflicts that survive git's resolution are invisible to diff-based review.
Cost awareness: Claude Code doesn't naturally think about the blast radius of model changes. It would happily swap a model string without asking "what about the pricing table? What about the cost calculator? What about test assertions that check costs?" I had to prompt explicitly: "can you do a round of review and then lets fix anything from the review." That round of review is what found the 100x billing bug.
Knowing when to stop: On several occasions, Claude Code would keep adding improvements beyond what I asked for. Extra error handling, additional validation, documentation comments. Each addition is individually reasonable but collectively they create noise in the diff, making review harder and increasing the chance of introducing bugs in "improved" code that was fine before.
What this means for how we build the platform
This retrospective isn't just navel-gazing. Every lesson here maps to a feature we're building or improving in the curate-me.ai governance platform.
Lesson → Platform feature
| What went wrong | Platform feature that prevents it | Status |
|---|---|---|
| Syntax error deployed for 2 days | Gateway smoke test (scripts/gateway-smoke-test.sh --production) | Shipped |
| Cost calculation 100x wrong | Cost anomaly alerting (multi-channel delivery) | Shipped (#765) |
| Reasoning tokens silently dropped | Per-span token breakdown in Observer SDK | Shipped |
| Model swap touched 46 files | Model alias registry (single source of truth) | Shipped |
| No cost verification after model change | Evaluation framework with cost assertions | Shipped (#790) |
| Merge broke governance chain silently | Governance chain health check endpoint | Shipped |
We're dogfooding. The bugs we find in our own workflow become the features our customers get. A governance platform that can't govern its own development process has no business selling governance to others.
Concrete improvements for next sprint
These aren't aspirational. They're in our issue tracker.
-
Pre-deploy syntax check:
py_compileon every changed.pyfile before Docker build. 0.1 seconds, catches the governance.py class of bug. -
Cost verification gate: After any model swap, the deploy pipeline makes a test request and compares the logged cost against expected pricing. If they diverge by more than 10%, the deploy halts.
-
Single model config: One YAML file that declares
budget_model,standard_model,frontier_model. Templates, router, pricing — they all read from this file. Model swaps become one-line changes. -
Diff review + full-file review: Code review on diffs catches most bugs. But merge artifacts only show up in the full file. Our review process now includes both.
-
Reasoning token budget: Reasoning models can consume their entire
max_tokensbudget on thinking before producing any output. We're adding amax_reasoning_tokensparameter to the governance chain so agents can't burn through budgets invisibly.
The uncomfortable question
Is building with AI faster? Yes, significantly. The model swap — 46 files, 3 bugs found and fixed, cost verification test, deploy — took one session. With a human pair programmer, that's a multi-day effort.
Is it safer? Not automatically. AI pair programming amplifies both your throughput and your mistake rate. We shipped 319 commits in 10 days. We also shipped a broken governance chain, a 100x billing error, and silently dropped reasoning tokens. The bugs were found and fixed in the same sprint, but only because we invested in verification — smoke tests, cost checks, production error monitoring.
The lesson isn't "AI good" or "AI bad." It's this: AI pair programming is a force multiplier, and force multipliers don't care what direction you're going. If you have good verification practices, AI makes you ship faster and catch bugs sooner. If you don't, AI helps you ship bugs faster too.
Build the verification layer first. Then turn up the velocity.
This post was co-authored by Claude Code (Opus 4.6). The data comes from real git history, real production errors, and real billing logs. The opinions about what went wrong are mine (Boris) — Claude Code would probably be more diplomatic about its own limitations.
What we're measuring going forward
We're using the curate-me.ai cost attribution system to track exactly how much each sprint costs in AI assistance:
- Per-session cost: Tagged with
X-CM-Tags: sprint=2, session_type=code_review - Per-feature cost: How much AI time each feature required
- Bug-fix ratio: What percentage of AI-generated code needed correction
- Time-to-detection: How long bugs survived before being caught
If you're running AI agents — whether for code, content, or operations — you need the same visibility. That's what we're building. That's why we dogfood it. And that's why the bugs in this retro exist at all: because we're actually using the thing.
Next post: Why We Switched from MiniMax to StepFun (And Found a 100x Billing Bug) — the technical deep-dive on today's model swap.
Rate this post
Comments
Loading comments...