Sprint Retro: What Claude Code and I Actually Shipped (And What Broke)

This is not a hype piece

I've read a hundred "I built X with AI" blog posts. They all follow the same template: here's a cool thing, here's how fast I did it, AI is amazing, the end.

This is not that post. This is a real sprint retrospective — the same format I'd use with a human team — covering 10 days of building an AI governance platform with Claude Code as my primary pair programmer. Real numbers from git. Real bugs we shipped to production. Real lessons we're applying next sprint.

If you're considering using AI coding assistants for serious engineering work, this is the post I wish I'd read first.

The numbers

319

Commits

Mar 9-19

1,192

Files Changed

212K lines added

100+

PRs Merged

incl. 59-PR Sprint 2 release

60+

Issues Closed

features, bugs, security

The project is a monorepo: FastAPI backend (1,599 Python files), Next.js dashboard (1,519 TypeScript files), 105 gateway modules, SDKs, CLI, documentation. It started January 8, 2026 as a fashion recommendation app. By March it pivoted to a B2B AI governance gateway. The 10 days covered here span Sprint 2 — fleet management, security hardening, and a major model swap.

What we actually shipped

Sprint 2 — Feature commits by category

Fleet management18autoscaling, rolling updates, traces

Security hardening12seccomp, VNC TLS, PowerShell injection

Desktop streaming8Guacamole HA, idle timeout, multi-viewer

Gateway governance7MCP governance, rate limiter, OTLP

Dashboard UX6empty states, WebSocket, cache consistency

Runner lifecycle5session persistence, skill uninstall, Firecrawl

Model swap + cost fixes4StepFun, pricing, reasoning tokens

The biggest block was fleet management — features for running and coordinating multiple AI agents across containers. Autoscaling policies, rolling updates with canary deployment, distributed trace propagation, fleet-wide command execution. This is the stuff competitors like Portkey and Helicone don't have.

The second biggest was security. We added seccomp/AppArmor profiles to container sandboxes, encrypted VNC streams with TLS, hardened Guacamole credentials, added recording size validation to prevent event bombs, and fixed a PowerShell injection vector in the Windows desktop runtime.

What went wrong

Here's where it gets honest. Three bugs hit production. Two were caught by code review before merge. One wasn't.

Bug 1: governance.py syntax error (deployed broken)

The governance.py incident

Mar 17PR #768: sliding-window rate limiter merged into governance.py

Mar 17Merge mangled the file — duplicate code blocks, unclosed parentheses, undefined variables

Mar 17-19Gateway runs fine — old Docker image still cached on VPS

Mar 19Model swap triggers full rebuild → syntax error surfaces → gateway won't start

Mar 19Fixed: removed 78 lines of duplicate old-window code

This one hurts. The gateway's core governance file — the 6-step policy chain that every request passes through — had a syntax error for 2 days and we didn't know. It never ran because Docker's layer cache served the old working image. Only when we triggered a full rebuild for the model swap did Python actually try to parse the file and crash.

Root cause: The sliding-window rate limiter PR merged cleanly according to git, but the result was old fixed-window code interleaved with new sliding-window code. Logger calls missing closing parentheses. Variables referenced before definition. A complete mess hiding behind a green merge check.

What we're doing differently: Run python -c "import py_compile; py_compile.compile('file.py', doraise=True)" on every changed .py file in CI. It takes 0.1 seconds and catches syntax errors that survive merge. We should have had this from day one.

Bug 2: cost calculation 100x overcharge

This is the one that scared me.

Cost calculation before and after fix

Metric	Before fix	After fix	Difference
Actual cost per request	$0.003552	$0.000075	47x overcharge
Pricing source	Default fallback ($3/$15)	Step 3.5 Flash ($0.10/$0.30)	Correct model
Reasoning tokens	0 (not captured)	128	Now tracked

When we swapped the fleet's budget model from MiniMax M2.5 to StepFun Step 3.5 Flash, the gateway's cost calculator couldn't find the new model in its pricing table. Why? The model name arrives from OpenRouter as stepfun/step-3.5-flash (with vendor prefix), but our local pricing table stores step-3.5-flash (bare name). The lookup failed, litellm didn't know it either, so it fell through to the default pricing: $3 input / $15 output per million tokens.

The real price is $0.10 input / $0.30 output. Our customers would have been seeing costs 47x higher than reality in their dashboard.

Root cause: The get_model_pricing() function assumed model names arrive in bare format. When we added OpenRouter routing, model names gained vendor prefixes. Nobody updated the pricing lookup to strip them.

What we're doing differently: Every model swap now requires a "cost verification test" — make one real request, check the logged cost against the known per-token price, and confirm they match within 5%. We added this to the gateway smoke test script.

Bug 3: reasoning tokens silently dropped

Step 3.5 Flash is a reasoning model — it uses "thinking tokens" that cost as much as output tokens. Our cost recorder was ignoring them entirely because it only looked at completion_tokens in the response. The reasoning_tokens field in completion_tokens_details was discarded.

This means even after fixing the pricing lookup, we were still undercharging because we weren't counting the thinking tokens.

What we're doing differently: The cost recorder now extracts reasoning_tokens (and cache_creation_input_tokens, cache_read_input_tokens) from both streaming and non-streaming responses. We log thinking tokens separately so they're visible in the dashboard.

The model swap: a case study in blast radius

Swapping one model sounds simple. Change a string, redeploy, done. Here's what it actually touched:

Files modified for MiniMax → StepFun model swap

Backend (templates, router, pricing)23

Blog ref-app (worker, fleet configs)15

Documentation & plans5

Test fixtures & snapshots3

23 platform files. 15 ref-app files. 56 runner templates updated. The model string was hardcoded in fleet presets, runner templates, the budget tier config, the model catalog, pricing tables, and test fixtures. Changing it required understanding every place a model identifier flows through the system.

This is a governance problem, not a code problem. If we had a single source of truth for "the budget model" — one config that everything reads — this would have been a one-line change. Instead it was a day of archaeology. We're building that config now.

What Claude Code is actually good at (and what it isn't)

After 10 days of intensive pair programming, here's my honest assessment.

Where it excelled

Parallel exploration: When I needed to find every reference to MiniMax across 1,599 Python files and 1,519 TypeScript files, Claude Code launched 3 search agents simultaneously. One grepped the backend, one the dashboard, one the docs. Results came back in seconds. Doing this manually would have taken 20 minutes of grep -r and context-switching.

Mechanical refactoring: The model swap required the same pattern applied 56 times across runner templates: find the model field, replace the value, update the description. Claude Code did this without getting bored, without typos, without missing the 47th occurrence.

Code review: When reviewing our own changes, Claude Code caught 6 real issues:

A duplicate React import that would fail the build
Missing TypeScript type casts for fleet template generics
A broken cost-anomaly-status endpoint returning 500
Stale Docker container references in deploy scripts
Race conditions in the Lua-based rate limit counters (INCR/EXPIRE not atomic)

These aren't style nits — they're bugs that would have hit production.

Where it struggled

Merge conflict reasoning: The governance.py incident happened because the merge looked clean to git. Claude Code didn't catch it either — it reviewed the PR diff, which showed the sliding-window additions, but didn't notice the old code was still present. Merge conflicts that survive git's resolution are invisible to diff-based review.

Cost awareness: Claude Code doesn't naturally think about the blast radius of model changes. It would happily swap a model string without asking "what about the pricing table? What about the cost calculator? What about test assertions that check costs?" I had to prompt explicitly: "can you do a round of review and then lets fix anything from the review." That round of review is what found the 100x billing bug.

Knowing when to stop: On several occasions, Claude Code would keep adding improvements beyond what I asked for. Extra error handling, additional validation, documentation comments. Each addition is individually reasonable but collectively they create noise in the diff, making review harder and increasing the chance of introducing bugs in "improved" code that was fine before.

What this means for how we build the platform

This retrospective isn't just navel-gazing. Every lesson here maps to a feature we're building or improving in the curate-me.ai governance platform.

Lesson → Platform feature

What went wrong	Platform feature that prevents it	Status
Syntax error deployed for 2 days	Gateway smoke test (scripts/gateway-smoke-test.sh --production)	Shipped
Cost calculation 100x wrong	Cost anomaly alerting (multi-channel delivery)	Shipped (#765)
Reasoning tokens silently dropped	Per-span token breakdown in Observer SDK	Shipped
Model swap touched 46 files	Model alias registry (single source of truth)	Shipped
No cost verification after model change	Evaluation framework with cost assertions	Shipped (#790)
Merge broke governance chain silently	Governance chain health check endpoint	Shipped

We're dogfooding. The bugs we find in our own workflow become the features our customers get. A governance platform that can't govern its own development process has no business selling governance to others.

Concrete improvements for next sprint

These aren't aspirational. They're in our issue tracker.

Pre-deploy syntax check: py_compile on every changed .py file before Docker build. 0.1 seconds, catches the governance.py class of bug.
Cost verification gate: After any model swap, the deploy pipeline makes a test request and compares the logged cost against expected pricing. If they diverge by more than 10%, the deploy halts.
Single model config: One YAML file that declares budget_model, standard_model, frontier_model. Templates, router, pricing — they all read from this file. Model swaps become one-line changes.
Diff review + full-file review: Code review on diffs catches most bugs. But merge artifacts only show up in the full file. Our review process now includes both.
Reasoning token budget: Reasoning models can consume their entire max_tokens budget on thinking before producing any output. We're adding a max_reasoning_tokens parameter to the governance chain so agents can't burn through budgets invisibly.

The uncomfortable question

Is building with AI faster? Yes, significantly. The model swap — 46 files, 3 bugs found and fixed, cost verification test, deploy — took one session. With a human pair programmer, that's a multi-day effort.

Is it safer? Not automatically. AI pair programming amplifies both your throughput and your mistake rate. We shipped 319 commits in 10 days. We also shipped a broken governance chain, a 100x billing error, and silently dropped reasoning tokens. The bugs were found and fixed in the same sprint, but only because we invested in verification — smoke tests, cost checks, production error monitoring.

The lesson isn't "AI good" or "AI bad." It's this: AI pair programming is a force multiplier, and force multipliers don't care what direction you're going. If you have good verification practices, AI makes you ship faster and catch bugs sooner. If you don't, AI helps you ship bugs faster too.

Build the verification layer first. Then turn up the velocity.

This post was co-authored by Claude Code (Opus 4.6). The data comes from real git history, real production errors, and real billing logs. The opinions about what went wrong are mine (Boris) — Claude Code would probably be more diplomatic about its own limitations.

What we're measuring going forward

We're using the curate-me.ai cost attribution system to track exactly how much each sprint costs in AI assistance:

Per-session cost: Tagged with X-CM-Tags: sprint=2, session_type=code_review
Per-feature cost: How much AI time each feature required
Bug-fix ratio: What percentage of AI-generated code needed correction
Time-to-detection: How long bugs survived before being caught

If you're running AI agents — whether for code, content, or operations — you need the same visibility. That's what we're building. That's why we dogfood it. And that's why the bugs in this retro exist at all: because we're actually using the thing.

Next post: Why We Switched from MiniMax to StepFun (And Found a 100x Billing Bug) — the technical deep-dive on today's model swap.