The Week We Stopped Building Features
74 commits, 30 fixes, 4,700 new tests. After a marathon sprint, we spent three days doing nothing but making the platform actually work.
Claude (Opus 4.6) — Co-author, all implementation, test generation, bug investigation
Claude Code (Sonnet 4.6) — Test wave execution, dashboard refactoring
Governed by curate-me.ai
The hangover
Three days ago, we shipped the marathon session: 4 sites, 50+ features, 150+ agent runs, one weekend. It felt like the most productive session we'd ever had. Parallel worktree agents running 6 features at a time. Terminal recordings. Chat widgets. Blog redesigns. Everything was shipping.
Then we tried to make it work in production.
Forty-one percent of the work since that marathon was fixing things the marathon broke. This is the post about those three days — the unglamorous part nobody writes about, the part where the platform actually became real.
Day 1: The bugs were everywhere
March 25 started as a feature day. We shipped multi-repository support for the dev-team pipeline, a Quick Connect flow for GitHub repos, and a board integration that fetches issues directly from connected repos. Six feature commits before lunch.
Then the bug reports started.
40 CORS failures
Fourteen dashboard API clients were making gateway requests to localhost:8002 in production. The root cause was simple: NEXT_PUBLIC_GATEWAY_URL had never been set as an environment variable on the VPS. Every file that used it fell back to the development default.
The fix was also simple — use /gw-api relative paths that get rewritten by Next.js to the gateway. But the fix had to be applied across 14 separate files in two batches, because we kept finding more.
The CORS cascade
Forty CORS bugs. All with the same root cause. All invisible in development because localhost:8002 works fine when the gateway is running locally.
The auth key that never existed
Twelve dashboard files were reading authentication tokens from a localStorage key called cm_gateway_key. That key had never been set by any part of the application. The canonical key was dashboard_access_token, set by the auth flow.
This means twelve pages — traces, fleet timeline, cost attribution, runner status — had been silently failing to authenticate since they were written. They worked in development because the dev server doesn't enforce auth the same way. In production, they returned empty data or 401 errors.
The fix was a find-and-replace across 12 files. The lesson was harder to absorb.
Null safety: the .toFixed() epidemic
This one took eight separate commits to resolve. The dashboard had hundreds of places where numerical values from API responses were passed directly to .toFixed(2). When the API returned null or undefined — which it does for any resource that hasn't accumulated cost data yet — JavaScript threw Cannot read properties of null (reading 'toFixed').
We found these in costs pages, runner detail pages, gateway logs, session recording pages, prompt editor components, and fleet timelines. Every page that displayed a cost or a latency number was potentially broken.
Day 2: The pivot to testing
By March 26, the pattern was clear. We were not going to find these bugs by using the dashboard manually. We needed systematic coverage.
Six waves of test generation, each targeting a different layer of the stack:
Test waves — what got covered
But here is the part nobody tells you about AI-generated tests: 11 of the generated test files were broken. Not failing — broken. Import errors that prevented the entire test suite from collecting. Tests that referenced functions that didn't exist. Tests for modules that had been rewritten.
We deleted them across three separate cleanup commits. The broken tests were generated by agents running against stale snapshots of the codebase, and no one had run them before merging. That is the same lesson as the marathon — speed without verification compounds debt.
The testing push also forced a refactor. The test configuration was a mess: fixtures duplicated across files, thresholds misaligned, no-op fixtures that did nothing. We extracted everything into reusable modules and standardized the infrastructure. This was the boring, correct work that makes the next 4,700 tests possible.
Day 2.5: The architecture reckoning
Between test waves, we broke up the monoliths.
lib/api.ts was 3,236 lines — a single file containing every API call the dashboard makes. It was the third-largest file in the entire monorepo. We split it into 20+ specialized modules: agents-api.ts, billing-api.ts, costs-api.ts, gateway-admin.ts, webhooks-api.ts, and so on.
The template builder modal was 1,494 lines. We extracted it into 7 step components (BasicsStep, InputsStep, WorkspaceStep). The trace viewer was 1,083 lines. The agent chat panel was 831 lines. Each one became a set of focused modules.
Net change: +1,111 lines. Breaking up monoliths always adds a few lines for the module boundaries and re-exports. That is fine. What matters is that each module can now be tested, reviewed, and reasoned about independently.
Day 3: The ?? 0 saga
On March 27, after the test waves landed, we tried to build the dashboard for production. Next.js 15 has gotten stricter about TypeScript. It flagged 519 instances of a pattern like this:
// value is already guaranteed to be a number by the type system
const display = (value ?? 0).toFixed(2);
The ?? 0 is a nullish coalescing operator — a fallback to zero if the value is null or undefined. But the type system says the value is already a number. Next.js now treats this as unreachable code: a type error.
So we bulk-removed all 519 instances with a single sed command.
The build passed. We deployed. And then 13 pages broke.
The ?? 0 incident
The problem: some of those values came through optional chaining — stats?.costs?.today ?? 0. TypeScript sees the final type as number | undefined, but the intermediate ?? 0 is what protects against the undefined case. Removing it means .toFixed() crashes when stats is null.
519 instances looked identical. 506 were genuinely safe to remove. 13 were load-bearing. You cannot tell which is which without reading each one.
The lesson is specific: bulk automated refactors that look purely mechanical are not purely mechanical. Types lie. Optional chains create invisible null paths. The only safe approach was surgical: read each instance, trace the value's origin, decide.
What we actually built
Not nothing. Nine feature commits survived between the fixes:
- Machine Registry: BYOVM runners now have ownership and sharing semantics. Personal machines, shared machines, pool machines. A dashboard page for managing them, an install wizard, and a machine picker in the runner creation modal.
- Cloud Runner + VPS monitoring: Capacity monitoring for multi-machine deployments.
- Command palette: Replaced the broken header search with a real
cmdk-powered Cmd+K palette. 40 searchable pages, 6 quick actions, contextual suggestions. - Landing page redesign: Collapsed 13 sections to 7, removed 3,700 lines, new headline.
- Security hardening: 45 findings across 4 severity waves.
But the ratio tells the story. 9 features, 30 fixes, 16 test commits. For every feature we shipped, we fixed three things and wrote tests for two more.
What I learned
The marathon was real. 50+ features shipped. But they shipped into a codebase that was not ready to receive them. The CORS bugs existed before the marathon — we just never noticed because we were moving too fast to test in production. The auth key mismatch was written months ago. The null safety issues had been accumulating since the dashboard's first draft.
Speed creates debt. AI-assisted development creates debt faster, because the agents can generate code faster than you can verify it works. The marathon proved we could ship 50 features in a weekend. The week after proved that shipping is not the same as working.
I don't regret the marathon. Those features needed to exist. But if I were doing it again, I would have stopped at feature 30 and spent the second day writing tests instead of feature 50.
Twenty percent backend coverage is not a testing strategy. It is an absence of one.
The platform now has 17,560 passing backend tests at 43% coverage, 1,000+ dashboard tests, and a modular architecture that can actually be maintained. None of that is visible to users. All of it is why the next marathon will not require a week of cleanup afterward.
Rate this post
Comments
Loading comments...