retrospectivetestingproductionbugsengineering

The Week We Stopped Building Features

74 commits, 30 fixes, 4,700 new tests. After a marathon sprint, we spent three days doing nothing but making the platform actually work.

March 28, 20269 min read
AI Collaboration

Claude (Opus 4.6)Co-author, all implementation, test generation, bug investigation

Claude Code (Sonnet 4.6)Test wave execution, dashboard refactoring

Total AI cost: $0.12

Governed by curate-me.ai

The hangover

Three days ago, we shipped the marathon session: 4 sites, 50+ features, 150+ agent runs, one weekend. It felt like the most productive session we'd ever had. Parallel worktree agents running 6 features at a time. Terminal recordings. Chat widgets. Blog redesigns. Everything was shipping.

Then we tried to make it work in production.

74
Commits since
3 days
30
Bug fixes
41% of all commits
4,700+
New tests
across 6 waves
9
New features
12% of all commits

Forty-one percent of the work since that marathon was fixing things the marathon broke. This is the post about those three days — the unglamorous part nobody writes about, the part where the platform actually became real.

Day 1: The bugs were everywhere

March 25 started as a feature day. We shipped multi-repository support for the dev-team pipeline, a Quick Connect flow for GitHub repos, and a board integration that fetches issues directly from connected repos. Six feature commits before lunch.

Then the bug reports started.

40 CORS failures

Fourteen dashboard API clients were making gateway requests to localhost:8002 in production. The root cause was simple: NEXT_PUBLIC_GATEWAY_URL had never been set as an environment variable on the VPS. Every file that used it fell back to the development default.

The fix was also simple — use /gw-api relative paths that get rewritten by Next.js to the gateway. But the fix had to be applied across 14 separate files in two batches, because we kept finding more.

The CORS cascade

10amFirst CORS bug found in secrets-vault API client
11am14 more CORS bugs found across gateway API files
2pmFix batch 1: switch 14 files to relative /gw-api paths
4pm12 MORE CORS bugs found in remaining dashboard clients
5pmFix batch 2: all dashboard API clients now use relative URLs
6pm3 gateway-side bugs: rate limit, sliding-window count, port exhaustion

Forty CORS bugs. All with the same root cause. All invisible in development because localhost:8002 works fine when the gateway is running locally.

The auth key that never existed

Twelve dashboard files were reading authentication tokens from a localStorage key called cm_gateway_key. That key had never been set by any part of the application. The canonical key was dashboard_access_token, set by the auth flow.

This means twelve pages — traces, fleet timeline, cost attribution, runner status — had been silently failing to authenticate since they were written. They worked in development because the dev server doesn't enforce auth the same way. In production, they returned empty data or 401 errors.

The fix was a find-and-replace across 12 files. The lesson was harder to absorb.

Null safety: the .toFixed() epidemic

This one took eight separate commits to resolve. The dashboard had hundreds of places where numerical values from API responses were passed directly to .toFixed(2). When the API returned null or undefined — which it does for any resource that hasn't accumulated cost data yet — JavaScript threw Cannot read properties of null (reading 'toFixed').

We found these in costs pages, runner detail pages, gateway logs, session recording pages, prompt editor components, and fleet timelines. Every page that displayed a cost or a latency number was potentially broken.

Day 2: The pivot to testing

By March 26, the pattern was clear. We were not going to find these bugs by using the dashboard manually. We needed systematic coverage.

20%
Coverage before
backend
43%
Coverage after
backend
17,560
Passing tests
+10,539
506
Test files
new

Six waves of test generation, each targeting a different layer of the stack:

Test waves — what got covered

Wave 1: Runner CP + enterprise960runner control plane, agents, DB, routes
Wave 2: Dashboard1,000validation, auth, API hooks, exports
Wave 3: Billing + gateway800billing, config, workers, models
Wave 4: Models + routes800routes, agents, DB, webhooks
Wave 5: Services + middleware1,000integrations, utils, checkpoint
Wave 6: Gateway modules140coverage script, runner CP

But here is the part nobody tells you about AI-generated tests: 11 of the generated test files were broken. Not failing — broken. Import errors that prevented the entire test suite from collecting. Tests that referenced functions that didn't exist. Tests for modules that had been rewritten.

We deleted them across three separate cleanup commits. The broken tests were generated by agents running against stale snapshots of the codebase, and no one had run them before merging. That is the same lesson as the marathon — speed without verification compounds debt.

The testing push also forced a refactor. The test configuration was a mess: fixtures duplicated across files, thresholds misaligned, no-op fixtures that did nothing. We extracted everything into reusable modules and standardized the infrastructure. This was the boring, correct work that makes the next 4,700 tests possible.

Day 2.5: The architecture reckoning

Between test waves, we broke up the monoliths.

lib/api.ts was 3,236 lines — a single file containing every API call the dashboard makes. It was the third-largest file in the entire monorepo. We split it into 20+ specialized modules: agents-api.ts, billing-api.ts, costs-api.ts, gateway-admin.ts, webhooks-api.ts, and so on.

The template builder modal was 1,494 lines. We extracted it into 7 step components (BasicsStep, InputsStep, WorkspaceStep). The trace viewer was 1,083 lines. The agent chat panel was 831 lines. Each one became a set of focused modules.

90
Refactored files
in one PR
8,849
Lines added
modular components
7,738
Lines removed
monoliths broken up
3,236
Largest file before
lib/api.ts

Net change: +1,111 lines. Breaking up monoliths always adds a few lines for the module boundaries and re-exports. That is fine. What matters is that each module can now be tested, reviewed, and reasoned about independently.

Day 3: The ?? 0 saga

On March 27, after the test waves landed, we tried to build the dashboard for production. Next.js 15 has gotten stricter about TypeScript. It flagged 519 instances of a pattern like this:

// value is already guaranteed to be a number by the type system
const display = (value ?? 0).toFixed(2);

The ?? 0 is a nullish coalescing operator — a fallback to zero if the value is null or undefined. But the type system says the value is already a number. Next.js now treats this as unreachable code: a type error.

So we bulk-removed all 519 instances with a single sed command.

The build passed. We deployed. And then 13 pages broke.

The ?? 0 incident

MorningNext.js flags 519 unreachable ?? 0 patterns as type errors
11amBulk removal: sed removes all 519 instances
11:30amBuild passes, deployed to production
12pm13 pages broken — values that CAN be null despite types
12:15pmReverted bulk removal entirely
1pmSurgical removal: 506 safe instances removed, 13 kept

The problem: some of those values came through optional chaining — stats?.costs?.today ?? 0. TypeScript sees the final type as number | undefined, but the intermediate ?? 0 is what protects against the undefined case. Removing it means .toFixed() crashes when stats is null.

519 instances looked identical. 506 were genuinely safe to remove. 13 were load-bearing. You cannot tell which is which without reading each one.

The lesson is specific: bulk automated refactors that look purely mechanical are not purely mechanical. Types lie. Optional chains create invisible null paths. The only safe approach was surgical: read each instance, trace the value's origin, decide.

What we actually built

Not nothing. Nine feature commits survived between the fixes:

  • Machine Registry: BYOVM runners now have ownership and sharing semantics. Personal machines, shared machines, pool machines. A dashboard page for managing them, an install wizard, and a machine picker in the runner creation modal.
  • Cloud Runner + VPS monitoring: Capacity monitoring for multi-machine deployments.
  • Command palette: Replaced the broken header search with a real cmdk-powered Cmd+K palette. 40 searchable pages, 6 quick actions, contextual suggestions.
  • Landing page redesign: Collapsed 13 sections to 7, removed 3,700 lines, new headline.
  • Security hardening: 45 findings across 4 severity waves.

But the ratio tells the story. 9 features, 30 fixes, 16 test commits. For every feature we shipped, we fixed three things and wrote tests for two more.

What I learned

The marathon was real. 50+ features shipped. But they shipped into a codebase that was not ready to receive them. The CORS bugs existed before the marathon — we just never noticed because we were moving too fast to test in production. The auth key mismatch was written months ago. The null safety issues had been accumulating since the dashboard's first draft.

Speed creates debt. AI-assisted development creates debt faster, because the agents can generate code faster than you can verify it works. The marathon proved we could ship 50 features in a weekend. The week after proved that shipping is not the same as working.

I don't regret the marathon. Those features needed to exist. But if I were doing it again, I would have stopped at feature 30 and spent the second day writing tests instead of feature 50.

Twenty percent backend coverage is not a testing strategy. It is an absence of one.

The platform now has 17,560 passing backend tests at 43% coverage, 1,000+ dashboard tests, and a modular architecture that can actually be maintained. None of that is visible to users. All of it is why the next marathon will not require a week of cleanup afterward.

Rate this post

Comments

Loading comments...

Leave a comment

Comments are moderated by our AI agent and reviewed by a human.