Dogfooding: How We Use Our Own Dashboard to Ship Better

The premise

We build curate-me.ai — a governance layer for AI agents. Cost tracking, error monitoring, rate limiting, PII scanning, evaluation framework. We sell it to teams running LLM-powered applications.

We also run LLM-powered applications ourselves: a blog pipeline with 9 agents, a Hospital CFO demo fleet with 6 agents, an autopilot system that decomposes tasks into containerized Claude Code workers. Plus Claude Code itself as our primary development tool.

So we eat our own cooking. And the meals have been... instructive.

What "dogfooding" actually looks like

It's not just "we use our product." It's a feedback loop:

Sprint happens — we ship features, break things, fix things
Dashboard surfaces the damage — cost spikes, error rates, model failures
Retro identifies patterns — why did we miss this? what would have caught it?
Issues become features — the dashboard feature that would have prevented the bug gets built next sprint
Next sprint, the dashboard catches it — the loop closes

Here's what each dashboard feature taught us this sprint.

Cost attribution: finding the $0.003 vs $0.00007 discrepancy

Our gateway tags every LLM request with cost metadata: which org, which agent, which model, which project. The dashboard shows this as a cost breakdown over time.

What cost attribution caught

What we saw	What it meant	What we did
Step 3.5 Flash request: $0.003552	47x higher than expected pricing	Found vendor prefix bug in pricing lookup
Reasoning tokens: 0 in all Step 3.5 requests	Thinking tokens silently dropped	Added reasoning_tokens extraction
Daily fleet cost jumped 3x after model swap	New model's reasoning mode uses more tokens	Added max_reasoning_tokens governance param

Without cost attribution, the 100x overcharge would have gone unnoticed until a customer complained about their bill. The dashboard showed the per-request cost immediately after the test request. We caught it in minutes instead of days.

Dashboard feature we're adding: Cost anomaly alerting (#765). If per-request cost deviates more than 2x from the trailing 7-day average for that model, fire a Slack alert. This was in our sprint backlog as a nice-to-have. After the billing bug, it became a P0.

Error tracking: the governance.py incident

Our error tracking pipeline works like this: ErrorCaptureMiddleware catches unhandled exceptions → logs to MongoDB error_logs → queryable via ./scripts/errors recent. The dashboard shows error rates, stack traces, and per-org breakdowns.

The governance.py syntax error was a Python SyntaxError — the module couldn't even import. The error showed up immediately in the error logs when the new Docker image tried to start:

SyntaxError: invalid syntax (governance.py, line 1154)
  File "governance.py", line 1154
    logger.warning(
                   ^

What it taught us: Our error tracking catches runtime errors beautifully. But this was a parse-time error — the file couldn't load at all. The gateway process crashed on startup, so no requests were processed, and the error capture middleware never ran.

We noticed because we watched the deploy logs. But if we'd deployed and walked away, the VPS would just be serving the old cached container with no indication that the new code was broken.

Dashboard feature we're adding: Deploy health verification. After a deploy, the dashboard should poll the /health endpoint of all services. If any service doesn't respond within 30 seconds, fire an alert. Simple, but we didn't have it.

Evaluation framework: would it have caught the merge?

Sprint 2 shipped an evaluation framework (#790) — regex rules, keyword matching, and LLM-as-judge for evaluating agent outputs. It was built for customer-facing quality checks. But can we use it on ourselves?

Evaluation rules we're adding for our own dev process

Rule	Type	What it catches
py_compile on changed .py files	Regex/script	Syntax errors surviving merge
Cost delta check after model swap	Numeric threshold	Pricing lookup failures
Token field completeness	Keyword	Missing reasoning_tokens, cache tokens
Governance chain health after deploy	HTTP check	Import errors, broken middleware
Test assertions reference current model	Regex	Stale model names in test fixtures

The evaluation framework was designed for customers. Now we're its first customer. Every rule above maps to a real bug we shipped this sprint.

The sprint velocity loop

We track sprint velocity the same way a traditional team would — issues closed, PRs merged, bugs shipped vs bugs caught. But we add AI-specific metrics:

Sprint 2 metrics (10 days)

Feature PRs merged59

Bug fixes22

Security patches8

Bugs shipped to prod (caught same sprint)3

Bugs caught in code review6

The ratio that matters: 6 bugs caught in review vs 3 that hit production. That's a 2:1 catch rate. For a human+AI pair programming workflow where the pace is 30+ commits/day, that's... not great. Traditional teams shipping 3-5 commits/day usually catch more bugs pre-production.

The insight: AI pair programming increases throughput and increases the review burden. If your review process doesn't scale with your commit rate, bugs leak through. We need to make review automatic, not optional.

What we're building: A pre-merge check that runs the evaluation framework rules against every PR. Not a full CI pipeline — just the fast checks (syntax, cost assertions, import verification) that catch the class of bug we actually shipped.

Concrete improvements from this retro

Every lesson from the sprint retro maps to a GitHub issue. These are real, not aspirational:

Retro lesson → GitHub issue → Dashboard feature

Lesson	Issue	Ships in
Syntax error hid for 2 days	Pre-deploy py_compile check	Sprint 3
Cost calc fell to $3/$15 default	Cost verification gate in deploy pipeline	Sprint 3
Model string in 56 files	Single model config (budget/standard/frontier)	Sprint 3
Reasoning tokens invisible	max_reasoning_tokens governance parameter	Sprint 3
Review didn't check full file	Eval framework rules for pre-merge checks	Sprint 3
No deploy health check	Post-deploy health poll + alert	Sprint 3

Why this matters for our customers

If we can't use our own dashboard to catch billing bugs, deployment failures, and cost anomalies in our own development process, why would anyone trust us to catch them in theirs?

Dogfooding isn't a nice-to-have. It's the only honest way to build a governance product. Every bug we ship and catch with our own tools proves the tools work. Every bug we miss shows us what to build next.

The governance platform isn't just for customers running AI agents in production. It's for anyone building with AI — including us. The cost attribution that caught our 47x billing error? That's the same feature a customer would use to spot their own pricing misconfigurations. The error tracking that surfaced the governance.py crash? Same pipeline that would alert a customer to their agent failing.

The loop closes when the product governs its own development. That's where we are now.

Every feature mentioned in this post is live on the curate-me.ai dashboard. If you're running AI agents and want cost attribution, error tracking, and evaluation rules — you don't have to build them yourself. That's literally what we sell.

Read the full sprint retrospective: What Claude Code and I Actually Shipped (And What Broke)

Technical deep-dive on the billing bug: Why We Switched from MiniMax to StepFun