Dogfooding: How We Use Our Own Dashboard to Ship Better
We build a governance platform for AI agents. We also run AI agents. Here's how using our own dashboard — cost attribution, error tracking, evaluation framework — makes our development process concretely better.
Claude (Opus 4.6) — Sprint data analysis, dashboard feature mapping
Governed by curate-me.ai
The premise
We build curate-me.ai — a governance layer for AI agents. Cost tracking, error monitoring, rate limiting, PII scanning, evaluation framework. We sell it to teams running LLM-powered applications.
We also run LLM-powered applications ourselves: a blog pipeline with 9 agents, a Hospital CFO demo fleet with 6 agents, an autopilot system that decomposes tasks into containerized Claude Code workers. Plus Claude Code itself as our primary development tool.
So we eat our own cooking. And the meals have been... instructive.
What "dogfooding" actually looks like
It's not just "we use our product." It's a feedback loop:
- Sprint happens — we ship features, break things, fix things
- Dashboard surfaces the damage — cost spikes, error rates, model failures
- Retro identifies patterns — why did we miss this? what would have caught it?
- Issues become features — the dashboard feature that would have prevented the bug gets built next sprint
- Next sprint, the dashboard catches it — the loop closes
Here's what each dashboard feature taught us this sprint.
Cost attribution: finding the $0.003 vs $0.00007 discrepancy
Our gateway tags every LLM request with cost metadata: which org, which agent, which model, which project. The dashboard shows this as a cost breakdown over time.
What cost attribution caught
| What we saw | What it meant | What we did |
|---|---|---|
| Step 3.5 Flash request: $0.003552 | 47x higher than expected pricing | Found vendor prefix bug in pricing lookup |
| Reasoning tokens: 0 in all Step 3.5 requests | Thinking tokens silently dropped | Added reasoning_tokens extraction |
| Daily fleet cost jumped 3x after model swap | New model's reasoning mode uses more tokens | Added max_reasoning_tokens governance param |
Without cost attribution, the 100x overcharge would have gone unnoticed until a customer complained about their bill. The dashboard showed the per-request cost immediately after the test request. We caught it in minutes instead of days.
Dashboard feature we're adding: Cost anomaly alerting (#765). If per-request cost deviates more than 2x from the trailing 7-day average for that model, fire a Slack alert. This was in our sprint backlog as a nice-to-have. After the billing bug, it became a P0.
Error tracking: the governance.py incident
Our error tracking pipeline works like this: ErrorCaptureMiddleware catches unhandled exceptions → logs to MongoDB error_logs → queryable via ./scripts/errors recent. The dashboard shows error rates, stack traces, and per-org breakdowns.
The governance.py syntax error was a Python SyntaxError — the module couldn't even import. The error showed up immediately in the error logs when the new Docker image tried to start:
SyntaxError: invalid syntax (governance.py, line 1154)
File "governance.py", line 1154
logger.warning(
^
What it taught us: Our error tracking catches runtime errors beautifully. But this was a parse-time error — the file couldn't load at all. The gateway process crashed on startup, so no requests were processed, and the error capture middleware never ran.
We noticed because we watched the deploy logs. But if we'd deployed and walked away, the VPS would just be serving the old cached container with no indication that the new code was broken.
Dashboard feature we're adding: Deploy health verification. After a deploy, the dashboard should poll the /health endpoint of all services. If any service doesn't respond within 30 seconds, fire an alert. Simple, but we didn't have it.
Evaluation framework: would it have caught the merge?
Sprint 2 shipped an evaluation framework (#790) — regex rules, keyword matching, and LLM-as-judge for evaluating agent outputs. It was built for customer-facing quality checks. But can we use it on ourselves?
Evaluation rules we're adding for our own dev process
| Rule | Type | What it catches |
|---|---|---|
| py_compile on changed .py files | Regex/script | Syntax errors surviving merge |
| Cost delta check after model swap | Numeric threshold | Pricing lookup failures |
| Token field completeness | Keyword | Missing reasoning_tokens, cache tokens |
| Governance chain health after deploy | HTTP check | Import errors, broken middleware |
| Test assertions reference current model | Regex | Stale model names in test fixtures |
The evaluation framework was designed for customers. Now we're its first customer. Every rule above maps to a real bug we shipped this sprint.
The sprint velocity loop
We track sprint velocity the same way a traditional team would — issues closed, PRs merged, bugs shipped vs bugs caught. But we add AI-specific metrics:
Sprint 2 metrics (10 days)
The ratio that matters: 6 bugs caught in review vs 3 that hit production. That's a 2:1 catch rate. For a human+AI pair programming workflow where the pace is 30+ commits/day, that's... not great. Traditional teams shipping 3-5 commits/day usually catch more bugs pre-production.
The insight: AI pair programming increases throughput and increases the review burden. If your review process doesn't scale with your commit rate, bugs leak through. We need to make review automatic, not optional.
What we're building: A pre-merge check that runs the evaluation framework rules against every PR. Not a full CI pipeline — just the fast checks (syntax, cost assertions, import verification) that catch the class of bug we actually shipped.
Concrete improvements from this retro
Every lesson from the sprint retro maps to a GitHub issue. These are real, not aspirational:
Retro lesson → GitHub issue → Dashboard feature
| Lesson | Issue | Ships in |
|---|---|---|
| Syntax error hid for 2 days | Pre-deploy py_compile check | Sprint 3 |
| Cost calc fell to $3/$15 default | Cost verification gate in deploy pipeline | Sprint 3 |
| Model string in 56 files | Single model config (budget/standard/frontier) | Sprint 3 |
| Reasoning tokens invisible | max_reasoning_tokens governance parameter | Sprint 3 |
| Review didn't check full file | Eval framework rules for pre-merge checks | Sprint 3 |
| No deploy health check | Post-deploy health poll + alert | Sprint 3 |
Why this matters for our customers
If we can't use our own dashboard to catch billing bugs, deployment failures, and cost anomalies in our own development process, why would anyone trust us to catch them in theirs?
Dogfooding isn't a nice-to-have. It's the only honest way to build a governance product. Every bug we ship and catch with our own tools proves the tools work. Every bug we miss shows us what to build next.
The governance platform isn't just for customers running AI agents in production. It's for anyone building with AI — including us. The cost attribution that caught our 47x billing error? That's the same feature a customer would use to spot their own pricing misconfigurations. The error tracking that surfaced the governance.py crash? Same pipeline that would alert a customer to their agent failing.
The loop closes when the product governs its own development. That's where we are now.
Every feature mentioned in this post is live on the curate-me.ai dashboard. If you're running AI agents and want cost attribution, error tracking, and evaluation rules — you don't have to build them yourself. That's literally what we sell.
Read the full sprint retrospective: What Claude Code and I Actually Shipped (And What Broke)
Technical deep-dive on the billing bug: Why We Switched from MiniMax to StepFun
Rate this post
Comments
Loading comments...