dogfoodingdashboardgovernancecost-trackingdevopscontinuous-improvement

Dogfooding: How We Use Our Own Dashboard to Ship Better

We build a governance platform for AI agents. We also run AI agents. Here's how using our own dashboard — cost attribution, error tracking, evaluation framework — makes our development process concretely better.

March 19, 20267 min read
AI Collaboration

Claude (Opus 4.6)Sprint data analysis, dashboard feature mapping

Total AI cost: $0.28

Governed by curate-me.ai

The premise

We build curate-me.ai — a governance layer for AI agents. Cost tracking, error monitoring, rate limiting, PII scanning, evaluation framework. We sell it to teams running LLM-powered applications.

We also run LLM-powered applications ourselves: a blog pipeline with 9 agents, a Hospital CFO demo fleet with 6 agents, an autopilot system that decomposes tasks into containerized Claude Code workers. Plus Claude Code itself as our primary development tool.

So we eat our own cooking. And the meals have been... instructive.

What "dogfooding" actually looks like

It's not just "we use our product." It's a feedback loop:

  1. Sprint happens — we ship features, break things, fix things
  2. Dashboard surfaces the damage — cost spikes, error rates, model failures
  3. Retro identifies patterns — why did we miss this? what would have caught it?
  4. Issues become features — the dashboard feature that would have prevented the bug gets built next sprint
  5. Next sprint, the dashboard catches it — the loop closes

Here's what each dashboard feature taught us this sprint.

Cost attribution: finding the $0.003 vs $0.00007 discrepancy

Our gateway tags every LLM request with cost metadata: which org, which agent, which model, which project. The dashboard shows this as a cost breakdown over time.

What cost attribution caught

What we sawWhat it meantWhat we did
Step 3.5 Flash request: $0.00355247x higher than expected pricingFound vendor prefix bug in pricing lookup
Reasoning tokens: 0 in all Step 3.5 requestsThinking tokens silently droppedAdded reasoning_tokens extraction
Daily fleet cost jumped 3x after model swapNew model's reasoning mode uses more tokensAdded max_reasoning_tokens governance param

Without cost attribution, the 100x overcharge would have gone unnoticed until a customer complained about their bill. The dashboard showed the per-request cost immediately after the test request. We caught it in minutes instead of days.

Dashboard feature we're adding: Cost anomaly alerting (#765). If per-request cost deviates more than 2x from the trailing 7-day average for that model, fire a Slack alert. This was in our sprint backlog as a nice-to-have. After the billing bug, it became a P0.

Error tracking: the governance.py incident

Our error tracking pipeline works like this: ErrorCaptureMiddleware catches unhandled exceptions → logs to MongoDB error_logs → queryable via ./scripts/errors recent. The dashboard shows error rates, stack traces, and per-org breakdowns.

The governance.py syntax error was a Python SyntaxError — the module couldn't even import. The error showed up immediately in the error logs when the new Docker image tried to start:

SyntaxError: invalid syntax (governance.py, line 1154)
  File "governance.py", line 1154
    logger.warning(
                   ^

What it taught us: Our error tracking catches runtime errors beautifully. But this was a parse-time error — the file couldn't load at all. The gateway process crashed on startup, so no requests were processed, and the error capture middleware never ran.

We noticed because we watched the deploy logs. But if we'd deployed and walked away, the VPS would just be serving the old cached container with no indication that the new code was broken.

Dashboard feature we're adding: Deploy health verification. After a deploy, the dashboard should poll the /health endpoint of all services. If any service doesn't respond within 30 seconds, fire an alert. Simple, but we didn't have it.

Evaluation framework: would it have caught the merge?

Sprint 2 shipped an evaluation framework (#790) — regex rules, keyword matching, and LLM-as-judge for evaluating agent outputs. It was built for customer-facing quality checks. But can we use it on ourselves?

Evaluation rules we're adding for our own dev process

RuleTypeWhat it catches
py_compile on changed .py filesRegex/scriptSyntax errors surviving merge
Cost delta check after model swapNumeric thresholdPricing lookup failures
Token field completenessKeywordMissing reasoning_tokens, cache tokens
Governance chain health after deployHTTP checkImport errors, broken middleware
Test assertions reference current modelRegexStale model names in test fixtures

The evaluation framework was designed for customers. Now we're its first customer. Every rule above maps to a real bug we shipped this sprint.

The sprint velocity loop

We track sprint velocity the same way a traditional team would — issues closed, PRs merged, bugs shipped vs bugs caught. But we add AI-specific metrics:

Sprint 2 metrics (10 days)

Feature PRs merged59
Bug fixes22
Security patches8
Bugs shipped to prod (caught same sprint)3
Bugs caught in code review6

The ratio that matters: 6 bugs caught in review vs 3 that hit production. That's a 2:1 catch rate. For a human+AI pair programming workflow where the pace is 30+ commits/day, that's... not great. Traditional teams shipping 3-5 commits/day usually catch more bugs pre-production.

The insight: AI pair programming increases throughput and increases the review burden. If your review process doesn't scale with your commit rate, bugs leak through. We need to make review automatic, not optional.

What we're building: A pre-merge check that runs the evaluation framework rules against every PR. Not a full CI pipeline — just the fast checks (syntax, cost assertions, import verification) that catch the class of bug we actually shipped.

Concrete improvements from this retro

Every lesson from the sprint retro maps to a GitHub issue. These are real, not aspirational:

Retro lesson → GitHub issue → Dashboard feature

LessonIssueShips in
Syntax error hid for 2 daysPre-deploy py_compile checkSprint 3
Cost calc fell to $3/$15 defaultCost verification gate in deploy pipelineSprint 3
Model string in 56 filesSingle model config (budget/standard/frontier)Sprint 3
Reasoning tokens invisiblemax_reasoning_tokens governance parameterSprint 3
Review didn't check full fileEval framework rules for pre-merge checksSprint 3
No deploy health checkPost-deploy health poll + alertSprint 3

Why this matters for our customers

If we can't use our own dashboard to catch billing bugs, deployment failures, and cost anomalies in our own development process, why would anyone trust us to catch them in theirs?

Dogfooding isn't a nice-to-have. It's the only honest way to build a governance product. Every bug we ship and catch with our own tools proves the tools work. Every bug we miss shows us what to build next.

The governance platform isn't just for customers running AI agents in production. It's for anyone building with AI — including us. The cost attribution that caught our 47x billing error? That's the same feature a customer would use to spot their own pricing misconfigurations. The error tracking that surfaced the governance.py crash? Same pipeline that would alert a customer to their agent failing.

The loop closes when the product governs its own development. That's where we are now.

Every feature mentioned in this post is live on the curate-me.ai dashboard. If you're running AI agents and want cost attribution, error tracking, and evaluation rules — you don't have to build them yourself. That's literally what we sell.


Read the full sprint retrospective: What Claude Code and I Actually Shipped (And What Broke)

Technical deep-dive on the billing bug: Why We Switched from MiniMax to StepFun

Rate this post

Comments

Loading comments...

Leave a comment

Comments are moderated by our AI agent and reviewed by a human.