Why We Switched from MiniMax to StepFun (And Found a 100x Billing Bug)
A model swap that should have been a one-line change turned into a 46-file investigation, uncovered a critical billing bug, and taught us why model governance needs a single source of truth.
Claude (Opus 4.6) — Codebase search, code review, cost verification testing
Governed by curate-me.ai
The switch
Our fleet agents — the Hospital CFO demo, the blog pipeline, every runner template — all ran on MiniMax M2.5 via OpenRouter. It was cheap ($0.20/$1.20 per million tokens) and fast enough for agent tasks that don't need frontier reasoning.
Then StepFun released Step 3.5 Flash. The benchmarks looked strong for agentic use, and the pricing was aggressive: $0.10 input / $0.30 output per million tokens. That's 3-4x cheaper than MiniMax for output tokens, which is where most agent cost goes.
Simple decision. Swap the model string. Redeploy.
It was not simple.
The scope problem
Where the model identifier was hardcoded
The model string openrouter/minimax/minimax-m2.5 appeared in 56 runner templates, the model catalog, the provider router, the pricing table, test fixtures, documentation, and 15 files in the reference app. There was no BUDGET_MODEL constant. No config file that everything read from. Just the same string, copy-pasted everywhere.
This is the kind of thing you don't notice until you have to change it. And then you notice it hard.
The hidden bug
After swapping all 46 files and deploying, I ran a test request through the gateway. The cost in the logs: $0.003552 for 14 input tokens and 114 output tokens.
That's wrong. At $0.10/$0.30 per million tokens, 128 tokens should cost about $0.000075. Not $0.003552. We were overcharging by 47x.
Debugging the cost calculation
The get_model_pricing() function looked up stepfun/step-3.5-flash in the local pricing table. Not found. Tried litellm's database. Not found (it's a new model). Fell through to the default fallback: $3.00 input / $15.00 output per million tokens. The same default that catches unknown models so they don't get billed at zero.
That default — meant as a safety net — became a silent overcharge.
The second bug: invisible thinking tokens
Step 3.5 Flash is a reasoning model. It uses "thinking tokens" to reason through problems before generating output. These tokens appear in completion_tokens_details.reasoning_tokens in the API response, but our cost recorder only looked at completion_tokens.
So even after fixing the pricing lookup, we were still wrong — just in the other direction. The thinking tokens were real compute being consumed but not counted. The 128 tokens I saw in the test? All of them were reasoning tokens. The actual "output" was the model's response, counted separately.
Token accounting before and after fix
| Token type | Before fix | After fix |
|---|---|---|
| Input (prompt) | Counted | Counted |
| Output (completion) | Counted | Counted |
| Reasoning (thinking) | Ignored (0) | Counted separately |
| Cache creation | Ignored | Counted at input rate |
| Cache read | Ignored | Counted at discounted rate |
The governance chain was also broken
While investigating the cost issue, we discovered the gateway wouldn't even start with a full rebuild. The governance.py file — the 6-step policy chain — had a syntax error from a previous merge.
Old fixed-window rate limiter code was interleaved with new sliding-window code. Logger calls without closing parentheses. Variables used before definition. The file hadn't actually run in 2 days because Docker was serving a cached image.
Three bugs. One model swap. All of them hiding behind a green CI check and a Docker cache.
What we built to prevent this
1. Cost verification test
The gateway smoke test now includes a cost check. After any model change, it makes a test request and compares the logged cost against the known pricing:
Expected cost: $0.000075 (128 tokens @ $0.10/$0.30)
Actual cost: $0.000075
Difference: 0% ✓
If the difference exceeds 10%, the deploy halts with an error.
2. Model alias registry
The gateway already had a model_alias_registry.py. We now use it as the single source of truth for model swaps. Templates reference aliases (budget, standard, frontier) instead of raw model strings. Changing the budget model means updating one alias mapping.
3. Vendor prefix stripping
get_model_pricing() now strips vendor/ prefixes before lookup. stepfun/step-3.5-flash → step-3.5-flash. anthropic/claude-sonnet-4-6 → claude-sonnet-4-6. Simple, but it has to be done everywhere a model name is used for cost calculation.
4. Reasoning token tracking
The cost recorder now extracts from both streaming and non-streaming responses:
reasoning_tokensfromcompletion_tokens_detailscache_creation_input_tokensandcache_read_input_tokens
All of these are logged separately and included in cost calculation.
The real lesson
A model swap is not a string replacement. It's a governance event that touches pricing, cost calculation, token accounting, test assertions, documentation, and every template that references the model.
If your platform doesn't have a single source of truth for model configuration, every swap is a minefield. If your cost calculator doesn't verify against known pricing, billing errors hide silently. If you don't track reasoning tokens, you're lying to your customers about what their agents cost.
We build a governance platform. We should have caught this before our customers did. We got lucky this time — we caught it during our own testing. Next time, the cost verification gate will catch it automatically.
If you're using OpenRouter or any multi-provider proxy, check your cost calculation right now. Make a test request, look at the logged cost, and compare it to the model's known pricing. If there's a discrepancy, your pricing lookup is probably falling through to a default.
This post is part of the Sprint 2 retrospective. Read the full retro for context on what else went wrong (and right) this sprint.
Rate this post
Comments
Loading comments...