Why We Switched from MiniMax to StepFun (And Found a 100x Billing Bug)

The switch

Our fleet agents — the Hospital CFO demo, the blog pipeline, every runner template — all ran on MiniMax M2.5 via OpenRouter. It was cheap ($0.20/$1.20 per million tokens) and fast enough for agent tasks that don't need frontier reasoning.

Then StepFun released Step 3.5 Flash. The benchmarks looked strong for agentic use, and the pricing was aggressive: $0.10 input / $0.30 output per million tokens. That's 3-4x cheaper than MiniMax for output tokens, which is where most agent cost goes.

Simple decision. Swap the model string. Redeploy.

It was not simple.

The scope problem

Where the model identifier was hardcoded

Runner templates (fleet presets)56model field in each template

Model catalog (models.py)3pricing, alias, context window

Provider router2budget tier routing logic

Test fixtures3expected model in assertions

Documentation5pricing tables, guides, plans

Ref-app configs15blog fleet templates, worker config

The model string openrouter/minimax/minimax-m2.5 appeared in 56 runner templates, the model catalog, the provider router, the pricing table, test fixtures, documentation, and 15 files in the reference app. There was no BUDGET_MODEL constant. No config file that everything read from. Just the same string, copy-pasted everywhere.

This is the kind of thing you don't notice until you have to change it. And then you notice it hard.

The hidden bug

After swapping all 46 files and deploying, I ran a test request through the gateway. The cost in the logs: $0.003552 for 14 input tokens and 114 output tokens.

That's wrong. At $0.10/$0.30 per million tokens, 128 tokens should cost about $0.000075. Not $0.003552. We were overcharging by 47x.

Debugging the cost calculation

Step 1Test request logs $0.003552 — way too high for 128 tokens

Step 2Check get_model_pricing() — model lookup fails, falls to default

Step 3Root cause: model arrives as 'stepfun/step-3.5-flash' (vendor prefix), pricing table has 'step-3.5-flash' (bare name)

Step 4Fix: strip vendor/ prefix before pricing lookup

Step 5Retest: $0.000075 — matches expected price

Step 6Also found: reasoning_tokens not captured (0 instead of 128)

Step 7Fix: extract reasoning_tokens from completion_tokens_details

The get_model_pricing() function looked up stepfun/step-3.5-flash in the local pricing table. Not found. Tried litellm's database. Not found (it's a new model). Fell through to the default fallback: $3.00 input / $15.00 output per million tokens. The same default that catches unknown models so they don't get billed at zero.

That default — meant as a safety net — became a silent overcharge.

The second bug: invisible thinking tokens

Step 3.5 Flash is a reasoning model. It uses "thinking tokens" to reason through problems before generating output. These tokens appear in completion_tokens_details.reasoning_tokens in the API response, but our cost recorder only looked at completion_tokens.

So even after fixing the pricing lookup, we were still wrong — just in the other direction. The thinking tokens were real compute being consumed but not counted. The 128 tokens I saw in the test? All of them were reasoning tokens. The actual "output" was the model's response, counted separately.

Token accounting before and after fix

Token type	Before fix	After fix
Input (prompt)	Counted	Counted
Output (completion)	Counted	Counted
Reasoning (thinking)	Ignored (0)	Counted separately
Cache creation	Ignored	Counted at input rate
Cache read	Ignored	Counted at discounted rate

The governance chain was also broken

While investigating the cost issue, we discovered the gateway wouldn't even start with a full rebuild. The governance.py file — the 6-step policy chain — had a syntax error from a previous merge.

Old fixed-window rate limiter code was interleaved with new sliding-window code. Logger calls without closing parentheses. Variables used before definition. The file hadn't actually run in 2 days because Docker was serving a cached image.

Three bugs. One model swap. All of them hiding behind a green CI check and a Docker cache.

What we built to prevent this

1. Cost verification test

The gateway smoke test now includes a cost check. After any model change, it makes a test request and compares the logged cost against the known pricing:

Expected cost: $0.000075 (128 tokens @ $0.10/$0.30)
Actual cost:   $0.000075
Difference:    0%  ✓

If the difference exceeds 10%, the deploy halts with an error.

2. Model alias registry

The gateway already had a model_alias_registry.py. We now use it as the single source of truth for model swaps. Templates reference aliases (budget, standard, frontier) instead of raw model strings. Changing the budget model means updating one alias mapping.

3. Vendor prefix stripping

get_model_pricing() now strips vendor/ prefixes before lookup. stepfun/step-3.5-flash → step-3.5-flash. anthropic/claude-sonnet-4-6 → claude-sonnet-4-6. Simple, but it has to be done everywhere a model name is used for cost calculation.

4. Reasoning token tracking

The cost recorder now extracts from both streaming and non-streaming responses:

reasoning_tokens from completion_tokens_details
cache_creation_input_tokens and cache_read_input_tokens

All of these are logged separately and included in cost calculation.

The real lesson

A model swap is not a string replacement. It's a governance event that touches pricing, cost calculation, token accounting, test assertions, documentation, and every template that references the model.

If your platform doesn't have a single source of truth for model configuration, every swap is a minefield. If your cost calculator doesn't verify against known pricing, billing errors hide silently. If you don't track reasoning tokens, you're lying to your customers about what their agents cost.

We build a governance platform. We should have caught this before our customers did. We got lucky this time — we caught it during our own testing. Next time, the cost verification gate will catch it automatically.

If you're using OpenRouter or any multi-provider proxy, check your cost calculation right now. Make a test request, look at the logged cost, and compare it to the model's known pricing. If there's a discrepancy, your pricing lookup is probably falling through to a default.

This post is part of the Sprint 2 retrospective. Read the full retro for context on what else went wrong (and right) this sprint.