How We Cut AI Agent Costs by 85%

The bill nobody budgeted for

We built a fashion analysis platform powered by 6 AI agents working in sequence: a vision agent reads your photo, a style agent generates recommendations, a critic agent reviews quality, a shopping agent matches products, and optional agents handle virtual try-on and feedback learning. Each analysis flows through 4-6 LLM calls.

During the first week of production, a single analysis cost $0.12. At 100 analyses per day, that is $360/month in LLM costs alone — before infrastructure, before salaries, before anything else. For a consumer product targeting a $9.99/month subscription, the unit economics were broken from day one.

Over the next eight weeks, we cut that cost to under $0.02 per analysis. An 85% reduction. Here is exactly how, with real numbers from our gateway cost logs.

Optimization 1: VisionAgent model swap (98% cost reduction)

The VisionAgent was the single most expensive step in the pipeline. It analyzes uploaded photos — identifying clothing items, colors, patterns, body proportions, and style signals. The original implementation used GPT-4V (now GPT-4o with vision), which was the only multimodal model available when we started building.

Before: GPT-4V at ~$0.04 per image analysis (high-resolution mode, ~1,000 input tokens for the image + 300 tokens of prompt, ~500 output tokens).

After: Gemini 2.5 Flash at ~$0.0008 per image analysis.

That is a 98% reduction on the most expensive step. Gemini Flash is not just cheaper — for fashion image analysis, it is actually better. Google's multimodal models have stronger visual grounding for clothing and color identification. We ran a blind evaluation on 200 images: Gemini Flash matched or exceeded GPT-4V quality on 94% of them, with noticeably better color accuracy.

The switch required changing one model parameter in the VisionAgent configuration. Zero code changes to the agent logic. Zero changes to the downstream pipeline. The gateway cost logs showed the impact immediately — VisionAgent went from our most expensive agent to one of the cheapest.

Savings: $0.039 per analysis

Optimization 2: CriticAgent migration to DeepSeek R1 (81% savings)

The CriticAgent reviews the StyleAgent's recommendations for quality, coherence, and style accuracy. It is a reasoning-heavy task: compare the visual analysis against the style recommendations, check for contradictions, score overall quality, and suggest improvements.

We originally used GPT-4o for this. It worked, but the CriticAgent was our second most expensive step because reasoning tasks generate long outputs — the critic writes detailed feedback with specific citations from the analysis.

DeepSeek R1 changed the equation. It is a reasoning-specialist model priced at $0.55 per million input tokens and $2.19 per million output tokens — roughly 80% cheaper than GPT-4o for equivalent tasks. More importantly, its reasoning chains are explicit and traceable, which means the critic's feedback is more structured and useful.

Before: GPT-4o CriticAgent at ~$0.025 per review (long reasoning output, ~800 completion tokens).

After: DeepSeek R1 at ~$0.005 per review.

We validated this with an A/B test over 500 analyses. DeepSeek R1 produced higher-quality critiques on 67% of comparisons (as scored by a separate evaluator model). The critic caught more subtle issues — color coordination problems, proportion mismatches, occasion-appropriateness gaps — because the explicit reasoning chain forced it to check each dimension systematically.

Savings: $0.020 per analysis

Optimization 3: Response caching (30-40% call reduction)

The insight that unlocked caching was simple: many users upload similar photos. A photo of a navy blazer and khakis generates essentially the same style analysis whether it is user A or user B. The clothing items, colors, and style signals are identical — only the body-specific recommendations differ.

The gateway's two-tier response cache handles this at the infrastructure level:

Exact match cache: If two requests have identical model, messages, and parameters (after stripping metadata like user ID and stream flag), the gateway returns the cached response. This catches duplicate requests from retries, page refreshes, and identical prompts.

Semantic cache: If two requests are semantically similar (cosine similarity above 0.92 on embedded prompts), the gateway returns the cached response. This catches paraphrased versions of the same question — common in agent-to-agent communication where the orchestrator rephrases prompts slightly between retries.

In practice, exact-match caching alone eliminated 30-40% of LLM calls for the StyleAgent and ShoppingAgent. These agents receive structured inputs from upstream agents, so the input space is more constrained than free-form user prompts. The same "casual weekend outfit for warm weather" query shows up repeatedly across different users.

The cache respects governance policies (cached responses still count against rate limits) and is per-org (one customer's cached responses are never served to another). Cache entries expire after 24 hours by default, configurable per org.

Savings: ~$0.015 per analysis (averaged across cache hit rate)

Optimization 4: Batch processing for non-interactive workflows

Not every analysis needs real-time results. Background tasks — daily trend analysis, product catalog updates, style preference learning — can tolerate higher latency in exchange for lower cost.

OpenAI's Batch API offers a 50% discount for requests that can wait up to 24 hours. We moved three non-interactive workflows to batch processing:

Nightly product re-ranking — The ShoppingAgent re-scores product relevance nightly. Moving this to batch cut costs by 50% on ~200 calls per night.
Weekly style trend analysis — Aggregate analysis across all users to identify trending styles. Batch processing made a $2.00/week job into a $1.00/week job.
Feedback learning pipeline — Processing user ratings to update style preferences. Moved from real-time to hourly batch.

Savings: ~$0.008 per analysis (amortized across batch volume)

Optimization 5: Prompt engineering and token reduction

This is the least exciting optimization and the one that required the most work. We audited every prompt in the pipeline and cut token waste:

System prompts: Trimmed from an average of 800 tokens to 350 tokens across all agents. Removed redundant instructions, examples, and formatting guidelines that the models already handle well.
Context passing: Instead of passing the full upstream agent output to downstream agents, we extract only the fields each agent needs. The CriticAgent does not need the ShoppingAgent's product URLs — it only needs the style analysis and recommendations.
Output schemas: Switched from free-form text output to structured JSON schemas (using OpenAI's response_format and Anthropic's tool_use). Structured outputs are shorter, more predictable, and easier to parse. Average output tokens dropped 35%.

The compound effect of shorter prompts and shorter outputs across 4-6 agents per analysis added up.

Savings: ~$0.010 per analysis

The total picture

| Optimization | Before | After | Savings | Reduction | |-------------|--------|-------|---------|-----------| | VisionAgent → Gemini Flash | $0.040 | $0.001 | $0.039 | 98% | | CriticAgent → DeepSeek R1 | $0.025 | $0.005 | $0.020 | 81% | | Response caching | — | — | $0.015 | 30-40% of calls | | Batch processing | — | — | $0.008 | 50% on batch jobs | | Prompt optimization | — | — | $0.010 | ~35% token reduction | | Total per analysis | $0.120 | $0.018 | $0.102 | 85% |

At 100 analyses per day, monthly LLM costs went from $360 to $54. At 1,000 analyses per day (our growth target), the projection went from $3,600/month to $540/month. The unit economics flipped from unsustainable to healthy.

What made this possible

None of these optimizations would have been practical without gateway-level cost tracking. Here is why:

Visibility: Before the gateway, we had no idea the VisionAgent was 33% of our cost. We were guessing based on model pricing pages and estimated token counts. The gateway's per-agent, per-model cost breakdown made the problem obvious.

Safe experimentation: Swapping a model is risky. What if quality drops? The gateway let us run A/B tests with cost tracking on both paths. We could see that DeepSeek R1 was cheaper AND better before committing to the switch.

Caching as infrastructure: Building a response cache inside application code is painful — you have to handle cache invalidation, TTLs, per-org isolation, and governance policy compliance. The gateway handles all of this as infrastructure. We flipped a feature flag and caching was live.

Budget guardrails during migration: While testing new models, budget caps prevented any single experiment from running up costs. If a new model turned out to be more expensive than expected, the daily budget cap caught it before it became a problem.

The lesson for other teams

The conventional wisdom is that AI costs are high and will come down as models get cheaper. That is true — model pricing drops roughly 10x every 18 months. But waiting for cheaper models is not a strategy. The optimizations in this case study are available today, with existing models, and they compound with future price drops.

The real lesson is that cost optimization requires cost visibility. You cannot optimize what you cannot measure. And measuring LLM costs accurately — across multiple models, multiple providers, streaming and non-streaming, batch and real-time — is harder than it sounds. A governance gateway solves this at the infrastructure level, which frees your engineering team to focus on the optimizations themselves.

Our multi-agent pipeline now runs at $0.018 per analysis. That is cheap enough to offer a generous free tier, sustainable enough to build a business on, and — critically — fully tracked so we will catch it immediately if costs start creeping back up.

See it in action: The agents page shows live per-agent cost breakdowns, or explore the cost explorer in the dashboard tour. Check the platform page for real-time gateway metrics.