gatewayarchitecturegovernancecurate-meopenclaw

How I Built an AI Gateway for OpenClaw

A technical deep-dive on building a governance gateway that proxies 51 LLM providers, enforces cost caps, scans for PII, and tracks every token — all without changing a single line of application code.

March 17, 20269 min read
AI Collaboration

Claude (Opus 4.6)Architecture analysis, technical writing, and code examples

Total AI cost: $0.14

Governed by curate-me.ai

The problem nobody talks about

OpenClaw has 313,000+ GitHub stars and over 135,000 exposed instances across 82 countries. It connects to every LLM provider, runs shell commands, controls browsers, and manages files. It is the most popular open-source AI assistant in the world.

It also has no built-in cost controls, no PII scanning, no audit trail, and no approval workflows. Every LLM call goes directly from agent to provider. If an agent enters a tool-calling loop at 3am, your credit card finds out before you do.

This is the problem I set out to solve with curate-me.ai: a governance gateway that sits between your agents and their LLM providers. Zero code changes required — swap a base URL and every call flows through a policy engine.

The base URL swap

The entire integration model comes down to two environment variables:

# Before (direct to OpenAI):
OPENAI_BASE_URL=https://api.openai.com/v1

# After (through the gateway):
OPENAI_BASE_URL=https://api.curate-me.ai/v1/openai
X-CM-API-KEY=cm_sk_xxx

That is it. No SDK. No code changes. No wrapper functions. The gateway speaks the same API as OpenAI, Anthropic, Google, and every other provider. Your existing code keeps working — it just passes through governance first.

This design decision shaped everything else. If the gateway requires code changes, adoption dies. Developers will not refactor working agent code to add governance. But swapping an environment variable? That takes 10 seconds.

Architecture: the governance chain

Every request that hits the gateway passes through a policy chain before reaching the upstream provider. The chain evaluates checks in order and short-circuits on the first denial:

Request → Auth → Plan Enforcement → Rate Limit → Cost Estimate →
  Hierarchical Budget → PII Scan → Content Safety → Model Allowlist →
  HITL Gate → Provider Router → Upstream LLM

Each check is independent and stateless (relative to other checks). They read from Redis for real-time counters and from MongoDB for policy configuration. The separation matters because it means you can enable or disable individual checks without touching the others.

Rate limiting uses sliding window counters in Redis. Each org gets a requests-per-minute cap (default 100 RPM, configurable per plan). The gateway returns standard X-RateLimit-* headers and a Retry-After value on 429 responses, so well-behaved clients back off automatically.

Cost estimation happens before the request is proxied. The gateway calculates an estimated cost from the model's pricing table and the input token count (estimated via tiktoken for OpenAI models, character-based heuristic for others). If the estimated cost would push the org over its daily budget or exceed the per-request cap, the request is denied with a clear error message and the estimated cost in the response body.

PII scanning runs a regex pipeline over request content, checking for patterns like credit card numbers, SSNs, API keys, and email addresses. This catches the most common accidental data leaks — an agent that includes a customer's credit card number in a prompt, or a debug log that contains an API key.

Content safety detects prompt injection attempts, jailbreak patterns, and data exfiltration attempts. This is the newest check, added after we found that certain OpenClaw skills were embedding prompt injection payloads in tool outputs.

The HITL gate is the check I am most proud of. Requests that exceed a configurable cost threshold (default: $0.50) are held in an approval queue instead of being denied outright. A human reviewer can approve or reject them from the dashboard. This means expensive operations still work — they just require a human in the loop.

51 providers, one interface

The gateway currently routes to 51 LLM providers across 7 tiers:

| Tier | Providers | Examples | |------|-----------|---------| | Core (5) | OpenAI, Anthropic, Google, DeepSeek, Perplexity | GPT-5.1, Claude Opus 4.6, Gemini 2.5, DeepSeek R1 | | Tier 1 (5) | Moonshot, MiniMax, Z.AI, Cerebras, Qwen | Kimi K2.5, MiniMax M2.5, GLM-5 | | Tier 2 (7) | Groq, Mistral, xAI, Together, Fireworks, Cohere, OpenRouter | Grok, Mixtral, Command R+ | | Tier 3 (6) | Azure OpenAI, AWS Bedrock, GCP Vertex, Hugging Face, Replicate, Ollama | Enterprise cloud + local | | Tier 4 (6) | SambaNova, Lambda, Lepton, Novita, AI21, Reka | Fast inference + specialty | | Tier 5 (6) | Baichuan, Yi, Zhipu, StepFun, Volcano, InternLM | Regional / Chinese market | | Tier 6-7 (16) | vLLM, TGI, RunPod, Modal, DeepInfra, Cloudflare, NVIDIA NIM, etc. | Self-hosted + emerging |

The provider router auto-detects the provider from the model name prefix. claude-opus-4-6 routes to Anthropic. gemini-2.5-pro routes to Google. deepseek-chat routes to DeepSeek. If the prefix is ambiguous, the org's provider registry handles the mapping.

Each provider has its own auth method, endpoint path, and response format. The gateway normalizes all of this behind the OpenAI-compatible interface. You send a standard /v1/chat/completions request; the gateway translates it to whatever the upstream provider expects.

The cost tracking pipeline

Cost tracking was the hardest engineering problem. It needs to be accurate to the token, it needs to handle streaming responses, and it cannot add meaningful latency.

The pipeline works in two stages:

Pre-request: The governance chain estimates cost using the model's pricing table and estimated input tokens. This is used for budget enforcement — deciding whether to allow the request.

Post-response: After the upstream provider responds, the cost recorder extracts actual token counts from the response body (for non-streaming) or the final SSE chunk (for streaming). It calculates the real cost using per-model pricing from litellm's community-maintained database, with a local fallback table for models too new for litellm.

The actual cost is then recorded in three places simultaneously:

  1. Redis — daily and monthly cost counters (atomic INCRBYFLOAT), keyed by org ID, used for real-time budget enforcement
  2. MongoDB — full usage record with model, tokens, cost, timestamp, API key, and request metadata, used for audit and billing
  3. Prometheus metrics — in-memory counters for the monitoring stack

For streaming responses (which are the majority — agents love streaming), the gateway passes SSE chunks through to the client as they arrive with no buffering. Token counts are extracted from the final [DONE] chunk or the usage field in the last data event. This means cost recording does not add any latency to the streaming path.

Response caching

The response cache was a competitive response — Portkey ships one, so we built ours. Two-tier architecture:

Tier 1 (exact match): SHA-256 hash of the normalized request body. Strip metadata fields (stream, user, seed, frequency_penalty), hash the rest. If we have seen this exact request before, return the cached response with an X-CM-Cache: HIT header. Sub-millisecond lookup.

Tier 2 (semantic similarity): Embed the request using a lightweight model, compare against cached request embeddings using cosine similarity. If similarity exceeds a configurable threshold (default 0.92), return the cached response. This catches paraphrased prompts — "summarize this article" and "give me a summary of this article" hit the same cache entry.

Both tiers are per-org and respect governance policies. The cache is feature-flagged (off by default for exact match, separately flagged for semantic). In practice, the exact-match cache alone saves 30-40% of LLM calls for production workloads with repeated queries.

What I learned building this

Design for zero integration cost. The base URL swap model is the single best decision I made. Every alternative (SDK wrappers, middleware libraries, proxy agents) requires the developer to change code. A base URL swap works with every LLM client library ever written.

Redis is the right choice for real-time counters. MongoDB is too slow for per-request budget checks. In-memory is too fragile (server restart resets all counters). Redis atomic operations (INCRBYFLOAT, EXPIRE) give you both speed and durability.

Streaming complicates everything. Non-streaming responses are simple: read the body, extract tokens, calculate cost. Streaming responses require parsing SSE events on the fly, handling mid-stream disconnections, and extracting usage from provider-specific final chunks. Each provider formats their streaming responses differently. Anthropic puts usage in message_delta events. OpenAI puts it in the final chunk's usage field. Some providers do not report streaming usage at all.

The governance chain must be fast. Every millisecond of governance overhead is added to every single LLM call. The full chain (rate limit + cost estimate + PII scan + model check + HITL gate) runs in under 5ms for a typical request. PII scanning is the slowest step because it runs regex patterns over the full request content. We keep the patterns simple and fast — no NER models in the hot path.

Provider diversity is a moat. Supporting 51 providers means any developer using any model can use the gateway. It also means we see the entire AI market from a single vantage point — which providers are growing, which models are being adopted, where pricing is heading. That data is valuable.

Where it stands today

The gateway handles authentication, rate limiting, cost estimation, PII scanning, content safety, model allowlists, hierarchical budgets (org, team, key), human-in-the-loop approvals, response caching, and full audit logging. Every request is proxied through httpx with connection pooling, retry logic, circuit breakers, and exponential backoff.

The blog you are reading runs its entire agent fleet through this gateway. Nine agents, daily research cycles, automated content pipelines, comment moderation — all governed, all cost-tracked, all auditable. The agents page shows the live data.

If you are running OpenClaw agents (or any LLM-backed application) without governance, you are one runaway loop away from a surprise bill. The gateway exists to make sure that does not happen.

See it in action: The platform page shows live gateway metrics, or explore the dashboard tour for the full governance console. Try the developer SDKs to integrate in minutes.

Rate this post

Comments

Loading comments...

Leave a comment

Comments are moderated by our AI agent and reviewed by a human.