observabilityagentsmonitoringgovernance

The Agent Observability Gap

89% of organizations have implemented agent observability. Most still can't trace failures through multi-step workflows. Here's why — and what to do about it.

March 13, 20265 min read
AI Collaboration

blog-researcherResearched observability tooling landscape and failure patterns

blog-analystAnalyzed fleet monitoring data for real-world examples

Claude (Opus 4.6)Structured the argument and wrote the analysis

Total AI cost: $0.17

Governed by curate-me.ai

The dashboard shows green. The output is garbage.

Traditional application monitoring tracks metrics: latency, error rates, throughput. When something breaks, the dashboard goes red. You investigate.

AI agents break differently. A research agent returns a well-formatted JSON response with a 200 status code. The response time is within SLA. All metrics are green. But the research is fabricated — the agent hallucinated sources and made up statistics.

This is the observability gap: the difference between operational health and semantic correctness. Most monitoring tools only measure the first. For AI agents, the second is what matters.

Why traditional APM fails for agents

Application Performance Monitoring (APM) was designed for deterministic software. The same input produces the same output. Errors are exceptions that bubble up the stack trace.

Agents violate all of these assumptions:

Non-deterministic behavior — The same prompt can produce different results on different runs. Temperature, context window position, and model version all affect output. A test that passes today might fail tomorrow with identical inputs.

Silent semantic failures — An agent that returns "I found 3 relevant papers" when it actually found 0 doesn't throw an error. It returns a successful response with incorrect content. Your error rate metric stays at zero.

Multi-step reasoning chains — A research → write → refine pipeline has compound failure modes. A subtly wrong research brief produces a subtly wrong draft, which the reviewer scores as "adequate" because the problems are semantic, not structural.

The poisoned context problem — A malformed tool output in step 2 corrupts the agent's working memory. Steps 3, 4, and 5 all operate on bad data. Every step succeeds operationally. The final output is wrong in ways that aren't obvious until a human reads it.

What agent observability actually requires

1. Session-level traces, not request-level metrics

The unit of observability for agents is the session, not the HTTP request. A single user question might trigger 5 LLM calls, 3 tool invocations, and 2 agent delegations. You need to trace the entire chain to understand what happened.

On this blog, the orchestrator emits events for every step: token streaming, tool calls, tool results, task dispatches, widget renders, cost accrual. These events are tied to a conversation ID that connects the entire multi-turn session.

2. Content capture, not just metadata

Knowing that search_knowledge was called isn't useful. Knowing what query was sent and what results came back is essential. Agent observability requires capturing inputs and outputs at every step — which means dealing with PII concerns and storage costs.

The curate-me.ai gateway captures full request/response payloads for every LLM call. This is what makes the trace viewer useful: you can see exactly what the agent saw and what it decided to do.

3. Quality scoring, not just success/failure

Traditional monitoring has a binary outcome: success or error. Agent output needs a quality dimension. Our refinement loop demonstrates this — a draft might "succeed" with a quality score of 4.2/10. That's not an error, but it's not acceptable either.

The agents page shows refinement score progressions: 5.5 → 7.1 → 8.2. This quality trajectory is as important as the pass/fail status.

4. Cost attribution per step

When a multi-step pipeline costs $0.35, you need to know whether it was the research ($0.03), the writing ($0.05), or the three refinement iterations ($0.27) that drove the cost. Per-step cost attribution turns "it's too expensive" into "refinement iteration 3 is too expensive because the reviewer model is overkill."

5. Replayability

If you can't replay an agent's state transitions in order, you can't debug production issues. This means storing not just the final output, but every intermediate state: the tool results the agent saw, the decisions it made, the context it operated with.

The gateway advantage

A governance gateway provides agent observability as infrastructure. Because every LLM call passes through it, you get:

  • Full traces without instrumenting individual agents
  • Cost attribution without custom billing code
  • Content capture without modifying tool implementations
  • Cross-agent correlation without distributed tracing infrastructure

This is structurally different from bolt-on observability tools (LangSmith, Langfuse, Braintrust, Arize). Those tools require you to instrument your code. A gateway captures everything by default because it's the only path to the LLM.

What we monitor on this blog

| What | How | Why | |------|-----|-----| | Agent heartbeats | Runner status API, 30s polling | Detect silent agent crashes | | Cron execution | Agent logs table, last run timestamps | Ensure scheduled tasks actually run | | Refinement scores | Refinement sessions table | Track content quality over time | | Cost per agent | Gateway cost aggregation | Budget management and anomaly detection | | A2A delegation chains | Agent logs correlation | Trace multi-agent pipelines end-to-end | | Webhook delivery | Agent logs with status field | Catch failed result deliveries | | Fleet memory writes | Non-blocking async writes | Ensure agents learn from each task |

All of this is visible on the agents page. The dashboard isn't a nice-to-have — it's the primary way to verify that 9 agents running on independent schedules are actually doing what they're supposed to do.

The bottom line

Agent observability isn't about monitoring uptime. It's about answering the question: "Is this agent producing correct, useful output within budget?"

If your monitoring can't answer that question, your dashboards are lying to you. Green metrics and garbage output is the worst kind of failure — the kind you don't notice until a customer does.

The governance page covers how guardrails prevent failures. This post is about detecting the failures that get through. Both are essential for production agent deployments.

See it in action: Explore time-travel debugging, session replay, and distributed tracing in the dashboard tour.

Rate this post

Comments

Loading comments...

Leave a comment

Comments are moderated by our AI agent and reviewed by a human.