Gable Blog | AI Agent Observability: Traces, Metrics, and Evals

The point of building an AI agent is to hand it work and step away. An agent that needs a human watching every tool call and reading every intermediate step isn't saving anyone time. Getting to the point where you can let one run unsupervised, though, takes a specific kind of confidence: the ability to reconstruct any decision the agent made, on any run, after the fact. You have to be able to answer "why did it do that?" without having been there when it happened.

Abstract 3D visual of a glowing translucent geometric form being observed from outside by converging thin beams of light, suggesting inspecting an autonomous process you can trust to run on its own

That ability is what agent observability provides, and it's why the discipline exists at all. An agent built on a large language model (LLM) chooses its own sequence of steps at runtime, which tools to call, what to retrieve, and how to reason from one to the next. When the result is wrong, the final output alone rarely tells you which of the many decisions behind it went sideways. Observability captures each of those steps as structured data, so the reasoning becomes inspectable instead of locked inside a single response you can't reproduce.

The sections below cover what to capture, how to measure agent quality, and a blind spot that even a well-instrumented setup leaves wide open.

What AI agent observability actually is

Agent observability provides step-by-step visibility into an agent's execution. It records which tools the agent called, what data it retrieved, where its reasoning held together, and where it diverged from the intended path. The standard observability vocabulary, drawn from the broader practice, is MELT data: metrics, events, logs, and traces. Agent observability uses that same foundation and adds signals unique to LLM-driven systems, such as token usage, tool interactions, and the agent's decision path.

It helps to set the expectation honestly up front: observability is a visibility tool, not a reliability mechanism. It tells you what an agent did after it did it. That distinction matters because it defines both what observability is excellent at and where it stops, a point the rest of this guide returns to.

Why agents need more than traditional monitoring

Traditional application performance monitoring captures request-response cycles. It shows that a request came in, a response went out, and how long it took. For a deterministic service, that's often enough. For an agent, it shows the wrapper and misses everything that matters inside.

Consider a support agent that invokes a billing tool with a malformed argument, loops while trying to recover, and then returns a confident but incorrect answer about a refund. Standard monitoring records a successful request: a response was returned in two seconds. Only step-level tracing reveals the hallucinated tool parameter, the retry loop, and the reasoning step where the agent committed to a wrong conclusion. The agent-specific surface that observability has to capture includes tool calls, prompt versions, retrieved context, reasoning transitions, and, in multi-agent systems, the handoffs between agents where a single failure can cascade across boundaries.

The three pillars: traces, metrics, and evals

Most agent observability practice organizes around three kinds of signal. Each answers a different question about the agent.

Traces and spans

A trace is the full execution tree of a single agent run, broken into spans. Each span represents one unit of work: an LLM call, a tool invocation, a retrieval step. Structured this way, a trace lets you localize a failure to the exact step that caused it, whether a retrieval returned irrelevant documents, the model fabricated a tool parameter, or a reasoning loop never converged. For agents that run hundreds of intermediate steps before producing an answer, this step-level view is the only practical way to find where things went wrong.

Abstract 3D execution tree, a glowing central node branching into nested geometric nodes connected by thin luminous lines, representing trace spans

Metrics

Metrics quantify agent behavior over many runs. The agent-specific ones that matter most are token usage and the cost it drives, latency per step, and error rates. These aren't vanity numbers. Cost and latency attribution lets you find, for example, that a single sub-task is consuming most of the tokens in a trace or adding several seconds of tool-call latency, so you know where optimization actually pays off before the bill or the user's patience runs out.

Evaluations

Evaluations measure whether the agent is doing a good job, not just what it did. Static test cases struggle here because a non-deterministic agent can take a different valid path on every run, so a single expected output rarely captures correctness. A common approach is to capture real production traces and convert them into evaluation datasets that reflect how the agent behaves in the wild, then score new runs against them. Teams also use an LLM to grade another model's outputs against criteria, often called LLM-as-judge, as one method among several. Telemetry here doubles as a feedback loop: the same traces that help you debug also become the raw material for measuring and improving quality.

What to instrument: agent observability best practices

Capturing the right signals is what separates a usable trace from noise. Effective instrumentation captures a consistent set of things at every step:

LLM calls: the model used, the inputs and prompt version, the output completion, and the input and output token counts.
Tool calls: which tool was selected, the arguments passed, the result returned, and how long the call took.
Retrieval steps: the queries sent to a vector store or knowledge base, the documents returned, and any relevance signals available.
Reasoning transitions: how the agent decided to move from one step to the next, including intermediate reasoning where it's exposed.
State changes: for stateful agents, what memory was read and written, and how that state shaped later decisions.

Instrumentation should scale with the agent, not arrive all at once. During early prototyping, print statements and local logging are genuinely sufficient, since you're watching single runs as they happen. Structured tracing earns its place once an agent chains two or more tools in sequence, runs at enough volume that you can't review every execution by hand, or reaches the point where someone other than the original developer has to debug it. Below that threshold, keeping it simple is the right call.

On the standards front, OpenTelemetry is emerging as the common foundation for agent telemetry. Its semantic conventions for generative AI, still in development, define an emerging shared vocabulary for describing LLM operations, which lets teams instrument once and avoid lock-in to a single vendor's or framework's telemetry format. Adoption is already widespread: in LangChain's State of Agent Engineering report, 89% of organizations said they'd implemented some form of agent observability and 62% had detailed step-level tracing. Granular tracing is table stakes, not a differentiator.

The blind spot every observability setup shares

Everything to this point watches the agent. Its calls, its reasoning, its outputs. None of it watches the data the agent depends on, and that's where a large share of real-world agent failures actually originate.

Picture an agent that retrieves from a feature table or queries an internal product to do its job. Overnight, a producer upstream renames a column, drops a required field, or quietly shifts what a status value means. The agent has no way to know its input changed. It reasons correctly over the data it's given and returns a confident, wrong answer. Observability records the whole run faithfully, and the trace looks clean: tools fired, steps completed, a response was returned. The failure is invisible at the agent layer because the agent didn't malfunction. Its data did.

This is the structural limit of observability. It's downstream and reactive by design. It catches the symptom in production, after the agent has already acted on the bad input, and it can tell you a retrieval returned the wrong documents. What it can't tell you is that an upstream table changed shape last night, because that change happened far outside the agent's execution and well before the trace ever started.

Catching failures before the agent ever runs

Closing that gap means moving enforcement upstream, to the point where the data is produced. That's the role of data contracts: enforceable agreements on schema, semantics, ownership, and constraints between the data producers who generate data and the systems, including agents, that consume it. A contract makes the expectations for a data asset explicit and checkable instead of assumed.

Because contracts are enforced in CI/CD, the check runs when a producer tries to ship a change. A backwards-incompatible change, the renamed column or the dropped field, fails the check at the pull request, before it merges and long before it reaches an agent's input. Gable implements this at the application code level, which assigns accountability for data quality to the producers who control the change, rather than leaving it to downstream consumers who can't see the change coming. The model is prevention at the source, not detection after the fact.

Observability and data contracts operate at different points in the lifecycle, and the relationship is complementary rather than competing. Observability tells you how an agent behaved once it ran. A contract keeps the data feeding that agent from changing underneath it without warning. One gives you visibility into the agent; the other gives the agent inputs it can trust. Production agents need both, in the same way reliable software needs both runtime monitoring and tests that run before deploy. For agents, that upstream layer connects directly to data quality and to broader AI data governance, since an agent acting on governed, contract-backed data is far easier to keep reliable than one consuming whatever arrives.

Reliable agents need reliable inputs

Visibility into an agent is necessary, and the traces, metrics, and evals above are how you get it. But that visibility is bounded. It ends at the agent's inputs, and a meaningful share of what gets logged as an agent failure is really an upstream data failure wearing an agent costume: the reasoning was sound, the tools fired correctly, and the answer was still wrong because the data was wrong. No amount of step-level tracing prevents that class of failure, because by the time the trace exists, the bad data is already inside the run.

Shrinking that failure class means treating the data an agent consumes with the same rigor as the agent's own code, and validating a producer's change at the source before it can propagate downstream. That's the premise behind shift-left data, and it's where teams building agents they actually trust tend to look next. To see how data contracts enforce data quality upstream, before a change can ever reach your agents, sign up with Gable.

Gable

June 26, 2026

AI Agent Observability: Traces, Metrics, and Evals

Get the ultimate guide to Data Contracts Deep Dive

Get the ultimate guide to Data Contracts as Code

Ultimate Guide to Data Contracts