What Your Agent Is Doing When You're Not Watching

The previous post was an introduction to how agents actually work — the reasoning loop, the tool calls, the decision sequence. I wrote it with two audiences in mind: people who build these systems, and people who use agents (or at least tools that claim to be agents).

Both benefit from understanding the architecture. If you build agents, the internals determine what goes wrong and why. If you use them, the internals are where your edge is. When you know that an agent reasons in steps, you can structure your instructions around those steps. When you know it relies on tools, you can guide which ones to use and when. When you know it has a retrieval step, you can give it better context to retrieve against. None of this requires writing code. It requires knowing roughly what's happening underneath — which, as the previous post tried to show, isn't rocket science.

Observability is the part of that picture I mentioned almost as a footnote and then moved past. The question isn't only how agents work, but how you see what they're actually doing while they work. That matters whether you're debugging your own agent or deciding how much to trust the output of one you didn't build.

Agents don't fail loudly. The output looks plausible, the run completes without errors, and the only sign something went wrong is a colleague asking why the agent flagged a clause that any third-year associate would pass on sight. At that point you have the input and the output. Between them is a sequence of tool calls, retrieval decisions, and reasoning steps with no record anywhere.

To make this concrete: I built a small contract review agent - more than 500 public-domain commercial contracts, no client files - that runs a Reason + Act loop across seven tools and gets traced automatically by MLflow. It's a dummy example. Not something you'd put in front of a client, not a production system, not a recommendation. Just the building blocks stitched together so you can see how observability works when an agent actually runs. A full walkthrough with outputs from an actual run is in the companion notebook. What follows is what observability means, what it looks like in practice, and why it belongs in every agent you run, whether you built it or not.

The Middle Is Where Everything Happens

A modern AI agent isn't one LLM call. It's a loop. The model receives a prompt, decides whether to use a tool, calls the tool, receives the result, and then decides what to do next. In a document review agent, that loop might run four or five iterations per document: retrieve relevant clauses, assess each one, check against a reference policy, generate a finding, decide whether to escalate.

Each step has its own inputs, its own model call, and its own intermediate output. When something goes wrong - or even when something goes subtly right for the wrong reasons - you need to inspect each step independently. Which query went into the retrieval and which clauses came back? What was the exact prompt the model received for the assessment step? Which query triggered the match? Without those answers, you're debugging by instinct. You're squinting at the final output and guessing at the cause.

Observability tools make those intermediate steps visible. They capture the full execution trace of an agent run and make every step inspectable, not just the final answer.

Picking a Tracer

I spent the better part of a year running LangFuse. It has a clean UI, useful annotation features, and its dataset management integrates reasonably well with an evaluation workflow. The problem is what it costs to keep alive: self-hosted LangFuse requires Docker, Postgres, and either ClickHouse or Redis running somewhere. That's real infrastructure overhead for what is, at its core, a debugging and evaluation tool.

I'm switching to MLflow. Not because LangFuse doesn't work but because MLflow's operational overhead for pure LLM tracing is close to zero and it nicely integrates with DSPy - an AI framework I really came to like in the previous months. Start the server:

# Requires uv (https://docs.astral.sh/uv/getting-started/installation/) — works on macOS, Linux, and Windows
uvx mlflow server  # on macOS also use --port 5001

SQLite backend, single process, no Docker. The UI lives at http://localhost:5000 (or http://localhost:5001 if you're on macOS). In your Python code, enable tracing with just a few simple lines:

import mlflow

mlflow.set_experiment("contract-review")
mlflow.dspy.autolog()

That autolog() call intercepts every DSPy module call, every LM invocation, every tool call, and every token count - automatically, with no instrumentation code in your agent.

Watching a Clause Agent Work

Like I said before, my dummy agent runs on DSPy, a Python framework that handles the mechanics of an AI agent — the loop, the tool calling, the conversation management between steps. You define what the agent should produce and what tools it has access to; DSPy handles the rest.

The contracts come from CUAD — the Contract Understanding Atticus Dataset — more than 500 commercial agreements labeled for 41 clause types, freely available on HuggingFace. NDAs, software licenses, service agreements, M&A exhibits. The clause types map directly onto standard commercial review concerns: indemnification, limitation of liability, governing law, confidentiality, termination.

The agent has seven tools:

list available contracts,
list available clause types,
retrieve contracts that contain a specific clause,
retrieve a specific clause from a specified contract,
assess a retrieved clause against a legal standard,
search the web for market standards or regulatory guidance, and
read a specific web page in full.

Each tool is a plain Python function. The agent knows about a tool only through its name, its argument names, and its docstring (the description string written directly below a function's definition) — there is no other configuration. The docstring isn't documentation for the developer; it's the instruction manual the agent reads at runtime to decide when to call the tool and what to pass in.

def search_clauses(contract_id: str, clause_type: str) -> str:
    """Retrieve a specific clause from a contract.

    Use find_contracts_with_clause() first if you don't know which
    contracts contain the clause type you're looking for.

    Args:
        contract_id: A contract ID from list_contracts() or find_contracts_with_clause().
            Partial match supported — 'atlassian' finds 'atlassian_subscriber_agreement__'.
        clause_type: One of: governing_law, indemnification, limitation_of_liability,
            non_compete, termination_for_convenience, ip_ownership, confidentiality,
            warranty, assignment, audit_rights, most_favored_nation, change_of_control,
            arbitration, auto_renewal, exclusivity, price_restrictions, revenue_sharing,
            minimum_commitment, insurance, license_grant, cap_on_liability.
            Fuzzy matching supported — minor spelling variations resolved automatically.

    Returns:
        The clause text, or a helpful message if not found.
    """
    ...

A vague docstring — "retrieves a clause" — produces a tool the agent misuses or ignores. A specific one, with argument guidance and a clear description of what comes back, is the difference between an agent that sequences its calls correctly and one that loops until it hits the iteration limit.

The agent runs as a Reason + Act loop. At each step it decides what to do next, picks a tool, calls it, reads the result, and continues from there. A typical run of the GDPR compliance task — find a confidentiality clause, check it against current ICO guidance, produce a gap analysis — makes six to eight tool calls: list contracts, retrieve the clause, search for EDPB guidance, read the relevant page, assess the clause against each requirement. MLflow captures every call as a span in the trace, automatically, from the single autolog() line set up earlier. Here is what that looks like in the UI:

MLflow trace UI showing the contract review agent run with tool call spans

Traces like these show where the agent breaks, e.g. a data retrieval step that produced noisy outputs - that's a retrieval quality problem, not a model problem. Tweaking the prompt does nothing. Improving the search index fixes it. Without the trace, you'd never know which one to work on.

The Evaluation Feedback Loop

Most teams use tracing for debugging. That's the floor, not the ceiling.

Every trace is a data point. Attach human feedback—a lawyer marking a finding correct or flagging an error—and those traces become a labeled dataset. Run new prompt versions against historical inputs. Build automated evaluators that check whether findings are supported by retrieved clause text. Catch model updates that silently change your agent's behavior in ways you didn't anticipate.

Six months of that signal is worth more than any synthetic benchmark you could construct upfront. It's also the most honest evaluation you'll ever get - real lawyers, real contracts, real stakes. And because the feedback is attached to a trace ID, every piece of signal is linkable back to the exact agent run, the exact prompt, and the exact model version that produced it.

Observability as Compliance

Let me be upfront: I'm not an expert in IT law, data law, or anything adjacent, and I'm generally more sceptical of regulation than I am enthusiastic about it. So this isn't a compliance checklist.

From what I can tell, the EU AI Act includes logging and traceability requirements for AI systems operating in high-risk contexts — and legal decision support is the kind of use case that sits close to that territory, though where any specific deployment lands is genuinely contested. My point isn't about regulatory interpretation. It's that the technical bar the regulation describes — inputs logged, decisions recorded, outputs timestamped, model versions tracked — is close to what good observability already gives you anyway. If you're building this for engineering reasons, you're most of the way there. Thinking about it from a legal angle too costs very little and might matter more than you'd expect in a couple of years.

Where to Start

The starting point depends on which side of the agent you're on.

If you use agents rather than build them, observability starts with questions. Ask the agent what tools it has access to and what each does. Ask what tools it used to produce a specific result. Ask whether you can instruct it to use a particular tool for a particular task — most agents will follow that, and knowing which tools exist tells you more about the system's real capabilities than any product documentation. If a tool is missing — no access to a regulatory database you rely on, no way to handle a specific clause type — that's worth pushing on. Skills, which I described in the Claude legal plugin post, are collections of prompts, tools, and resources that define how an agent approaches a specific workflow. They're usually just files, often public, often editable. And here's the part most people miss: you can use the agent itself to create one. Claude Code and Claude Cowork both have a built-in skill creator - describe the workflow you need, and the agent scaffolds the skill for you. If your agent doesn't have a skill creator, ask it to look up the skills standard for that platform and build one from there. The agent that doesn't know how to extend itself is worth less than one that does.

If you build agents, three instrumentation points cover most of the ground. Start with retrieval. It's the most common failure point in any RAG-based workflow, and it's almost never the model that's wrong. Log the query, the clauses returned, and their similarity scores. Most of what looks like hallucination traces back to retrieval returning the wrong context - you can't fix a search quality problem by tweaking your generation prompt.

Next, capture every LLM call with its full prompt. Not just the user message, but the complete system prompt, the full context window, and the exact model version. Prompts drift. Models update. Context changes between runs. Without the full input captured, you can't reproduce a result and you can't run regression tests when something changes.

The third instrumentation point is the hardest to automate and the most valuable: human feedback. A simple correct/incorrect verdict from the reviewing lawyer, attached to the trace ID, turns your observability setup into a labeled dataset. For legal work, there is no better evaluation signal than a lawyer who looked at the output and told you whether it was right.