"Agent" arrived in legal the way most terms do — borrowed from somewhere else, diluted on the way over.
OpenClaw put the word on the front page. Claude Code showed what it actually meant: an LLM taking actions, observing results, correcting course, finishing complex work without someone at the keyboard for every step. Now every legal AI vendor has agents or "agent builders", too.
The word has stopped being useful.
An agent isn't a smarter chatbot. It's a specific architecture — an LLM with tools, running in a loop, making decisions about what to do next until it finishes or hits a limit. The lawyers who deploy them effectively won't be the ones who bought the best platform. They'll be the ones who understand what agents actually need to work.
Anatomy of an Agent
Strip away the hype, and an agent is surprisingly simple. In essence, it has only four components.
First, a goal — not "answer this question" but "accomplish this task." Second, tools: functions the LLM can call, like searching a database or the web, reading a document, writing a file, sending an email, or asking a human for clarification. Third, a loop: the agent plans, acts, observes the result, and repeats — deciding what to do next based on what it learned from the last step. Finally, a stopping condition: the goal is achieved, maximum iterations are reached, or it's time to escalate to a human.
That's it. The same LLM you use in a chat interface, given structure, tools, and permission to iterate.
def agent_loop(goal, tools, max_iterations=10):
current_state = None
iteration = 0
# Keep going until the goal is reached or we hit a safety limit
while iteration < max_iterations:
# LLM decides what to do next
action = llm.plan(goal, current_state, available_tools=tools)
# Execute that action
current_state = tools.execute(action)
# Check if we're done
if llm.evaluate(goal, current_state):
return current_state
iteration += 1
return "Maximum iterations reached — escalating to human"
One thing non-technical readers often miss: the LLM doesn't "have" tools. You define tools as functions — regular code functions that take specific inputs and return specific outputs. A search function. A file reader. An email sender. Each tool has a name, a description of what it does, and a schema defining what inputs it expects.
You give the LLM a system prompt that lists what's available:
"You have access to these functions: query_database(sql), read_file(path), send_email(to, subject, body)..."
When the LLM decides it needs to search the database, it doesn't execute the search itself. It generates the function call:
query_database(sql="""
SELECT * FROM precedents
WHERE type = 'loan_agreement'
AND jurisdiction = 'New York'
LIMIT 10
""")
The agent framework receives this, executes the actual function, and feeds the result back to the LLM. The LLM is just deciding which tool to use and what parameters to pass — the tool itself is your code.
Think of an agent as a first-year associate who can work autonomously: they understand the assignment, know what resources they have access to, check their work as they go, and ask for help when stuck.
You don't have to write all this plumbing yourself. Frameworks like PydanticAI, LangGraph, or DSPy handle the loop management, tool calling, error handling, and observability for you. You define your tools as Python functions, describe what they do, and the framework orchestrates the rest. The code example above is simplified pseudocode — in practice, you'd use a library that has already solved the retry logic, conversation management, and logging.
Why Now?
Agents aren't new. Two years ago, AutoGPT and similar projects tried to build autonomous LLM systems. They failed spectacularly. The models couldn't plan reliably, hallucinated tool inputs, and got stuck in loops.
What's changed is that LLMs have gotten substantially better at multi-step reasoning and consistent tool use. The breakthrough came in coding: Claude Code, Cursor, and similar tools proved that agents could execute complex, multi-step tasks without constant human intervention.
You can now trust a model to decide "resolve the missing collateral details before pulling a template" — and then actually do it, rather than making up values or looping endlessly. That reliability gap — the gap between what vendors promised and what the models could actually deliver — was the blocker. It's narrowing.
The coding agents came first because the feedback loops are tight and the domain is well-defined. Other workflows, including legal, are next. The infrastructure is ready. The question is whether legal tech is.
A Concrete Example: Drafting a Loan Agreement
The workflow is clearer with a concrete example. Let's say you need to draft a loan agreement. Here's what an associate would do — and what an agent does, step by step.
Step 1: Intake and Clarification
The associate asks the client: What's the loan amount? What's the interest rate? Who are the parties? Is there collateral? What jurisdiction governs?
The agent's tool: An intake function that presents questions, validates responses, and flags ambiguities. If the client says "market rate," the agent asks: "Do you mean the prevailing LIBOR rate, the federal funds rate plus a spread, or something else?"
Good agents know when they don't know. They ask rather than guess.
Step 2: Research
The associate searches for similar past agreements. They check the firm's precedent database. They look for comparable deals with the same counterparty or similar terms.
The agent's tool: A research function that queries internal knowledge bases. It might search by matter type, counterparty, jurisdiction, or specific clause patterns.
The reality: Most law firms don't have a clean, searchable precedent database. The most prevalent "research tool" in practice is: draft an "all@firm.com" email asking who recently wrote a loan agreement with these or similar terms. An agent could do this too.
Step 3: Template Selection and Drafting
The associate pulls a template, fills in the variables, and adapts it based on the specific deal terms and any special requirements.
The agent's tool: A drafting function that generates documents from templates, fills in structured data, and saves to the appropriate matter folder.
Step 4: Self-Review
Before anything goes to the partner, the associate reads the draft through — all of it. Not just the terms, but whether the document actually holds together. Is the defined term used consistently throughout? Does the recitals section match what's in the operative clauses? Does this still reflect what the client asked for, or did the drafting drift? Is the language consistent with the firm's house style?
The agent's tools: This is where a single agent can become several. You could run one review function that checks everything sequentially — or spin up specialized subagents in parallel: one for internal coherence, one for playbook conformance, one for house style, one to re-verify the output against the original client brief. Each returns a structured findings report; the orchestration layer aggregates them. Parallel subagents are faster, and specialization tends to produce more thorough results than asking one agent to do everything at once. Either way, issues get logged by type and severity — structural, positional, stylistic — so the partner review package is useful, not just complete.
Step 5: Partner Review
The associate packages their work for the supervising partner: the draft, the flagged deviations, the precedents relied on, any open questions. The partner reviews and decides what to change.
The agent's tool: A review submission function that compiles the draft, analysis, and supporting materials into a structured handoff. The agent doesn't decide whether the work is good enough — the partner does. Every time. What the agent controls is how well-prepared that handoff is.
Each step uses tools appropriate to the task. Each tool is something a human would do — just automated, consistent, and tireless. And each tool can itself be an agent: its own goal, its own tools, its own loop. That's what "multi-agent" means in practice — not a fleet of robots, but tools that think.
The Autonomy Spectrum
"Agent" describes a spectrum, not a category. The variable is autonomy — how much decision-making freedom does the LLM have?
| Level | Who Decides the Path | Example | Testing |
|---|---|---|---|
| Strict Workflow | You hardcode the sequence | Step 1 → Step 2 → Step 3, always | Unit tests. Deterministic. |
| Agent-Assisted | LLM plans which tools to use, within guardrails | "Draft this loan agreement" — agent decides whether to research precedents first or clarify terms | Evaluate outcomes + process quality |
| Fully Autonomous | LLM sets sub-goals, adapts from outcomes | "Handle this financing" — agent determines scope, research needs, documentation, manages closing | Mostly experimental. Operational overhead is significant. |
The gap between Level 2 and Level 3 is bigger than it looks. A Level 2 agent waits for a human to trigger it. A Level 3 agent runs on a schedule — it wakes up, recalls what it did last time, decides what to investigate next, and spawns whatever subagents it needs to get there. No one presses a button. The result arrives when it's done. That's a fundamentally different relationship between the system and the humans around it, which is exactly why the operational overhead is significant.
How to Build This
The instinct is to start at Level 2 or 3 — the demos look impressive and that's the goal anyway. Resist it.
Start strict. Build a hardcoded workflow first. Step 1, then Step 2, then Step 3. No autonomy, no decisions left to the LLM. You control every transition.
Build tests. For each step, define what success looks like before you run it. For intake: did it capture all required fields? For research: did it find relevant precedents? For drafting: does the output match the template structure?
Add evaluation. Run the workflow on real examples. Where does it fail? Where does it produce garbage? Fix the strict workflow before you add any autonomy.
Gradually loosen. Once you trust the components, give the LLM limited decision-making power. Let it choose between two research strategies based on the query. Let it decide whether a term needs clarification or is clear enough.
Never remove observability. Every step gets logged. Every tool call gets recorded. Every decision is traceable.
The operational overhead of autonomous agents is real. You trade flexibility for control. A system that works reliably at Level 1 is more valuable than one that occasionally works at Level 3. Start strict, earn the right to be autonomous.
Build for Agents, Not for Humans
Most legal software was designed for a human at a desk. Every design decision — the dashboard, the PDF export, the approval button — assumes an intelligent person on the other end interpreting what they see.
Agents don't read dashboards. They call APIs or command-line tools. They don't click buttons — they pass parameters. They don't open PDFs — they parse structured data.
This matters because the models are increasingly a commodity. Claude, GPT, Gemini — capable enough for most legal workflows. What isn't commoditized is the infrastructure around them: the tools they can call, the data they can reach, the systems they can actually interact with. And most legal tech was not designed for this consumer.
Contract playbooks: A PDF of guidelines is designed for a lawyer to read and interpret. An agent needs a structured rule set — conditions, thresholds, required clauses — that it can execute against without interpretation. Same knowledge, different format.
Document management: A folder hierarchy is designed for a human to browse. An agent needs an API that returns documents with structured metadata — matter number, type, parties, date — and supports programmatic queries, not manual navigation.
Client intake: A web form that produces a Word document is designed for a human to process downstream. An agent needs validated structured data it can pipe directly into the next step without anyone transcribing anything.
Internal knowledge: A shared drive or wiki is designed for human navigation. An agent needs a database with retrieval-optimized vector and full-text search capabilities and clear provenance — something it can query semantically, not browse manually.
Execution environment: A polished GUI is designed for a human to click through. An agent doesn't need any of that — it needs a sandbox: an isolated environment with a filesystem it can write to, CLI tools it can invoke, and the ability to run small scripts and inspect the results. This is a recent but significant observation: what separates capable agents from limited ones often isn't the model. It's whether the agent has a safe place to execute code and the low-level tools to interact with the surrounding system.
The question to ask about every tool in your stack: was this built for a person, or can a program use it without one? Every "for a person" is a point where your agents will stall.
The Bottom Line
An agent is LLM, tools, loop, stopping condition — nothing more. The autonomy spectrum exists because different tasks tolerate different levels of unpredictability. Legal work sits toward the conservative end of that spectrum, which is exactly why starting strict makes sense.
The loan agreement workflow isn't theoretical. The models are capable enough today. The constraint is always the same: can the agent get what it needs, in a form it can use, from systems that were designed to give it?