Lawyers Are Obsessed With the Wrong Thing
The most common question I get is some version of "which model should I use to do X." The question treats the model as the strategic choice: pick the right one, commit to it, and the work follows.
The same logic drives tool selection. Firms tend to choose AI products based on which model sits underneath: the latest, the biggest, the newest. The model isn't usually in the product name, but it's the purchase driver. The question "which tool should we buy" is often really "which model should we commit to" dressed up in vendor packaging.
Lawyers are willing to bet everything on the least durable choice in the stack. Every other layer — data, prompts, tools, integrations, evals — survives a model change. The model doesn't survive anything.
Stop asking which model to pick. Ask what system to build so the question doesn't matter.
The Capability Gap Is Smaller Than You Think
The implicit assumption behind model-maximalism: only the frontier can handle the work reliably.
Legal work is mostly research and analysis. But the quality of both comes from the data they run on: proprietary documents, domain-specific corpora, jurisdictional rules, firm precedents, internal memoranda. None of that lives inside the model. It lives in retrieval, in the data pipeline, in the tools you give the model access to.
A smaller model that can see the right documents and call the right tools outperforms a frontier model that can't. The capability gap between models is dwarfed by the access gap between a system wired into your data and a model operating blind. The leverage isn't in the model. It's in the data access.
The frontier exists for a specific reason: the long-running, genuinely hard tasks where the reasoning itself is the bottleneck. Most work that crosses a lawyer's desk doesn't hit that ceiling. It hits data bottlenecks instead. There are two: the volume problem (processing large amounts of text — contracts, filings, correspondence) and the retrieval problem (finding and synthesizing information across current documents and historical knowledge — precedents, prior matters, firm memory). A well-tuned mid-tier model handles both. Neither requires the frontier.
Stop asking which model is better. Ask which model is sufficient — given what it can actually see. Sufficiency plus access beats superiority plus blindness.
The capability gap is also shrinking. Newer mid-tier models handle tasks that required frontier models a year ago. The "frontier is irreplaceable" position ages badly because the frontier keeps moving. A well-designed system gets you most of the way there: good retrieval, well-indexed documents, clean tool interfaces, evaluation, error handling. The system supplies the leverage that model size can't.
One Provider, Three Ways to Lose
The launch of Claude Fable 5 produced three separate crises in a matter of weeks: a data retention policy that disqualified it for legal work, a silent degradation that caused a community uproar, and an overnight government directive that suspended access without warning. Not the same problem three times — three different categories of risk that any single-provider setup is exposed to:
Government or regulatory action. Providers want to serve you, but the jurisdiction won't let them. Export controls, sanctions, data residency rules. The US government's June 12 directive against Fable 5 showed what that looks like in practice: access cut off overnight, no prior notice, no specific reason given. Any workflow that touched it stopped working before anyone knew it was coming. The June 12 directive should be a wake-up call. It will happen again — different provider, different grounds, same result.
Provider governance failure. Providers change things on their end, and you're exposed whether you saw it coming or not. Anthropic's 30-day data retention policy on Claude Fable 5 is a quiet example: it conflicts directly with client confidentiality duties and makes it unusable for legal work, regardless of capability. Anthropic frames it as a safety feature; for law firms it's a deal-breaker on its own.
By the way, if you're thinking this is just an Anthropic problem, you are wrong. On April 23, an OpenAI employee publicly listed text-embedding-3-small — the model many teams use for document search — for shutdown on October 23, 2026. A colleague walked it back the same day. False alarm. But if it had been real, every existing embedding built on that model would be worthless overnight. The false alarm is still a structural risk. It shows exactly what you're exposed to.

Commercial decisions. The provider changes their terms unilaterally. The shift from flat-rate subscriptions to usage-based pricing is the most concrete trend: GitHub Copilot recently moved to usage-based pricing, and due to the computational demands of frontier models, the frontier-model providers are likely to head the same direction. For a law firm running heavy AI workloads, that turns a predictable budget line into a variable cost that scales with usage, and the provider sets the rate.
All three collapse into the same structural problem: commit to one model provider and you're exposed to whatever that provider — or their government — decides to do next. The only defense that works is structural: make the model interchangeable.
What Model-Agnosticism Actually Looks Like
The fastest way to stop obsessing over model choice is to try different models in the same tool: switch the model behind the interface, run the same task, compare the output. Once you can do that without changing your workflow, the differences become concrete: capability, cost, sovereignty. The model becomes a dropdown, not a commitment.
The tools that make this easy: pi, OpenCode, and Hermes agent are the ones I use regularly: they work across providers, run locally, and treat the model as a config choice. But the model-switching is the smaller benefit. What matters more is that all three are extensible: skills, plugins, hooks, tool integrations. You shape the tool around your workflow, not the other way around. A configurable tool on a mid-tier model with access to your proprietary data beats a locked-down tool on a frontier model for almost everything you'll actually do, and the configuration investment survives every model swap. That's the argument I made in the post on encoding legal expertise into skills: the leverage lives in the configuration layer, not in the model.
My personal daily stack, each component with a different role:
- Qwen3.6 and Gemma 4 for private, day-to-day work. Both are small enough to run on my own hardware, no provider can pull the plug. The "you actually own it" tier.
- DeepSeek 4 (often the Flash variant) for research tasks working through a detailed checklist. Cheap, fast, totally sufficient. The "I need to run through a long list of items" tier.
- Kimi2.6, MiniMax 3, or Claude Sonnet 4.6 for the harder tasks. Complex projects, creative ideas, novel analysis. Not Opus, not GPT-5.5.
- Open-source retrieval — the retrieval layer, separate from the generative stack. I currently use an open-weight retrieval model I can download and run locally, but this changes per project. The point is owning the model: nobody can deprecate it out from under me.
- Whisper (OpenAI) and Parakeet (NVIDIA) for speech-to-text — the audio layer. Both run locally, both open-source. Same principle: own the model, own the output, no API call leaving the machine.
Each component is swappable; none is load-bearing for the others. The Qwen/Gemma local tier is sovereign. The DeepSeek tier is cheap. The Kimi tier is the reasoning ceiling. If any one of the models disappears, the rest keep working.
This is what works for me right now. In three months the model names will be different. The point isn't these specific choices. It's that any of them can be swapped.
This stack has been resilient to every provider-side and geopolitical change in the last six months. Fable 5's silent degradation, the embeddings deprecation scare, the Copilot pricing shift, the June 12 directive. None of them touched the system. Not because I predicted any of them, but because the system wasn't pinned to any one of them.
Now, here is an interesting observation: the most capable open-source models you can actually own — Qwen, DeepSeek, Kimi, MiniMax — all come from Chinese labs. The proprietary models subject to unilateral US government restriction are US-developed. The "authoritarian regime" produces the open stack. The "free world" restricts access to it. Weird. Europe, by the way, isn't part of the AI game anymore — it wrote the rulebook and missed the race.
Build the System, Not the Dependency
If you're building AI-powered legal workflows — not just using a tool, but designing something that runs at scale on client data — the argument shifts from models to infrastructure.
Owning the stack means more than making the model swappable. It means controlling where your data sits, where inference runs, and what happens when any one layer of the system changes on you. For firms handling sensitive client data, that argument eventually extends to the infrastructure itself. Not every firm needs to run its own servers, but every firm should make that decision deliberately, with a clear view of what they're giving up when they don't. Outsourced infrastructure means a provider controls where inference runs, how long data is retained, and whether access continues tomorrow. That's a risk position, not a neutral default. The model is a component. The infrastructure is the asset.
A model-agnostic system has five layers:
- The prompt layer — well-shaped instructions, few-shot examples, output schemas. If your prompts are good, the model choice matters less. For a law firm, this means your expertise lives in the prompts, not in the model.
- The tool layer — retrieval over your documents, access to firm precedents, the ability to call external systems. The intelligence of the system comes from what the model can do, not just what it can say. Your document-retrieval pipeline and precedent database are assets no provider can take away.
- The eval layer — automated checks on output quality, regression tests, comparison runs between models. This is how you know whether a swap actually changed anything. As I argued in the observability post in April, observability is the prerequisite for model-agnosticism. If you can't measure what the model is doing, you can't compare it to alternatives, and you can't detect the moment your provider silently changes behavior. In practice, this means running the same task against two models and checking whether the outputs diverge on a real client matter. The eval set is also what makes the system improvable without manual re-tuning. When you swap models, you can run an automated optimization pass against it — searching for the prompt formulations and few-shot examples that work best for the new model. The system doesn't just survive a model swap; it can adapt to one.
- The error-handling layer — fallback paths, retries with a different model, graceful degradation. The system survives a model failure by routing around it. If your primary model goes offline, the system routes to another one automatically — no manual workaround needed.
- The configuration layer — model choice lives in a config file, not in the code. Swapping providers is a config change, not a rewrite. But configuration isn't just about which model you use. It's about which model you use for what. A well-built system routes tasks to the right model: hard reasoning goes to a capable model, document classification or chunking goes to something small, fast, and cheap. Most tasks don't need the frontier. Routing by task type is how you control cost and latency without touching the architecture.
Start with the audit: which workflows depend on which model, and what happens if access is cut off tomorrow? If the answer is "we'd have to rebuild," the system isn't built yet.
The model you're depending on will be deprecated, restricted, or repriced. Not as a risk — as a certainty. Build accordingly.