The Structured Data Mindset That Unlocks Legal Automation

In Defense of Excel (Sort Of)

Last time I wrote here, I criticized Microsoft Word for trapping legal knowledge in proprietary formats and making collaboration unnecessarily painful. You might expect me to bash Excel next.

I won't. Excel taught me something Word never could: how to think in structured data.

Not because Excel is perfect—it's not. I've spent enough hours fighting with merged cells and watching formulas break to know its limits. But Excel forces you to think in rows and columns. It demands structure. And once you learn to see legal information that way, everything changes.

The shift from "document thinking" to "database thinking" isn't about learning specific tools. It's about recognizing that legal information has inherent structure—even when it looks messy—and learning to work with that structure instead of fighting it.

This mental model made Excel powerful until it wasn't. It made Python inevitable. And it made working with LLMs practical instead of theoretical.

What "Database Thinking" Actually Means

Database thinking means recognizing that legal information contains patterns that can be formalized.

I first learned this managing class actions. I was tracking hundreds (eventually thousands) of plaintiffs—each with a claim amount, filing date, statute of limitations issues, special circumstances. Most cases followed standard patterns, but prudent legal work meant spotting the exceptions: the plaintiff who transferred rights, the claim with unique standing issues.

Word documents couldn't handle this. Each plaintiff was buried in prose. Finding patterns meant reading everything, manually. A spreadsheet could. Each plaintiff became a row. Each case detail became a column. Eventually I could filter, sort, flag outliers. I could query my cases instead of reading through files.

That's when it clicked: I wasn't just organizing data. I was changing how I saw the information itself.

Consider contracts. Most lawyers see a contract as a document—pages of text with clauses, formatted nicely, saved as a PDF or .docx file.

Database thinking sees:

Parties: Entities with attributes (name, jurisdiction, role)
Terms: Data fields (effective date, termination date, renewal period, governing law)
Obligations: Categorizable commitments with triggers, deadlines, and responsible parties
Clauses: Standardized components (indemnification, confidentiality, IP assignment) with variations

Same contract. Different mental model.

The database thinker asks: "If I had 500 of these contracts, how would I want to query them?" Not "How do I make this one contract look professional?"

This applies everywhere:

Legal Information	Document Thinking	Structured Data Thinking
Case files	Folders of documents	Tables of plaintiffs, claims, statuses, outcomes
Due diligence	Stack of PDFs to review	Structured data to extract, validate, flag
Cap tables	Spreadsheet to update	Relational data (investors, shares, rounds) with calculated fields

The shift is subtle but powerful. You stop optimizing for how information looks and start optimizing for what you can do with it.

Why This Mental Shift Actually Matters

1. Legal Work Is Pattern Recognition

Most legal work involves finding patterns and exceptions. Class action management: 80% of plaintiffs have similar claims; the value is spotting the 20% with unique issues. Contract review: most clauses are standard; lawyers add value by identifying non-standard terms or missing protections.

Pattern recognition requires seeing data, not just documents.

When I had case information in a structured format (plaintiff → claim amount → filing date → special circumstances), I could sort by filing date to identify limitation issues, filter by claim amount to find outliers, flag cases with non-standard attributes. When it was scattered across Word documents, every review started from scratch.

2. Normalized Data Enables Analysis

Normalization ensures consistency: dates formatted uniformly, entity names standardized, categories defined clearly.

Normalized data means you can aggregate, analyze, and validate reliably. Unnormalized data means every analysis starts with manual cleanup.

Legal work increasingly demands this kind of analysis: What percentage of M&A deals include specific indemnification language? Are compliance deadlines clustered in ways that create bottlenecks? Which "market standard" clauses have meaningful variations? Are your templates aligned with what clients actually negotiate?

You can't answer these questions if every data point requires manual extraction and cleaning.

3. Structured Data Is Reusable and Portable

Once information is structured, it becomes multi-purpose.

Take a cap table from a VC funding round as an example. The same cap table data could generate Excel files for internal review, PDF attachments for legal agreements, JSON for API integrations with external platforms (think of the client's own system), SQL database entries for long-term storage, and CSV exports for third-party software.

Think of structured data as an API for information: it can be transformed into whatever format the situation requires without losing fidelity.

This portability becomes especially valuable when working with AI tools. LLMs can directly process structured formats like CSV and JSON far more reliably than parsing prose from Word documents or PDFs. Feed an LLM a JSON file with contract metadata, and it can analyze patterns, flag outliers, or generate reports with minimal hallucination risk. Feed it an unstructured document, and you need to build a parsing layer to extract the raw text and require the AI to find structure first—adding tooling complexity and reducing reliability—before the LLM can do anything useful.

Document-centric thinking means recreating information for each use case. Need the cap table in a different format? Rebuild it manually. Want to integrate with another system? Copy-paste and hope nothing breaks.

Data-centric thinking means storing information once in a structured format, then generating whatever views or exports are needed.

This is why vendor lock-in is so dangerous: if your data only exists in a proprietary format tied to a specific tool, you lose this portability. You can't easily migrate, integrate, or analyze without starting over.

Where Most Lawyers Go Wrong: Excel as Canvas, Not Database

Excel is where most lawyers first encounter structured data—often without realizing it.

When you build a spreadsheet with clear column headers, consistent data types per column, and formulas that calculate across rows, you're thinking like a database. You're clustering similar information into columns. You're classifying each row as a distinct record. You're standardizing formats so operations work consistently.

But, similar to Word, most lawyers treat Excel like a design tool instead of a data tool.

The Anti-Pattern: Merged Cells and Visual Design

Open any Excel file in your law firm and you will most likely see merged cells for headers, multiple tables on one sheet, heavy styling, summary rows mixed with data rows, and empty rows for visual spacing.

This looks professional. But it also completely breaks sorting, filtering, pivot tables, formulas, CSV export, and any script or AI tool that expects consistent structure.

I spent years creating these beautiful spreadsheets before I realized: every formatting choice that prioritizes appearance over structure makes your data less useful.

Merged cells deserve special mention. They're the single worst Excel habit you can have. When you merge cells, sorting breaks, filtering breaks, formulas break, database imports break, and Python scripts that expect consistent structure break. You've traded Excel's entire analytical power for a prettier header.

If you want visual formatting, apply it after exporting to the final deliverable format. Keep your working data clean.

The Pattern: One Table, Clear Structure

The database approach:

One table per sheet
Clear headers in row 1 only
No merged cells (use "Center Across Selection" for centered headers if needed)
Minimal or no styling
Each row is one complete record
Each column has one consistent data type
No empty rows for spacing

Boring? Yes. But now your data is sortable, filterable, pivot-table ready, importable into databases, processable with scripts, and usable by AI tools.

When Excel's Limits Force Better Tools

Excel taught me structured thinking. But structured thinking eventually demanded better tools.

The version control problem: My cap table spreadsheet became Cap_Table_v3_FINAL_updated_SeriesA.xlsx. Someone emailed an old version with changes. Reconciling manually was error-prone.

The collaboration problem: Multiple people editing complex spreadsheets simultaneously = corrupted formulas, conflicting changes, broken references.

The scale problem: Excel files can grow in size and complexity rapidly. Then Excel becomes slow: It freezes when opening it. Filtering takes several seconds that feel like minutes. In addition, you can't easily query across multiple spreadsheets (different case types in separate files).

That's when I learned about relational databases like SQLite—a database that comes with Python, no server setup required. Just a file (like Excel), but designed for data storage and querying.

Migrating was trivial:

import pandas as pd
import sqlite3

# Read Excel tracking sheet
df = pd.read_excel('my_file.xlsx')

# Create SQLite database
with sqlite3.connect('my_database.db') as conn:
    df.to_sql('my_table', conn, if_exists='replace', index=False)

# Reality: I first had to clean merged cells, 
# standardize date formats, and separate summary 
# rows before this worked

Now I could query across entries with SQL, filter instantly even with thousands of rows and build automated reports from the data.

The mental model transferred perfectly. I wasn't learning a completely new way of thinking—I was upgrading the tools that implemented the thinking I'd already developed in Excel.

Why Structured Thinking Is Your LLM Unlock: Two Perspectives

Structured data thinking matters for AI adoption in two ways that reinforce each other:

LLMs work best with structured data as input and output
LLMs excel at turning unstructured information into structured data

This isn't just theoretical. It's the difference between "we tried AI and it didn't work" and actually useful automation.

The Double Advantage: Structure In, Structure Out

LLMs consume structure: As mentioned above, when you feed an LLM clean, well-organized text with clear context, it performs better. Messy Word documents with formatting artifacts, inconsistent layouts, and buried information? The LLM spends its "attention" fighting the noise instead of understanding the content.

LLMs generate structure: This is where it gets interesting. LLMs are remarkably good at reduction tasks—taking unstructured text and extracting structured information from it. This is fundamentally different from generation tasks (creating new text from scratch), and it's where most practical legal value lives.

Consider the difference:

Reduction: Extract key terms from 500 contracts, classify documents by type, flag missing clauses, identify compliance gaps
Generation: Draft new contract clauses, create legal analysis, generate compliance documentation

Reduction works because you can validate the output. You know what data should be present. You can check if dates are valid, required fields are populated, values match expected patterns. Generation? Much harder to validate. How do you programmatically check if a generated indemnification clause is sound legal advice?

Generating one impressive clause makes for a good demo. Extracting structured data from 500 contracts into a queryable database transforms workflows.

The Basic Workflow (And How It Gets Sophisticated Fast)

Here's the pattern that actually works when using LLMs for data extraction:

1. Extract raw data from documents

Pull text from PDFs, Word files, emails—whatever format the information lives in. Yes, this step seems simple. It's not. OCR quality matters. Layout preservation matters. Table extraction matters. You can spend months refining document parsing alone. (But start simple: even basic text extraction beats manual work.)

2. Define your target structure

Decide what fields you want extracted and what format you want them in. This is where your domain knowledge matters. What's a "party" in this contract? What constitutes a "key term"? What dates matter?

The better you define this structure, the better your results. Start with obvious fields. Refine as you learn what the LLM struggles with.

3. Call the LLM with your structure

Send the extracted text to the LLM with clear instructions: "Extract these specific fields in this exact format." Be precise. Be explicit. Be like a senior lawyer giving instructions to a junior associate.

You'll iterate on this. A lot. The first prompt rarely works perfectly. That's fine. Prompt engineering is just refining your instructions.

4. Validate the result

Check if the LLM's output matches your expected structure. Are required fields present? Are dates actually dates? Do values fall within expected ranges? Is the JSON valid?

This is where structured thinking pays off: you can write automated validation because you defined the structure upfront. No structure? No validation. No validation? No confidence in your results.

5. Retry if validation fails

When validation catches errors (and it will), your program can handle retries automatically. No manual intervention needed. Failed to extract a date in the right format? The system retries with a more explicit prompt. Missing a required field? Retry with emphasis on that specific field. JSON parsing failed? Retry with stricter formatting instructions.

Here's where it gets fun. You can add another LLM call as an automated "reviewer" to check the first LLM's work before it even reaches your validation layer. You can build logic that tries different extraction strategies if the first one fails—maybe a more detailed prompt, or breaking the task into smaller steps. You can create feedback loops that programmatically refine prompts based on common failure patterns.

Do this, and congratulations: you've built an "agentic workflow". Your marketing department can now breathlessly announce you're using AI agents.

Here's where Python frameworks designed specifically for structured LLM outputs and/or multi-step "agentic workflows" become invaluable. Tools like instructor, outlines, Pydantic AI, and DSPy aren't just helpful—they're built precisely for this problem. They enforce schema validation, handle retries automatically, and make sure your LLM outputs actually match your data structures.

The Conclusion: Why This Mindset Matters More Than Ever

Structured data thinking isn't just about working faster. It's about what becomes possible.

High-volume legal work demands it. Compliance reviews, contract portfolios, due diligence—these aren't one-off document projects anymore. They're data operations that need to scale.

Knowledge management requires it. You can't build institutional knowledge on unstructured Word files scattered across desktops. You need searchable, queryable, analyzable information.

Automation depends on it. Every repetitive legal task is a candidate for automation. But you can't automate chaos. You need consistent structure, clear schemas, validated data.

And AI adoption—the capability every firm is chasing—depends on all three. LLMs need clean inputs and validated outputs. Firms with data trapped in documents will spend years cleaning up before they can even start. Firms that already think structurally can experiment today.

The AI capability gap everyone's worried about? It's really a data quality gap. And the data quality gap starts with how you think about information: documents to be created, or data to be structured.

Structured thinking isn't just preparation for AI. It's the foundation that makes AI practical instead of aspirational.