Richard Zhu

LLM Extraction Methodology

How opinion PDFs become a legal-data table without letting model output outrun the source material.

2026
LLM Extraction Methodology cover

What the pipeline is trying to preserve.

Legal extraction fails when it treats opinions as generic text. The important fields are not just names and outcomes. They include posture, publication status, standard of review, issue type, offense category, relief type, authorship, panel composition, and uncertainty.

The goal is to preserve enough legal structure that downstream analysis can ask institutional questions rather than merely count wins and losses.

Review controls

The Seventh Circuit pipeline is organized around traceability:

  • canonical PDFs become structured records;
  • extraction outputs are schema constrained;
  • decision-level tables are separated from issue, offense, statute, judge, and separate-opinion tables;
  • figures have backing CSVs;
  • quality reports preserve missingness and denominator transitions.

That structure makes the empirical claim easier to challenge, which is the standard. If a result cannot be traced back to a source and denominator, it is not ready to carry legal weight.

opinion_pdf
-> schema_constrained_extraction
-> review_table + long_tables
-> quality_report
-> model_input
-> figure_with_backing_csv

Research use

The thesis result is narrower than a generic "AI found bias" claim: visible appellate sentencing disparity changes once publication track and case composition are separated. That kind of claim only becomes coherent when the pipeline keeps doctrine, procedure, and institutional structure in the data.