Applied AI Research for Legal Practitioners

Joshua Escoto | LegalEngineering.AI | Recommended Books

LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

Serene Wang, Lavanya Pobbathi, Haihua Chen
March 9, 2026
arXiv | PDF


LAMUS (Legal Argument Mining from U.S. Caselaw) is a large-scale, sentence-level dataset built from U.S. Supreme Court decisions and Texas criminal appellate opinions, designed to train and benchmark models that identify the functional structure of judicial reasoning. The paper frames legal argument mining as a six-class sentence classification task — categorizing each sentence as a fact, issue, rule, analysis, or conclusion — and introduces a scalable pipeline for building such datasets using LLM-based annotation with human-in-the-loop quality control. The core contribution is methodological: rather than relying entirely on expensive human annotation or entirely on noisy LLM labels, LAMUS combines both. LLMs do the heavy lifting on annotation, and human reviewers focus their effort on correcting the cases where LLMs are most likely to be wrong.



Legal argument mining is the task of automatically identifying and classifying the functional components of legal reasoning in text. It draws on the IRAC framework familiar to law students: Issue — the legal question the court must resolve; Rule — the legal standard or statute governing the issue; Analysis (or Application) — the court’s reasoning applying the rule to the facts; Conclusion — the court’s holding or outcome; Fact — the underlying factual record.

Chain-of-Thought (CoT) prompting asks the model to reason through the classification step by step before giving the answer. Example: “First, identify what this sentence is doing in the context of the opinion. Then determine which functional role it plays.” Research consistently shows CoT improves accuracy on tasks requiring structured reasoning — this paper confirms that effect in the legal domain.

Few-Shot Prompting provides 2-5 examples of correct (sentence → label) pairs in the prompt before asking the LLM to label a new sentence. Improves accuracy significantly over zero-shot but is more expensive (more tokens) and requires selecting representative examples.

Human-in-the-looop (HITL) Annotation is a hybrid annotation methodology where automated tools (LLMs, classifiers) do an initial pass, and human reviewers focus their effort on correcting low-confidence or flagged outputs rather than reviewing everything. Balances cost efficiency with annotation quality. Standard in modern NLP dataset construction.

Cohen’s Kappa is a statistical measure of inter-annotator agreement that accounts for chance agreement. Ranges from -1 to 1; values above 0.8 indicate strong agreement. The gold standard for evaluating annotation reliability in NLP. A Kappa of 0.85 is excellent for a complex multi-class legal task.