Applied AI Research for Legal Practitioners

Joshua Escoto | LegalEngineering.AI | Recommended Books

  • How to Build a Local Legal RAG Pipeline with ChromaDB and Voyage AI

    Over the course of my career, I have saved a large body of cases that support my specific practice areas. Quickly finding supportive cases by keyword search can be challenging and creating a detailed topic index is very time consuming. A local RAG solves both problems and lets you quickly search through a large body of cases based on a plain language search.

    This post describes how to build a local RAG pipeline for legal case retrieval: domain-specific embeddings via Voyage AI’s voyage-law-2 model, persistent vector storage in ChromaDB, and a query interface that returns ranked case excerpts for downstream LLM analysis.

    For a library of over 300 legal cases I stayed within the Voyage AI free tier and continued use is very nominal. There are other potential legal industry uses including: document clause and deposition search retrieval. Depending upon the use, you may need to modify your chunking strategy for paragraph granularity (example code provided below).

    Architecture

    cases_text/          ← raw .txt case files (one per case)
    cases_db/            ← ChromaDB persistent store (auto-created)
    cases_build_index.py ← one-time indexing script
    cases_query.py       ← query script, called at runtime
    .env                 ← VOYAGE_API_KEY (never commit)
    

    The benefit of local vector storage is that nothing is sent to a third-party vector store.

    Voyage AI voyage-law-2 is fine-tuned on legal corpora. You will need to obtain a voyage_api_key from the Voyage AI website and save it as a env file in the project root. This produces better retrieval for legal terminology (e.g. consideration) where general-purpose embeddings conflate legal and lay meanings.

    Dependencies

    pip install chromadb voyageai python-dotenv
    
    PackageRole
    chromadbLocal persistent vector database
    voyageaiLegal-domain embedding model API
    python-dotenvLoads API key from .env without hardcoding

    Step 1: Prepare Case Files

    Each case is a plain .txt file. Filename = case identifier in retrieval results.

    Smith-v-Jones-2023.txt
    Doe-v-Acme-Corp-2021.txt
    Johnson-v-State-2019.txt
    

    Step 2: Build the Index

    Runs once. Reads every .txt file, generates embeddings via voyage-law-2, and upserts into ChromaDB. Subsequent runs are idempotent — upsert skips unchanged documents.

    import os
    import chromadb
    import voyageai
    from dotenv import load_dotenv
    
    load_dotenv()
    
    client = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))
    db = chromadb.PersistentClient(path="./cases_db")
    collection = db.get_or_create_collection("my_cases")
    
    cases_folder = "./cases_text"
    
    for filename in os.listdir(cases_folder):
        if not filename.endswith(".txt"):
            continue
    
        filepath = os.path.join(cases_folder, filename)
        with open(filepath, "r", errors="ignore") as f:
            text = f.read().strip()
    
        if not text:
            print(f"Skipping empty file: {filename}")
            continue
    
        # voyage-law-2 with input_type="document" optimizes embeddings for storage/retrieval
        result = client.embed([text], model="voyage-law-2", input_type="document")
        vector = result.embeddings[0]
    
        case_id = filename.replace(".txt", "")
        collection.upsert(
            ids=[case_id],
            embeddings=[vector],
            documents=[text],
            metadatas=[{"filename": filename, "case": case_id}]
        )
        print(f"Indexed: {filename}")
    
    print(f"Done. {collection.count()} cases in index.")
    

    Voyage AI uses asymmetric embeddings, documents and queries are embedded differently to optimize dot-product similarity at retrieval time. Always use document when indexing and query when embedding search terms.

    ChromaDB persistence: PersistentClient writes to disk. The ./cases_db directory persists between runs — you only pay for embeddings once.

    Step 3: Query the Index

    At query time, embed the question with input_type="query" and retrieve the top-N nearest neighbors by cosine similarity.

    import os
    import sys
    import chromadb
    import voyageai
    from dotenv import load_dotenv
    
    load_dotenv()
    
    question = " ".join(sys.argv[1:])
    if not question:
        print("Usage: python cases_query.py 'your question here'")
        sys.exit(1)
    
    client = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))
    db = chromadb.PersistentClient(path="./cases_db")
    collection = db.get_collection("my_cases")
    
    # Asymmetric query embedding
    result = client.embed([question], model="voyage-law-2", input_type="query")
    query_vector = result.embeddings[0]
    
    # n_results controls how many cases are returned — tune based on context window
    results = collection.query(
        query_embeddings=[query_vector],
        n_results=4
    )
    
    for i, (doc, meta) in enumerate(zip(results["documents"][0], results["metadatas"][0])):
        print(f"\n--- CASE {i+1}: {meta['case']} ---\n")
        print(doc[:3000])  # truncate to manage LLM context length
    

    Tuning n_results: More results = more context for the LLM but higher token cost. For a 100K context window (Claude, GPT-4o), 4-6 cases at 3000 chars each is a good starting point. Increase if your cases are short; decrease if they’re full-text opinions.

    Truncation trade-off: Cutting at 3000 chars risks missing holding language buried deep in an opinion. For production use, consider chunking each case into overlapping segments at index time rather than truncating at query time.

    Chunking Strategy (Optional)

    Depending upon the use, chunk the legal text into overlapping segments at index time so retrieval operates at paragraph granularity rather than full-document granularity. This increases index size but improves precision.

    def chunk_text(text, chunk_size=1000, overlap=200):
        """Split text into overlapping chunks for finer-grained retrieval."""
        chunks = []
        start = 0
        while start < len(text):
            end = start + chunk_size
            chunks.append(text[start:end])
            start += chunk_size - overlap
        return chunks
    
    # In the indexing loop, replace single upsert with:
    chunks = chunk_text(text)
    for j, chunk in enumerate(chunks):
        result = client.embed([chunk], model="voyage-law-2", input_type="document")
        vector = result.embeddings[0]
        collection.upsert(
            ids=[f"{case_id}_chunk_{j}"],
            embeddings=[vector],
            documents=[chunk],
            metadatas=[{"filename": filename, "case": case_id, "chunk": j}]
        )
    

    Send me a note on linkedin if you come up with any new use cases.

  • Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

    Ryan C. Barron, Maksim E. Eren, Olga M. Serafimova, Cynthia Matuszek, Boian S. Alexandrov

    May 9, 2025 (v2)

    arXiv | PDF


    The paper presents a jurisdiction-specific legal AI system that combines Retrieval-Augmented Generation (RAG), Vector Stores (VS), Knowledge Graphs (KG), and Hierarchical Non-Negative Matrix Factorization (HNMFk) to improve legal information retrieval and reduce LLM hallucinations. The system was built and tested on New Mexico’s legal corpus: 265 constitutional provisions, 28,251 statutory sections, 5,727 Supreme Court cases, and 10,072 Court of Appeals cases, all scraped from Justia.

    The core innovation is using HNMFk (via the T-ELF library) to automatically discover latent topic clusters within legal documents and then integrating those topics into a Neo4j knowledge graph alongside citation links and metadata. When a user asks a legal question, the system performs both semantic vector search and knowledge graph traversal, then feeds the combined results to an LLM for grounded, citation-backed answers. In evaluations against GPT-4o, Claude 3 Opus, Gemini Pro, and Nemotron-70B, the system provided more accurate, reproducible, and citation-specific answers — particularly for quantitative queries (e.g., counting cases mentioning “habeas corpus”) and citation pattern queries where general-purpose LLMs either refused to answer or hallucinated fake case names.

    Retrieval-Augmented Generation (RAG) is a technique where an LLM doesn’t rely solely on its training data to answer questions. Instead, it first retrieves relevant documents from an external database, then generates an answer grounded in those documents. This reduces hallucinations because the model cites real sources rather than guessing. Think of it like an open-book exam versus a closed-book exam.

    Vector Store (VS) / Vector Database is a database that stores text as high-dimensional numerical vectors (embeddings) rather than raw strings. When you search, your query is also converted to a vector, and the database finds the most semantically similar documents using distance metrics (cosine similarity). This means “due process violations” can match documents about “constitutional rights infringements” even without shared keywords. The paper uses Milvus as its vector database and OpenAI’s text-embedding-ada-002 for generating embeddings.

    Knowledge Graph (KG) is a structured database where information is stored as nodes (entities) connected by edges (relationships), forming triplets like (Case A) –[cites]–> (Statute B). Unlike vector stores that find similar documents, knowledge graphs can traverse explicit relationships — e.g., “find all cases that cite this statute and were decided after 2010.” The paper uses Neo4j, the most widely-used graph database.

    Non-Negative Matrix Factorization (NMF) is a dimensionality reduction technique that decomposes a large matrix into two smaller matrices where all values are non-negative (zero or positive). For text, you start with a TF-IDF matrix (documents x terms) and decompose it into: (1) a topics x terms matrix (what words define each topic) and (2) a documents x topics matrix (which topics each document belongs to). The non-negativity constraint makes results interpretable — each topic is an additive combination of words, not a mix of positive and negative weights.

    TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document within a corpus. Words that appear frequently in one document but rarely across all documents get high scores. “Estoppel” appearing in 50 of 10,000 cases would score high; “the” appearing everywhere scores near zero. It’s the input matrix that NMF decomposes.

    Hierarchical NMF with Automatic Model Selection (HNMFk) is an extension of NMF that: (1) automatically determines the optimal number of topics k using bootstrap resampling and silhouette scores (rather than requiring you to guess), and (2) applies NMF recursively — first finding broad topics, then decomposing each into subtopics, creating a tree structure. The paper uses the T-ELF library (Tensor Extraction of Latent Features) developed at Los Alamos National Laboratory. Maximum decomposition depth was set to 2, with a minimum of 100 documents per cluster to continue decomposing.

    Silhouette Score is a metric measuring how well-separated clusters are. Ranges from -1 to 1: high values mean data points are well-matched to their own cluster and poorly matched to neighboring clusters. Used here to determine the optimal number of topics at each level of the hierarchy.

    Neo4j is a graph database that stores data as nodes and relationships natively (not in tables). Cypher is its query language, similar to how SQL is for relational databases. Example: MATCH (c:Case)-[:CITES]->(s:Statute) WHERE s.id = 'NMSA 41-5-1' RETURN c finds all cases citing a specific statute.

    ROUGE-L is a metric for evaluating text summaries by measuring the longest common subsequence (LCS) between generated and reference text. Higher ROUGE-L means the generated text preserves more of the reference’s sentence structure. Used here to evaluate AI-generated legal answers against expert reference answers.

    NLI (Natural Language Inference) Entailment is classification task where a model determines if one text (hypothesis) logically follows from another (premise). Labels: entailment (follows), contradiction (conflicts), or neutral (unrelated). Used here to check if AI-generated answers are logically supported by reference legal texts.

    FactCC and SummaC are Evaluation metrics for factual consistency. FactCC fine-tunes a model on labeled correct/incorrect summaries to detect factual errors. SummaC aggregates entailment scores across sentence pairs. Both check whether generated text stays faithful to source documents — critical in legal contexts where a fabricated citation could lead to sanctions.

    LEGAL-BERT is a version of BERT pre-trained on legal text (court opinions, legislation, contracts) rather than general web text. It better understands legal language nuances like “estoppel,” “res judicata,” and “habeas corpus.” Referenced as a baseline for domain-specific embeddings.

    LexGLUE is a benchmark dataset for legal NLP with 7 tasks spanning contracts, court opinions, and legislation. Referenced as one of the standard evaluation frameworks for legal AI systems.

  • LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

    Serene Wang, Lavanya Pobbathi, Haihua Chen
    March 9, 2026
    arXiv | PDF


    LAMUS (Legal Argument Mining from U.S. Caselaw) is a large-scale, sentence-level dataset built from U.S. Supreme Court decisions and Texas criminal appellate opinions, designed to train and benchmark models that identify the functional structure of judicial reasoning. The paper frames legal argument mining as a six-class sentence classification task — categorizing each sentence as a fact, issue, rule, analysis, or conclusion — and introduces a scalable pipeline for building such datasets using LLM-based annotation with human-in-the-loop quality control. The core contribution is methodological: rather than relying entirely on expensive human annotation or entirely on noisy LLM labels, LAMUS combines both. LLMs do the heavy lifting on annotation, and human reviewers focus their effort on correcting the cases where LLMs are most likely to be wrong.



    Legal argument mining is the task of automatically identifying and classifying the functional components of legal reasoning in text. It draws on the IRAC framework familiar to law students: Issue — the legal question the court must resolve; Rule — the legal standard or statute governing the issue; Analysis (or Application) — the court’s reasoning applying the rule to the facts; Conclusion — the court’s holding or outcome; Fact — the underlying factual record.

    Chain-of-Thought (CoT) prompting asks the model to reason through the classification step by step before giving the answer. Example: “First, identify what this sentence is doing in the context of the opinion. Then determine which functional role it plays.” Research consistently shows CoT improves accuracy on tasks requiring structured reasoning — this paper confirms that effect in the legal domain.

    Few-Shot Prompting provides 2-5 examples of correct (sentence → label) pairs in the prompt before asking the LLM to label a new sentence. Improves accuracy significantly over zero-shot but is more expensive (more tokens) and requires selecting representative examples.

    Human-in-the-looop (HITL) Annotation is a hybrid annotation methodology where automated tools (LLMs, classifiers) do an initial pass, and human reviewers focus their effort on correcting low-confidence or flagged outputs rather than reviewing everything. Balances cost efficiency with annotation quality. Standard in modern NLP dataset construction.

    Cohen’s Kappa is a statistical measure of inter-annotator agreement that accounts for chance agreement. Ranges from -1 to 1; values above 0.8 indicate strong agreement. The gold standard for evaluating annotation reliability in NLP. A Kappa of 0.85 is excellent for a complex multi-class legal task.

  • Capturing Legal Reasoning Paths from Facts to Law in Court Judgments using Knowledge Graphs

    Ryoma Kondo, Riona Matsuoka, Takahiro Yoshida, Kazuyuki Yamasawa, Ryohei Hisano
    August 24, 2025
    PDF

    The paper tackles legal reasoning by building a knowledge graph from 648 Japanese administrative court decisions that makes the hidden reasoning path machine-readable. The system uses large language models to extract the key components of legal reasoning: factual findings, legal provisions cited, and how the court applied those provisions to the facts, and connects them through a purpose-built legal ontology. The result is a structured graph where you can trace the logical steps from a fact to the legal norm it triggers to the outcome it produces. In retrieval tests, the system outperformed standard LLM baselines at finding the correct legal provisions given a set of facts, meaning the knowledge graph adds genuine precision beyond what a general-purpose AI can achieve alone.

    Knowledge Graph (KG) is a database that stores information as a network of entities and relationships rather than rows and columns. In a legal context, entities might be facts, court decisions, legal provisions, parties, and the relationships between them capture how they connect (e.g., “Fact A triggers Provision B which leads to Outcome C”). Knowledge graphs make implicit relationships explicit and queryable.

    Legal Reasoning Path is the structured logical chain a court follows from factual findings to a legal conclusion: facts → applicable legal norm → application of the norm to the facts → decision. In most court opinions this path is written as prose and must be inferred by a human reader. This paper’s core contribution is extracting and storing these paths as structured data.

    Ontology is a formal specification of concepts and relationships within a domain — essentially a vocabulary with rules. A legal ontology defines what entities exist in legal reasoning (facts, norms, parties, outcomes) and how they can relate to each other. It constrains the knowledge graph so that extracted information follows a consistent structure across all cases.

    Expert Annotation is created by having human domain experts (legal professionals) manually label examples to create a “gold standard” dataset for evaluating the system’s accuracy. The annotated examples serve as the benchmark. If the system’s extracted reasoning paths match what the experts identified, the system is considered accurate.

  • The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making

    Jon Chun, Katherine Elkins (Kenyon College)
    January 30, 2026
    arXiv | PDF

    The paper investigates whether emotional framing, the kind of persuasive, sympathetic narratives that reliably bias human decision-makers, can also sway LLMs when they’re applied to rule-bound institutional decisions like grade appeals, loan underwriting, and emergency triage. The surprising answer is no: across 12,113 responses from six different models, emotional narratives produced essentially zero decision drift (Cohen’s h = 0.003), while the same types of framing effects cause substantial bias in humans (Cohen’s h = 0.3–0.8).

    The “paradox” is that LLMs are known to be lexically brittle (sensitive to how a prompt is formatted) and prone to sycophancy, yet they are rationally stable when it comes to rule-based decisions. They resist emotional manipulation 110–300x better than humans. This decoupling between surface-level prompt sensitivity and deep logical consistency is counterintuitive and has significant implications for deploying AI in high-stakes institutional settings.

    Cohen’s h (Effect Size) is a statistical measure of the difference between two proportions. Values near 0 mean no practical difference; 0.2 is “small,” 0.5 is “medium,” 0.8 is “large.” The paper uses Cohen’s h to compare decision rates between emotional and neutral conditions. The LLM value of 0.003 is essentially zero.

    Bayes Factor (BF₀₁) is a Bayesian statistic that quantifies evidence for the null hypothesis (no effect) vs. the alternative (some effect). BF₀₁ = 109 means the data is 109 times more likely under “no effect” than under “some effect”. Conventionally, anything above 100 is “extreme evidence.”

    Framing Effects is a well-documented cognitive bias where the way information is presented (e.g., sympathetic backstory, emotional language) changes human decisions even when the underlying facts are identical. This is a core concern in behavioral economics and legal decision-making.

    RLHF (Reinforcement Learning from Human Feedback) is the dominant fine-tuning method for instruction-following LLMs. Human raters rank model outputs, and the model is trained to prefer higher-ranked responses. Used by GPT, Llama, and Mistral families.

    Constitutional AI is Anthropic’s training approach (used for Claude) where the model self-critiques against a set of principles rather than relying solely on human raters. The paper tests whether this different alignment approach produces different robustness characteristics (it doesn’t).

    Decision Drift is the change in a model’s decision rate when exposed to emotional framing vs. a neutral control. A drift of 0% means the model’s decisions are identical regardless of framing.

    Instruction Ablation is an experimental technique where instructions are systematically removed to test what drives a behavior. Here, removing “ignore the narrative” instructions showed that robustness isn’t dependent on explicit guardrails.

  • Robust Generalizable Heterogeneous Legal Link Prediction

    Lorenz Wendlinger, Simon Alexander Nonn, Abdullah Al Zubaer, Michael Granitzer
    2602.04812v1 | PDF
    February 4, 2026


    The paper improves legal citation link prediction using Graph Neural Networks (GNNs). The authors introduce R-HGE (Robust Heterogeneous Graph Enrichment), which predicts missing citations between legal cases and laws more accurately than previous methods.

    Graph Neural Networks (GNNs) are deep learning models that operate on graph-structured data (nodes + edges) by iteratively passing messages between connected nodes to learn representations. After multiple rounds of neighborhood aggregation, each node captures information from its surrounding structure, enabling tasks like node classification, link prediction, and graph-level classification.                                         

    Robust Heterogeneous Graph Enrichment extends basic GNNs to handle real-world graphs that have multiple node/edge types (heterogeneous), missing information (enrichment fills gaps with external data or inferred connections), and noise or incompleteness (robustness). It’s particularly relevant for domains like legal AI where knowledge graphs naturally contain diverse entity types, incomplete relationships, and messy data.

  • Constrained Process Maps for Multi-Agent Generative AI Workflows

    Ananya Joshi, Michael Rudow

    February 2, 2026

    arXiv | PDF


    The paper introduces a multi-agent framework for running complex, regulated workflows (like compliance review) using LLM-based agents, formalized as a bounded-horizon Markov Decision Process (MDP) constrained by a directed acyclic graph (DAG). Instead of relying on a single LLM agent with a long prompt to handle an entire compliance process, the system maps each step of an existing Standard Operating Procedure (SOP) to a specialized agent node — for example, a content reviewer, a triage router, a risk assessor, and a legal compliance checker. Uncertain cases are escalated along predefined paths, just like they would be in a real compliance team, and the system uses Monte Carlo sampling to quantify each agent’s confidence without needing access to the LLM’s internal probabilities.

    Tested on NVIDIA’s AEGIS 2.0 AI Safety Benchmark for self-harm detection, the multi-agent framework achieved up to a 19% accuracy improvement over a single-agent baseline (88% vs. 70%), reduced the number of cases requiring human review by up to 85x (from ~17 to ~0.2 per run), and in the fastest configuration ran 30% quicker. The framework also caught annotation errors in the benchmark dataset itself, demonstrating its practical value for auditable, high-stakes AI deployments.

    Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making where outcomes are partly random and partly under the control of a decision-maker. An MDP consists of: states (where you are), actions (what you can do), transition probabilities (how likely each next state is given current state and action), and rewards (feedback on how good each action was). In the paper, each state is an agent in the compliance workflow, actions are classification labels (safe/unsafe/uncertain), and transitions follow the escalation paths in the process map. The “bounded-horizon” part means every case must resolve within a fixed maximum number of steps.

    Directed Acyclic Graph (DAG) is a graph structure where edges have direction (A -> B means A comes before B) and there are no cycles (you can never loop back to a previous node). In the paper, the DAG represents the compliance process map: Worker -> Triage -> Risk/Legal -> Labeled Data or Human Review. The DAG constraint guarantees that every case terminates — no infinite loops are possible, which is a known problem in some multi-agent LLM architectures like LangChain.

    AEGIS 2.0 AI Safety Benchmark is a dataset released by NVIDIA (August 2025) for evaluating AI safety guardrails. The subset used in the paper covers suicide and self-harm topics, with 112 examples (68 safe, 44 unsafe) derived from the Suicide Detection corpus. Each example consists of a user prompt and an LLM response that needs to be classified as safe or unsafe.

  • The Limits of AI Data Transparency Policy: Three Disclosure Fallacies

    Judy Hanwen Shen, Ken Liu, Angelina Wang, Sarah H. Cen, Andy K. Zhang, Caroline Meinhardt, Daniel Zhang, Kevin Klyman, Rishi Bommasani, Daniel E. Ho (Stanford University)
    January 26, 2026
    arXiv | PDF


    The paper from Stanford argues that current transparency policies are largely symbolic: they suffer from three fundamental gaps that prevent them from actually achieving their stated goals of protecting privacy, copyright, and data quality. The three fallacies are: (1) a specification gap; (2) an enforcement gap; and (3) an impact gap. The paper offers a taxonomy of disclosure levels, maps each transparency objective to what’s actually needed, and proposes technical research directions and policy fixes.

    California AB 2013 is a California state law (effective 2026) requiring developers of generative AI systems to publicly post “high-level summaries” of training datasets on their websites. Covers data sources, synthetic data usage, presence of personal information, copyrighted content, and dataset statistics. It was the first U.S. law specifically mandating AI training data transparency, but the paper argues it was weakened through the legislative process from detailed requirements to vague summaries.

    EU AI Act (Regulation 2024/1689) is the European Union’s comprehensive AI regulation, which classifies AI systems by risk tier and imposes different transparency requirements for each. General-purpose AI model providers must disclose a data summary including data types and copyright status. High-risk systems (healthcare, criminal justice, employment) face stricter requirements under Article 10 for data governance practices. Unlike AB 2013, the EU AI Act assigns enforcement to the EU AI Office and imposes significant fines.

    GDPR (General Data Protection Regulation) is the EU’s data privacy law (2018) that requires organizations to inform individuals about data collection purposes (Article 13) and gives individuals rights over their personal data. Relevant here because GDPR’s data processing requirements apply to AI training data that contains personal information, but the paper notes that GDPR’s individual-level protections don’t map cleanly to the scale of LLM training.

    Membership Inference is a technical method for determining whether a specific data point was used in a model’s training set by analyzing the model’s behavior (e.g., confidence scores, loss values) on that data point. The paper identifies this as critical for copyright and privacy verification but notes it remains unreliable at scale. A model can memorize content without being able to reproduce it verbatim, and content overlap between sources makes attribution difficult.

    Data Provenance is the documented history of a piece of data, where it came from, how it was collected, what licenses apply, and how it was transformed. The paper argues that tracking provenance through the AI data supply chain (from original creators through data vendors to model developers) is essential but rarely required or practiced.

    Foundation Model Transparency Index (FMTI) is a Stanford HAI project that scores major AI model developers on 100+ transparency indicators, including 10 data-related ones. Useful for comparing company practices but doesn’t specify the intended impact of each disclosure.

    N-gram Overlap is a method for detecting text similarity by comparing sequences of N consecutive words between two texts. The paper highlights a critical limitation: courts have granted data owners permission to “inspect” training data using methods like substring search, but research shows LLMs can synthesize and reproduce content without any original n-grams, meaning n-gram-based membership tests can be trivially evaded.

  • Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

    Marton Szep, Jorge Marin Ruiz, Georgios Kaissis, Paulina Seidl, Rüdiger von Eisenhart-Rothe, Florian Hinterwimmer, Daniel Rueckert (Technical University of Munich, Imperial College London)
    2601.17480v1
    January 24, 2026


    The paper investigates a critical privacy vulnerability: LLMs can memorize and leak personally identifiable information (PII) that appears only in training inputs, not in the training targets. Even when PII is irrelevant to the downstream task, fine-tuned models can be tricked into revealing names, addresses, and other sensitive data.

  • Knowledge Graphs Construction from Criminal Court Appeals: Insights from the French Cassation Court

    Alexander V. Belikov (Growgraph, Paris), Sacha Raoult (Institut Universitaire de France, Aix-Marseille University)
    January 24, 2025
    arXiv | PDF


    The paper presents a complete, end-to-end framework for transforming unstructured French criminal court appeals into structured knowledge graphs. The authors process 2,820 appeals from the criminal chamber of the French Cassation Court (France’s Supreme Court for criminal cases) using GPT-4o mini to automatically extract entities and relationships into RDF triples. The core contribution is a domain-specific criminal law ontology developed semi-automatically through iterative interaction with LLMs (GPT-4o mini and Claude 3.5 Sonnet), which guides the extraction process and ensures consistent, structured output. The key finding is that ontology-guided RDF triple generation significantly outperforms property graph approaches — the RDF method achieved >90% accuracy (93% precision, 89% recall) compared to only 50-60% for property graph extraction. This demonstrates that providing a well-designed domain ontology in the LLM prompt is critical for reliable legal knowledge graph construction.

    A knowledge graph is a structured representation of information where nodes represent entities (people, crimes, courts, punishments) and edges represent relationships between them. Unlike flat databases or plain text, KGs capture the connections between pieces of information, making them ideal for domains like law where relationships between actors, events, and legal provisions are critical. Currently, there are two competing ways to store knowledge graphs: (a) Resource Description Framework (RDF) triples follow the format subject-predicate-object; and (b) Property Graphs (e.g., Neo4j) store nodes and edges with arbitrary key-value properties.