Applied AI Research for Legal Practitioners

Joshua Escoto | LegalEngineering.AI | Recommended Books

How to Build a Local Legal RAG Pipeline with ChromaDB and Voyage AI

Over the course of my career, I have saved a large body of cases that support my specific practice areas. Quickly finding supportive cases by keyword search can be challenging and creating a detailed topic index is very time consuming. A local RAG solves both problems and lets you quickly search through a large body of cases based on a plain language search.

This post describes how to build a local RAG pipeline for legal case retrieval: domain-specific embeddings via Voyage AI’s voyage-law-2 model, persistent vector storage in ChromaDB, and a query interface that returns ranked case excerpts for downstream LLM analysis.

For a library of over 300 legal cases I stayed within the Voyage AI free tier and continued use is very nominal. There are other potential legal industry uses including: document clause and deposition search retrieval. Depending upon the use, you may need to modify your chunking strategy for paragraph granularity (example code provided below).

Architecture

cases_text/          ← raw .txt case files (one per case)
cases_db/            ← ChromaDB persistent store (auto-created)
cases_build_index.py ← one-time indexing script
cases_query.py       ← query script, called at runtime
.env                 ← VOYAGE_API_KEY (never commit)

The benefit of local vector storage is that nothing is sent to a third-party vector store.

Voyage AI voyage-law-2 is fine-tuned on legal corpora. You will need to obtain a voyage_api_key from the Voyage AI website and save it as a env file in the project root. This produces better retrieval for legal terminology (e.g. consideration) where general-purpose embeddings conflate legal and lay meanings.

Dependencies

pip install chromadb voyageai python-dotenv
PackageRole
chromadbLocal persistent vector database
voyageaiLegal-domain embedding model API
python-dotenvLoads API key from .env without hardcoding

Step 1: Prepare Case Files

Each case is a plain .txt file. Filename = case identifier in retrieval results.

Smith-v-Jones-2023.txt
Doe-v-Acme-Corp-2021.txt
Johnson-v-State-2019.txt

Step 2: Build the Index

Runs once. Reads every .txt file, generates embeddings via voyage-law-2, and upserts into ChromaDB. Subsequent runs are idempotent — upsert skips unchanged documents.

import os
import chromadb
import voyageai
from dotenv import load_dotenv

load_dotenv()

client = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))
db = chromadb.PersistentClient(path="./cases_db")
collection = db.get_or_create_collection("my_cases")

cases_folder = "./cases_text"

for filename in os.listdir(cases_folder):
    if not filename.endswith(".txt"):
        continue

    filepath = os.path.join(cases_folder, filename)
    with open(filepath, "r", errors="ignore") as f:
        text = f.read().strip()

    if not text:
        print(f"Skipping empty file: {filename}")
        continue

    # voyage-law-2 with input_type="document" optimizes embeddings for storage/retrieval
    result = client.embed([text], model="voyage-law-2", input_type="document")
    vector = result.embeddings[0]

    case_id = filename.replace(".txt", "")
    collection.upsert(
        ids=[case_id],
        embeddings=[vector],
        documents=[text],
        metadatas=[{"filename": filename, "case": case_id}]
    )
    print(f"Indexed: {filename}")

print(f"Done. {collection.count()} cases in index.")

Voyage AI uses asymmetric embeddings, documents and queries are embedded differently to optimize dot-product similarity at retrieval time. Always use document when indexing and query when embedding search terms.

ChromaDB persistence: PersistentClient writes to disk. The ./cases_db directory persists between runs — you only pay for embeddings once.

Step 3: Query the Index

At query time, embed the question with input_type="query" and retrieve the top-N nearest neighbors by cosine similarity.

import os
import sys
import chromadb
import voyageai
from dotenv import load_dotenv

load_dotenv()

question = " ".join(sys.argv[1:])
if not question:
    print("Usage: python cases_query.py 'your question here'")
    sys.exit(1)

client = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))
db = chromadb.PersistentClient(path="./cases_db")
collection = db.get_collection("my_cases")

# Asymmetric query embedding
result = client.embed([question], model="voyage-law-2", input_type="query")
query_vector = result.embeddings[0]

# n_results controls how many cases are returned — tune based on context window
results = collection.query(
    query_embeddings=[query_vector],
    n_results=4
)

for i, (doc, meta) in enumerate(zip(results["documents"][0], results["metadatas"][0])):
    print(f"\n--- CASE {i+1}: {meta['case']} ---\n")
    print(doc[:3000])  # truncate to manage LLM context length

Tuning n_results: More results = more context for the LLM but higher token cost. For a 100K context window (Claude, GPT-4o), 4-6 cases at 3000 chars each is a good starting point. Increase if your cases are short; decrease if they’re full-text opinions.

Truncation trade-off: Cutting at 3000 chars risks missing holding language buried deep in an opinion. For production use, consider chunking each case into overlapping segments at index time rather than truncating at query time.

Chunking Strategy (Optional)

Depending upon the use, chunk the legal text into overlapping segments at index time so retrieval operates at paragraph granularity rather than full-document granularity. This increases index size but improves precision.

def chunk_text(text, chunk_size=1000, overlap=200):
    """Split text into overlapping chunks for finer-grained retrieval."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

# In the indexing loop, replace single upsert with:
chunks = chunk_text(text)
for j, chunk in enumerate(chunks):
    result = client.embed([chunk], model="voyage-law-2", input_type="document")
    vector = result.embeddings[0]
    collection.upsert(
        ids=[f"{case_id}_chunk_{j}"],
        embeddings=[vector],
        documents=[chunk],
        metadatas=[{"filename": filename, "case": case_id, "chunk": j}]
    )

Send me a note on linkedin if you come up with any new use cases.