Over the course of my career, I have saved a large body of cases that support my specific practice areas. Quickly finding supportive cases by keyword search can be challenging and creating a detailed topic index is very time consuming. A local RAG solves both problems and lets you quickly search through a large body of cases based on a plain language search.
This post describes how to build a local RAG pipeline for legal case retrieval: domain-specific embeddings via Voyage AI’s voyage-law-2 model, persistent vector storage in ChromaDB, and a query interface that returns ranked case excerpts for downstream LLM analysis.
For a library of over 300 legal cases I stayed within the Voyage AI free tier and continued use is very nominal. There are other potential legal industry uses including: document clause and deposition search retrieval. Depending upon the use, you may need to modify your chunking strategy for paragraph granularity (example code provided below).
Architecture
cases_text/ ← raw .txt case files (one per case)
cases_db/ ← ChromaDB persistent store (auto-created)
cases_build_index.py ← one-time indexing script
cases_query.py ← query script, called at runtime
.env ← VOYAGE_API_KEY (never commit)
The benefit of local vector storage is that nothing is sent to a third-party vector store.
Voyage AI voyage-law-2 is fine-tuned on legal corpora. You will need to obtain a voyage_api_key from the Voyage AI website and save it as a env file in the project root. This produces better retrieval for legal terminology (e.g. consideration) where general-purpose embeddings conflate legal and lay meanings.
Dependencies
pip install chromadb voyageai python-dotenv
| Package | Role |
|---|---|
chromadb | Local persistent vector database |
voyageai | Legal-domain embedding model API |
python-dotenv | Loads API key from .env without hardcoding |
Step 1: Prepare Case Files
Each case is a plain .txt file. Filename = case identifier in retrieval results.
Smith-v-Jones-2023.txt
Doe-v-Acme-Corp-2021.txt
Johnson-v-State-2019.txt
Step 2: Build the Index
Runs once. Reads every .txt file, generates embeddings via voyage-law-2, and upserts into ChromaDB. Subsequent runs are idempotent — upsert skips unchanged documents.
import os
import chromadb
import voyageai
from dotenv import load_dotenv
load_dotenv()
client = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))
db = chromadb.PersistentClient(path="./cases_db")
collection = db.get_or_create_collection("my_cases")
cases_folder = "./cases_text"
for filename in os.listdir(cases_folder):
if not filename.endswith(".txt"):
continue
filepath = os.path.join(cases_folder, filename)
with open(filepath, "r", errors="ignore") as f:
text = f.read().strip()
if not text:
print(f"Skipping empty file: {filename}")
continue
# voyage-law-2 with input_type="document" optimizes embeddings for storage/retrieval
result = client.embed([text], model="voyage-law-2", input_type="document")
vector = result.embeddings[0]
case_id = filename.replace(".txt", "")
collection.upsert(
ids=[case_id],
embeddings=[vector],
documents=[text],
metadatas=[{"filename": filename, "case": case_id}]
)
print(f"Indexed: {filename}")
print(f"Done. {collection.count()} cases in index.")
Voyage AI uses asymmetric embeddings, documents and queries are embedded differently to optimize dot-product similarity at retrieval time. Always use document when indexing and query when embedding search terms.
ChromaDB persistence: PersistentClient writes to disk. The ./cases_db directory persists between runs — you only pay for embeddings once.
Step 3: Query the Index
At query time, embed the question with input_type="query" and retrieve the top-N nearest neighbors by cosine similarity.
import os
import sys
import chromadb
import voyageai
from dotenv import load_dotenv
load_dotenv()
question = " ".join(sys.argv[1:])
if not question:
print("Usage: python cases_query.py 'your question here'")
sys.exit(1)
client = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))
db = chromadb.PersistentClient(path="./cases_db")
collection = db.get_collection("my_cases")
# Asymmetric query embedding
result = client.embed([question], model="voyage-law-2", input_type="query")
query_vector = result.embeddings[0]
# n_results controls how many cases are returned — tune based on context window
results = collection.query(
query_embeddings=[query_vector],
n_results=4
)
for i, (doc, meta) in enumerate(zip(results["documents"][0], results["metadatas"][0])):
print(f"\n--- CASE {i+1}: {meta['case']} ---\n")
print(doc[:3000]) # truncate to manage LLM context length
Tuning n_results: More results = more context for the LLM but higher token cost. For a 100K context window (Claude, GPT-4o), 4-6 cases at 3000 chars each is a good starting point. Increase if your cases are short; decrease if they’re full-text opinions.
Truncation trade-off: Cutting at 3000 chars risks missing holding language buried deep in an opinion. For production use, consider chunking each case into overlapping segments at index time rather than truncating at query time.
Chunking Strategy (Optional)
Depending upon the use, chunk the legal text into overlapping segments at index time so retrieval operates at paragraph granularity rather than full-document granularity. This increases index size but improves precision.
def chunk_text(text, chunk_size=1000, overlap=200):
"""Split text into overlapping chunks for finer-grained retrieval."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap
return chunks
# In the indexing loop, replace single upsert with:
chunks = chunk_text(text)
for j, chunk in enumerate(chunks):
result = client.embed([chunk], model="voyage-law-2", input_type="document")
vector = result.embeddings[0]
collection.upsert(
ids=[f"{case_id}_chunk_{j}"],
embeddings=[vector],
documents=[chunk],
metadatas=[{"filename": filename, "case": case_id, "chunk": j}]
)
Send me a note on linkedin if you come up with any new use cases.