Projects & Dabblings

"Building a Personal RAG System in Three Iterations"

Key Takeaway: Three iterations to learn that simpler RAG outperforms the over-engineered version -- faster, more reliable, better results.


Three major iterations to build RAG over my own data: emails, meeting transcripts, text messages, documents, conversation history. 18,000 chunks, sub-millisecond search with HNSW. The first version did everything the research papers recommend. The final version deleted most of it. That's the lesson. And it's a lesson that applies far beyond RAG -- the instinct to add complexity is almost always wrong, and the courage to remove it is almost always right. But you can only earn the right to simplify by first understanding why the complexity was there.

Iteration 1: Over-engineered

Everything the RAG literature recommends. Tiered retrieval with multiple passes. Query decomposition into sub-queries. Re-ranking with cross-encoder. Hybrid search combining dense embeddings with BM25. The works.

Retrieval took 400-800ms per query. With decomposition creating 3-4 sub-queries, total approached 3 seconds. For a personal system, unacceptable. Worse, the complexity didn't improve results. The re-ranker occasionally promoted a better result from position 5 to position 2, but top-10 were largely the same. I'd built a system optimized for the 5% rare case at the expense of the 95% common one. Decline the complexity.

Iteration 2: Too simple

Overcorrection. Single-pass: one query, top-K, done. Fast — under 50ms. But without topic routing, business queries surfaced personal health data using similar language. The embedding space doesn't distinguish "blood pressure is high" from "pressure on this deal is high" as cleanly as you'd expect. Turns out "simple" and "too simple" are different things.


Iteration 3: Radical simplification

Inspired by Andrej Karpathy's philosophy: remove everything that isn't load-bearing, then verify what remains works.

A binary needsRAG() check — lightweight classifier decides if the query needs personal context. When retrieval is needed, fan-out across all topic indexes simultaneously. HNSW costs under 1ms per index, so eight topics costs 8ms. Distance cutoff of 0.70 — return only results below threshold, nothing forced. If nothing is close enough, return nothing. That's the point.

No tiers. No decomposition. No re-ranking. Total: under 10ms. Results better than iteration one because the cutoff eliminates noise the re-ranker was trying to filter. The simpler system is faster, more reliable, and produces better output. Everyone building RAG should hear this.

Technical details

Gemini embeddings at 3,072 dimensions, truncated to 768 for storage — 4x reduction, 95%+ quality retained. sqlite-vec for vector storage, same database as everything else. No separate vector database service. No infrastructure overhead.

Data sources: emails (richest for relationships), iMessages (402K+ messages, different register), Granola transcripts (decisions rarely make it to writing), Drive documents, conversation history (compounding recall improvement with every conversation).

Enterprise lessons

Distance cutoffs beat top-K. Eliminates a category of hallucination where the LLM synthesizes from marginally related context. Stop forcing the model to use bad context.

Fan-out is underrated. Routing introduces a classification step that can fail. Searching everything in parallel is trivial at these speeds. Just search everything. Move on.

Simplicity scales better than sophistication. The over-engineered version had more failure modes, higher latency, and marginally better benchmarks that didn't reflect actual usage. The organizations that ship RAG to production resist implementing every technique in the latest paper. The ones still re-ranking are the ones that haven't shipped.

← PreviousWhat Changes When AI Remembers You Next →Delaware Is a $40T Asset Manager Disguised as a State