Retrieval-Augmented Generation (RAG) has moved from an experimental technique to a core pattern in enterprise AI deployments. We've built RAG systems for a healthcare client (medical protocol Q&A), a legal tech startup (contract analysis), and a fintech company (regulatory document search). Here's what we got wrong first, and right eventually.
RAG combines a retrieval system (usually a vector database) with a generative LLM. Instead of asking the model to recall facts from training data, you retrieve relevant context from your own documents and include it in the prompt. The model answers based on what you give it, not what it was trained on.
The most common misconception: RAG is not just "ChatGPT with your documents." The retrieval quality, chunking strategy, and prompt engineering matter enormously. Get them wrong and you get confidently wrong answers.
Our first prototype chunked documents by fixed token count (512 tokens). Answers were bad. The problem: meaningful context was being split across chunks — a question answered across two paragraphs would never be fully retrieved.
The fix: semantic chunking by paragraph with a 15% overlap between adjacent chunks. Retrieval accuracy on our eval set jumped from 61% to 84%.
Vector similarity alone isn't sufficient for most enterprise use cases. A clause from a 2018 contract and a 2024 contract might be semantically identical but legally very different. We added structured metadata (document date, type, author, jurisdiction) and hybrid retrieval (vector + keyword + metadata filters). Precision improved significantly.
We shipped v1 based on subjective "it seems good" testing. This is not how you build production AI systems. We now maintain an evaluation dataset of 200+ question/expected-answer pairs per domain, run RAGAS metrics (faithfulness, answer relevancy, context precision) on every deploy, and gate releases on minimum score thresholds.
# RAGAS evaluation pipeline
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
# Gate deployment if any metric drops below threshold
assert results['faithfulness'] >= 0.85, "Faithfulness below threshold"Hybrid retrieval (BM25 + dense vectors), semantic chunking with overlap, rich metadata filtering, reranking with a cross-encoder before the final LLM call, and a deterministic fallback for low-confidence responses. This architecture consistently outperforms naive RAG by 20–40% on precision metrics.
Define your evaluation metrics before writing a single line of RAG code. If you can't measure it, you can't improve it — and you definitely can't ship it to production responsibly.