Production RAG: The Five Decisions Behind Every System That Works

Wait 5 sec.

RAG is not just an out-of-the-box system. It is a pipeline of decisions, and each decision is equally important.A demo RAG can be built with three lines of LangChain or any other framework, but for it to work in production, there are five important choices you have to make carefully. If any one of them is sub-optimal, the whole system can quietly degrade.I have seen many ways these systems break in practice. Your chunker may split an important fact across two chunks. Your retriever may pull the right document at rank 7 when your architecture only passes the top 5 to the model. Your generator may produce a citation that looks correct but does not exist in any source document. Each of these is a quiet failure. The user just sees a confident wrong answer.This article walks through the five decisions you need to make to build a well-optimized RAG.1. Whether you need retrieval at allThe first decision is whether to build a RAG in the first place.This used to be obvious. Context windows were 4K to 32K tokens, so retrieval was the only way to fit a large knowledge base into the model. That has changed. Frontier models now support 200K to 2M token context windows, and context caching has dropped the cost of repeated input to roughly 10% of uncached tokens.For small to medium corpora and repeat-query workloads, loading everything into context and caching it is often cheaper and simpler than retrieval.Long context does not eliminate RAG. It changes when RAG is the right choice. A 40-page HR policy for a single query may fit in context, but a 50 GB internal wiki cannot fit in the context.RAG is still the right choice when:Your corpus is too large to fit in any context window.Most queries only need a small slice of the data.Different users should only see specific documents.If your corpus fits in a cached context and your query volume is bounded, you may not need RAG. Just put the documents in the prompt and use context caching to keep costs down.2. How you chunk and parseIf you are doing RAG, the way you prepare and split your documents decides the quality of retrieval. Even the best reranker cannot fix a bad chunk.Chunk sizeChunk size is something you should test and adjust, not pick once and forget. In many RAG systems, smaller chunks (200 to 400 tokens) with simple recursive splitters work better than the larger defaults. There is a balance: chunks that are too big add noise around the relevant sentence, and chunks that are too small lose the context around them. The default of 1000 tokens is usually not the best choice.Semantic chunkingSemantic chunking groups sentences by meaning. It can improve retrieval on documents that mix several topics. The cost is that when documents change, you may need to redo embeddings around the cluster boundaries. It is good for stable corpora, but usually not ideal for streaming or constantly changing data.Tables and imagesTables and images in PDFs usually need a vision model to be parsed properly. If you only extract the PDF as text, tables become broken numbers and lost spacing.Example 1: pricing table problemA PDF pricing table has columns like:Plan | Price | Users | FeaturesA naive text extraction may turn it into:Basic 10 5 Pro 30 20 Enterprise Custom UnlimitedThe chunker can no longer tell which price belongs to which plan.Example 2: vision parser benefitA vision model (GPT-4o, Claude Sonnet, Gemini) can read the same table and return structured text:Basic plan: $10, 5 users \n Pro plan: $30, 20 users \n Enterprise: custom price, unlimited usersExample 3: chart problemA chart image often has no useful text inside the PDF. Plain text extraction skips it entirely.Example 4: vision parser for chartsA vision model can describe the chart in plain language:Revenue increased from January to June, with the largest jump in May.MetadataMetadata is a first-class retrieval signal. For every chunk, store the source file, page number, section heading, author, and date. Filtering by metadata before searching embeddings often makes retrieval much cleaner.Example 1: page number. If the answer comes from page 12 of a PDF, saving the page number lets you cite the exact source.Example 2: section heading. A chunk from the "Refund Policy" section is more useful for a refund question than a random chunk from the same document.Example 3: date filter. If the user asks about the latest pricing, you can first filter for recent documents and then search inside those chunks.Example 4: source filter. If the user asks about HR policy, you can search only HR documents instead of the whole knowledge base.In code, the difference looks like this:# Bad: chunker ignores structure and drops metadatachunks = text.split(\n")# Better: recursive splitter, structural separators, metadata preservedsplitter = RecursiveCharacterTextSplitter( chunk_size=300, chunk_overlap=40, separators=[## ",### ",\n",", " "], ) chunks = [ Chunk(text=c, source=doc.source, page=doc.page, date=doc.date) for c in splitter.split(doc.text)]Tune ingestion before you tune retrieval. The ceiling is set here.3. How do you retrieveBasic retrieval means using one embedding model, doing one similarity search, and returning the top few results. That used to be enough. For most production systems today it is not, because the single-vector assumption breaks on questions that are unclear or need information from multiple places.Example 1: ambiguous questionUser asks: What is the policy for returns? "Returns" could mean product returns, tax returns, or returning equipment. A simple search may pick the wrong meaning.Example 2: multi-hop questionUser asks: Which customers had failed payments and later contacted support?The system needs payment data and support ticket data. One similarity search will not connect both.Example 3: single-vector problemA single embedding represents the whole query as one meaning. Some questions contain multiple sub-questions, so one vector loses part of the intent.Production RAG today usually combines several techniques.Query rewritingBefore searching, it can help to rewrite the user's question into a form that looks more like the document text. The user's question and the answer often mean the same thing in different words, and embedding search can miss that.Example 1Original question: How do I cancel a subscription?Document text: To cancel, open Settings > Billing and select Manage Plan.Rewritten query: Cancel subscription settings billing manage planExample 2Original question: Can I get my money back?Document text: Refunds are available within 14 days of purchase.Rewritten query: refund policy money back purchase 14 days\HyDE: Hypothetical Document EmbeddingsHyDE has the model first write a fake answer to the question, then embeds the fake answer and uses that for search. The fake answer may be wrong in detail, but it sits in the same part of the embedding space as the real answer, so search has a better chance of finding the right passage.Example 1Original question:How do I cancel my subscription?The model generates a fake answer:To cancel your subscription, go to Settings, open Billing, and choose Cancel Plan.Searching with this fake answer often finds the real document section:Open Settings > Billing > Manage Plan to cancel your subscription.Example 2Original question:What happens if a payment fails?Fake answer:If a payment fails, the system retries the charge and may pause the account.This helps search find passages about failed payments, retries, billing status, and account suspension, even if the exact retry logic in the fake answer is wrong.Query decompositionQuery decomposition breaks a compound question into smaller questions, searches each one separately, and combines the results.ExampleOriginal question:Compare Stripe and Square on international fees and dispute handling.Break it into:Stripe international feesSquare international feesStripe dispute handlingSquare dispute handlingSearch each separately, then let the model write the comparison from the retrieved chunks. A single search for "Stripe Square international fees dispute handling" usually returns a vague comparison page and misses the specific fee and dispute sections.Hybrid searchHybrid search combines dense vector search (for meaning) with BM25 (for exact words). BM25 is useful for error codes, product SKUs, and any technical token where the exact string matters.Example 1: error codeError E1027 during checkoutVector search may find general checkout problems. BM25 finds the exact code E1027.Example 2: SKU lookupFind details for SKU ABX-4421BM25 matches the SKU exactly. Vector search may return a similar-looking product, which is not useful here.Example 3: semantic matchHow do I stop my subscription?Vector search can match this with "Cancel your plan from Billing Settings", even though the wording is different.Combine the scores from both with reciprocal rank fusion. A hybrid usually beats pure dense for any corpus with domain jargon.Two-stage retrieval with a rerankerTwo-stage retrieval first uses a fast model to pull 50 to 100 candidates, then a slower reranker scores the candidates more carefully and picks the top 5.The reranker is slower per pair, but more accurate, because it scores the query and the passage together rather than as separate vectors. Common choices are Cohere Rerank and BGE Reranker.Metadata filtering as a hard constraintFilter by date ranges, tenant IDs, and document types before similarity search, not after. Post-filtering wastes the top-k on documents the user is not allowed to see.If you can only add one upgrade over naive retrieval, add a reranker. It is the single change with the best payoff for the least work.4. How you orchestrateA basic RAG pipeline searches once and answers once. It works only if the search results are good. If the retriever brings back bad chunks, the model may still write a confident answer using that bad information. There is no checking step, so the system does not know whether the retrieved content was useful.A better pattern checks the retrieved results before answering. If the results look weak, the system tries something else instead of generating a low-quality answer. There are a few common ways to implement this.Corrective RAG (CRAG)A small classifier labels each retrieved document as relevant, ambiguous, or irrelevant. If most are irrelevant, the system runs a different search (often web search) instead of generating from the bad context.Self-RAGThe model decides whether to retrieve at all on each generation step, and critiques its own output against the retrieved evidence using reflection tokens.Agentic retrieval loopsThe RAG system runs as a workflow rather than a single shot. It searches, checks the results, and decides what to do next. If the results are good, it answers. If they are bad, it rewrites the query, runs web search, or escalates to a human. These loops are usually built on LangGraph, LlamaIndex Workflows, or a similar state machine.The shape of the loop is:query → retrieve → grade \n ├── good → generate answer \n └── bad  → rewrite query or websearch → retrieve → …The downside of a loop is more latency and more tokens per query. The benefit is that the system can say "I don't know" when the evidence is weak, instead of guessing. This tradeoff is usually worth it in high-stakes domains like medical, legal, and finance, but often not in a casual chatbot.5. How you evaluateYou need to test the search part and the answer-writing part separately. If you only score the final answer, you cannot tell which half is broken. A good generator can produce a polished, confident-looking answer on top of bad retrieval, and you will not see the problem until a user reports it.These are the metrics most teams use to measure RAG performance.Retriever metricsContext precision, context recall, MRR (mean reciprocal rank), and hit rate at k. The question these answer is: did the right documents show up, and how high in the ranking?Generator metricsFaithfulness measures whether the answer stays inside the retrieved context. Answer relevancy measures whether the answer actually addresses the question that was asked. A faithful answer to the wrong question is still useless.End-to-end correctnessScore against a ground-truth answer set. This is slow to build and painful to maintain, but it is the only thing that tells you whether the full system actually works for users. Start with 50 queries and grow the set every time a real user reports a bad answer.LLM-as-a-judge, with caveatsRAGAS, DeepEval, and Phoenix automate these metrics by using a stronger model to grade a weaker one. The judge has biases, often toward longer answers and certain phrasings. Calibrate it against human labels on a small sample before trusting the scores. Otherwise the judge's biases become your system's biases.Notable case studiesSeveral teams have written about how they apply these patterns in real systems. The useful lesson from each one is usually the constraint that shaped the architecture, not the architecture itself.DoorDash supports Copilot. DoorDash built a RAG system over its support articles and added two checking layers: a real-time guardrail that validates responses before they reach users, and a quality judge that monitors answers after the fact. The retrieval part was straightforward. The validation layer is what brought hallucinations down by about 90% after launch.Royal Bank of Canada (Arcane). RBC built Arcane to help financial advisors search complex investment policies. The hard part was not picking a better embedding model. The hard part was normalizing semi-structured documents from many internal systems and connecting cross-references between policies at answer time.LinkedIn customer support. LinkedIn combined RAG with a knowledge graph built from historical support cases. The graph preserves relationships that text chunking would lose, like shared root causes and linked resolutions. Retrieval pulls connected sub-graphs rather than isolated chunks. After six months in production, it cut median resolution time by 28.6%.The common thread has nothing to do with the model or the vector store. Each system is a pipeline of deliberate decisions, and the decisions that mattered most were the ones shaped by a constraint specific to that team, not the ones a reference architecture would suggest.The pipeline end-to-endDecide whether retrieval is the right tool. Long context plus caching may cover your use case more cheaply and simply than RAG.Chunk and parse deliberately. Ingestion sets the retrieval ceiling. Tune it before anything else.Build a retrieval pipeline, not just a retriever. Query rewriting, hybrid search, reranking, and metadata filters are now table stakes for production systems.Add grading and fallback to orchestration. Single-shot pipelines confidently generate nonsense on bad retrieval.Evaluate the retriever and the generator separately. End-to-end scores can hide which half is failing.A working RAG system is built from many small decisions, and each one has a quiet way of breaking the system if you choose it badly. That is why every step needs to be made on purpose. The teams that ship well-performing RAG systems get there by recognizing that the embedding model is rarely the thing that matters most.SourcesPapersGao et al., Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE), 2022Yan et al., Corrective Retrieval Augmented Generation (CRAG), 2024Asai et al., Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection, 2023Es et al., RAGAS: Automated Evaluation of Retrieval Augmented Generation, 2023LinkedIn, Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering, SIGIR 2024Research and studiesChroma Research, Evaluating Chunking Strategies for RetrievalCase studiesDoorDash Path to high-quality LLM-based Dasher support automationRBC Arcane, RAG System for Investment Policy Search and Advisory at RBC (ZenML LLMOps Database)\\n \