Retrieval-Augmented Generation is the most popular pattern in enterprise AI for one good reason: it grounds a language model in your data and largely escapes the hallucination problem. It is also the pattern with the highest gap between demo quality and production quality of any AI architecture we work with. The demo always looks great. The first real workload reveals everything the demo papered over.
Here are the eight failure modes we see again and again. None of them are exotic. Each has a workable fix. Most teams hit at least three before they get to a stable rollout.
1. Chunking by character count, not by meaning
The default Langchain or LlamaIndex setup chunks documents at fixed character counts. Fine for cleanly-formatted prose. Catastrophic for tables, contracts, bulleted policies, code blocks, and the kind of mixed-content PDFs enterprises actually have. A table split across two chunks loses every row-to-header relationship. A bulleted policy split mid-bullet loses its meaning entirely.
Fix: structure-aware chunking. Parse the document layout first (Unstructured, PaddleOCR, Azure Document Intelligence, AWS Textract), then chunk at semantic boundaries — section headings, table rows, list items. Keep each chunk self-contained: a chunk should make sense to a reader who has not seen the rest of the document.
2. Pure semantic search, no keyword fallback
Dense retrieval is great at meaning and poor at exact match. A user who searches for “Policy 27.4” or “ABC-1147-X” will get a list of vaguely related results ranked above the literal match. Acronyms, model numbers, statute references, product SKUs — all of these break pure semantic retrieval.
Fix: hybrid retrieval. Run BM25 (keyword) and dense (semantic) in parallel, then reciprocal-rank-fuse the results before reranking. The BM25 path handles exact-match recall; the dense path handles synonym and paraphrase recall. Every serious enterprise RAG system we see in production runs hybrid, not pure-dense.
3. No reranker
The top-K results from any retriever — sparse, dense, or hybrid — are noisy. A cross-encoder reranker that scores query-document pairs jointly typically lifts precision-at-K by 10–25 percentage points for almost no engineering cost. Yet most teams skip it because the initial retrieval “looked fine in the demo.”
Fix: add a reranker (Cohere Rerank, Voyage Rerank, BGE cross-encoder, ColBERT). Retrieve top-50, rerank to top-5 or top-8. The retrieval latency cost is single-digit milliseconds; the quality lift pays for itself in days.
4. Stale knowledge base, no freshness signal
Enterprise documents change. Policies are revised, contracts are amended, product specs are updated. A RAG system that indexes once and never refreshes will confidently quote a policy that was retired six months ago. Worse, it will sometimes retrieve both the old and new versions and let the LLM pick one.
Fix: design ingestion as a continuous pipeline, not a one-off batch. Track source identifiers, document hashes, and effective dates. When a document is superseded, retire the old vectors — do not just add the new ones. Surface the document date to the LLM so it can prefer recent sources.
5. No groundedness or faithfulness evaluation
Most teams evaluate RAG with vibes-based testing — “the demo answers look good.” This breaks down the moment scope expands. The right evaluation decomposes into four metrics:
- Context precision — of the retrieved chunks, how many were actually relevant?
- Context recall — of the chunks needed to answer the question, how many were retrieved?
- Groundedness — does every claim in the answer trace back to the retrieved context?
- Answer relevance — does the answer address the question that was actually asked?
Fix: build a golden-set of 200–500 query/answer pairs reviewed by a subject-matter expert, run these four metrics on every change, and gate deployment on the metrics. Frameworks like Ragas, TruLens, and Patronus give you this off the shelf.
6. Prompts that beg the model to hallucinate
We routinely see prompts that say things like “Use the context to answer the question. If you do not know the answer, do your best.” That last phrase is an invitation to invent. The model will dutifully invent.
Fix: make refusal a first-class behaviour. Say explicitly: “If the answer is not present in the context, respond with ‘I do not have that information’ and suggest the user contact a human.” Add a check in the eval suite that confirms refusal behaviour on out-of-domain queries.
7. No citation surface for the user
A user who gets an answer with no citation cannot trust it. A user who gets an answer with a citation can verify it in 10 seconds and stop treating the LLM as a magic oracle. Citation is the single highest-leverage trust feature in RAG.
Fix: render every claim with a source link or inline citation. Engineer the prompt so the model emits a structured response (claim + citation IDs) which the UI then renders. If the model cannot cite, the model should not assert.
8. Treating RAG as the answer to every problem
Once you have a RAG system that works for FAQ-style queries, the temptation is to throw everything at it. But RAG is bad at synthesis across many documents (it retrieves chunks, not full corpora), bad at numerical reasoning (LLMs are still poor calculators), and bad at time-sensitive aggregation (counting, ranking, filtering across a large set).
Fix: recognise the boundaries. For aggregation, use text-to-SQL or text-to-API. For complex synthesis, decompose the question into sub-questions and recompose. For numbers, give the model a calculator tool. RAG is a great substrate; it is not the universal answer.
The production checklist
Before you promote any RAG system to a real user population, run through this checklist. We use it on every engagement.
- Structure-aware chunking with semantic boundaries
- Hybrid retrieval (BM25 + dense) with reciprocal rank fusion
- Reranker over the top-50 candidates
- Refresh pipeline with retirement of superseded documents
- Golden-set eval gated on four metrics in CI
- Refusal behaviour engineered, not assumed
- Inline citations rendered for every claim
- Out-of-domain detection routing to a human or escalation path
- PII and secrets filter on inputs and outputs
- Cost and latency budgets enforced per query
A RAG system that clears these ten gates rarely embarrasses the team that shipped it. A system that skips any of them will, eventually and publicly.