Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is an architecture pattern that couples a parametric language model with a non-parametric knowledge store. Before or during decoding, the system retrieves documents, embeddings, or key-value pairs most relevant to the user query and injects them into the model's context window. This hybrid approach grounds responses in up-to-date facts and reduces hallucinations without costly full model retraining.

Definition and Scope

At minimum a RAG pipeline contains:

An index built from vectors or sparse BM25 signals.
A retriever that converts the user query into the same space and returns top-k candidates.
A generator—typically a transformer decoder—that conditions on both the original prompt and retrieved passages.
Some variants interleave retrieval with generation (iterative RAG) or treat retrieval as latent variables learned end-to-end.

Canonical Pipeline

Stage	Common Implementation	Latency Budget
Query Encoding	MiniLM, Ada-002 embeddings¹	<2 ms
Search / Ann Index	FAISS IVFPQ, Elasticsearch KNN	<5 ms
Reranking	ColBERT-v2, Cohere Rerank-english-v3²	3-10 ms
Prompt Construction	JSON / Markdown template	~1 ms
Generation	GPT-4o, Mixtral-8x22B	30-300 ms

Architectural Variants

Fusion-in-Decoder (FiD)

Retrieved passages are concatenated and fed into the decoder with special segment tokens. The model learns to attend more to relevant chunks.

ReACT & Iterative RAG

The model alternates between reasoning steps and new retrieval calls, emitting search queries as actions³. This improves multi-hop QA.

Key-Value Memory RAG

Rather than free-form text passages, retrieval returns structured tuples ⟨key,value⟩ that are directly mapped into the attention cache, enabling constant-time lookup during decoding⁴.

Performance Metrics

Metric	Description	Typical Value
Recall@5	Fraction of answers whose gold document appears in top-5 retrieval	85-95 % on NQ-open
EM Score	Exact-match accuracy of generated answer	55-65 % (FiD-base)
Tokens per Dollar	1.4× improvement vs fine-tuned GPT-4 alone⁵
Latency 99th	End-to-end including search	<600 ms for enterprise chatbots

Design Trade-offs

Large context windows reduce need for strict top-k recall but raise inference cost.
Dense embeddings require GPUs for index refresh; sparse BM25 scales cheaply on CPUs.
Aggressive compression (PQ, HNSW M≈8) saves RAM but can hurt recall.

Current Trends (2025)

Differentiable Retrieval: DRAGON-v2 trains retriever and generator jointly with REINFORCE to maximize answer likelihood.
On-device RAG: Mobile LLMs pair a 7 B parameter model with a 100 k vector store stored in Apple Neural Engine SRAM for offline privacy.
Temporal Indexing: Time-aware retrieval adds decay terms so models prefer fresh documents for news use-cases⁶.

Implementation Tips

Store both raw text and citation metadata; include source URLs in the final answer for auditability.
Refresh embeddings nightly to capture newly crawled content.
Use mixed-precision (FP16/INT8) for the retriever to cut serving costs by 40 % without recall loss.
Cache top-k results per query hash to amortize popular lookups.

Failure Modes

Symptom	Likely Cause	Mitigation
Hallucinated stats despite retrieval	Top-k misses gold doc	Increase k or switch to multi-vector retriever
Verbose, copy-pasted passages	Prompt template lacks compression	Use summarization layer before insertion
Slow tail latencies	HNSW graph too deep	Tune efSearch, deploy GPU-accelerated search

arxiv.org ↩
cohere.com ↩
Yao et al., ReACT: Synergizing Reasoning and Acting in Language Models, 2023. ↩
Borgeaud et al., Improving language models by retrieving from trillions of tokens, 2022. ↩
Internal benchmark comparing GPT-4 Turbo with and without RAG on legal QA corpus, 2024. ↩
Liu et al., Time-aware Retrieval for Temporal Question Answering, ACL 2025. ↩

Command Palette