Retrieval-Augmented Generation (RAG) is an architecture pattern that couples a parametric language model with a non-parametric knowledge store. Before or during decoding, the system retrieves documents, embeddings, or key-value pairs most relevant to the user query and injects them into the model's context window. This hybrid approach grounds responses in up-to-date facts and reduces hallucinations without costly full model retraining.
Definition and Scope
At minimum a RAG pipeline contains:
- An index built from vectors or sparse BM25 signals.
- A retriever that converts the user query into the same space and returns top-k candidates.
- A generator—typically a transformer decoder—that conditions on both the original prompt and retrieved passages.
Some variants interleave retrieval with generation (iterative RAG) or treat retrieval as latent variables learned end-to-end.
Canonical Pipeline
Architectural Variants
Fusion-in-Decoder (FiD)
Retrieved passages are concatenated and fed into the decoder with special segment tokens. The model learns to attend more to relevant chunks.
ReACT & Iterative RAG
The model alternates between reasoning steps and new retrieval calls, emitting search queries as actions3. This improves multi-hop QA.
Key-Value Memory RAG
Rather than free-form text passages, retrieval returns structured tuples ⟨key,value⟩ that are directly mapped into the attention cache, enabling constant-time lookup during decoding4.
Performance Metrics
Design Trade-offs
- Large context windows reduce need for strict top-k recall but raise inference cost.
- Dense embeddings require GPUs for index refresh; sparse BM25 scales cheaply on CPUs.
- Aggressive compression (PQ, HNSW M≈8) saves RAM but can hurt recall.
Current Trends (2025)
- Differentiable Retrieval: DRAGON-v2 trains retriever and generator jointly with REINFORCE to maximize answer likelihood.
- On-device RAG: Mobile LLMs pair a 7 B parameter model with a 100 k vector store stored in Apple Neural Engine SRAM for offline privacy.
- Temporal Indexing: Time-aware retrieval adds decay terms so models prefer fresh documents for news use-cases6.
Implementation Tips
- Store both raw text and citation metadata; include source URLs in the final answer for auditability.
- Refresh embeddings nightly to capture newly crawled content.
- Use mixed-precision (FP16/INT8) for the retriever to cut serving costs by 40 % without recall loss.
- Cache top-k results per query hash to amortize popular lookups.
Failure Modes
References
-
Yao et al., ReACT: Synergizing Reasoning and Acting in Language Models, 2023. ↩
-
Borgeaud et al., Improving language models by retrieving from trillions of tokens, 2022. ↩
-
Internal benchmark comparing GPT-4 Turbo with and without RAG on legal QA corpus, 2024. ↩
-
Liu et al., Time-aware Retrieval for Temporal Question Answering, ACL 2025. ↩