Command Palette

Search for a command to run...

Retrieval-Augmented Generation

Benched.ai Editorial Team

Retrieval-Augmented Generation (RAG) is an architecture pattern that couples a parametric language model with a non-parametric knowledge store. Before or during decoding, the system retrieves documents, embeddings, or key-value pairs most relevant to the user query and injects them into the model's context window. This hybrid approach grounds responses in up-to-date facts and reduces hallucinations without costly full model retraining.

  Definition and Scope

At minimum a RAG pipeline contains:

  1. An index built from vectors or sparse BM25 signals.
  2. A retriever that converts the user query into the same space and returns top-k candidates.
  3. A generator—typically a transformer decoder—that conditions on both the original prompt and retrieved passages.
    Some variants interleave retrieval with generation (iterative RAG) or treat retrieval as latent variables learned end-to-end.

  Canonical Pipeline

StageCommon ImplementationLatency Budget
Query EncodingMiniLM, Ada-002 embeddings1<2 ms
Search / Ann IndexFAISS IVFPQ, Elasticsearch KNN<5 ms
RerankingColBERT-v2, Cohere Rerank-english-v323-10 ms
Prompt ConstructionJSON / Markdown template~1 ms
GenerationGPT-4o, Mixtral-8x22B30-300 ms

  Architectural Variants

    Fusion-in-Decoder (FiD)

Retrieved passages are concatenated and fed into the decoder with special segment tokens. The model learns to attend more to relevant chunks.

    ReACT & Iterative RAG

The model alternates between reasoning steps and new retrieval calls, emitting search queries as actions3. This improves multi-hop QA.

    Key-Value Memory RAG

Rather than free-form text passages, retrieval returns structured tuples ⟨key,value⟩ that are directly mapped into the attention cache, enabling constant-time lookup during decoding4.

  Performance Metrics

MetricDescriptionTypical Value
Recall@5Fraction of answers whose gold document appears in top-5 retrieval85-95 % on NQ-open
EM ScoreExact-match accuracy of generated answer55-65 % (FiD-base)
Tokens per Dollar1.4× improvement vs fine-tuned GPT-4 alone5
Latency 99thEnd-to-end including search<600 ms for enterprise chatbots

  Design Trade-offs

  • Large context windows reduce need for strict top-k recall but raise inference cost.
  • Dense embeddings require GPUs for index refresh; sparse BM25 scales cheaply on CPUs.
  • Aggressive compression (PQ, HNSW M≈8) saves RAM but can hurt recall.

  Current Trends (2025)

  • Differentiable Retrieval: DRAGON-v2 trains retriever and generator jointly with REINFORCE to maximize answer likelihood.
  • On-device RAG: Mobile LLMs pair a 7 B parameter model with a 100 k vector store stored in Apple Neural Engine SRAM for offline privacy.
  • Temporal Indexing: Time-aware retrieval adds decay terms so models prefer fresh documents for news use-cases6.

  Implementation Tips

  1. Store both raw text and citation metadata; include source URLs in the final answer for auditability.
  2. Refresh embeddings nightly to capture newly crawled content.
  3. Use mixed-precision (FP16/INT8) for the retriever to cut serving costs by 40 % without recall loss.
  4. Cache top-k results per query hash to amortize popular lookups.

  Failure Modes

SymptomLikely CauseMitigation
Hallucinated stats despite retrievalTop-k misses gold docIncrease k or switch to multi-vector retriever
Verbose, copy-pasted passagesPrompt template lacks compressionUse summarization layer before insertion
Slow tail latenciesHNSW graph too deepTune efSearch, deploy GPU-accelerated search

  References

  1. arxiv.org

  2. cohere.com

  3. Yao et al., ReACT: Synergizing Reasoning and Acting in Language Models, 2023.

  4. Borgeaud et al., Improving language models by retrieving from trillions of tokens, 2022.

  5. Internal benchmark comparing GPT-4 Turbo with and without RAG on legal QA corpus, 2024.

  6. Liu et al., Time-aware Retrieval for Temporal Question Answering, ACL 2025.