What is Retrieval-Augmented Generation (RAG)?

Retrieval-augmented generation (RAG) is a technique where an LLM, instead of relying solely on its training data, retrieves relevant text from an external source (documents, a vector database, the web) at query time and includes those retrieved passages in its prompt before generating an answer.

Also known as: RAG, retrieval-augmented generation, context augmentation

Why RAG exists

LLMs can't store and recall every fact reliably — they hallucinate, and they're frozen at training time. RAG sidesteps both limits by treating the model as a reasoning engine rather than a knowledge store. The actual knowledge lives in external documents that you control, search, and update; the model's job is to read those documents and synthesize an answer grounded in them.

How a basic RAG pipeline works

(1) Chunk source documents into passages. (2) Embed each chunk into a vector representation using an embedding model. (3) Store the embeddings in a vector database (Pinecone, Weaviate, pgvector, Chroma). (4) At query time, embed the user's question, retrieve the K nearest passages, and include them in the prompt as context. (5) The LLM answers using both its parametric knowledge and the retrieved passages. (6) Optionally re-rank retrieved passages with a cross-encoder before final generation.

When RAG beats fine-tuning

Use RAG when your knowledge changes often (docs, support tickets, news), when you need source citations, when the dataset is too small to fine-tune on, or when you need fine-grained access control over what the model can see. Fine-tune when you need a specific output style or to deeply embed domain reasoning that doesn't fit in any context window.

RAG in 2026

RAG has matured beyond simple nearest-neighbor retrieval. Modern stacks combine BM25 keyword search with dense embeddings, use re-ranking models, implement query rewriting (HyDE), chunk hierarchically (parent-child passages), and increasingly use agentic retrieval — letting the LLM decide what to search for in multiple steps. The pure "embed and retrieve" pattern is now considered the floor, not the ceiling.

Last updated 2026-05-18 · First published 2026-05-18

What is Retrieval-Augmented Generation (RAG)?

Why RAG exists

How a basic RAG pipeline works

When RAG beats fine-tuning

RAG in 2026

Related terms

Embedding

Web search AI

Large language model (LLM)

Context window

Try Retrieval-Augmented Generation (RAG) in vMira