What is Retrieval-Augmented Generation (RAG)?
Retrieval-augmented generation (RAG) is a technique where an LLM, instead of relying solely on its training data, retrieves relevant text from an external source (documents, a vector database, the web) at query time and includes those retrieved passages in its prompt before generating an answer.
Also known as: RAG, retrieval-augmented generation, context augmentation
Why RAG exists
LLMs can't store and recall every fact reliably — they hallucinate, and they're frozen at training time. RAG sidesteps both limits by treating the model as a reasoning engine rather than a knowledge store. The actual knowledge lives in external documents that you control, search, and update; the model's job is to read those documents and synthesize an answer grounded in them.
How a basic RAG pipeline works
(1) Chunk source documents into passages. (2) Embed each chunk into a vector representation using an embedding model. (3) Store the embeddings in a vector database (Pinecone, Weaviate, pgvector, Chroma). (4) At query time, embed the user's question, retrieve the K nearest passages, and include them in the prompt as context. (5) The LLM answers using both its parametric knowledge and the retrieved passages. (6) Optionally re-rank retrieved passages with a cross-encoder before final generation.
When RAG beats fine-tuning
Use RAG when your knowledge changes often (docs, support tickets, news), when you need source citations, when the dataset is too small to fine-tune on, or when you need fine-grained access control over what the model can see. Fine-tune when you need a specific output style or to deeply embed domain reasoning that doesn't fit in any context window.
RAG in 2026
RAG has matured beyond simple nearest-neighbor retrieval. Modern stacks combine BM25 keyword search with dense embeddings, use re-ranking models, implement query rewriting (HyDE), chunk hierarchically (parent-child passages), and increasingly use agentic retrieval — letting the LLM decide what to search for in multiple steps. The pure "embed and retrieve" pattern is now considered the floor, not the ceiling.