RAG stands for Retrieval-Augmented Generation. Instead of asking an AI model to answer from its training data, you retrieve relevant information and give it to the model. The model then generates responses grounded in that information. RAG pipelines are one of the most practical recent advances in AI.
The problem that RAG solves: language models have a knowledge cutoff. They can't know about recent events. They can't know about your organization's proprietary information. They can hallucinate (make things up confidently). RAG addresses this by providing current, relevant information to the model.
A basic RAG pipeline: user asks a question, the retriever searches for relevant documents, those documents are included in the prompt sent to the model, the model generates a response. Simple, but powerful.
The challenge is in the retriever. It needs to find actually relevant documents from potentially millions of options. If you retrieve irrelevant documents, the model produces garbage ("garbage in, garbage out"). Retrievers use vector similarity, keyword matching, semantic search, and other techniques to find relevant documents.
Pipeline steps: documents are ingested and preprocessed (convert to text, handle PDFs, etc.). Documents are chunked into smaller pieces (a 100-page document is split into 500-word chunks; otherwise, each document is too large for the prompt). Chunks are embedded (converted to vectors representing meaning). Embeddings are stored in a vector database. At query time, the query is embedded, similar embeddings are retrieved, the corresponding documents are fetched, and they're included in the prompt.
Quality issues arise at multiple stages. Bad chunking (splitting in the middle of important concepts) produces chunks that don't make sense. Stale embeddings (the document changed but the embedding is old) produce irrelevant retrieval. Poor chunk ranking (the most relevant chunk is ranked 10th) means the model doesn't see it.
Multi-stage retrieval is increasingly common. A fast, cheap retrieval stage narrows down candidates. Then a slower, more expensive stage ranks them. This balances speed and accuracy.
Hybrid search combines different retrieval methods. Keyword search is fast but doesn't understand meaning. Semantic search understands meaning but is slower. Using both, then combining results, often works better than either alone.
The evaluation problem is real. How do you know whether your RAG pipeline is working? You can measure retrieval quality (did we retrieve the documents needed to answer the question?) separately from generation quality (did the model generate a good response given the documents?).
RAG is enabling a new class of applications: customer support bots that answer based on your documentation, financial advisors that ground recommendations in market data, research assistants that cite sources. It's not perfect (models still hallucinate, retrieve can miss relevant documents), but it's much better than pure generation.
Why It Matters
RAG is the bridge between AI's impressive capabilities and actual organizational knowledge. It's what enables AI to be knowledgeable about your specific domain without retraining.
Example
A software company uses RAG for customer support: user asks a question, RAG retrieves relevant documentation and previous support tickets, those documents are sent to the AI model, the model generates a response grounded in actual documentation. This ensures answers are accurate, up-to-date, and can cite documentation.