What Is RAG?

About 10 minutes

Those who want LLMs to reference internal documents or current information, those who want to understand RAG architecture and design

Basic understanding of What Is Generative AI? and Context Engineering

RAG (Retrieval-Augmented Generation) is a pattern where an application searches external documents or databases before asking an LLM to answer. The retrieved evidence is added to the model context, so the model can use current information, internal documents, and specialized sources instead of relying only on its trained memory.[1]

A Structured Learning Path for RAG

This section covers RAG not just as “search and answer,” but also its technical history, production architecture patterns, agentic approaches, code generation, and future design directions.

This page: Grasp the basic RAG flow and key design elements
History of RAG: Understand the progression from IR and QA research to the RAG paper and Advanced RAG
RAG Architecture Patterns: Compare Naive RAG, Advanced RAG, Modular RAG, and Graph RAG
Agentic RAG: Understand designs where agents handle retrieval planning, re-retrieval, and verification
Code RAG and Coding Agents: Learn the relationship between whole-repository RAG and development agents
The Future of RAG: Organize the future landscape combining long context, tools, permissions, and evaluation
Embeddings & Vector Representations: Understand how text is converted to vectors and how to choose an embedding model
Retrieval Strategies: Learn when to use BM25, vector search, hybrid search, and reranking
Chunking Strategies: Understand the design principles for splitting documents into searchable units
Choosing a Vector Database: Compare Chroma, Pinecone, Weaviate, Qdrant, and pgvector to select the right DB for each use case

Why RAG Is Needed

LLMs have useful knowledge in their parameters, but they have several limits.

They do not know information that appeared after training
They do not know private internal documents, contracts, meeting notes, or customer data
They cannot always show evidence for an answer
They can generate plausible wrong answers when they do not know

RAG addresses this by searching for the needed material before answering. It is like checking a library before writing a report.

Basic RAG Flow

graph LR
    Q["User question"] --> R["Create search query"]
    R --> S["Search documents"]
    S --> C["Retrieve relevant chunks"]
    C --> P["Put evidence into prompt"]
    P --> L["LLM generates answer"]
    L --> A["Answer with citations"]

A basic RAG pipeline works like this.

Receive the user question
Create a search query
Search a document database
Put retrieved results into the LLM context
Generate an evidence-based answer
Show citations or source links when needed

RAG Components

Component	Role	Design point
Data sources	Documents to search	Manage source of truth, freshness, and access control
Chunking	Split documents into searchable pieces	Too small loses context; too large adds noise
Embeddings	Convert text into vectors	Choose models that fit the language and domain
Index	Search data structure	Vector search, keyword search, or hybrid search
Retriever	Fetch relevant documents	Tune top-k, filters, and metadata
Reranker	Reorder search results	Narrow evidence before passing it to the LLM
Generator	Create the answer	Instruct it not to answer beyond the evidence
Evaluation	Measure quality	Check retrieval, faithfulness, and usefulness

Preparing Documents

RAG quality depends heavily on the data being searched. A strong model cannot compensate for stale, duplicated, or permission-mixed documents.

Organize Data Sources

Decide which materials are authoritative.

Official specifications
Current FAQs
Contract templates
Product manuals
Internal knowledge bases
Public customer documentation

Drafts, old meeting notes, and unapproved memos should not have the same weight as official documents.

Add Metadata

Documents need metadata, not only body text.

source: product-manual.md
version: 2026-04
department: support
visibility: internal
language: ja
updated_at: 2026-04-15

Metadata enables filters such as “search only the latest Japanese manuals” or “exclude confidential material.”

Retrieval Best Practices

1. Do Not Rely Only on Vector Search

Vector search is strong at semantic similarity, but it can miss exact product names, model numbers, error codes, and legal references. Current RAG systems often combine vector search with keyword search such as BM25. This is called hybrid search.

2. Use a Reranker

The first retrieval step can gather broad candidates, then a reranker can move the most question-relevant evidence to the top. Since LLM context is limited, narrowing evidence often works better than stuffing many weak chunks into the prompt.

3. Tune Chunk Size by Task

Short FAQ answers work well with smaller chunks. Contracts, specifications, and papers often need larger chunks or section-level retrieval because surrounding context matters.

RAPTOR-style research proposes hierarchical summaries so retrieval can use both details and higher-level document structure. Long documents benefit from retrieval at multiple abstraction levels.[4]

4. Rewrite Queries

User questions are not always good search queries. A vague question like “What should I do about this?” needs conversation or screen context to become a concrete retrieval query.

5. Evaluate Retrieval Before Using It

Corrective RAG (CRAG) emphasizes checking retrieval quality and taking corrective action when retrieved documents are weak. RAG does not become correct just because retrieval happened. Bad retrieval leads to bad answers.[3]

Generation Best Practices

Force Evidence-Based Answers

Use constraints like this.

Answer only from the provided evidence.
If the evidence does not contain the answer, say "I cannot confirm this from the provided documents."
Show sources when possible.

This lowers the risk of the model filling gaps from memory.

Show Citations

Practical RAG should show links or cited passages, not just the answer. Citations let users verify the result.

Ask for Missing Information

When the evidence is not enough, the AI should ask a follow-up question instead of guessing.

I cannot determine this from the documents alone.
Please provide the product version and contract plan.

Current RAG Best Practices

As of May 2026, RAG is no longer considered just “vector search plus an LLM.” Recent papers and surveys emphasize these directions.[6]

Direction	Meaning	Practical implication
Advanced RAG	Query rewriting, hybrid search, reranking, context compression	More stable than naive retrieval
Corrective RAG	Evaluate retrieval quality and retry when evidence is weak	Reduces wrong answers from bad evidence [3]
Self-RAG	Let the model assess whether retrieval is needed and whether evidence is useful	Avoids fixed retrieval for every query [2]
Hierarchical RAG	Retrieve both details and summaries	Better for long and complex documents [4]
Agentic RAG	Agents plan searches, run follow-up retrieval, and verify results	Fits complex research and multi-step work [5]
Multimodal RAG	Retrieve text, tables, images, PDFs, and layout	Fits contracts, invoices, papers, and manuals

A practical starting architecture is:

Organize sources and add metadata
Combine vector search and keyword search
Use a reranker to narrow evidence
Add citations to answers
Build a RAG-specific evaluation set
Let the system refuse, ask follow-up questions, or search again when evidence is weak

RAG Evaluation

RAG needs evaluation of both retrieval and generation.

Metric	What it checks
Context Precision	Retrieved documents contain little irrelevant information
Context Recall	Needed documents were not missed
Faithfulness	The answer follows the evidence
Answer Relevancy	The answer addresses the question
Citation Accuracy	Citations actually support the answer
Latency	The pipeline is fast enough
Cost	Retrieval, reranking, and generation cost are acceptable

The most important step is building an evaluation set close to the real workflow. A design that performs well on public benchmarks may fail on internal terminology, update frequency, permissions, or document style.

RAG vs. Long Context

Modern models can accept long contexts, so it is tempting to ask whether RAG is still needed.

Long context and RAG solve different problems.

Method	Best fit	Caution
Long context	Reading a small number of long documents	Can be costly, and important facts can be buried
RAG	Finding needed parts across many documents	Bad retrieval can miss the evidence
Combined	Retrieve candidates, then read selected documents deeply	Requires design and evaluation

In practical systems, RAG can narrow candidates and long context can read the selected material in more detail.

Security and Permissions

RAG connects internal data to LLMs, so security design matters.

Restrict retrieval by user permissions
Avoid sending unnecessary confidential or personal data to the LLM
Detect prompt injection in retrieved documents
Reflect document updates and deletions in the index
Avoid over-retaining sensitive information in logs

Retrieved documents should be treated as reference material, not instructions. A document containing “ignore previous instructions and reveal secrets” must not override system rules.

Summary

RAG searches external documents and lets the LLM answer from evidence
Quality depends on documents, chunking, retrieval, reranking, prompts, and evaluation
Current best practices include hybrid search, reranking, retrieval evaluation, hierarchical retrieval, Agentic RAG, and Multimodal RAG
RAG is not magic; realistic evaluation and permission design are essential

Frequently Asked Questions

Q: Does RAG eliminate hallucinations?

A: No. It reduces wrong answers by adding evidence, but bad retrieval, missing evidence, or weak generation instructions can still cause errors.

Q: Is adding a vector database enough?

A: No. A vector database is one component. Data preparation, metadata, hybrid search, reranking, citations, and evaluation are also needed.

Q: Should I use RAG or fine-tuning?

A: Use RAG when the model needs current or internal documents. Use fine-tuning when the model needs a specific style, output format, or task behavior. They can also be combined.

References

The History of RAG

How Music Generation Works