RAG did not appear out of nowhere. Search engines, question answering, knowledge bases, neural retrieval, and large language models all evolved in parallel, and in the 2020s that convergence produced what is now the standard pattern for giving LLMs access to external knowledge.
Before RAG: retrieval and generation were separate
Section titled “Before RAG: retrieval and generation were separate”Long before RAG, computers were already searching large document collections for relevant information — search engines being the most familiar example.
Classical search measures how well a user’s query terms match the words in a document. Keyword-based approaches such as BM25 excel at exact matches: error codes, product names, regulation numbers, and proper nouns. They remain important in production RAG today.
Classical search has clear limits, however.
- When the query and the document use different wording, recall suffers.
- Assembling an answer from search results is left entirely to the human reader.
- Summarising or comparing information across multiple documents is difficult.
In short, pre-RAG search was a technology for finding documents, not for answering in natural language based on those documents.
The era of open-domain QA
Section titled “The era of open-domain QA”An important precursor to RAG is open-domain question answering — the research area focused on finding answers to questions from large document collections such as Wikipedia.
A typical system worked in two stages.
- A Retriever finds documents or passages related to the question.
- A Reader extracts the answer string from those passages.
This two-stage design is quite close to modern RAG. Most QA systems of that period, however, centred on span extraction. They were good at answering “What is the capital of Japan?” with “Tokyo,” but weak at integrating information from multiple sources to generate an explanatory answer.
2018–2020: neural retrieval and generative models converge
Section titled “2018–2020: neural retrieval and generative models converge”The Transformer, BERT, and the GPT family transformed both retrieval and generation.
On the retrieval side, dense vector search — encoding sentences and passages as vectors and ranking by semantic similarity — became widespread. Documents no longer had to share exact wording with a query; “semantically close” documents could be surfaced even when phrasing differed.
On the generation side, pre-trained language models could produce fluent, extended responses. Relying solely on a model’s internal knowledge, however, left several problems unsolved.
- The model has no knowledge of events after its training cutoff.
- It cannot access private or proprietary documents.
- It cannot cite its sources.
- It may confidently produce plausible-sounding but incorrect answers.
These problems sit at the heart of the motivation for RAG.
2020: what the original RAG paper established
Section titled “2020: what the original RAG paper established”The 2020 paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” is the landmark work that popularised RAG in its current meaning. It combined a pre-trained seq2seq model — treated as “parametric memory” — with a dense vector index of Wikipedia passages — treated as “non-parametric memory.”[1]
The key insight was that knowledge need not be locked entirely inside a model’s weights; it can instead be retrieved from an external index at inference time.
graph LR
Q["Question"] --> R["Neural retrieval"]
R --> W["Wikipedia-sourced documents"]
W --> G["Generative model"]
Q --> G
G --> A["Answer"]This design gave RAG several practical advantages.
| Advantage | Meaning |
|---|---|
| Easy to update | The document index can be refreshed without retraining the model |
| Source attribution | Retrieved documents can be surfaced as citations |
| Domain extensibility | Internal documents, manuals, and papers can be added to the index |
| Generative capability | The system can explain, summarise, and compare — not only extract |
2022–2023: RAG becomes a standard building block for LLM applications
Section titled “2022–2023: RAG becomes a standard building block for LLM applications”After ChatGPT, RAG moved from a research technique to a standard component of production applications.
Demand surged for systems that let LLMs consult proprietary documents: internal chatbots, FAQ search, support automation, contract search, and technical documentation search.
The typical architecture of that period looked like this.
graph TD
D["Documents"] --> C["Chunking"]
C --> E["Embedding"]
E --> V["Vector DB"]
Q["Question"] --> QE["Embed question"]
QE --> V
V --> K["Top-k chunks"]
K --> P["Prompt assembly"]
P --> L["LLM"]The naive “vector DB + top-k + LLM” approach had its own weaknesses, however.
- Exact-string queries — product codes, error messages, proper nouns — were often missed.
- Chunks that were too small lost surrounding context.
- Chunks that were too large introduced noise.
- Irrelevant documents could end up in the top-k.
- Poor retrieval directly degraded answer quality.
- Mishandling permissions or stale documents created risk.
These shortcomings led to Advanced RAG.
2023–2024: the era of Advanced RAG
Section titled “2023–2024: the era of Advanced RAG”Advanced RAG is the collective term for practical techniques that strengthen a basic retrieval pipeline.
| Technique | Purpose |
|---|---|
| Hybrid search | Combine vector search with keyword search |
| Query rewriting | Transform the user’s question into a more retrievable form |
| Reranking | Re-order retrieval candidates by relevance to the question |
| Context compression | Strip irrelevant content before passing evidence to the LLM |
| Hierarchical retrieval | Use both detailed chunks and summary chunks depending on need |
| Retrieval quality evaluation | Judge whether retrieved results are usable before generating |
Self-RAG demonstrated having the model evaluate whether retrieval is needed, whether retrieved evidence is useful, and whether the generated output is faithful to that evidence.[2] CRAG introduced assessing retrieval quality and triggering corrective actions — such as web search or knowledge distillation — when results are insufficient.[3] RAPTOR showed how recursively summarising documents into a tree structure enables retrieval at both fine-grained and high-level granularity.[4]
The lesson these approaches share is that “retrieval is not the end of the story.” There is design space before retrieval, during retrieval, after retrieval, and after generation.
2024: Graph RAG and corpus-level questions
Section titled “2024: Graph RAG and corpus-level questions”Conventional RAG is designed for questions whose answers live somewhere in a specific chunk.
It struggles with questions such as these.
- Across all of these meeting notes, what are the major discussion themes?
- From all customer inquiries, classify the product improvement areas.
- Describe the risk structure of this organisation at a high level.
These are corpus-level questions that cannot be answered by finding a single relevant chunk. GraphRAG addresses this by extracting entities and relationships from documents, building a knowledge graph, and generating community summaries, making it practical to answer questions about an entire corpus.[5]
The significance of Graph RAG is that it marks the point where RAG began connecting not just to retrieval but to knowledge structuring.
2025–2026: Agentic RAG and Code RAG
Section titled “2025–2026: Agentic RAG and Code RAG”From 2025 onward, RAG is tightly coupled with agents.
In conventional RAG, the developer fixes the retrieval procedure.
Question → query transform → retrieve → rerank → generateIn Agentic RAG, the agent chooses its next action based on what it has observed so far.
Decompose the question
Select the needed sources
Choose between keyword and semantic search as appropriate
Re-retrieve if evidence is insufficient
Read the evidence
Verify for contradictions
Produce the answerThe rise of coding agents has also made RAG relevant for understanding entire repositories — not just natural-language documents. In code, fixed-length character chunking breaks functions, classes, types, tests, and dependency relationships. AST-based structural chunking, symbol search, test execution logs, and past modification history all become important.
The essence of RAG as seen through its history
Section titled “The essence of RAG as seen through its history”The history of RAG can be summarised in one sentence: a shift from “what should the LLM memorise?” to “how should the LLM access the information it needs, and how should it verify that information?”
| Period | Central concern | Meaning for RAG |
|---|---|---|
| Search engines | Finding documents | Foundation of keyword search and ranking |
| Open-domain QA | Extracting answers from documents | Retriever + Reader architecture |
| Original RAG paper | Combining retrieval with generation | Using external knowledge for generation |
| LLM application era | Querying internal documents | Widespread adoption of vector-DB RAG |
| Advanced RAG | Reducing retrieval failure and noise | Hybrid search, reranking, evaluation |
| Graph / Agentic RAG | Complex investigation and corpus-level understanding | Retrieval planning, re-retrieval, verification, structuring |
| Code RAG | Understanding repositories | Syntax, dependencies, tests, history |
Summary
Section titled “Summary”- RAG grew out of the convergence of search, QA, neural retrieval, and generative modelling.
- The 2020 RAG paper established the design of combining a model’s internal knowledge with an external memory.
- From 2023, plain vector search proved insufficient, making Advanced RAG essential.
- Graph RAG, Agentic RAG, and Code RAG are expanding RAG from a “retrieval pipeline” into a broader discipline of knowledge access and verification.
References
Section titled “References”- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
- Corrective Retrieval Augmented Generation
- RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
- From Local to Global: A Graph RAG Approach to Query-Focused Summarization