The History of RAG

About 10 minutes

What is RAG and the basics of What is an LLM?

RAG did not appear out of nowhere. Search engines, question answering, knowledge bases, neural retrieval, and large language models all evolved in parallel, and in the 2020s that convergence produced what is now the standard pattern for giving LLMs access to external knowledge.

Before RAG: retrieval and generation were separate

Long before RAG, computers were already searching large document collections for relevant information — search engines being the most familiar example.

Classical search measures how well a user’s query terms match the words in a document. Keyword-based approaches such as BM25 excel at exact matches: error codes, product names, regulation numbers, and proper nouns. They remain important in production RAG today.

Classical search has clear limits, however.

When the query and the document use different wording, recall suffers.
Assembling an answer from search results is left entirely to the human reader.
Summarising or comparing information across multiple documents is difficult.

In short, pre-RAG search was a technology for finding documents, not for answering in natural language based on those documents.

The era of open-domain QA

An important precursor to RAG is open-domain question answering — the research area focused on finding answers to questions from large document collections such as Wikipedia.

A typical system worked in two stages.

A Retriever finds documents or passages related to the question.
A Reader extracts the answer string from those passages.

This two-stage design is quite close to modern RAG. Most QA systems of that period, however, centred on span extraction. They were good at answering “What is the capital of Japan?” with “Tokyo,” but weak at integrating information from multiple sources to generate an explanatory answer.

2018–2020: neural retrieval and generative models converge

The Transformer, BERT, and the GPT family transformed both retrieval and generation.

On the retrieval side, dense vector search — encoding sentences and passages as vectors and ranking by semantic similarity — became widespread. Documents no longer had to share exact wording with a query; “semantically close” documents could be surfaced even when phrasing differed.

On the generation side, pre-trained language models could produce fluent, extended responses. Relying solely on a model’s internal knowledge, however, left several problems unsolved.

The model has no knowledge of events after its training cutoff.
It cannot access private or proprietary documents.
It cannot cite its sources.
It may confidently produce plausible-sounding but incorrect answers.

These problems sit at the heart of the motivation for RAG.

2020: what the original RAG paper established

The 2020 paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” is the landmark work that popularised RAG in its current meaning. It combined a pre-trained seq2seq model — treated as “parametric memory” — with a dense vector index of Wikipedia passages — treated as “non-parametric memory.”[1]

The key insight was that knowledge need not be locked entirely inside a model’s weights; it can instead be retrieved from an external index at inference time.

graph LR
    Q["Question"] --> R["Neural retrieval"]
    R --> W["Wikipedia-sourced documents"]
    W --> G["Generative model"]
    Q --> G
    G --> A["Answer"]

This design gave RAG several practical advantages.

Advantage	Meaning
Easy to update	The document index can be refreshed without retraining the model
Source attribution	Retrieved documents can be surfaced as citations
Domain extensibility	Internal documents, manuals, and papers can be added to the index
Generative capability	The system can explain, summarise, and compare — not only extract

2022–2023: RAG becomes a standard building block for LLM applications

After ChatGPT, RAG moved from a research technique to a standard component of production applications.

Demand surged for systems that let LLMs consult proprietary documents: internal chatbots, FAQ search, support automation, contract search, and technical documentation search.

The typical architecture of that period looked like this.

graph TD
    D["Documents"] --> C["Chunking"]
    C --> E["Embedding"]
    E --> V["Vector DB"]
    Q["Question"] --> QE["Embed question"]
    QE --> V
    V --> K["Top-k chunks"]
    K --> P["Prompt assembly"]
    P --> L["LLM"]

The naive “vector DB + top-k + LLM” approach had its own weaknesses, however.

Exact-string queries — product codes, error messages, proper nouns — were often missed.
Chunks that were too small lost surrounding context.
Chunks that were too large introduced noise.
Irrelevant documents could end up in the top-k.
Poor retrieval directly degraded answer quality.
Mishandling permissions or stale documents created risk.

These shortcomings led to Advanced RAG.

2023–2024: the era of Advanced RAG

Advanced RAG is the collective term for practical techniques that strengthen a basic retrieval pipeline.

Technique	Purpose
Hybrid search	Combine vector search with keyword search
Query rewriting	Transform the user’s question into a more retrievable form
Reranking	Re-order retrieval candidates by relevance to the question
Context compression	Strip irrelevant content before passing evidence to the LLM
Hierarchical retrieval	Use both detailed chunks and summary chunks depending on need
Retrieval quality evaluation	Judge whether retrieved results are usable before generating

Self-RAG demonstrated having the model evaluate whether retrieval is needed, whether retrieved evidence is useful, and whether the generated output is faithful to that evidence.[2] CRAG introduced assessing retrieval quality and triggering corrective actions — such as web search or knowledge distillation — when results are insufficient.[3] RAPTOR showed how recursively summarising documents into a tree structure enables retrieval at both fine-grained and high-level granularity.[4]

The lesson these approaches share is that “retrieval is not the end of the story.” There is design space before retrieval, during retrieval, after retrieval, and after generation.

2024: Graph RAG and corpus-level questions

Conventional RAG is designed for questions whose answers live somewhere in a specific chunk.

It struggles with questions such as these.

Across all of these meeting notes, what are the major discussion themes?
From all customer inquiries, classify the product improvement areas.
Describe the risk structure of this organisation at a high level.

These are corpus-level questions that cannot be answered by finding a single relevant chunk. GraphRAG addresses this by extracting entities and relationships from documents, building a knowledge graph, and generating community summaries, making it practical to answer questions about an entire corpus.[5]

The significance of Graph RAG is that it marks the point where RAG began connecting not just to retrieval but to knowledge structuring.

2025–2026: Agentic RAG and Code RAG

From 2025 onward, RAG is tightly coupled with agents.

In conventional RAG, the developer fixes the retrieval procedure.

Question → query transform → retrieve → rerank → generate

In Agentic RAG, the agent chooses its next action based on what it has observed so far.

Decompose the question
Select the needed sources
Choose between keyword and semantic search as appropriate
Re-retrieve if evidence is insufficient
Read the evidence
Verify for contradictions
Produce the answer

The rise of coding agents has also made RAG relevant for understanding entire repositories — not just natural-language documents. In code, fixed-length character chunking breaks functions, classes, types, tests, and dependency relationships. AST-based structural chunking, symbol search, test execution logs, and past modification history all become important.

The essence of RAG as seen through its history

The history of RAG can be summarised in one sentence: a shift from “what should the LLM memorise?” to “how should the LLM access the information it needs, and how should it verify that information?”

Period	Central concern	Meaning for RAG
Search engines	Finding documents	Foundation of keyword search and ranking
Open-domain QA	Extracting answers from documents	Retriever + Reader architecture
Original RAG paper	Combining retrieval with generation	Using external knowledge for generation
LLM application era	Querying internal documents	Widespread adoption of vector-DB RAG
Advanced RAG	Reducing retrieval failure and noise	Hybrid search, reranking, evaluation
Graph / Agentic RAG	Complex investigation and corpus-level understanding	Retrieval planning, re-retrieval, verification, structuring
Code RAG	Understanding repositories	Syntax, dependencies, tests, history

Summary

RAG grew out of the convergence of search, QA, neural retrieval, and generative modelling.
The 2020 RAG paper established the design of combining a model’s internal knowledge with an external memory.
From 2023, plain vector search proved insufficient, making Advanced RAG essential.
Graph RAG, Agentic RAG, and Code RAG are expanding RAG from a “retrieval pipeline” into a broader discipline of knowledge access and verification.

References

RAG Architecture Patterns

What Is RAG?