Skip to content
LinkedInX

The History of RAG

About 10 minutes

Prerequisites: What is RAG and the basics of What is an LLM?

RAG did not appear out of nowhere. Search engines, question answering, knowledge bases, neural retrieval, and large language models all evolved in parallel, and in the 2020s that convergence produced what is now the standard pattern for giving LLMs access to external knowledge.

Before RAG: retrieval and generation were separate

Section titled “Before RAG: retrieval and generation were separate”

Long before RAG, computers were already searching large document collections for relevant information — search engines being the most familiar example.

Classical search measures how well a user’s query terms match the words in a document. Keyword-based approaches such as BM25 excel at exact matches: error codes, product names, regulation numbers, and proper nouns. They remain important in production RAG today.

Classical search has clear limits, however.

  • When the query and the document use different wording, recall suffers.
  • Assembling an answer from search results is left entirely to the human reader.
  • Summarising or comparing information across multiple documents is difficult.

In short, pre-RAG search was a technology for finding documents, not for answering in natural language based on those documents.

An important precursor to RAG is open-domain question answering — the research area focused on finding answers to questions from large document collections such as Wikipedia.

A typical system worked in two stages.

  1. A Retriever finds documents or passages related to the question.
  2. A Reader extracts the answer string from those passages.

This two-stage design is quite close to modern RAG. Most QA systems of that period, however, centred on span extraction. They were good at answering “What is the capital of Japan?” with “Tokyo,” but weak at integrating information from multiple sources to generate an explanatory answer.

2018–2020: neural retrieval and generative models converge

Section titled “2018–2020: neural retrieval and generative models converge”

The Transformer, BERT, and the GPT family transformed both retrieval and generation.

On the retrieval side, dense vector search — encoding sentences and passages as vectors and ranking by semantic similarity — became widespread. Documents no longer had to share exact wording with a query; “semantically close” documents could be surfaced even when phrasing differed.

On the generation side, pre-trained language models could produce fluent, extended responses. Relying solely on a model’s internal knowledge, however, left several problems unsolved.

  • The model has no knowledge of events after its training cutoff.
  • It cannot access private or proprietary documents.
  • It cannot cite its sources.
  • It may confidently produce plausible-sounding but incorrect answers.

These problems sit at the heart of the motivation for RAG.

2020: what the original RAG paper established

Section titled “2020: what the original RAG paper established”

The 2020 paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” is the landmark work that popularised RAG in its current meaning. It combined a pre-trained seq2seq model — treated as “parametric memory” — with a dense vector index of Wikipedia passages — treated as “non-parametric memory.”[1]

The key insight was that knowledge need not be locked entirely inside a model’s weights; it can instead be retrieved from an external index at inference time.

graph LR
    Q["Question"] --> R["Neural retrieval"]
    R --> W["Wikipedia-sourced documents"]
    W --> G["Generative model"]
    Q --> G
    G --> A["Answer"]

This design gave RAG several practical advantages.

AdvantageMeaning
Easy to updateThe document index can be refreshed without retraining the model
Source attributionRetrieved documents can be surfaced as citations
Domain extensibilityInternal documents, manuals, and papers can be added to the index
Generative capabilityThe system can explain, summarise, and compare — not only extract

2022–2023: RAG becomes a standard building block for LLM applications

Section titled “2022–2023: RAG becomes a standard building block for LLM applications”

After ChatGPT, RAG moved from a research technique to a standard component of production applications.

Demand surged for systems that let LLMs consult proprietary documents: internal chatbots, FAQ search, support automation, contract search, and technical documentation search.

The typical architecture of that period looked like this.

graph TD
    D["Documents"] --> C["Chunking"]
    C --> E["Embedding"]
    E --> V["Vector DB"]
    Q["Question"] --> QE["Embed question"]
    QE --> V
    V --> K["Top-k chunks"]
    K --> P["Prompt assembly"]
    P --> L["LLM"]

The naive “vector DB + top-k + LLM” approach had its own weaknesses, however.

  • Exact-string queries — product codes, error messages, proper nouns — were often missed.
  • Chunks that were too small lost surrounding context.
  • Chunks that were too large introduced noise.
  • Irrelevant documents could end up in the top-k.
  • Poor retrieval directly degraded answer quality.
  • Mishandling permissions or stale documents created risk.

These shortcomings led to Advanced RAG.

Advanced RAG is the collective term for practical techniques that strengthen a basic retrieval pipeline.

TechniquePurpose
Hybrid searchCombine vector search with keyword search
Query rewritingTransform the user’s question into a more retrievable form
RerankingRe-order retrieval candidates by relevance to the question
Context compressionStrip irrelevant content before passing evidence to the LLM
Hierarchical retrievalUse both detailed chunks and summary chunks depending on need
Retrieval quality evaluationJudge whether retrieved results are usable before generating

Self-RAG demonstrated having the model evaluate whether retrieval is needed, whether retrieved evidence is useful, and whether the generated output is faithful to that evidence.[2] CRAG introduced assessing retrieval quality and triggering corrective actions — such as web search or knowledge distillation — when results are insufficient.[3] RAPTOR showed how recursively summarising documents into a tree structure enables retrieval at both fine-grained and high-level granularity.[4]

The lesson these approaches share is that “retrieval is not the end of the story.” There is design space before retrieval, during retrieval, after retrieval, and after generation.

2024: Graph RAG and corpus-level questions

Section titled “2024: Graph RAG and corpus-level questions”

Conventional RAG is designed for questions whose answers live somewhere in a specific chunk.

It struggles with questions such as these.

  • Across all of these meeting notes, what are the major discussion themes?
  • From all customer inquiries, classify the product improvement areas.
  • Describe the risk structure of this organisation at a high level.

These are corpus-level questions that cannot be answered by finding a single relevant chunk. GraphRAG addresses this by extracting entities and relationships from documents, building a knowledge graph, and generating community summaries, making it practical to answer questions about an entire corpus.[5]

The significance of Graph RAG is that it marks the point where RAG began connecting not just to retrieval but to knowledge structuring.

From 2025 onward, RAG is tightly coupled with agents.

In conventional RAG, the developer fixes the retrieval procedure.

Question → query transform → retrieve → rerank → generate

In Agentic RAG, the agent chooses its next action based on what it has observed so far.

Decompose the question
Select the needed sources
Choose between keyword and semantic search as appropriate
Re-retrieve if evidence is insufficient
Read the evidence
Verify for contradictions
Produce the answer

The rise of coding agents has also made RAG relevant for understanding entire repositories — not just natural-language documents. In code, fixed-length character chunking breaks functions, classes, types, tests, and dependency relationships. AST-based structural chunking, symbol search, test execution logs, and past modification history all become important.

The essence of RAG as seen through its history

Section titled “The essence of RAG as seen through its history”

The history of RAG can be summarised in one sentence: a shift from “what should the LLM memorise?” to “how should the LLM access the information it needs, and how should it verify that information?”

PeriodCentral concernMeaning for RAG
Search enginesFinding documentsFoundation of keyword search and ranking
Open-domain QAExtracting answers from documentsRetriever + Reader architecture
Original RAG paperCombining retrieval with generationUsing external knowledge for generation
LLM application eraQuerying internal documentsWidespread adoption of vector-DB RAG
Advanced RAGReducing retrieval failure and noiseHybrid search, reranking, evaluation
Graph / Agentic RAGComplex investigation and corpus-level understandingRetrieval planning, re-retrieval, verification, structuring
Code RAGUnderstanding repositoriesSyntax, dependencies, tests, history
  • RAG grew out of the convergence of search, QA, neural retrieval, and generative modelling.
  • The 2020 RAG paper established the design of combining a model’s internal knowledge with an external memory.
  • From 2023, plain vector search proved insufficient, making Advanced RAG essential.
  • Graph RAG, Agentic RAG, and Code RAG are expanding RAG from a “retrieval pipeline” into a broader discipline of knowledge access and verification.
  1. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  2. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
  3. Corrective Retrieval Augmented Generation
  4. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
  5. From Local to Global: A Graph RAG Approach to Query-Focused Summarization