Retrieval Strategies

About 10 minutes

People who want to improve RAG retrieval accuracy, and those designing retrieval architecture

Familiarity with What Is RAG? and the basics of Embeddings & Vector Representations

RAG quality depends on the accuracy of document retrieval. Vector search — which finds semantically close text — and keyword search — which finds exact word matches — each have distinct strengths and weaknesses. This page organizes the main retrieval approaches and explains when to use each.

The Three Main Retrieval Approaches

RAG retrieval methods fall into three broad categories.

Approach	Mechanism	Strengths	Weaknesses
Keyword Search (BM25)	Scores documents by word match frequency	Exact matches: product names, model numbers, error codes	Paraphrasing, synonyms, semantic similarity
Vector Search	Retrieves documents by embedding vector similarity	Semantic similarity, paraphrasing, multilingual	Exact matches for specific keywords
Hybrid Search	Combines both and merges scores	Covers both of the above	Requires configuration and tuning

How BM25 Keyword Search Works

BM25 (Best Match 25) is a standard algorithm for keyword search. It scores documents by combining TF (Term Frequency — how often query words appear in a document) and IDF (Inverse Document Frequency — how rare each word is across all documents).[1]

For a query like “Python error solution,” documents that contain more occurrences of “Python,” “error,” and “solution” score higher. Because “Python” appears in many documents, IDF reduces its contribution to the overall score.

BM25 characteristics:

Strong at exact word matching (e.g., product codes like “SKU-2048”)
Fast index construction
No vectorization required (low computational cost)
Cannot handle paraphrasing or synonyms (“smartphone” and “mobile phone” are treated as different terms)

How Vector Search (ANN) Works

Vector search uses distances between embedding vectors to find semantically similar documents.

Computing the distance between a query vector and every stored vector (brute-force search) becomes impractical at scale. In practice, ANN (Approximate Nearest Neighbor) is used — it finds vectors that are “close enough” rather than strictly nearest, at much higher speed. Algorithms such as HNSW (Hierarchical Navigable Small World) are widely used.

Vector search characteristics:

Handles paraphrasing and word variation well (“how to search” vs. “tell me how to find things”)
Works with multilingual semantic similarity
May lose accuracy when exact keyword matching is required

Why Hybrid Search Is Effective

BM25 and vector search evaluate documents from fundamentally different perspectives, so combining them creates a complementary relationship.

graph TD
    Q["User query"] --> BM25["BM25 keyword search"]
    Q --> VEC["Vector search (ANN)"]
    BM25 --> M["Score fusion (RRF etc.)"]
    VEC --> M
    M --> R["Merged ranking"]
    R --> RE["Reranker (optional)"]
    RE --> TOP["Top K results → LLM"]

Hybrid search merges two result sets into a unified ranking. RRF (Reciprocal Rank Fusion) is a simple, proven method for this. RRF uses the rank position of each result to compute a combined score, making it possible to merge two searches with different score scales directly.

def reciprocal_rank_fusion(bm25_results, vector_results, k=60):
    """
    Merge BM25 and vector search results using RRF.
    bm25_results: [(doc_id, score), ...]
    vector_results: [(doc_id, score), ...]
    k: RRF constant (typically 60)
    """
    scores = {}
    
    for rank, (doc_id, _) in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rank + k)
    
    for rank, (doc_id, _) in enumerate(vector_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rank + k)
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

What Is Reranking?

A reranker is a component that re-orders the candidate documents returned by the initial search based on their relevance to the query.

If search (BM25 + vector) is the “broad candidate gathering” stage, reranking is the “selecting documents that actually answer the question” stage.

Cross-encoders are commonly used as rerankers. A cross-encoder processes the query and each document together to compute a relevance score.[2]

# pip install sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "How to read a file in Python"
candidates = [
    "Explains how to open files using Python's open() function",
    "Basic steps for reading files in Java",
    "Best practices for file operations in Python using with statements",
]

pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)

ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
    print(f"{score:.4f}: {doc[:60]}...")

Reranker Cost Tradeoffs

A reranker improves accuracy but adds latency and cost.

Stage	Documents processed	Purpose
Search (BM25 + vector)	All documents (thousands to millions)	Narrow down to top 50–100 candidates
Reranker	Top 50–100 candidates	Narrow to final top 5–20
LLM input	Top 5–20	Answer generation

By limiting the reranker’s input to the top 100 results from the search stage, computational cost stays manageable while accuracy improves.

Query Expansion and Query Rewriting

User questions are sometimes poor search queries. Pre-processing queries can meaningfully improve retrieval accuracy.

Query Rewriting

Rewrites context-dependent questions (“How do I do that?”) into concrete search queries.

from openai import OpenAI

client = OpenAI()

def rewrite_query(conversation_history, user_question):
    """Rewrite a query with conversation context taken into account"""
    messages = [
        {
            "role": "system",
            "content": (
                "From the conversation and question below, generate one standalone search query. "
                "Replace any pronoun references (it, this, that) with specific terms from the context. "
                "Output only the query itself, no explanation."
            )
        },
        *conversation_history,
        {"role": "user", "content": f"Question: {user_question}"}
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=100
    )
    return response.choices[0].message.content

Query Expansion (HyDE)

HyDE (Hypothetical Document Embeddings) asks the LLM to generate a hypothetical document that would answer the question, then uses that document’s vector to search.[4] For the question “How do I detect memory leaks in Python?”, the LLM generates a hypothetical answer like “Detecting memory leaks in Python can be done with the tracemalloc module…” and that text is used for retrieval. This aligns the search vector more closely with what a real document would look like.

Practical Design Guidelines

Start with Hybrid Search

A recommended starting configuration:

Use hybrid search (BM25 + vector) as the foundation
Add a reranker when accuracy requirements are high
Add query rewriting when conversational context matters

“Start with vector-only and move to hybrid if accuracy is insufficient” is also a valid path. That said, for use cases involving product names, model numbers, or error codes, starting with hybrid is advisable from the beginning.

Hybrid Search Implementation Example

Search platforms such as Weaviate, Qdrant, and Elasticsearch provide ways to combine BM25-style retrieval with vector search. Cohere also provides a Rerank API for re-ordering candidates after initial retrieval.[3]

# Hybrid search using Weaviate
# pip install weaviate-client
import weaviate

client = weaviate.connect_to_local()

collection = client.collections.get("Documents")

results = collection.query.hybrid(
    query="How to read a file in Python",
    alpha=0.5,   # 0 = BM25 only, 1 = vector only, 0.5 = equal blend
    limit=10
)

for obj in results.objects:
    print(obj.properties["content"][:100])

The alpha parameter controls the balance between BM25 and vector search. For use cases where exact matches matter, try alpha=0.3 (BM25-dominant); for semantic search, try alpha=0.7 (vector-dominant). Tune with an evaluation dataset.

Summary

BM25 is strong for exact word matches; vector search is strong for semantic similarity
Hybrid search uses RRF to merge both, compensating for each method’s weaknesses
A reranker (cross-encoder) improves accuracy at the cost of additional latency and compute
Query rewriting is useful preprocessing for bringing conversational context into the search
In practice, building with hybrid search as the foundation and adding a reranker based on accuracy requirements is a realistic approach

FAQ

Q: Should I always use hybrid search?

A: In most cases, hybrid search provides more consistent accuracy than vector-only. However, for documents that are semantically uniform with little paraphrasing (for example, short fixed-format FAQs), vector search alone may be sufficient. I recommend starting with hybrid and making decisions based on evaluation data.

Q: Is reranking always worth the extra cost?

A: For use cases where accuracy is critical (legal, medical, customer support), the additional cost of reranking is well justified. For cases where latency is the top priority (real-time chat responses) or accuracy requirements are low, operating without a reranker is also reasonable. The right approach is to measure the accuracy improvement with an evaluation set before deciding.

Q: Does BM25 work for Japanese?

A: Yes, but Japanese text is not naturally space-delimited like English, so morphological analysis (MeCab, Sudachi, etc.) for word segmentation is required. Without a proper tokenizer, BM25 accuracy will be significantly degraded.

Q: Does query rewriting require an LLM call on every request?

A: Essentially yes. However, using a lightweight model (GPT-4o-mini, Claude Haiku, etc.) keeps the cost manageable. For single-turn questions (no conversation history), the benefit of query rewriting is limited and it is often skipped.

References

Chunking Strategies

Embeddings & Vector Representations