Skip to content
LinkedInX

Retrieval Strategies

About 10 minutes

Target audience: People who want to improve RAG retrieval accuracy, and those designing retrieval architecture
Prerequisites: Familiarity with What Is RAG? and the basics of Embeddings & Vector Representations

RAG quality depends on the accuracy of document retrieval. Vector search — which finds semantically close text — and keyword search — which finds exact word matches — each have distinct strengths and weaknesses. This page organizes the main retrieval approaches and explains when to use each.

RAG retrieval methods fall into three broad categories.

ApproachMechanismStrengthsWeaknesses
Keyword Search (BM25)Scores documents by word match frequencyExact matches: product names, model numbers, error codesParaphrasing, synonyms, semantic similarity
Vector SearchRetrieves documents by embedding vector similaritySemantic similarity, paraphrasing, multilingualExact matches for specific keywords
Hybrid SearchCombines both and merges scoresCovers both of the aboveRequires configuration and tuning

BM25 (Best Match 25) is a standard algorithm for keyword search. It scores documents by combining TF (Term Frequency — how often query words appear in a document) and IDF (Inverse Document Frequency — how rare each word is across all documents).[1]

For a query like “Python error solution,” documents that contain more occurrences of “Python,” “error,” and “solution” score higher. Because “Python” appears in many documents, IDF reduces its contribution to the overall score.

BM25 characteristics:

  • Strong at exact word matching (e.g., product codes like “SKU-2048”)
  • Fast index construction
  • No vectorization required (low computational cost)
  • Cannot handle paraphrasing or synonyms (“smartphone” and “mobile phone” are treated as different terms)

Vector search uses distances between embedding vectors to find semantically similar documents.

Computing the distance between a query vector and every stored vector (brute-force search) becomes impractical at scale. In practice, ANN (Approximate Nearest Neighbor) is used — it finds vectors that are “close enough” rather than strictly nearest, at much higher speed. Algorithms such as HNSW (Hierarchical Navigable Small World) are widely used.

Vector search characteristics:

  • Handles paraphrasing and word variation well (“how to search” vs. “tell me how to find things”)
  • Works with multilingual semantic similarity
  • May lose accuracy when exact keyword matching is required

BM25 and vector search evaluate documents from fundamentally different perspectives, so combining them creates a complementary relationship.

graph TD
    Q["User query"] --> BM25["BM25 keyword search"]
    Q --> VEC["Vector search (ANN)"]
    BM25 --> M["Score fusion (RRF etc.)"]
    VEC --> M
    M --> R["Merged ranking"]
    R --> RE["Reranker (optional)"]
    RE --> TOP["Top K results → LLM"]

Hybrid search merges two result sets into a unified ranking. RRF (Reciprocal Rank Fusion) is a simple, proven method for this. RRF uses the rank position of each result to compute a combined score, making it possible to merge two searches with different score scales directly.

def reciprocal_rank_fusion(bm25_results, vector_results, k=60):
    """
    Merge BM25 and vector search results using RRF.
    bm25_results: [(doc_id, score), ...]
    vector_results: [(doc_id, score), ...]
    k: RRF constant (typically 60)
    """
    scores = {}
    
    for rank, (doc_id, _) in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rank + k)
    
    for rank, (doc_id, _) in enumerate(vector_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rank + k)
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

A reranker is a component that re-orders the candidate documents returned by the initial search based on their relevance to the query.

If search (BM25 + vector) is the “broad candidate gathering” stage, reranking is the “selecting documents that actually answer the question” stage.

Cross-encoders are commonly used as rerankers. A cross-encoder processes the query and each document together to compute a relevance score.[2]

# pip install sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "How to read a file in Python"
candidates = [
    "Explains how to open files using Python's open() function",
    "Basic steps for reading files in Java",
    "Best practices for file operations in Python using with statements",
]

pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)

ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
    print(f"{score:.4f}: {doc[:60]}...")

A reranker improves accuracy but adds latency and cost.

StageDocuments processedPurpose
Search (BM25 + vector)All documents (thousands to millions)Narrow down to top 50–100 candidates
RerankerTop 50–100 candidatesNarrow to final top 5–20
LLM inputTop 5–20Answer generation

By limiting the reranker’s input to the top 100 results from the search stage, computational cost stays manageable while accuracy improves.

User questions are sometimes poor search queries. Pre-processing queries can meaningfully improve retrieval accuracy.

Rewrites context-dependent questions (“How do I do that?”) into concrete search queries.

from openai import OpenAI

client = OpenAI()

def rewrite_query(conversation_history, user_question):
    """Rewrite a query with conversation context taken into account"""
    messages = [
        {
            "role": "system",
            "content": (
                "From the conversation and question below, generate one standalone search query. "
                "Replace any pronoun references (it, this, that) with specific terms from the context. "
                "Output only the query itself, no explanation."
            )
        },
        *conversation_history,
        {"role": "user", "content": f"Question: {user_question}"}
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=100
    )
    return response.choices[0].message.content

HyDE (Hypothetical Document Embeddings) asks the LLM to generate a hypothetical document that would answer the question, then uses that document’s vector to search.[4] For the question “How do I detect memory leaks in Python?”, the LLM generates a hypothetical answer like “Detecting memory leaks in Python can be done with the tracemalloc module…” and that text is used for retrieval. This aligns the search vector more closely with what a real document would look like.

A recommended starting configuration:

  1. Use hybrid search (BM25 + vector) as the foundation
  2. Add a reranker when accuracy requirements are high
  3. Add query rewriting when conversational context matters

“Start with vector-only and move to hybrid if accuracy is insufficient” is also a valid path. That said, for use cases involving product names, model numbers, or error codes, starting with hybrid is advisable from the beginning.

Search platforms such as Weaviate, Qdrant, and Elasticsearch provide ways to combine BM25-style retrieval with vector search. Cohere also provides a Rerank API for re-ordering candidates after initial retrieval.[3]

# Hybrid search using Weaviate
# pip install weaviate-client
import weaviate

client = weaviate.connect_to_local()

collection = client.collections.get("Documents")

results = collection.query.hybrid(
    query="How to read a file in Python",
    alpha=0.5,   # 0 = BM25 only, 1 = vector only, 0.5 = equal blend
    limit=10
)

for obj in results.objects:
    print(obj.properties["content"][:100])

The alpha parameter controls the balance between BM25 and vector search. For use cases where exact matches matter, try alpha=0.3 (BM25-dominant); for semantic search, try alpha=0.7 (vector-dominant). Tune with an evaluation dataset.

  • BM25 is strong for exact word matches; vector search is strong for semantic similarity
  • Hybrid search uses RRF to merge both, compensating for each method’s weaknesses
  • A reranker (cross-encoder) improves accuracy at the cost of additional latency and compute
  • Query rewriting is useful preprocessing for bringing conversational context into the search
  • In practice, building with hybrid search as the foundation and adding a reranker based on accuracy requirements is a realistic approach

Q: Should I always use hybrid search?

A: In most cases, hybrid search provides more consistent accuracy than vector-only. However, for documents that are semantically uniform with little paraphrasing (for example, short fixed-format FAQs), vector search alone may be sufficient. I recommend starting with hybrid and making decisions based on evaluation data.

Q: Is reranking always worth the extra cost?

A: For use cases where accuracy is critical (legal, medical, customer support), the additional cost of reranking is well justified. For cases where latency is the top priority (real-time chat responses) or accuracy requirements are low, operating without a reranker is also reasonable. The right approach is to measure the accuracy improvement with an evaluation set before deciding.

Q: Does BM25 work for Japanese?

A: Yes, but Japanese text is not naturally space-delimited like English, so morphological analysis (MeCab, Sudachi, etc.) for word segmentation is required. Without a proper tokenizer, BM25 accuracy will be significantly degraded.

Q: Does query rewriting require an LLM call on every request?

A: Essentially yes. However, using a lightweight model (GPT-4o-mini, Claude Haiku, etc.) keeps the cost manageable. For single-turn questions (no conversation history), the benefit of query rewriting is limited and it is often skipped.

  1. BM25 — Wikipedia
  2. sentence-transformers — CrossEncoder
  3. Cohere Rerank — Official Documentation
  4. HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels