Retrieval Strategies
About 10 minutes
RAG quality depends on the accuracy of document retrieval. Vector search — which finds semantically close text — and keyword search — which finds exact word matches — each have distinct strengths and weaknesses. This page organizes the main retrieval approaches and explains when to use each.
The Three Main Retrieval Approaches
Section titled “The Three Main Retrieval Approaches”RAG retrieval methods fall into three broad categories.
| Approach | Mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Keyword Search (BM25) | Scores documents by word match frequency | Exact matches: product names, model numbers, error codes | Paraphrasing, synonyms, semantic similarity |
| Vector Search | Retrieves documents by embedding vector similarity | Semantic similarity, paraphrasing, multilingual | Exact matches for specific keywords |
| Hybrid Search | Combines both and merges scores | Covers both of the above | Requires configuration and tuning |
How BM25 Keyword Search Works
Section titled “How BM25 Keyword Search Works”BM25 (Best Match 25) is a standard algorithm for keyword search. It scores documents by combining TF (Term Frequency — how often query words appear in a document) and IDF (Inverse Document Frequency — how rare each word is across all documents).[1]
For a query like “Python error solution,” documents that contain more occurrences of “Python,” “error,” and “solution” score higher. Because “Python” appears in many documents, IDF reduces its contribution to the overall score.
BM25 characteristics:
- Strong at exact word matching (e.g., product codes like “SKU-2048”)
- Fast index construction
- No vectorization required (low computational cost)
- Cannot handle paraphrasing or synonyms (“smartphone” and “mobile phone” are treated as different terms)
How Vector Search (ANN) Works
Section titled “How Vector Search (ANN) Works”Vector search uses distances between embedding vectors to find semantically similar documents.
Computing the distance between a query vector and every stored vector (brute-force search) becomes impractical at scale. In practice, ANN (Approximate Nearest Neighbor) is used — it finds vectors that are “close enough” rather than strictly nearest, at much higher speed. Algorithms such as HNSW (Hierarchical Navigable Small World) are widely used.
Vector search characteristics:
- Handles paraphrasing and word variation well (“how to search” vs. “tell me how to find things”)
- Works with multilingual semantic similarity
- May lose accuracy when exact keyword matching is required
Why Hybrid Search Is Effective
Section titled “Why Hybrid Search Is Effective”BM25 and vector search evaluate documents from fundamentally different perspectives, so combining them creates a complementary relationship.
graph TD
Q["User query"] --> BM25["BM25 keyword search"]
Q --> VEC["Vector search (ANN)"]
BM25 --> M["Score fusion (RRF etc.)"]
VEC --> M
M --> R["Merged ranking"]
R --> RE["Reranker (optional)"]
RE --> TOP["Top K results → LLM"]Hybrid search merges two result sets into a unified ranking. RRF (Reciprocal Rank Fusion) is a simple, proven method for this. RRF uses the rank position of each result to compute a combined score, making it possible to merge two searches with different score scales directly.
def reciprocal_rank_fusion(bm25_results, vector_results, k=60):
"""
Merge BM25 and vector search results using RRF.
bm25_results: [(doc_id, score), ...]
vector_results: [(doc_id, score), ...]
k: RRF constant (typically 60)
"""
scores = {}
for rank, (doc_id, _) in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (rank + k)
for rank, (doc_id, _) in enumerate(vector_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (rank + k)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)What Is Reranking?
Section titled “What Is Reranking?”A reranker is a component that re-orders the candidate documents returned by the initial search based on their relevance to the query.
If search (BM25 + vector) is the “broad candidate gathering” stage, reranking is the “selecting documents that actually answer the question” stage.
Cross-encoders are commonly used as rerankers. A cross-encoder processes the query and each document together to compute a relevance score.[2]
# pip install sentence-transformers
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
query = "How to read a file in Python"
candidates = [
"Explains how to open files using Python's open() function",
"Basic steps for reading files in Java",
"Best practices for file operations in Python using with statements",
]
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
print(f"{score:.4f}: {doc[:60]}...")Reranker Cost Tradeoffs
Section titled “Reranker Cost Tradeoffs”A reranker improves accuracy but adds latency and cost.
| Stage | Documents processed | Purpose |
|---|---|---|
| Search (BM25 + vector) | All documents (thousands to millions) | Narrow down to top 50–100 candidates |
| Reranker | Top 50–100 candidates | Narrow to final top 5–20 |
| LLM input | Top 5–20 | Answer generation |
By limiting the reranker’s input to the top 100 results from the search stage, computational cost stays manageable while accuracy improves.
Query Expansion and Query Rewriting
Section titled “Query Expansion and Query Rewriting”User questions are sometimes poor search queries. Pre-processing queries can meaningfully improve retrieval accuracy.
Query Rewriting
Section titled “Query Rewriting”Rewrites context-dependent questions (“How do I do that?”) into concrete search queries.
from openai import OpenAI
client = OpenAI()
def rewrite_query(conversation_history, user_question):
"""Rewrite a query with conversation context taken into account"""
messages = [
{
"role": "system",
"content": (
"From the conversation and question below, generate one standalone search query. "
"Replace any pronoun references (it, this, that) with specific terms from the context. "
"Output only the query itself, no explanation."
)
},
*conversation_history,
{"role": "user", "content": f"Question: {user_question}"}
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=100
)
return response.choices[0].message.contentQuery Expansion (HyDE)
Section titled “Query Expansion (HyDE)”HyDE (Hypothetical Document Embeddings) asks the LLM to generate a hypothetical document that would answer the question, then uses that document’s vector to search.[4] For the question “How do I detect memory leaks in Python?”, the LLM generates a hypothetical answer like “Detecting memory leaks in Python can be done with the tracemalloc module…” and that text is used for retrieval. This aligns the search vector more closely with what a real document would look like.
Practical Design Guidelines
Section titled “Practical Design Guidelines”Start with Hybrid Search
Section titled “Start with Hybrid Search”A recommended starting configuration:
- Use hybrid search (BM25 + vector) as the foundation
- Add a reranker when accuracy requirements are high
- Add query rewriting when conversational context matters
“Start with vector-only and move to hybrid if accuracy is insufficient” is also a valid path. That said, for use cases involving product names, model numbers, or error codes, starting with hybrid is advisable from the beginning.
Hybrid Search Implementation Example
Section titled “Hybrid Search Implementation Example”Search platforms such as Weaviate, Qdrant, and Elasticsearch provide ways to combine BM25-style retrieval with vector search. Cohere also provides a Rerank API for re-ordering candidates after initial retrieval.[3]
# Hybrid search using Weaviate
# pip install weaviate-client
import weaviate
client = weaviate.connect_to_local()
collection = client.collections.get("Documents")
results = collection.query.hybrid(
query="How to read a file in Python",
alpha=0.5, # 0 = BM25 only, 1 = vector only, 0.5 = equal blend
limit=10
)
for obj in results.objects:
print(obj.properties["content"][:100])The alpha parameter controls the balance between BM25 and vector search. For use cases where exact matches matter, try alpha=0.3 (BM25-dominant); for semantic search, try alpha=0.7 (vector-dominant). Tune with an evaluation dataset.
Summary
Section titled “Summary”- BM25 is strong for exact word matches; vector search is strong for semantic similarity
- Hybrid search uses RRF to merge both, compensating for each method’s weaknesses
- A reranker (cross-encoder) improves accuracy at the cost of additional latency and compute
- Query rewriting is useful preprocessing for bringing conversational context into the search
- In practice, building with hybrid search as the foundation and adding a reranker based on accuracy requirements is a realistic approach
Q: Should I always use hybrid search?
A: In most cases, hybrid search provides more consistent accuracy than vector-only. However, for documents that are semantically uniform with little paraphrasing (for example, short fixed-format FAQs), vector search alone may be sufficient. I recommend starting with hybrid and making decisions based on evaluation data.
Q: Is reranking always worth the extra cost?
A: For use cases where accuracy is critical (legal, medical, customer support), the additional cost of reranking is well justified. For cases where latency is the top priority (real-time chat responses) or accuracy requirements are low, operating without a reranker is also reasonable. The right approach is to measure the accuracy improvement with an evaluation set before deciding.
Q: Does BM25 work for Japanese?
A: Yes, but Japanese text is not naturally space-delimited like English, so morphological analysis (MeCab, Sudachi, etc.) for word segmentation is required. Without a proper tokenizer, BM25 accuracy will be significantly degraded.
Q: Does query rewriting require an LLM call on every request?
A: Essentially yes. However, using a lightweight model (GPT-4o-mini, Claude Haiku, etc.) keeps the cost manageable. For single-turn questions (no conversation history), the benefit of query rewriting is limited and it is often skipped.