Hybrid Search RAG with Weaviate: Vectors + BM25
The first time hybrid search clicked for me was when a user searched for "HIPAA compliance requirements" and our pure vector search returned general healthcare regulation chunks instead of the specific HIPAA document. BM25 would have nailed it. The embedding model understood the semantic neighborhood of healthcare compliance, but it didn't care about the exact acronym. It treated "HIPAA" as roughly interchangeable with "healthcare regulations" and "medical data privacy standards."
That one failure taught me more about retrieval than any tutorial.
Why Pure Vector Search Isn't Enough
Vector search is incredible at capturing meaning. "How do I reset my password" and "I can't log into my account" will match even though they share zero keywords. That semantic understanding is why vector search powers most RAG pipelines today.
But it has a blind spot. Vector search struggles with specificity.
Try searching for "SOC2 Type II audit requirements" in a compliance knowledge base. The embedding model knows this is about security compliance, so it returns chunks about security certifications in general. What you wanted was the exact SOC2 document. Or search for ticket ID "JIRA-4521" and watch the vector search return random tickets because the embedding model has no meaningful representation of that identifier.
The pattern is consistent. Vector search underperforms on:
- Acronyms and abbreviations like HIPAA, SOC2, GDPR, PCI-DSS
- Proper nouns like specific product names, company names, or people
- Technical jargon that embedding models lump into broad categories
- ID numbers, codes, and exact-match terms that have no semantic neighborhood
These are exactly the queries where traditional keyword matching, specifically BM25, excels. BM25 doesn't understand meaning. It matches tokens. And sometimes, token matching is exactly what you need.
Weaviate's Hybrid Search
Most vector databases bolted on keyword search as an afterthought -- it's one of Weaviate's biggest differentiators in the Pinecone vs Qdrant vs Weaviate comparison. Weaviate built it in natively. It supports both BM25 and vector search in a single query, then merges the results using reciprocal rank fusion.
Here's how reciprocal rank fusion works: each search method produces its own ranked list of results. The fusion algorithm combines these lists by assigning scores based on rank position. A document that ranks highly in both lists gets boosted. A document that only one method finds still makes it into the final results, just with a lower combined score.
The key parameter is alpha. It controls the balance between the two search methods:
alpha = 0is pure BM25, keyword matching onlyalpha = 1is pure vector search, embeddings onlyalpha = 0.5weights both methods equally
Here's what a hybrid search looks like with the Weaviate Python client:
import weaviate
import weaviate.classes.query as wq
client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
documents = client.collections.get("ComplianceDocs")
# Hybrid search: BM25 + vector, balanced at alpha=0.5
response = documents.query.hybrid(
query="HIPAA compliance requirements",
alpha=0.5,
limit=5,
return_metadata=wq.MetadataQuery(score=True, explain_score=True),
)
for obj in response.objects:
print(f"Score: {obj.metadata.score:.4f}")
print(f"Content: {obj.properties['content'][:200]}")
print(f"Explanation: {obj.metadata.explain_score}")
print("---")That explain_score metadata is invaluable for debugging. It tells you exactly how much each search method contributed to the final ranking, so you can see whether BM25 or vector search is doing the heavy lifting for a given query.
For a full RAG pipeline, you feed the retrieved chunks into your LLM context as usual:
from openai import OpenAI
openai_client = OpenAI()
# Retrieve with hybrid search
response = documents.query.hybrid(
query="What are the HIPAA data encryption requirements?",
alpha=0.4, # slightly favor BM25 for this specific/acronym-heavy query
limit=5,
)
# Build context from retrieved chunks
context = "\n\n".join(
obj.properties["content"] for obj in response.objects
)
# Generate answer grounded in retrieved documents
completion = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on this context:\n\n{context}"},
{"role": "user", "content": "What are the HIPAA data encryption requirements?"},
],
)
print(completion.choices[0].message.content)Nothing exotic. The only change from a standard RAG pipeline is swapping query.near_text() for query.hybrid(). One line. Of course, the quality of what you retrieve also depends on how you chunk your documents in the first place.
Tuning Alpha
The default advice is to start at 0.5 and adjust from there. That's fine, but here's what I've actually found works in practice.
Lower alpha toward BM25 (0.2-0.4) when:
- Users search for specific acronyms, product names, or codes
- Your domain has heavy jargon that embedding models underrepresent
- Queries tend to be short and keyword-dense
- You're working with legal, compliance, or medical documents where exact terminology matters
Raise alpha toward vector (0.6-0.8) when:
- Users ask conversational, natural-language questions
- Synonyms and paraphrasing are common in your queries
- Your documents use inconsistent terminology for the same concepts
- Questions are exploratory rather than lookup-oriented
Don't guess. Measure. Build an evaluation set of 50-100 query-document pairs where you know the correct retrievals. Run them at different alpha values and track hit rate at k, MRR (mean reciprocal rank), or NDCG. You'll find that the optimal alpha varies by query type, which is why some production systems dynamically adjust alpha based on query classification. Short keyword-heavy queries get a lower alpha. Long conversational queries get a higher one.
A simple heuristic that works surprisingly well: if the query is under 5 words, bias toward BM25. If it's a full sentence, bias toward vector. It's crude, but it beats a static alpha for mixed-intent traffic.
The Takeaway
Hybrid search is the lowest-effort, highest-impact improvement you can make to most RAG pipelines. You're not adding a new component or a separate reranking step. You're changing one query method and adding one parameter. If you're already using Weaviate, it's a five-minute change. If you're not, this is a strong reason to consider it.
Pure vector search is not enough. Pure keyword search is not enough either. The answer, as it usually is in engineering, is to use both and let the data tell you the right balance.
Related Posts
Self-Hosting Qdrant: From Docker Compose to Production
Qdrant gives you the fastest open-source vector search. Here's how to go from docker-compose up to production-ready deployment.
Pinecone vs Qdrant vs Weaviate: An Engineer's Decision Framework
Not another feature matrix. Here are three real deployment scenarios and which vector database fits each one.
Reranking: The 20-Line Fix for Bad RAG Results
If your RAG pipeline retrieves the wrong chunks, adding a cross-encoder reranker between retrieval and generation can fix it in 20 lines of code.