ยท6 min read

Hybrid Search RAG with Weaviate: Vectors + BM25

ragvector-databases

The first time hybrid search clicked for me was when a user searched for "HIPAA compliance requirements" and our pure vector search returned general healthcare regulation chunks instead of the specific HIPAA document. BM25 would have nailed it. The embedding model understood the semantic neighborhood of healthcare compliance, but it didn't care about the exact acronym. It treated "HIPAA" as roughly interchangeable with "healthcare regulations" and "medical data privacy standards."

That one failure taught me more about retrieval than any tutorial.

Hybrid search fuses BM25 keyword matching with vector similarity, then ranks results using reciprocal rank fusion.

Why Pure Vector Search Isn't Enough

Vector search is incredible at capturing meaning. "How do I reset my password" and "I can't log into my account" will match even though they share zero keywords. That semantic understanding is why vector search powers most RAG pipelines today.

But it has a blind spot. Vector search struggles with specificity.

Try searching for "SOC2 Type II audit requirements" in a compliance knowledge base. The embedding model knows this is about security compliance, so it returns chunks about security certifications in general. What you wanted was the exact SOC2 document. Or search for ticket ID "JIRA-4521" and watch the vector search return random tickets because the embedding model has no meaningful representation of that identifier.

The pattern is consistent. Vector search underperforms on:

  • Acronyms and abbreviations like HIPAA, SOC2, GDPR, PCI-DSS
  • Proper nouns like specific product names, company names, or people
  • Technical jargon that embedding models lump into broad categories
  • ID numbers, codes, and exact-match terms that have no semantic neighborhood

These are exactly the queries where traditional keyword matching, specifically BM25, excels. BM25 doesn't understand meaning. It matches tokens. And sometimes, token matching is exactly what you need.

Most vector databases bolted on keyword search as an afterthought -- it's one of Weaviate's biggest differentiators in the Pinecone vs Qdrant vs Weaviate comparison. Weaviate built it in natively. It supports both BM25 and vector search in a single query, then merges the results using reciprocal rank fusion.

Here's how reciprocal rank fusion works: each search method produces its own ranked list of results. The fusion algorithm combines these lists by assigning scores based on rank position. A document that ranks highly in both lists gets boosted. A document that only one method finds still makes it into the final results, just with a lower combined score.

The key parameter is alpha. It controls the balance between the two search methods:

  • alpha = 0 is pure BM25, keyword matching only
  • alpha = 1 is pure vector search, embeddings only
  • alpha = 0.5 weights both methods equally

Here's what a hybrid search looks like with the Weaviate Python client:

import weaviate
import weaviate.classes.query as wq
 
client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()
 
documents = client.collections.get("ComplianceDocs")
 
# Hybrid search: BM25 + vector, balanced at alpha=0.5
response = documents.query.hybrid(
    query="HIPAA compliance requirements",
    alpha=0.5,
    limit=5,
    return_metadata=wq.MetadataQuery(score=True, explain_score=True),
)
 
for obj in response.objects:
    print(f"Score: {obj.metadata.score:.4f}")
    print(f"Content: {obj.properties['content'][:200]}")
    print(f"Explanation: {obj.metadata.explain_score}")
    print("---")

That explain_score metadata is invaluable for debugging. It tells you exactly how much each search method contributed to the final ranking, so you can see whether BM25 or vector search is doing the heavy lifting for a given query.

For a full RAG pipeline, you feed the retrieved chunks into your LLM context as usual:

from openai import OpenAI
 
openai_client = OpenAI()
 
# Retrieve with hybrid search
response = documents.query.hybrid(
    query="What are the HIPAA data encryption requirements?",
    alpha=0.4,  # slightly favor BM25 for this specific/acronym-heavy query
    limit=5,
)
 
# Build context from retrieved chunks
context = "\n\n".join(
    obj.properties["content"] for obj in response.objects
)
 
# Generate answer grounded in retrieved documents
completion = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n\n{context}"},
        {"role": "user", "content": "What are the HIPAA data encryption requirements?"},
    ],
)
 
print(completion.choices[0].message.content)

Nothing exotic. The only change from a standard RAG pipeline is swapping query.near_text() for query.hybrid(). One line. Of course, the quality of what you retrieve also depends on how you chunk your documents in the first place.

Tuning Alpha

The default advice is to start at 0.5 and adjust from there. That's fine, but here's what I've actually found works in practice.

Lower alpha toward BM25 (0.2-0.4) when:

  • Users search for specific acronyms, product names, or codes
  • Your domain has heavy jargon that embedding models underrepresent
  • Queries tend to be short and keyword-dense
  • You're working with legal, compliance, or medical documents where exact terminology matters

Raise alpha toward vector (0.6-0.8) when:

  • Users ask conversational, natural-language questions
  • Synonyms and paraphrasing are common in your queries
  • Your documents use inconsistent terminology for the same concepts
  • Questions are exploratory rather than lookup-oriented

Don't guess. Measure. Build an evaluation set of 50-100 query-document pairs where you know the correct retrievals. Run them at different alpha values and track hit rate at k, MRR (mean reciprocal rank), or NDCG. You'll find that the optimal alpha varies by query type, which is why some production systems dynamically adjust alpha based on query classification. Short keyword-heavy queries get a lower alpha. Long conversational queries get a higher one.

A simple heuristic that works surprisingly well: if the query is under 5 words, bias toward BM25. If it's a full sentence, bias toward vector. It's crude, but it beats a static alpha for mixed-intent traffic.

The Takeaway

Hybrid search is the lowest-effort, highest-impact improvement you can make to most RAG pipelines. You're not adding a new component or a separate reranking step. You're changing one query method and adding one parameter. If you're already using Weaviate, it's a five-minute change. If you're not, this is a strong reason to consider it.

Pure vector search is not enough. Pure keyword search is not enough either. The answer, as it usually is in engineering, is to use both and let the data tell you the right balance.