ยท6 min read

Reranking: The 20-Line Fix for Bad RAG Results

ragpatterns

Here's a pattern I wish I'd known earlier: if your RAG answers are bad, the problem is almost never the LLM. It's the retriever. And the fix is a reranker.

I spent weeks tweaking prompts, adjusting temperature, and trying different models before I realized the issue was upstream. The LLM was generating perfectly reasonable answers given the context it received. The problem was that the context was wrong. The retriever was surfacing tangentially related chunks instead of the best one.

The fix took 20 lines of code.

The retrieve-then-rerank pattern: bi-encoder retrieval for speed, cross-encoder reranking for accuracy.

Why Bi-Encoders Miss

Most RAG pipelines use embedding models for retrieval. These are bi-encoders: they encode the query and each document independently into vectors, then rank by cosine similarity. This is the standard approach with models like all-MiniLM-L6-v2 or OpenAI's text-embedding-3-small.

Bi-encoders are fast. You can search millions of documents in milliseconds. But they have a fundamental limitation: the query and document never see each other during encoding. The model has to compress all meaning into a fixed-size vector without knowing what the query is asking.

This means the model can't do fine-grained relevance matching. It captures general topic similarity well, but struggles with nuance. A query like "How do I handle authentication timeouts in the Python SDK?" might rank a chunk about "Python SDK installation" higher than one about "timeout configuration for auth tokens" because the first chunk has more lexical overlap with the query terms.

The result: your top-5 retrieved chunks are topically related but not precisely relevant. The LLM does its best with what it gets, and the answer is mediocre. Sometimes the problem is even further upstream -- if your chunking strategy splits a relevant passage across two chunks, no amount of reranking will reassemble it.

Cross-Encoder Reranking

A cross-encoder works differently. Instead of encoding query and document separately, it takes the query-document pair as a single input and outputs a relevance score directly. The query and document attend to each other through the full transformer attention mechanism. This means the model can do the kind of fine-grained token-level matching that bi-encoders cannot.

Cross-encoders are dramatically more accurate at judging relevance. But they're also much slower, because you can't pre-compute document embeddings. Every query requires a forward pass for each candidate document.

The trick is to combine both approaches. Use the bi-encoder to retrieve the top 20-50 candidates quickly, then use the cross-encoder to rerank those candidates down to the top 5. You get the speed of vector search with the accuracy of cross-attention.

This is the retrieve-then-rerank pattern, and it's one of the highest-leverage improvements you can make to any RAG system.

The Code

Here's the full pattern using sentence-transformers and a cross-encoder model:

from sentence_transformers import CrossEncoder
import numpy as np
 
# Load a cross-encoder model (runs locally, no API needed)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def retrieve_and_rerank(query: str, vector_store, top_k: int = 5):
    # Step 1: Bi-encoder retrieval (fast, broad)
    candidates = vector_store.similarity_search(query, k=20)
 
    # Step 2: Cross-encoder reranking (accurate, narrow)
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)
 
    # Step 3: Return top-k by cross-encoder score
    ranked_indices = np.argsort(scores)[::-1][:top_k]
    return [candidates[i] for i in ranked_indices]

That's it. The ms-marco-MiniLM-L-6-v2 model is small, fast, and works surprisingly well for general-purpose reranking.

If you prefer an API-based approach, Cohere's Rerank endpoint is even simpler:

import cohere
 
co = cohere.Client("your-api-key")
 
def rerank_with_cohere(query: str, documents: list[str], top_k: int = 5):
    results = co.rerank(
        query=query,
        documents=documents,
        top_n=top_k,
        model="rerank-english-v3.0",
    )
    return [documents[r.index] for r in results.results]

Either way, you're adding roughly 10-50ms of latency per query. For the accuracy gain, that's nothing.

Before and After

Here's a real example from a codebase Q&A system I built. The query was: "How do I configure retry logic for failed API calls?"

Top 5 without reranking:

  1. "API client initialization and base URL configuration"
  2. "List of supported API endpoints"
  3. "Error codes and their meanings"
  4. "Rate limiting and quota management"
  5. "Logging configuration for HTTP requests"

All topically related to APIs. None of them actually about retry logic.

Top 5 with cross-encoder reranking:

  1. "Retry configuration: exponential backoff and max attempts"
  2. "Error codes and their meanings"
  3. "Circuit breaker pattern for external service calls"
  4. "Timeout settings for HTTP client"
  5. "Rate limiting and quota management"

The chunk about retry configuration was buried at position 14 in the original retrieval. The cross-encoder pulled it to position 1. That single reordering is the difference between a correct answer and a hallucinated one.

When to Use This

Always. I'm only half joking. If your RAG system returns more than 3 chunks to the LLM, you should be reranking. The cost is minimal: a small model that adds a few milliseconds of latency. The benefit is that your LLM gets better context, which means better answers, which means fewer hallucinations.

The only case where you might skip it is if you're doing simple keyword-style lookups where the top result from vector search is almost always correct. For queries that mix exact terms with semantic intent, consider pairing reranking with hybrid search to get the best of both worlds. For anything involving nuanced questions over complex documents, reranking is the move.

20 lines of code, 10-50ms added latency, dramatically better answers. Best trade-off in RAG engineering. If you want the LLM to return answers in a consistent format after reranking, look into structured output with JSON mode to enforce a schema on the generation step.