Chunking Strategies That Actually Matter for RAG
In my beginner's guide to RAG, I mentioned chunking in one paragraph. That was a mistake. Chunking is where most RAG pipelines succeed or fail.
I've spent months debugging retrieval failures across different projects, and the pattern is almost always the same. The LLM isn't broken. The embedding model is fine. The vector database is doing its job. The chunks are wrong. Either they split a concept in half, or they carry no context about the document they came from, or they lump unrelated ideas together. Every one of these failures traces back to the chunking strategy.
Here are the three approaches I've used, what they're actually good at, and where they fall apart.
Recursive Character Splitting
This is the default. LangChain's RecursiveCharacterTextSplitter is probably the single most-used chunking tool in the ecosystem, and for good reason. It's dead simple: set a chunk size in characters or tokens, set an overlap, and it splits your document by trying progressively smaller separators -- paragraphs first, then sentences, then words.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)The good: It's fast, deterministic, and easy to reason about. You know exactly what your chunks will look like. For homogeneous text -- think news articles, documentation pages, blog posts -- it works surprisingly well. In my benchmarks across straightforward Q&A retrieval tasks, recursive splitting hits around 85-90% retrieval recall. That's not bad for zero configuration.
The bad: Character boundaries and semantic boundaries are different things. A 500-character window might cut a paragraph about database indexing right at the point where it explains why you'd use a B-tree. The overlap helps, but it's a band-aid. When your documents mix topics within sections -- like a meeting transcript jumping between budget and product roadmap -- recursive splitting creates chunks that blend unrelated ideas. The embedding for that chunk becomes a muddy average of two topics, and retrieval suffers. Pairing recursive splitting with hybrid search can partially compensate, since BM25 keyword matching catches what fuzzy embeddings miss.
Semantic Chunking
Semantic chunking uses embedding similarity to detect where topics shift. Instead of splitting at arbitrary character boundaries, you embed each sentence, then measure the cosine similarity between consecutive sentences. When similarity drops below a threshold, that's your chunk boundary.
The intuition is simple: sentences about the same topic produce similar embeddings. When the embeddings suddenly diverge, you've hit a topic transition.
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = chunker.create_documents([document])The good: Topic boundaries are respected. That meeting transcript now gets split into a budget chunk and a roadmap chunk instead of a messy hybrid. In my testing, semantic chunking pushes retrieval recall to 91-92% -- a meaningful improvement when you're serving hundreds of queries against complex documents. The chunks are also variable-length, which means short sections stay intact instead of getting padded with unrelated content.
The bad: You're running an embedding model at indexing time for every sentence. For large corpora, that compute cost adds up. You also need to tune the similarity threshold, and the right threshold varies by document type. Academic papers need a different breakpoint than Slack conversations. It's not plug-and-play the way recursive splitting is.
Late Chunking
This is the newer approach, popularized by Jina AI. The core idea flips the order of operations: instead of chunking first and then embedding, you embed the full document first using a long-context embedding model, then split the resulting token-level embeddings into chunks.
Why does the order matter? Because when you embed a chunk in isolation, it loses the context of the surrounding document. A chunk that says "this approach outperforms the baseline by 12%" means nothing without knowing what "this approach" and "the baseline" refer to. With late chunking, every token was embedded with visibility into the full document. When you then split those embeddings into chunks, each chunk retains document-level context even though it's stored independently.
The good: Best retrieval quality I've seen. Chunks that reference pronouns, abbreviations, or earlier context don't degrade because the embedding already captured that context. For high-value corpora -- legal documents, research papers, technical specs -- the quality difference is noticeable.
The bad: You need a long-context embedding model like jina-embeddings-v2-base-en that can handle full documents in a single pass. That means higher memory and compute requirements. The approach is also newer, so tooling and community best practices are still catching up. You won't find a drop-in LangChain splitter for this one -- you'll be writing more custom code.
The Comparison
| Strategy | Retrieval Recall | Indexing Speed | Implementation Complexity |
|---|---|---|---|
| Recursive Character | ~85-90% | Fast | Low |
| Semantic | ~91-92% | Moderate | Medium |
| Late Chunking | ~93-95% | Slow | High |
These numbers are from my own benchmarks on mixed-domain document sets. Your mileage will vary based on your data and your queries. The relative ordering, though, has been consistent across every dataset I've tested.
My Recommendation
Start with recursive character splitting. It handles 80% of use cases well enough, and the simplicity matters more than people think. You can debug it, you can explain it, and you can ship it in an afternoon. Most RAG failures I've seen aren't caused by the chunking strategy -- they're caused by bad retrieval parameters, missing metadata filters, or insufficient preprocessing of the source documents. Fix those first.
Move to semantic chunking when you notice retrieval failures at topic boundaries. If your users are asking questions that span a topic transition and getting back irrelevant chunks, that's the signal. Semantic chunking is the targeted fix, and the compute cost is manageable for most production workloads.
Reserve late chunking for high-value corpora where retrieval quality is critical. Legal discovery, medical records, dense technical documentation. If a missed retrieval has real consequences, the extra complexity and compute are justified.
The biggest mistake I see is teams jumping straight to the most sophisticated approach before they've even measured their baseline retrieval quality. Measure first, optimize second. You might find that recursive splitting with good overlap and a solid cross-encoder reranker gets you to 95% recall without any of the added complexity.
Chunking isn't glamorous. Nobody's giving conference talks about their text splitter configuration. But it's the foundation everything else sits on, and getting it right will do more for your RAG quality than any other single change. Once your chunks are solid, adding observability with OpenTelemetry makes it straightforward to measure retrieval quality in production and catch regressions early.
Related Posts
Reranking: The 20-Line Fix for Bad RAG Results
If your RAG pipeline retrieves the wrong chunks, adding a cross-encoder reranker between retrieval and generation can fix it in 20 lines of code.
LLM Guardrails in Practice: Input Validation to Output Filtering
A three-layer guardrail pipeline: validate inputs, constrain execution, filter outputs. Here's what each layer catches and how to build them.
Function Calling Patterns for Production LLM Agents
Function calling connects LLMs to the real world. Here are the patterns that survive production: permission models, error handling, and human-in-the-loop checkpoints.