You have a 100-page document. You want to use it in RAG. Do you stuff the whole thing as one vector in the database? Do you split it into individual sentences? The answer lies between these extremes, and that answer is chunking.
Chunking is the art and science of dividing documents into appropriately-sized pieces. Each piece gets embedded and stored separately. When you retrieve, you get chunks, not whole documents. The quality of chunking directly affects RAG quality.
The tradeoffs are real. Too small chunks: "Artificial intelligence is." "Transformative technology." "Changes industries." Each sentence is a chunk. Each embeds and retrieves independently. The system might retrieve disconnected sentences that don't make sense together. You lose context. Too large chunks: "Here are 50 paragraphs about AI..." The chunk contains noise. The query might match one relevant sentence buried in 50 paragraphs of irrelevant content. You get over-large context with minimal relevant information.
The sweet spot is contextual relevance. A chunk should be a meaningful unit that can stand alone and answer a question. This is ambiguous and task-dependent. For a FAQ document, each question-answer pair is a chunk. For a research paper, maybe each section is a chunk. For a dense technical manual, maybe each subsection or even paragraph. The domain and content structure matters.
Fixed-size chunking is simplest: "Every chunk is exactly 256 tokens with 50 tokens of overlap between consecutive chunks." This is mechanical but sometimes suboptimal. A chunk might end mid-sentence or mid-concept. Overlap helps: consecutive chunks share tokens, providing continuity.
Semantic chunking is smarter. Divide the document into sentences, embed each sentence, identify where embeddings have big jumps (meaning topic shifts), chunk at those boundaries. This respects semantic structure. But it's more complex and slower.
Recursive chunking handles hierarchical structure. Split by paragraph, then sentence, then word if needed. Useful for documents with natural hierarchies.
Metadata-aware chunking preserves important structure. For a document with tables, code, images, and text, chunk each differently. Embed table rows differently than paragraphs. This preserves structure.
The hard problem is that different queries need different chunk sizes. A specific factual query ("What is the capital of France?") might need small chunks. A conceptual query ("How does photosynthesis work?") might need larger chunks with more context. Optimal chunk size is query-dependent, and you don't know the query in advance.
Overlap between chunks helps. If chunks 1 and 2 overlap, retrieval gets both chunks which together provide more context than either alone. But overlap costs money (more vectors to store and search).
Chunk size dramatically affects costs. More chunks mean more embeddings to compute and store. Fewer chunks mean bigger embeddings to search. The economics depend on your vector database pricing (per-vector, per-query, storage, compute).
Chunking failures are common in RAG systems. A query about "company headquarters location" retrieves chunks about office culture instead. The chunking split the location information from the context that made it relevant. Teams spend weeks debugging RAG retrieval only to discover chunking was the culprit.
There's also the problem of information spanning chunks. "The company was founded in 1995 in California." Sentence 1: "The company was founded in 1995." Sentence 2: "It was located in California." Each sentence alone is ambiguous. "California" without "founded" is unclear. "1995" without "company" is just a number. If these sentences end up in different chunks, the context is lost.
Adaptive chunking considers query context. Index chunks of different sizes for the same document. When you query, retrieve the chunk size most relevant to the query type. This is more complex but handles diverse query types better.
Why It Matters
Chunking directly affects RAG quality, retrieval speed, and cost. Poor chunking causes the system to retrieve irrelevant or incomplete information. Optimal chunking requires domain knowledge and testing. For organizations deploying RAG at scale, chunking strategy determines whether retrieval is effective and economically sustainable. Teams that underestimate chunking complexity often discover it's the limiting factor for RAG performance.
Example
A healthcare system builds RAG over medical guidelines. Poor chunking splits clinical recommendations: "Condition X is treated with medication Y in cases where..." is chunk 1. "...patient age is under 40 and kidney function is normal." is chunk 2. Queries retrieve chunk 1 without the critical contraindication information in chunk 2. A clinician using the system misses the kidney function constraint. Better chunking keeps the entire condition-treatment-contraindication unit together, ensuring retrieved information is complete and safe.