Chunking Strategies

About 10 minutes

People who want to design the data preprocessing step of RAG, and those unsure about chunk size choices

Familiarity with the basic flow described in What Is RAG?

Chunking is the process of splitting large documents into smaller units (chunks) that can be searched effectively. In RAG, a 100-page PDF cannot be used as a single search target. By splitting it into appropriately sized pieces, the system can find “the specific part that answers this question” with high precision.

Why Chunking Is Needed

Consider a 100-page specification document. When a user asks “What are the login feature specifications?”, passing the entire document to an LLM is not practical for two reasons:

Context length limits: LLMs can only process a limited amount of text at once
Noise increases: When large amounts of unrelated information are included, the LLM has difficulty focusing on the truly relevant parts

Chunking splits the specification document into meaningful units — “login,” “search features,” “error handling” — and makes each independently searchable and referenceable.

Chunking Strategy Comparison

Strategy	Mechanism	Implementation Complexity	Context Quality	Best For
Fixed-size chunking	Split at a fixed token count	Low	Low (may cut mid-sentence)	Prototypes, uniform text
Sentence/paragraph boundary	Split at line breaks or punctuation	Low–Medium	Medium	General documents, blogs, manuals
Semantic chunking	Split at topic changes	High	High	Academic papers, complex technical docs
Hierarchical (parent-child)	Two-layer: sections + sentences	Medium–High	High	Long documents requiring multiple granularities

Fixed-Size Chunking

Fixed-size chunking mechanically splits text at a constant number of tokens. It is the simplest method. LangChain text splitters provide splitting methods based on characters, tokens, and language-aware structures.[1]

# pip install langchain-text-splitters tiktoken
from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,      # Maximum tokens per chunk
    chunk_overlap=50,    # Overlap (preserve a little context from the previous chunk)
)

text = "Long document text..."
chunks = splitter.split_text(text)
print(f"Number of chunks: {len(chunks)}")

Advantage: Simple to implement and fast Disadvantage: Sentences can be cut in the middle. If a sentence like “This feature—” is split between the end of one chunk and the start of the next, the meaning is lost.

Sentence/Paragraph Boundary Chunking

This approach uses paragraph breaks (\n\n) or sentence endings as split points, avoiding cuts in the middle of a sentence.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,           # Target chunk size (characters)
    chunk_overlap=100,         # Overlap
    separators=["\n\n", "\n", ". ", " ", ""],  # Priority order for splitting
)

with open("document.txt", "r", encoding="utf-8") as f:
    text = f.read()

chunks = splitter.split_text(text)

RecursiveCharacterTextSplitter first tries to split at paragraphs (\n\n), then at lines (\n) if a chunk is still too large, then at sentence boundaries. This “split at the largest meaningful unit possible” strategy helps preserve context.[1]

Semantic Chunking

This approach detects changes in meaning (topic shifts) and splits there. It computes the distance between embedding vectors of adjacent sentences, and uses large changes as split points.

# pip install langchain-experimental
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Use the top 5% of change points as split points
)

chunks = splitter.split_text(long_document_text)

Advantage: Produces chunks that are meaningful by topic Disadvantage: Requires an embedding model, increasing preprocessing cost. Chunk sizes become irregular.

Hierarchical Chunking (Parent-Child)

This approach manages documents in two layers: a “large unit (parent chunk)” and a “small unit (child chunk).” LlamaIndex represents documents as nodes and provides node parsers for splitting and structuring them for different use cases.[2]

How it works:

Split the document into sections (parent chunks)
Split each section further into sentence-level pieces (child chunks)
Search using child chunks (higher precision); pass parent chunks to the LLM (richer context)

graph TD
    D["Document"] --> P1["Parent chunk: Section 1 (1000 tokens)"]
    D --> P2["Parent chunk: Section 2 (1000 tokens)"]
    P1 --> C1["Child chunk: Sentence 1 (200 tokens)"]
    P1 --> C2["Child chunk: Sentence 2 (200 tokens)"]
    P1 --> C3["Child chunk: Sentence 3 (200 tokens)"]
    C1 -->|"Search hit"| R["Return child chunk ID"]
    R --> P1_RETURN["Pass corresponding parent chunk to LLM"]

from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Parent chunks: larger (richer context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# Child chunks: smaller (higher search precision)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,     # Vector DB for child chunks
    docstore=docstore,           # Storage for parent chunks
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

The Role of Overlap

Overlap means including a repeated portion from the end of the previous chunk at the start of the next chunk.

For example, with chunk_size=512, chunk_overlap=50:

Chunk 1: Tokens 1–512
Chunk 2: Tokens 463–974 (carries over the last 50 tokens of the previous chunk)
Chunk 3: Tokens 925–1436 (same pattern)

Chunk 1: [... text ... overlap portion]
Chunk 2:             [overlap portion ... text ... overlap portion]
Chunk 3:                                   [overlap portion ... text]

Without overlap, important information that falls exactly at a split boundary risks being absent from any chunk. An overlap of 10–20% (e.g., 50–100 tokens for a chunk size of 512) is a common practical starting point for RAG chunking.[3]

Chunk Size Guidelines

The appropriate chunk size varies by document type and use case. Pinecone’s chunking guidance also explains that chunks that are too short lose context, while chunks that are too long add noise, so chunking should be evaluated against the use case.[3]

Document type	Recommended chunk size	Reason
FAQ, short answer format	256–512 tokens	One Q&A pair fits in one chunk
Product manuals, technical docs	512–1024 tokens	Preserves procedural flow
Legal documents, contracts	1024–2048 tokens	Clause context is critical
Academic papers	512–1024 tokens (+ hierarchical)	Paragraphs are self-contained units
Code files	Per function/class	Prioritize syntactic boundaries

Token count reference: In English, 1 token ≈ 4 characters (0.75 words).

Why Metadata Matters

Each chunk should carry metadata — not just the text. Metadata is essential for citation, filtering, and debugging.

from langchain_core.documents import Document

chunk = Document(
    page_content="The login feature uses OAuth 2.0...",
    metadata={
        "source": "product-spec-v2.pdf",       # File name
        "page": 15,                             # Page number
        "section": "Authentication & Security", # Section title
        "doc_type": "specification",            # Document type
        "last_updated": "2026-04-01",           # Document update date
        "visibility": "internal",              # Access scope
    }
)

With rich metadata:

Responses to the LLM can include citations like “Source: product-spec-v2.pdf, page 15”
Filtering like “search only specifications updated after 2026” becomes possible
Debugging becomes easier: “which document and section was retrieved?”

Handling Special Content

Tables

When converting tables to text, the row-column relationship can break down. Keep tables intact in a single chunk and store the surrounding heading as metadata.

# Preserve tables as Markdown
table_chunk = Document(
    page_content="""
## Pricing Plan Comparison

| Plan | Monthly | Users | Storage |
|------|---------|-------|---------|
| Basic | $10 | 1 | 5GB |
| Pro | $30 | 10 | 50GB |
| Enterprise | Contact | Unlimited | 1TB |
""",
    metadata={"section": "Pricing Plans", "content_type": "table"}
)

Code Blocks

Split code at function or class boundaries. Splitting in the middle of a function destroys its meaning.

# Use AST-based splitting to split at function boundaries
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=0  # No overlap is standard for code
)

Common Mistakes

Chunks That Are Too Small (Context Loss)

Splitting into 1–2 sentences per chunk loses surrounding context. A sentence like “This feature builds on the method described in the previous section” becomes meaningless without that previous section present.

Chunks That Are Too Large (Noise Increase)

Large chunks of 2000–4000 tokens mix relevant and irrelevant information. The longer the context passed to an LLM, the greater the risk that important information gets buried.

No Overlap (Information Loss at Boundaries)

Without overlap, important information that falls exactly at a split point may not appear in any chunk. For example, a sentence that reads “When the following conditions apply—” where the second half moves to the next chunk.

Summary

Chunking splits documents into searchable units and has a large impact on RAG quality
Start with paragraph-boundary chunking (RecursiveCharacterTextSplitter) and improve based on accuracy requirements
Set overlap (10–20%) to maintain context continuity
Metadata (source, page number, section name) is essential for citation and filtering
Tables and code require special handling: keep tables as single chunks; split code at syntactic boundaries

FAQ

Q: What chunk size should I start with?

A: Starting with chunk_size=512, chunk_overlap=50 is a reasonable default. It is a well-balanced starting point that works for many document types. Build an evaluation set, compare retrieval accuracy with smaller and larger chunks, and adjust from there. There is no universally “correct” size — the optimal value varies by document type and query patterns.

Q: How do I handle tables and code blocks?

A: Keep tables in a single chunk to preserve the header-row relationship. For code, use RecursiveCharacterTextSplitter.from_language() with function/class as the unit. For extracting tables accurately from PDFs, libraries like pdfplumber or camelot are helpful.

Q: If I change the chunking strategy, do I need to re-index?

A: Yes. Changing chunk size, splitting strategy, or overlap requires re-splitting and re-vectorizing all documents. When changing chunking strategy in production, verify accuracy on an evaluation set first and plan the migration carefully.

Q: How do I count tokens for non-English text?

A: LangChain and LlamaIndex splitters can count by character (len()) or by tokens (tiktoken, etc.). Using tiktoken for token counting is more accurate and recommended.

References

Choosing a Vector Database

Retrieval Strategies