Skip to content
LinkedInX

Chunking Strategies

About 10 minutes

Target audience: People who want to design the data preprocessing step of RAG, and those unsure about chunk size choices
Prerequisites: Familiarity with the basic flow described in What Is RAG?

Chunking is the process of splitting large documents into smaller units (chunks) that can be searched effectively. In RAG, a 100-page PDF cannot be used as a single search target. By splitting it into appropriately sized pieces, the system can find “the specific part that answers this question” with high precision.

Consider a 100-page specification document. When a user asks “What are the login feature specifications?”, passing the entire document to an LLM is not practical for two reasons:

  1. Context length limits: LLMs can only process a limited amount of text at once
  2. Noise increases: When large amounts of unrelated information are included, the LLM has difficulty focusing on the truly relevant parts

Chunking splits the specification document into meaningful units — “login,” “search features,” “error handling” — and makes each independently searchable and referenceable.

StrategyMechanismImplementation ComplexityContext QualityBest For
Fixed-size chunkingSplit at a fixed token countLowLow (may cut mid-sentence)Prototypes, uniform text
Sentence/paragraph boundarySplit at line breaks or punctuationLow–MediumMediumGeneral documents, blogs, manuals
Semantic chunkingSplit at topic changesHighHighAcademic papers, complex technical docs
Hierarchical (parent-child)Two-layer: sections + sentencesMedium–HighHighLong documents requiring multiple granularities

Fixed-size chunking mechanically splits text at a constant number of tokens. It is the simplest method. LangChain text splitters provide splitting methods based on characters, tokens, and language-aware structures.[1]

# pip install langchain-text-splitters tiktoken
from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=512,      # Maximum tokens per chunk
    chunk_overlap=50,    # Overlap (preserve a little context from the previous chunk)
)

text = "Long document text..."
chunks = splitter.split_text(text)
print(f"Number of chunks: {len(chunks)}")

Advantage: Simple to implement and fast Disadvantage: Sentences can be cut in the middle. If a sentence like “This feature—” is split between the end of one chunk and the start of the next, the meaning is lost.

This approach uses paragraph breaks (\n\n) or sentence endings as split points, avoiding cuts in the middle of a sentence.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,           # Target chunk size (characters)
    chunk_overlap=100,         # Overlap
    separators=["\n\n", "\n", ". ", " ", ""],  # Priority order for splitting
)

with open("document.txt", "r", encoding="utf-8") as f:
    text = f.read()

chunks = splitter.split_text(text)

RecursiveCharacterTextSplitter first tries to split at paragraphs (\n\n), then at lines (\n) if a chunk is still too large, then at sentence boundaries. This “split at the largest meaningful unit possible” strategy helps preserve context.[1]

This approach detects changes in meaning (topic shifts) and splits there. It computes the distance between embedding vectors of adjacent sentences, and uses large changes as split points.

# pip install langchain-experimental
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Use the top 5% of change points as split points
)

chunks = splitter.split_text(long_document_text)

Advantage: Produces chunks that are meaningful by topic Disadvantage: Requires an embedding model, increasing preprocessing cost. Chunk sizes become irregular.

This approach manages documents in two layers: a “large unit (parent chunk)” and a “small unit (child chunk).” LlamaIndex represents documents as nodes and provides node parsers for splitting and structuring them for different use cases.[2]

How it works:

  1. Split the document into sections (parent chunks)
  2. Split each section further into sentence-level pieces (child chunks)
  3. Search using child chunks (higher precision); pass parent chunks to the LLM (richer context)
graph TD
    D["Document"] --> P1["Parent chunk: Section 1 (1000 tokens)"]
    D --> P2["Parent chunk: Section 2 (1000 tokens)"]
    P1 --> C1["Child chunk: Sentence 1 (200 tokens)"]
    P1 --> C2["Child chunk: Sentence 2 (200 tokens)"]
    P1 --> C3["Child chunk: Sentence 3 (200 tokens)"]
    C1 -->|"Search hit"| R["Return child chunk ID"]
    R --> P1_RETURN["Pass corresponding parent chunk to LLM"]
from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Parent chunks: larger (richer context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# Child chunks: smaller (higher search precision)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,     # Vector DB for child chunks
    docstore=docstore,           # Storage for parent chunks
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Overlap means including a repeated portion from the end of the previous chunk at the start of the next chunk.

For example, with chunk_size=512, chunk_overlap=50:

  • Chunk 1: Tokens 1–512
  • Chunk 2: Tokens 463–974 (carries over the last 50 tokens of the previous chunk)
  • Chunk 3: Tokens 925–1436 (same pattern)
Chunk 1: [... text ... overlap portion]
Chunk 2:             [overlap portion ... text ... overlap portion]
Chunk 3:                                   [overlap portion ... text]

Without overlap, important information that falls exactly at a split boundary risks being absent from any chunk. An overlap of 10–20% (e.g., 50–100 tokens for a chunk size of 512) is a common practical starting point for RAG chunking.[3]

The appropriate chunk size varies by document type and use case. Pinecone’s chunking guidance also explains that chunks that are too short lose context, while chunks that are too long add noise, so chunking should be evaluated against the use case.[3]

Document typeRecommended chunk sizeReason
FAQ, short answer format256–512 tokensOne Q&A pair fits in one chunk
Product manuals, technical docs512–1024 tokensPreserves procedural flow
Legal documents, contracts1024–2048 tokensClause context is critical
Academic papers512–1024 tokens (+ hierarchical)Paragraphs are self-contained units
Code filesPer function/classPrioritize syntactic boundaries

Token count reference: In English, 1 token ≈ 4 characters (0.75 words).

Each chunk should carry metadata — not just the text. Metadata is essential for citation, filtering, and debugging.

from langchain_core.documents import Document

chunk = Document(
    page_content="The login feature uses OAuth 2.0...",
    metadata={
        "source": "product-spec-v2.pdf",       # File name
        "page": 15,                             # Page number
        "section": "Authentication & Security", # Section title
        "doc_type": "specification",            # Document type
        "last_updated": "2026-04-01",           # Document update date
        "visibility": "internal",              # Access scope
    }
)

With rich metadata:

  • Responses to the LLM can include citations like “Source: product-spec-v2.pdf, page 15”
  • Filtering like “search only specifications updated after 2026” becomes possible
  • Debugging becomes easier: “which document and section was retrieved?”

When converting tables to text, the row-column relationship can break down. Keep tables intact in a single chunk and store the surrounding heading as metadata.

# Preserve tables as Markdown
table_chunk = Document(
    page_content="""
## Pricing Plan Comparison

| Plan | Monthly | Users | Storage |
|------|---------|-------|---------|
| Basic | $10 | 1 | 5GB |
| Pro | $30 | 10 | 50GB |
| Enterprise | Contact | Unlimited | 1TB |
""",
    metadata={"section": "Pricing Plans", "content_type": "table"}
)

Split code at function or class boundaries. Splitting in the middle of a function destroys its meaning.

# Use AST-based splitting to split at function boundaries
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=0  # No overlap is standard for code
)

Splitting into 1–2 sentences per chunk loses surrounding context. A sentence like “This feature builds on the method described in the previous section” becomes meaningless without that previous section present.

Chunks That Are Too Large (Noise Increase)

Section titled “Chunks That Are Too Large (Noise Increase)”

Large chunks of 2000–4000 tokens mix relevant and irrelevant information. The longer the context passed to an LLM, the greater the risk that important information gets buried.

No Overlap (Information Loss at Boundaries)

Section titled “No Overlap (Information Loss at Boundaries)”

Without overlap, important information that falls exactly at a split point may not appear in any chunk. For example, a sentence that reads “When the following conditions apply—” where the second half moves to the next chunk.

  • Chunking splits documents into searchable units and has a large impact on RAG quality
  • Start with paragraph-boundary chunking (RecursiveCharacterTextSplitter) and improve based on accuracy requirements
  • Set overlap (10–20%) to maintain context continuity
  • Metadata (source, page number, section name) is essential for citation and filtering
  • Tables and code require special handling: keep tables as single chunks; split code at syntactic boundaries

Q: What chunk size should I start with?

A: Starting with chunk_size=512, chunk_overlap=50 is a reasonable default. It is a well-balanced starting point that works for many document types. Build an evaluation set, compare retrieval accuracy with smaller and larger chunks, and adjust from there. There is no universally “correct” size — the optimal value varies by document type and query patterns.

Q: How do I handle tables and code blocks?

A: Keep tables in a single chunk to preserve the header-row relationship. For code, use RecursiveCharacterTextSplitter.from_language() with function/class as the unit. For extracting tables accurately from PDFs, libraries like pdfplumber or camelot are helpful.

Q: If I change the chunking strategy, do I need to re-index?

A: Yes. Changing chunk size, splitting strategy, or overlap requires re-splitting and re-vectorizing all documents. When changing chunking strategy in production, verify accuracy on an evaluation set first and plan the migration carefully.

Q: How do I count tokens for non-English text?

A: LangChain and LlamaIndex splitters can count by character (len()) or by tokens (tiktoken, etc.). Using tiktoken for token counting is more accurate and recommended.

  1. LangChain Text Splitters — Documentation
  2. LlamaIndex Node Parsers — Documentation
  3. Chunking Strategies for LLM Applications — Pinecone