Chunk Engineering
In the context of modern NLP and Large Language Model (LLM) applications, chunk engineering refers to the strategic process of dividing text documents into smaller, semantically meaningful segments. This technique is fundamental to Retrieval-Augmented Generation (RAG) systems, where the quality of chunking directly impacts retrieval accuracy and generation quality.1
Why Chunking Matters
Effective chunking serves several critical purposes:
| Purpose | Description |
|---|---|
| Context Window Optimization | Ensures text fits within LLM token limits while preserving meaning |
| Retrieval Precision | Smaller, focused chunks improve semantic search accuracy |
| Memory Efficiency | Enables processing of documents larger than available memory |
| Embedding Quality | Right-sized chunks produce more meaningful vector representations |
| Cost Management | Reduces token usage in API calls by retrieving only relevant segments |
Chunking vs. Tokenization
Understanding the distinction between chunking and tokenization is essential:
| Aspect | Tokenization | Chunking |
|---|---|---|
| Granularity | Word or subword level | Sentence, paragraph, or semantic unit level |
| Purpose | Convert text to model-readable format | Organize text for retrieval and processing |
| Output | Individual tokens | Coherent text segments |
| Relationship | Prerequisite to embedding | Strategy for document organization |
Chunking Strategies
1. Fixed-Size Chunking
The simplest approach: divide text into chunks of a predetermined size.
Parameters:
- Chunk size (tokens or characters)
- Overlap between chunks
Example:
Document: "Artificial intelligence is transforming industries. Machine learning
enables pattern recognition. Deep learning uses neural networks."
Chunk 1: "Artificial intelligence is transforming industries. Machine learning"
Chunk 2: "Machine learning enables pattern recognition. Deep learning"
Chunk 3: "Deep learning uses neural networks."
Advantages:
- Simple to implement
- Predictable chunk sizes for embedding models
Disadvantages:
- May split sentences or concepts mid-thought
- No awareness of semantic boundaries
2. Recursive Character Text Splitting
Hierarchically splits text using a priority list of separators, attempting to keep semantically related content together.2
Separator priority (typical order):
- Paragraph breaks (
\n\n) - Line breaks (
\n) - Sentences (
.) - Words (
) - Characters
Algorithm:
- Try splitting on the first separator
- If chunks are too large, recursively split using the next separator
- Continue until all chunks meet size requirements
Best for: General-purpose text processing where document structure is unknown.
3. Sentence-Based Chunking
Groups complete sentences up to a maximum chunk size.
Implementation considerations:
- Requires accurate sentence boundary detection
- Handle abbreviations (Dr., U.S., etc.)
- Consider multi-sentence dependencies
Advantages:
- Preserves grammatical completeness
- Natural reading units
Disadvantages:
- Variable chunk sizes
- May separate related sentences
4. Semantic Chunking
Uses embeddings to identify natural semantic breakpoints in text.
Process:
- Split document into sentences
- Generate embeddings for each sentence
- Calculate cosine similarity between adjacent sentences
- Split where similarity drops below threshold
Mathematical representation:
$$ \text{split_at}(i) = \begin{cases} \text{True} & \text{if } \cos(\mathbf{e}i, \mathbf{e}{i+1}) < \theta \ \text{False} & \text{otherwise} \end{cases} $$
Where $\mathbf{e}_i$ is the embedding of sentence $i$ and $\theta$ is the similarity threshold.
Best for: Documents with distinct topic shifts (research papers, textbooks).
5. Document-Structure-Aware Chunking
Leverages document formatting to identify natural boundaries.
Structure elements:
- Headers and sections
- Lists and bullet points
- Tables and figures
- Code blocks
- Paragraphs
Markdown example:
# Chapter 1: Introduction ← Section boundary
## 1.1 Background ← Subsection boundary
Content paragraph... ← Chunk content
## 1.2 Motivation ← New subsection = new chunk
Best for: Structured documents (documentation, reports, academic papers).
6. Agentic Chunking
An advanced approach where an LLM decides optimal chunk boundaries based on content understanding.
Process:
- Present document sections to an LLM
- Ask the model to identify natural semantic boundaries
- Group content into propositions or concepts
- Generate descriptive metadata for each chunk
Advantages:
- Highest semantic quality
- Adaptive to content type
Disadvantages:
- Computationally expensive
- Requires LLM API calls during ingestion
- Higher latency
Chunk Overlap
Overlapping chunks help preserve context across boundaries:
Chunk 1: [----------content A----------]
Chunk 2: [----------content B----------]
^--- overlap region ---^
Overlap guidelines:
- Typical overlap: 10-20% of chunk size
- Higher overlap for conversational text
- Lower overlap for structured documents
Trade-offs:
| Overlap | Pros | Cons |
|---|---|---|
| High (30%+) | Better context continuity | Storage bloat, redundant retrieval |
| Low (5-10%) | Efficient storage | May lose cross-boundary context |
| None | Most efficient | Risk of context loss |
Chunk Size Selection
Optimal chunk size depends on your use case:
| Use Case | Recommended Size | Rationale |
|---|---|---|
| Q&A Systems | 256-512 tokens | Precise, focused answers |
| Document Summary | 1024-2048 tokens | Broader context needed |
| Code Retrieval | Function-level | Natural semantic units |
| Legal Documents | Paragraph or clause | Preserve legal structure |
| Chat Applications | 128-256 tokens | Quick, relevant responses |
Embedding model constraints:
- OpenAI ada-002: 8,191 tokens max
- Sentence-Transformers: Often 256-512 tokens optimal
- Cohere: Up to 4,096 tokens
Metadata Enrichment
Enhance chunks with metadata for improved retrieval:
{
"content": "Chunk text content...",
"metadata": {
"source": "document.pdf",
"page": 15,
"section": "Chapter 3: Methods",
"chunk_index": 42,
"total_chunks": 156,
"created_at": "2025-06-16",
"topics": ["machine learning", "neural networks"]
}
}
Key metadata fields:
- Document source and location
- Hierarchical position (section, subsection)
- Creation and modification timestamps
- Extracted entities and topics
- Relationships to other chunks
Evaluation Metrics
Retrieval Quality
- Recall@K: Proportion of relevant chunks retrieved in top K results
- Mean Reciprocal Rank (MRR): Position of first relevant result
- Normalized Discounted Cumulative Gain (nDCG): Graded relevance measure
Chunk Quality
- Semantic coherence: Are concepts self-contained within chunks?
- Boundary accuracy: Do splits occur at natural breakpoints?
- Size consistency: Are chunks within expected size ranges?
Best Practices
- Start simple: Begin with recursive character splitting before complex methods
- Experiment with sizes: Test multiple chunk sizes for your specific use case
- Use overlap strategically: Balance context preservation with storage efficiency
- Add metadata: Rich metadata improves filtering and reranking
- Evaluate empirically: Measure retrieval performance on representative queries
- Consider hybrid approaches: Combine multiple strategies for complex documents
- Document your choices: Record chunking parameters for reproducibility
Implementation Example
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Configure splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
# Split document
document = "Your long document text here..."
chunks = splitter.split_text(document)
# Add metadata
enriched_chunks = [
{"content": chunk, "index": i, "source": "document.txt"}
for i, chunk in enumerate(chunks)
]
Related Topics
- Semantic Search - Uses chunked documents for retrieval
- Vector Embeddings - Chunks are embedded for similarity search
- Prompt Engineering - Retrieved chunks inform prompts
- Transformers - Token limits drive chunking requirements
References
LangChain Documentation. (2024). Text Splitters. LangChain. https://python.langchain.com/docs/modules/data_connection/document_transformers/ ↩︎
Pinecone. (2024). Chunking Strategies for LLM Applications. Pinecone Learning Center. https://www.pinecone.io/learn/chunking-strategies/ ↩︎