Chunk Engineering

In the context of modern NLP and Large Language Model (LLM) applications, chunk engineering refers to the strategic process of dividing text documents into smaller, semantically meaningful segments. This technique is fundamental to Retrieval-Augmented Generation (RAG) systems, where the quality of chunking directly impacts retrieval accuracy and generation quality.1


Why Chunking Matters

Effective chunking serves several critical purposes:

PurposeDescription
Context Window OptimizationEnsures text fits within LLM token limits while preserving meaning
Retrieval PrecisionSmaller, focused chunks improve semantic search accuracy
Memory EfficiencyEnables processing of documents larger than available memory
Embedding QualityRight-sized chunks produce more meaningful vector representations
Cost ManagementReduces token usage in API calls by retrieving only relevant segments

Chunking vs. Tokenization

Understanding the distinction between chunking and tokenization is essential:

AspectTokenizationChunking
GranularityWord or subword levelSentence, paragraph, or semantic unit level
PurposeConvert text to model-readable formatOrganize text for retrieval and processing
OutputIndividual tokensCoherent text segments
RelationshipPrerequisite to embeddingStrategy for document organization

Chunking Strategies

1. Fixed-Size Chunking

The simplest approach: divide text into chunks of a predetermined size.

Parameters:

  • Chunk size (tokens or characters)
  • Overlap between chunks

Example:

Document: "Artificial intelligence is transforming industries. Machine learning 
           enables pattern recognition. Deep learning uses neural networks."

Chunk 1: "Artificial intelligence is transforming industries. Machine learning"
Chunk 2: "Machine learning enables pattern recognition. Deep learning"
Chunk 3: "Deep learning uses neural networks."

Advantages:

  • Simple to implement
  • Predictable chunk sizes for embedding models

Disadvantages:

  • May split sentences or concepts mid-thought
  • No awareness of semantic boundaries

2. Recursive Character Text Splitting

Hierarchically splits text using a priority list of separators, attempting to keep semantically related content together.2

Separator priority (typical order):

  1. Paragraph breaks (\n\n)
  2. Line breaks (\n)
  3. Sentences (. )
  4. Words ( )
  5. Characters

Algorithm:

  1. Try splitting on the first separator
  2. If chunks are too large, recursively split using the next separator
  3. Continue until all chunks meet size requirements

Best for: General-purpose text processing where document structure is unknown.


3. Sentence-Based Chunking

Groups complete sentences up to a maximum chunk size.

Implementation considerations:

  • Requires accurate sentence boundary detection
  • Handle abbreviations (Dr., U.S., etc.)
  • Consider multi-sentence dependencies

Advantages:

  • Preserves grammatical completeness
  • Natural reading units

Disadvantages:

  • Variable chunk sizes
  • May separate related sentences

4. Semantic Chunking

Uses embeddings to identify natural semantic breakpoints in text.

Process:

  1. Split document into sentences
  2. Generate embeddings for each sentence
  3. Calculate cosine similarity between adjacent sentences
  4. Split where similarity drops below threshold

Mathematical representation:

$$ \text{split_at}(i) = \begin{cases} \text{True} & \text{if } \cos(\mathbf{e}i, \mathbf{e}{i+1}) < \theta \ \text{False} & \text{otherwise} \end{cases} $$

Where $\mathbf{e}_i$ is the embedding of sentence $i$ and $\theta$ is the similarity threshold.

Best for: Documents with distinct topic shifts (research papers, textbooks).


5. Document-Structure-Aware Chunking

Leverages document formatting to identify natural boundaries.

Structure elements:

  • Headers and sections
  • Lists and bullet points
  • Tables and figures
  • Code blocks
  • Paragraphs

Markdown example:

# Chapter 1: Introduction    ← Section boundary
## 1.1 Background           ← Subsection boundary

Content paragraph...        ← Chunk content

## 1.2 Motivation           ← New subsection = new chunk

Best for: Structured documents (documentation, reports, academic papers).


6. Agentic Chunking

An advanced approach where an LLM decides optimal chunk boundaries based on content understanding.

Process:

  1. Present document sections to an LLM
  2. Ask the model to identify natural semantic boundaries
  3. Group content into propositions or concepts
  4. Generate descriptive metadata for each chunk

Advantages:

  • Highest semantic quality
  • Adaptive to content type

Disadvantages:

  • Computationally expensive
  • Requires LLM API calls during ingestion
  • Higher latency

Chunk Overlap

Overlapping chunks help preserve context across boundaries:

Chunk 1: [----------content A----------]
Chunk 2:           [----------content B----------]
                   ^--- overlap region ---^

Overlap guidelines:

  • Typical overlap: 10-20% of chunk size
  • Higher overlap for conversational text
  • Lower overlap for structured documents

Trade-offs:

OverlapProsCons
High (30%+)Better context continuityStorage bloat, redundant retrieval
Low (5-10%)Efficient storageMay lose cross-boundary context
NoneMost efficientRisk of context loss

Chunk Size Selection

Optimal chunk size depends on your use case:

Use CaseRecommended SizeRationale
Q&A Systems256-512 tokensPrecise, focused answers
Document Summary1024-2048 tokensBroader context needed
Code RetrievalFunction-levelNatural semantic units
Legal DocumentsParagraph or clausePreserve legal structure
Chat Applications128-256 tokensQuick, relevant responses

Embedding model constraints:

  • OpenAI ada-002: 8,191 tokens max
  • Sentence-Transformers: Often 256-512 tokens optimal
  • Cohere: Up to 4,096 tokens

Metadata Enrichment

Enhance chunks with metadata for improved retrieval:

{
  "content": "Chunk text content...",
  "metadata": {
    "source": "document.pdf",
    "page": 15,
    "section": "Chapter 3: Methods",
    "chunk_index": 42,
    "total_chunks": 156,
    "created_at": "2025-06-16",
    "topics": ["machine learning", "neural networks"]
  }
}

Key metadata fields:

  • Document source and location
  • Hierarchical position (section, subsection)
  • Creation and modification timestamps
  • Extracted entities and topics
  • Relationships to other chunks

Evaluation Metrics

Retrieval Quality

  1. Recall@K: Proportion of relevant chunks retrieved in top K results
  2. Mean Reciprocal Rank (MRR): Position of first relevant result
  3. Normalized Discounted Cumulative Gain (nDCG): Graded relevance measure

Chunk Quality

  1. Semantic coherence: Are concepts self-contained within chunks?
  2. Boundary accuracy: Do splits occur at natural breakpoints?
  3. Size consistency: Are chunks within expected size ranges?

Best Practices

  1. Start simple: Begin with recursive character splitting before complex methods
  2. Experiment with sizes: Test multiple chunk sizes for your specific use case
  3. Use overlap strategically: Balance context preservation with storage efficiency
  4. Add metadata: Rich metadata improves filtering and reranking
  5. Evaluate empirically: Measure retrieval performance on representative queries
  6. Consider hybrid approaches: Combine multiple strategies for complex documents
  7. Document your choices: Record chunking parameters for reproducibility

Implementation Example

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Configure splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Split document
document = "Your long document text here..."
chunks = splitter.split_text(document)

# Add metadata
enriched_chunks = [
    {"content": chunk, "index": i, "source": "document.txt"}
    for i, chunk in enumerate(chunks)
]

  • Semantic Search - Uses chunked documents for retrieval
  • Vector Embeddings - Chunks are embedded for similarity search
  • Prompt Engineering - Retrieved chunks inform prompts
  • Transformers - Token limits drive chunking requirements

References


  1. LangChain Documentation. (2024). Text Splitters. LangChain. https://python.langchain.com/docs/modules/data_connection/document_transformers/ ↩︎

  2. Pinecone. (2024). Chunking Strategies for LLM Applications. Pinecone Learning Center. https://www.pinecone.io/learn/chunking-strategies/ ↩︎