Chunking
ProcessingChunking is the process of splitting a document into smaller segments (chunks) for embedding and retrieval. DocLD uses semantic chunking so that splits respect logical boundaries (e.g. paragraphs, sections, tables) rather than fixed character counts, which improves relevance when searching or answering questions over your documents.
How Chunking Works in DocLD
When a document is processed, the pipeline:
- Parse — Extract text, tables, and layout from the source file
- Segment — Split content at logical boundaries (paragraphs, headings, table rows)
- Chunk — Group segments into chunks sized for embedding models
- Vectorize — Each chunk is embedded and stored for vector search
Semantic chunking preserves context. For example, a table stays together rather than being cut mid-row, and related paragraphs remain in the same chunk.
Chunking Strategies
| Strategy | Best For | Description |
|---|---|---|
| Semantic | Documents with clear structure | Splits at paragraph, section, table boundaries |
| Fixed size | Uniform content | Splits by character/token count with overlap |
| Section-based | Technical docs | Splits by heading hierarchy |
DocLD defaults to semantic chunking for most document types. Knowledge bases can be configured with different chunking settings depending on use case.
Best Practices
- Avoid splitting mid-sentence — Chunks should be self-contained for retrieval
- Respect tables — Keep table rows together; tables are often high-value for extraction
- Consider overlap — Optional overlap between chunks can improve recall for edge cases
- Match chunk size to embedding model — DocLD uses Pinecone embedding with appropriate limits
Related Concepts
Chunking feeds directly into embedding and RAG. Smaller, well-formed chunks lead to more precise vector search results and better citation quality in chat responses.