Chunking | Glossary | DocLD

Chunking is the process of splitting a document into smaller segments (chunks) for embedding and retrieval. DocLD uses semantic chunking so that splits respect logical boundaries (e.g. paragraphs, sections, tables) rather than fixed character counts, which improves relevance when searching or answering questions over your documents.

How Chunking Works in DocLD

When a document is processed, the pipeline:

Parse — Extract text, tables, and layout from the source file
Segment — Split content at logical boundaries (paragraphs, headings, table rows)
Chunk — Group segments into chunks sized for embedding models
Vectorize — Each chunk is embedded and stored for vector search

Semantic chunking preserves context. For example, a table stays together rather than being cut mid-row, and related paragraphs remain in the same chunk.

Chunking Strategies

Strategy	Best For	Description
Semantic	Documents with clear structure	Splits at paragraph, section, table boundaries
Fixed size	Uniform content	Splits by character/token count with overlap
Section-based	Technical docs	Splits by heading hierarchy

DocLD defaults to semantic chunking for most document types. Knowledge bases can be configured with different chunking settings depending on use case.

Best Practices

Avoid splitting mid-sentence — Chunks should be self-contained for retrieval
Respect tables — Keep table rows together; tables are often high-value for extraction
Consider overlap — Optional overlap between chunks can improve recall for edge cases
Match chunk size to embedding model — DocLD uses the vector database embedding with appropriate limits

Chunking feeds directly into embedding and RAG. Smaller, well-formed chunks lead to more precise vector search results and better citation quality in chat responses.

Frequently Asked Questions

How Chunking Works in DocLD

When a document is processed, the pipeline:

Parse — Extract text, tables, and layout from the source file

Segment — Split content at logical boundaries (paragraphs, headings, table rows)

Chunk — Group segments into chunks sized for embedding models

Vectorize — Each chunk is embedded and stored for vector search

Semantic chunking preserves context. For example, a table stays together rather than being cut mid-row, and related paragraphs remain in the same chunk.

Chunking Strategies

Strategy	Best For	Description
Semantic	Documents with clear structure	Splits at paragraph, section, table boundaries
Fixed size	Uniform content	Splits by character/token count with overlap
Section-based	Technical docs	Splits by heading hierarchy

DocLD defaults to semantic chunking for most document types. Knowledge bases can be configured with different chunking settings depending on use case.

Best Practices

Avoid splitting mid-sentence — Chunks should be self-contained for retrieval

Respect tables — Keep table rows together; tables are often high-value for extraction

Consider overlap — Optional overlap between chunks can improve recall for edge cases

Match chunk size to embedding model — DocLD uses the vector database embedding with appropriate limits

Frequently Asked Questions

How Chunking Works in DocLD

Chunking Strategies

Best Practices

Related Concepts

Frequently Asked Questions

How Chunking Works in DocLD

Chunking Strategies

Best Practices

Related Concepts

Frequently Asked Questions