Document Processing Guide

Understand how documents flow through the DocLD processing pipeline.

Pipeline Overview


Upload → Parse → OCR (if needed) → Chunk → Vectorize → Complete

Documents move through these processing stages. Chunk text is sent to the vector database for embedding (llama-text-embed-v2); embeddings are generated server-side. Each stage is handled automatically by the background queue. Documents are uploaded and parsed via the Documents API (POST /v1/documents) and managed there. For supported formats and pipeline details, see Document Parsing.

Chunking Strategy

DocLD uses layout-aware chunking to preserve document structure:

Semantic boundaries — Splits at paragraph and section boundaries
Size limits — Chunks are sized for optimal retrieval
Overlap — Optional overlap between chunks for context continuity

OCR Behavior

OCR is triggered when:

The file is an image (PNG, JPEG, etc.)
The PDF appears to be scanned (no extractable text layer)
Parsing yields insufficient text

The OCR supports multiple languages and handwriting.

Monitoring

Track processing status via the API or the dashboard. Failed documents can be reprocessed from the document detail page.