Document Processing Guide
Understand how documents flow through the DocLD processing pipeline.
Pipeline Overview
Upload → Parse → OCR (if needed) → Chunk → Vectorize → CompleteDocuments move through these processing stages. Chunk text is sent to Pinecone for embedding (llama-text-embed-v2); embeddings are generated server-side. Each stage is handled automatically by the background queue. Documents are uploaded via the Upload API, parsed with the Parse API, and managed in the Documents API. For supported formats and pipeline details, see Document Parsing.
Chunking Strategy
DocLD uses layout-aware chunking to preserve document structure:
- Semantic boundaries — Splits at paragraph and section boundaries
- Size limits — Chunks are sized for optimal retrieval
- Overlap — Optional overlap between chunks for context continuity
OCR Behavior
OCR is triggered when:
- The file is an image (PNG, JPEG, etc.)
- The PDF appears to be scanned (no extractable text layer)
- Parsing yields insufficient text
The OCR supports multiple languages and handwriting.
Monitoring
Track processing status via the API or the dashboard. Failed documents can be reprocessed from the document detail page.
Last updated on