Document Parsing
DocLD supports parsing multiple document formats with intelligent layout detection, OCR, and semantic chunking.
Supported Formats
| Format | Extensions | Notes |
|---|---|---|
.pdf | Native parsing, OCR for scanned PDFs | |
| Images | .png, .jpg, .jpeg, .gif, .bmp, .tiff | VLM-based OCR |
| Spreadsheets | .csv, .xlsx, .xls, .xlsm, .xltx, .xltm, .qpw | Structured extraction |
| Presentations | .pptx, .ppt | Slide content extraction. Legacy .ppt has no native text; OCR or the document converter is recommended. |
| Documents | .docx, .doc, .txt, .html, .rtf | Direct text extraction |
Processing Pipeline
Upload → Parse → OCR → Chunk → Vectorize → Index- Upload — File is uploaded to secure storage (Cloudflare R2)
- Parse — Content extracted based on file type
- OCR — VLM-based OCR handles images and scanned pages
- Chunk — Semantic and layout-aware chunking
- Vectorize — Chunk text is sent to the vector database; embeddings are generated server-side (llama-text-embed-v2)
- Index — Records stored in the vector database for semantic search
OCR Capabilities
Language Support
DocLD supports 50+ languages for OCR:
- European: English, Spanish, French, German, Italian, Portuguese, Dutch
- Asian: Chinese (Simplified/Traditional), Japanese, Korean, Thai, Vietnamese
- Middle Eastern: Arabic, Hebrew, Persian
- Indic: Hindi, Bengali, Tamil
- Auto-detection when language not specified
OCR Features
| Feature | Description |
|---|---|
| Multi-language | 50+ languages supported |
| Auto-detection | Automatically detects document language |
| Handwriting | Recognizes handwritten text |
| Table extraction | Preserves table structure |
| Layout preservation | Maintains document layout |
| Bounding boxes | Returns coordinates for each text block |
OCR Confidence
Results include confidence scores:
{
"text": "Extracted text content",
"confidence": 0.95,
"language": "en",
"bounding_boxes": [
{
"text": "Invoice #12345",
"x": 100,
"y": 50,
"width": 150,
"height": 25,
"confidence": 0.98
}
]
}PDF-specific extraction
For PDFs, the pipeline can extract:
- Form fields — Fillable form fields (text, checkbox, dropdown, etc.) are extracted and attached to
metadata.formFieldsand included as blocks when using block-level parse. Enable withpdf.extractFormFields(default: true). - Hyperlinks — Link annotations and their URLs are extracted and attached to
metadata.hyperlinksand as blocks. Enable withpdf.extractHyperlinks(default: true). - Tables — Tables come from two optional sources: (1) native — in-PDF layout heuristics; (2) camelot — Camelot Modal service when
pdf.useCamelotandCAMELOT_SERVICE_URLare set. When Camelot is enabled, both native and Camelot tables may be present per page. Each table entry includes an optionalsource: 'native' | 'camelot'for analytics and debugging. - Figures — The enhance option
agenticFiguresuses a VLM to detect and describe figures from rendered page images (not native PDF XObject extraction). For embedded image extraction, use the dedicated images API where available.
Agentic OCR
Enable AI-enhanced parsing for complex documents:
{
"parsing_config": {
"agentic": true
}
}Agentic mode provides:
- Better table reconstruction
- Figure and chart analysis
- Layout understanding
- Complex form parsing
- Higher accuracy (slower processing)
Chunking Strategies
Semantic Chunking (Default)
Splits by meaning and natural boundaries:
- Respects paragraph breaks
- Keeps related content together
- Preserves context across chunks
Fixed Size Chunking
Splits by character count:
{
"chunking": {
"strategy": "fixed",
"max_chunk_size": 1000,
"overlap": 100
}
}Page-Based Chunking
One chunk per page:
{
"chunking": {
"strategy": "page"
}
}Configuration Options
Parsing Config
Programmatic parse (POST /api/v1/parse with Authorization: Bearer): accepts config with formatting.table_output_format and chunking (strategy, max_chunk_size, overlap). JSON bodies use input (URL, docld://…, or { file_id }) plus optional config. The dashboard can call the session route POST /api/parse (cookies) with the same handler. See the Documents API for request shapes.
Full document processing (e.g. document ingestion, workflows, or processing via the dashboard): uses a richer parsing config that can include OCR, agentic mode, and metadata options:
{
"parsing_config": {
"chunking": {
"strategy": "semantic",
"max_chunk_size": 1000,
"overlap": 100
},
"ocr": {
"enabled": true,
"language": "auto"
},
"agentic": false,
"include_metadata": true
}
}| Option | Default | Description | Where |
|---|---|---|---|
chunking.strategy | semantic | Chunking method | API and pipeline |
chunking.max_chunk_size | 1000 | Max characters per chunk | API and pipeline |
chunking.overlap | 100 | Character overlap between chunks | API and pipeline |
ocr.enabled | true | Enable OCR for images/scans | Pipeline |
ocr.language | auto | Language code or auto-detect | Pipeline |
agentic | false | Enable AI-enhanced parsing | Pipeline |
include_metadata | true | Include document metadata | Pipeline |
Parsing Methods
Synchronous
Parse immediately and get results:
curl -X POST "/api/v1/parse" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf"Asynchronous
Queue for background processing:
curl -X POST "/api/v1/parse/async" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "webhook_url=https://your-server.com/webhook"Output Format
Parse results include structured content:
{
"job_id": "uuid",
"duration": 2.5,
"usage": {
"num_pages": 5,
"credits": 5
},
"result": {
"type": "full",
"chunks": [
{
"content": "# Invoice\n\nInvoice Number: INV-001...",
"page": 1,
"metadata": {
"type": "text",
"confidence": 0.98
}
}
]
}
}Page usage
Parsing consumes pages (1 page = 1 unit). Pricing: 200 free pages per month, then $0.05 per page. See the dashboard or API Reference for balance and usage.
CLI Support
Parse documents from the command line:
# Parse a single file
docld parse document.pdf
# Parse with agentic mode
docld parse document.pdf --agentic
# Parse an entire folder
docld parse ./documents -o ./parsedMetadata and future capabilities
- XMP — When available,
getPDFMetadatareturns an optionalxmpRawstring (catalog Metadata stream) for compliance and custom metadata. - Planned (optional) — Barcode/QR detection, large-PDF streaming (page-by-page), and duplicate/near-duplicate page detection may be added as opt-in steps; see the API changelog for updates.
API Reference
See the Documents API and API Reference for endpoint documentation.