Document Parsing
DocLD supports parsing multiple document formats with intelligent layout detection, OCR, and semantic chunking.
Supported Formats
| Format | Extensions | Notes |
|---|---|---|
.pdf | Native parsing, OCR for scanned PDFs | |
| Images | .png, .jpg, .jpeg, .gif, .bmp, .tiff | VLM-based OCR |
| Spreadsheets | .csv, .xlsx, .xls, .xlsm, .xltx, .xltm, .qpw | Structured extraction |
| Presentations | .pptx, .ppt | Slide content extraction. Legacy .ppt has no native text; OCR or the document converter is recommended. |
| Documents | .docx, .doc, .txt, .html, .rtf | Direct text extraction |
Processing Pipeline
Upload → Parse → OCR → Chunk → Vectorize → Index- Upload — File is uploaded to secure storage (Cloudflare R2)
- Parse — Content extracted based on file type
- OCR — VLM-based OCR handles images and scanned pages
- Chunk — Semantic and layout-aware chunking
- Vectorize — Chunk text is sent to Pinecone; embeddings are generated server-side (llama-text-embed-v2)
- Index — Records stored in Pinecone for semantic search
OCR Capabilities
Language Support
DocLD supports 50+ languages for OCR:
- European: English, Spanish, French, German, Italian, Portuguese, Dutch
- Asian: Chinese (Simplified/Traditional), Japanese, Korean, Thai, Vietnamese
- Middle Eastern: Arabic, Hebrew, Persian
- Indic: Hindi, Bengali, Tamil
- Auto-detection when language not specified
OCR Features
| Feature | Description |
|---|---|
| Multi-language | 50+ languages supported |
| Auto-detection | Automatically detects document language |
| Handwriting | Recognizes handwritten text |
| Table extraction | Preserves table structure |
| Layout preservation | Maintains document layout |
| Bounding boxes | Returns coordinates for each text block |
OCR Confidence
Results include confidence scores:
{
"text": "Extracted text content",
"confidence": 0.95,
"language": "en",
"bounding_boxes": [
{
"text": "Invoice #12345",
"x": 100,
"y": 50,
"width": 150,
"height": 25,
"confidence": 0.98
}
]
}Agentic OCR
Enable AI-enhanced parsing for complex documents:
{
"parsing_config": {
"agentic": true
}
}Agentic mode provides:
- Better table reconstruction
- Figure and chart analysis
- Layout understanding
- Complex form parsing
- Higher accuracy (slower processing)
Chunking Strategies
Semantic Chunking (Default)
Splits by meaning and natural boundaries:
- Respects paragraph breaks
- Keeps related content together
- Preserves context across chunks
Fixed Size Chunking
Splits by character count:
{
"chunking": {
"strategy": "fixed",
"max_chunk_size": 1000,
"overlap": 100
}
}Page-Based Chunking
One chunk per page:
{
"chunking": {
"strategy": "page"
}
}Configuration Options
Parsing Config
API parse endpoint (POST /api/parse): accepts config with formatting.table_output_format and chunking (strategy, max_chunk_size, overlap). See the Parse API for the exact request shape.
Full document processing (e.g. document ingestion, workflows, or processing via the dashboard): uses a richer parsing config that can include OCR, agentic mode, and metadata options:
{
"parsing_config": {
"chunking": {
"strategy": "semantic",
"max_chunk_size": 1000,
"overlap": 100
},
"ocr": {
"enabled": true,
"language": "auto"
},
"agentic": false,
"include_metadata": true
}
}| Option | Default | Description | Where |
|---|---|---|---|
chunking.strategy | semantic | Chunking method | API and pipeline |
chunking.max_chunk_size | 1000 | Max characters per chunk | API and pipeline |
chunking.overlap | 100 | Character overlap between chunks | API and pipeline |
ocr.enabled | true | Enable OCR for images/scans | Pipeline |
ocr.language | auto | Language code or auto-detect | Pipeline |
agentic | false | Enable AI-enhanced parsing | Pipeline |
include_metadata | true | Include document metadata | Pipeline |
Parsing Methods
Synchronous
Parse immediately and get results:
curl -X POST "/api/parse" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf"Asynchronous
Queue for background processing:
curl -X POST "/api/parse/async" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "webhook_url=https://your-server.com/webhook"Output Format
Parse results include structured content:
{
"job_id": "uuid",
"duration": 2.5,
"usage": {
"num_pages": 5,
"credits": 7.5
},
"result": {
"type": "full",
"chunks": [
{
"content": "# Invoice\n\nInvoice Number: INV-001...",
"page": 1,
"metadata": {
"type": "text",
"confidence": 0.98
}
}
]
}
}Credit Usage
| Operation | Credits per Page |
|---|---|
| Standard parse | 1.5 |
| Agentic parse | 3.0 |
CLI Support
Parse documents from the command line:
# Parse a single file
docld parse document.pdf
# Parse with agentic mode
docld parse document.pdf --agentic
# Parse an entire folder
docld parse ./documents -o ./parsedAPI Reference
See the Parse API for complete endpoint documentation.