Document Parsing

DocLD supports parsing multiple document formats with intelligent layout detection, OCR, and semantic chunking.

Supported Formats

Format	Extensions	Notes
PDF	`.pdf`	Native parsing, OCR for scanned PDFs
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`	VLM-based OCR
Spreadsheets	`.csv`, `.xlsx`, `.xls`, `.xlsm`, `.xltx`, `.xltm`, `.qpw`	Structured extraction
Presentations	`.pptx`, `.ppt`	Slide content extraction. Legacy `.ppt` has no native text; OCR or the document converter is recommended.
Documents	`.docx`, `.doc`, `.txt`, `.html`, `.rtf`	Direct text extraction

Processing Pipeline


Upload → Parse → OCR → Chunk → Vectorize → Index

Upload — File is uploaded to secure storage (Cloudflare R2)
Parse — Content extracted based on file type
OCR — VLM-based OCR handles images and scanned pages
Chunk — Semantic and layout-aware chunking
Vectorize — Chunk text is sent to the vector database; embeddings are generated server-side (llama-text-embed-v2)
Index — Records stored in the vector database for semantic search

OCR Capabilities

Language Support

DocLD supports 50+ languages for OCR:

European: English, Spanish, French, German, Italian, Portuguese, Dutch
Asian: Chinese (Simplified/Traditional), Japanese, Korean, Thai, Vietnamese
Middle Eastern: Arabic, Hebrew, Persian
Indic: Hindi, Bengali, Tamil
Auto-detection when language not specified

OCR Features

Feature	Description
Multi-language	50+ languages supported
Auto-detection	Automatically detects document language
Handwriting	Recognizes handwritten text
Table extraction	Preserves table structure
Layout preservation	Maintains document layout
Bounding boxes	Returns coordinates for each text block

OCR Confidence

Results include confidence scores:


{
  "text": "Extracted text content",
  "confidence": 0.95,
  "language": "en",
  "bounding_boxes": [
    {
      "text": "Invoice #12345",
      "x": 100,
      "y": 50,
      "width": 150,
      "height": 25,
      "confidence": 0.98
    }
  ]
}

PDF-specific extraction

For PDFs, the pipeline can extract:

Form fields — Fillable form fields (text, checkbox, dropdown, etc.) are extracted and attached to metadata.formFields and included as blocks when using block-level parse. Enable with pdf.extractFormFields (default: true).
Hyperlinks — Link annotations and their URLs are extracted and attached to metadata.hyperlinks and as blocks. Enable with pdf.extractHyperlinks (default: true).
Tables — Tables come from in-PDF layout heuristics (detectTables). Each table entry may include an optional source field; legacy payloads can include camelot for older processed documents.
Figures — The enhance option agenticFigures uses a VLM to detect and describe figures from rendered page images (not native PDF XObject extraction). For embedded image extraction, use the dedicated images API where available.

Agentic OCR

Enable AI-enhanced parsing for complex documents:


{
  "parsing_config": {
    "agentic": true
  }
}

Agentic mode provides:

Better table reconstruction
Figure and chart analysis
Layout understanding
Complex form parsing
Higher accuracy (slower processing)

Chunking Strategies

Semantic Chunking (Default)

Splits by meaning and natural boundaries:

Respects paragraph breaks
Keeps related content together
Preserves context across chunks

Fixed Size Chunking

Splits by character count:


{
  "chunking": {
    "strategy": "fixed",
    "max_chunk_size": 1000,
    "overlap": 100
  }
}

Page-Based Chunking

One chunk per page:


{
  "chunking": {
    "strategy": "page"
  }
}

Configuration Options

Parsing Config

Programmatic parse (POST /api/v1/parse with Authorization: Bearer): accepts config with formatting.table_output_format and chunking (strategy, max_chunk_size, overlap). JSON bodies use input (URL, docld://…, or { file_id }) plus optional config. The dashboard can call the session route POST /api/parse (cookies) with the same handler. See the Documents API for request shapes.

Full document processing (e.g. document ingestion, workflows, or processing via the dashboard): uses a richer parsing config that can include OCR, agentic mode, and metadata options:


{
  "parsing_config": {
    "chunking": {
      "strategy": "semantic",
      "max_chunk_size": 1000,
      "overlap": 100
    },
    "ocr": {
      "enabled": true,
      "language": "auto"
    },
    "agentic": false,
    "include_metadata": true
  }
}

Option	Default	Description	Where
`chunking.strategy`	`semantic`	Chunking method	API and pipeline
`chunking.max_chunk_size`	1000	Max characters per chunk	API and pipeline
`chunking.overlap`	100	Character overlap between chunks	API and pipeline
`ocr.enabled`	true	Enable OCR for images/scans	Pipeline
`ocr.language`	`auto`	Language code or auto-detect	Pipeline
`agentic`	false	Enable AI-enhanced parsing	Pipeline
`include_metadata`	true	Include document metadata	Pipeline

Parsing Methods

Synchronous

Parse immediately and get results:


curl -X POST "/api/v1/parse" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf"

Asynchronous

Queue for background processing:


curl -X POST "/api/v1/parse/async" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "webhook_url=https://your-server.com/webhook"

Output Format

Parse results include structured content:


{
  "job_id": "uuid",
  "duration": 2.5,
  "usage": {
    "num_pages": 5,
    "credits": 5
  },
  "result": {
    "type": "full",
    "chunks": [
      {
        "content": "# Invoice\n\nInvoice Number: INV-001...",
        "page": 1,
        "metadata": {
          "type": "text",
          "confidence": 0.98
        }
      }
    ]
  }
}

Page usage

Parsing consumes pages (1 page = 1 unit). Pricing: 200 free pages per month, then $0.05 per page. See the dashboard or API Reference for balance and usage.

CLI Support

Parse documents from the command line:


# Parse a single file
docld parse document.pdf
 
# Parse with agentic mode
docld parse document.pdf --agentic
 
# Parse an entire folder
docld parse ./documents -o ./parsed

Metadata and future capabilities

XMP — When available, getPDFMetadata returns an optional xmpRaw string (catalog Metadata stream) for compliance and custom metadata.
Planned (optional) — Barcode/QR detection, large-PDF streaming (page-by-page), and duplicate/near-duplicate page detection may be added as opt-in steps; see the API changelog for updates.

API Reference

See the Documents API and API Reference for endpoint documentation.