File Format
ConceptsA file format is the structure of a document file (e.g., PDF, DOCX, XLSX). DocLD supports a wide range of formats for parsing and extraction. Each format is handled with format-specific parsers and, for images and scanned documents, OCR when needed.
Supported Formats
| Format | Extensions | Parsing Method | Notes |
|---|---|---|---|
.pdf | Native text + layout; OCR for scanned | Most common; preserves structure | |
| Images | .png, .jpg, .jpeg, .gif, .tiff | VLM-based OCR | Photos, screenshots, scanned pages |
| Spreadsheets | .csv, .xlsx, .xls, .xlsm | Structured extraction | Tables, formulas, multiple sheets |
| Presentations | .pptx, .ppt | Slide content | Text and layout per slide |
| Documents | .docx, .doc, .txt, .html, .rtf | Direct text | Office and plain text |
Upload endpoints accept these extensions. File size limits and page limits may apply per plan.
Format-Specific Behavior
- PDF — Native digital PDFs are parsed directly; scanned pages or image-only PDFs go through OCR
- Images — All image content is processed with OCR; supports 50+ languages
- Spreadsheets — Tables are extracted with structure preserved; useful for extraction and chunking
- Office docs — DOCX and similar are parsed for text and layout; formatting aids chunking
Choosing Formats
- Prefer native digital formats (PDF with embedded text, DOCX) over scanned versions when possible
- For scanned or image-only content, ensure sufficient resolution; OCR quality depends on image clarity
- Mixed batches (e.g., PDFs and images) are supported; DocLD detects format per file
Related Concepts
File format determines how parsing runs. OCR applies to images and scanned PDFs. Extraction works across all supported formats once content is parsed.