File Format | Glossary | DocLD

A file format is the structure of a document file (e.g., PDF, DOCX, XLSX). DocLD supports a wide range of formats for parsing and extraction. Each format is handled with format-specific parsers and, for images and scanned documents, OCR when needed.

Supported Formats

Format	Extensions	Parsing Method	Notes
PDF	`.pdf`	Native text + layout; OCR for scanned	Most common; preserves structure
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.tiff`	VLM-based OCR	Photos, screenshots, scanned pages
Spreadsheets	`.csv`, `.xlsx`, `.xls`, `.xlsm`	Structured extraction	Tables, formulas, multiple sheets
Presentations	`.pptx`, `.ppt`	Slide content	Text and layout per slide
Documents	`.docx`, `.doc`, `.txt`, `.html`, `.rtf`	Direct text	Office and plain text

Upload endpoints accept these extensions. File size limits and page limits may apply per plan.

Format-Specific Behavior

PDF — Native digital PDFs are parsed directly; scanned pages or image-only PDFs go through OCR
Images — All image content is processed with OCR; supports 50+ languages
Spreadsheets — Tables are extracted with structure preserved; useful for extraction and chunking
Office docs — DOCX and similar are parsed for text and layout; formatting aids chunking

Choosing Formats

Prefer native digital formats (PDF with embedded text, DOCX) over scanned versions when possible
For scanned or image-only content, ensure sufficient resolution; OCR quality depends on image clarity
Mixed batches (e.g., PDFs and images) are supported; DocLD detects format per file

File format determines how parsing runs. OCR applies to images and scanned PDFs. Extraction works across all supported formats once content is parsed.

Frequently Asked Questions

Supported Formats

Format	Extensions	Parsing Method	Notes
PDF	`.pdf`	Native text + layout; OCR for scanned	Most common; preserves structure
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.tiff`	VLM-based OCR	Photos, screenshots, scanned pages
Spreadsheets	`.csv`, `.xlsx`, `.xls`, `.xlsm`	Structured extraction	Tables, formulas, multiple sheets
Presentations	`.pptx`, `.ppt`	Slide content	Text and layout per slide
Documents	`.docx`, `.doc`, `.txt`, `.html`, `.rtf`	Direct text	Office and plain text

Upload endpoints accept these extensions. File size limits and page limits may apply per plan.

Format-Specific Behavior

PDF — Native digital PDFs are parsed directly; scanned pages or image-only PDFs go through OCR

Images — All image content is processed with OCR; supports 50+ languages

Spreadsheets — Tables are extracted with structure preserved; useful for extraction and chunking

Office docs — DOCX and similar are parsed for text and layout; formatting aids chunking

Frequently Asked Questions

Supported Formats

Format-Specific Behavior

Choosing Formats

Related Concepts

Frequently Asked Questions

Supported Formats

Format-Specific Behavior

Choosing Formats

Related Concepts

Frequently Asked Questions