Parsing | Glossary | DocLD

Parsing extracts text, tables, and layout from documents. It's the first step in the DocLD pipeline: turn unstructured files into structured content that can be chunked, embedded, and searched.

Supported Formats

Format	Extensions	Notes
PDF	`.pdf`	Native parsing; OCR for scanned
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.tiff`	VLM-based OCR
Spreadsheets	`.csv`, `.xlsx`, `.xls`, `.xlsm`	Structured extraction
Presentations	`.pptx`, `.ppt`	Slide content
Documents	`.docx`, `.doc`, `.txt`, `.html`, `.rtf`	Direct text

Pipeline

Upload → Parse → [OCR](/glossary/ocr) (if needed) → [Chunk](/glossary/chunking) → Vectorize → Index

Parsing produces:

Text — Extracted content with layout preserved
Tables — Structured rows and columns
Figures — Images with descriptions
Metadata — Page count, file info, processing details

Layout Awareness

DocLD uses semantic parsing to respect document structure. Headings, paragraphs, lists, and tables are identified so chunking can split at logical boundaries. Parsed output feeds into embedding and the vector index for vector search and extraction.

Parsing is the first step in the DocLD pipeline. Output is chunked and embedded for vector search and RAG. OCR runs when documents lack embedded text. Metadata from parsing is attached to chunks.

Frequently Asked Questions

Parsing extracts text, tables, and layout from documents. It's the first step in the DocLD pipeline: turn unstructured files into structured content that can be chunked, embedded, and searched.

Supported Formats

Format	Extensions	Notes
PDF	`.pdf`	Native parsing; OCR for scanned
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.tiff`	VLM-based OCR
Spreadsheets	`.csv`, `.xlsx`, `.xls`, `.xlsm`	Structured extraction
Presentations	`.pptx`, `.ppt`	Slide content
Documents	`.docx`, `.doc`, `.txt`, `.html`, `.rtf`	Direct text

Pipeline

Upload → Parse → [OCR](/glossary/ocr) (if needed) → [Chunk](/glossary/chunking) → Vectorize → Index

Parsing produces:

Text — Extracted content with layout preserved
Tables — Structured rows and columns
Figures — Images with descriptions
Metadata — Page count, file info, processing details

Layout Awareness

Parsing is the first step in the DocLD pipeline. Output is chunked and embedded for vector search and RAG. OCR runs when documents lack embedded text. Metadata from parsing is attached to chunks.

Supported Formats

Pipeline

Layout Awareness

Related Concepts

Frequently Asked Questions

Supported Formats

Pipeline

Layout Awareness

Related Concepts

Frequently Asked Questions