Parsing
ProcessingParsing extracts text, tables, and layout from documents. It's the first step in the DocLD pipeline: turn unstructured files into structured content that can be chunked, embedded, and searched.
Supported Formats
| Format | Extensions | Notes |
|---|---|---|
.pdf | Native parsing; OCR for scanned | |
| Images | .png, .jpg, .jpeg, .gif, .tiff | VLM-based OCR |
| Spreadsheets | .csv, .xlsx, .xls, .xlsm | Structured extraction |
| Presentations | .pptx, .ppt | Slide content |
| Documents | .docx, .doc, .txt, .html, .rtf | Direct text |
Pipeline
Upload → Parse → [OCR](/glossary/ocr) (if needed) → [Chunk](/glossary/chunking) → Vectorize → Index
Parsing produces:
- Text — Extracted content with layout preserved
- Tables — Structured rows and columns
- Figures — Images with descriptions
- Metadata — Page count, file info, processing details
Layout Awareness
DocLD uses semantic parsing to respect document structure. Headings, paragraphs, lists, and tables are identified so chunking can split at logical boundaries. Parsed output feeds into embedding and the vector index for vector search and extraction.
Related Concepts
Parsing is the first step in the DocLD pipeline. Output is chunked and embedded for vector search and RAG. OCR runs when documents lack embedded text. Metadata from parsing is attached to chunks.