Text Extraction | Glossary | DocLD

Text extraction is the process of pulling raw text from documents. For native PDFs, text is read from content streams and font mappings. For scanned documents and images, text extraction requires OCR (optical character recognition) because the content is stored as pixels, not characters.

In DocLD

Parsing performs text extraction as its first step. Output includes plain text with layout preserved (paragraphs, headings, lists). Tables are extracted as structured data; text within table cells is part of the extracted content. Extracted text feeds into chunking and embedding for vector search, or into extraction for structured fields.

Text extraction is the core of parsing. Layout analysis helps preserve structure; OCR enables extraction from scanned documents. Resulting text is then chunked and used in RAG or extraction.

Frequently Asked Questions

In DocLD

Text extraction is the core of parsing. Layout analysis helps preserve structure; OCR enables extraction from scanned documents. Resulting text is then chunked and used in RAG or extraction.

In DocLD

Related Concepts

Frequently Asked Questions

In DocLD

Related Concepts

Frequently Asked Questions