Text Extraction
ProcessingText extraction is the process of pulling raw text from documents. For native PDFs, text is read from content streams and font mappings. For scanned documents and images, text extraction requires OCR (optical character recognition) because the content is stored as pixels, not characters.
In DocLD
Parsing performs text extraction as its first step. Output includes plain text with layout preserved (paragraphs, headings, lists). Tables are extracted as structured data; text within table cells is part of the extracted content. Extracted text feeds into chunking and embedding for vector search, or into extraction for structured fields.
Related Concepts
Text extraction is the core of parsing. Layout analysis helps preserve structure; OCR enables extraction from scanned documents. Resulting text is then chunked and used in RAG or extraction.