Native PDF
ProcessingA native PDF (sometimes called "digital" or "text-based" PDF) is a PDF whose pages are drawn using text operators and embedded fonts. Text can be extracted directly from content streams without OCR. This is in contrast to a scanned document, where each page is an image and OCR is required.
Why It Matters
- Speed — Parsing is faster when text is already present; no OCR step.
- Accuracy — No character recognition errors; extraction reflects the actual encoded text.
- Layout — Layout analysis can use position and font info for better chunking and table extraction.
DocLD automatically uses native text when available and falls back to OCR for image-only pages.
Related Concepts
Native PDFs are parsed via text extraction and content streams. Scanned documents require OCR. Both feed into parsing and chunking.