Native PDF | Glossary | DocLD

A native PDF (sometimes called "digital" or "text-based" PDF) is a PDF whose pages are drawn using text operators and embedded fonts. Text can be extracted directly from content streams without OCR. This is in contrast to a scanned document, where each page is an image and OCR is required.

Why It Matters

Speed — Parsing is faster when text is already present; no OCR step.
Accuracy — No character recognition errors; extraction reflects the actual encoded text.
Layout — Layout analysis can use position and font info for better chunking and table extraction.

DocLD automatically uses native text when available and falls back to OCR for image-only pages.

Native PDFs are parsed via text extraction and content streams. Scanned documents require OCR. Both feed into parsing and chunking.

Frequently Asked Questions

Why It Matters

Speed — Parsing is faster when text is already present; no OCR step.
Accuracy — No character recognition errors; extraction reflects the actual encoded text.
Layout — Layout analysis can use position and font info for better chunking and table extraction.

DocLD automatically uses native text when available and falls back to OCR for image-only pages.

Native PDFs are parsed via text extraction and content streams. Scanned documents require OCR. Both feed into parsing and chunking.

Why It Matters

Related Concepts

Frequently Asked Questions

Why It Matters

Related Concepts

Frequently Asked Questions