Content Stream
ProcessingIn PDF, a content stream is the sequence of graphics and text operators that define how a page is drawn. It references fonts, positions, and drawing commands. Text extraction from a native PDF works by interpreting these content streams to recover character strings and their positions.
Why It Matters
- Native PDF — Text in a native PDF lives in content streams; parsing reads it without OCR.
- Layout — Position and order of operators inform layout analysis and chunking.
- Scanned PDF — Scanned documents may have no meaningful text in content streams (only "draw image"); OCR is used instead.
DocLD’s parser handles content streams when parsing PDFs to produce text and structure for chunking and extraction.
Related Concepts
Content streams are the source of text in native PDF. Text extraction and parsing interpret them; scanned documents bypass them in favor of OCR.