Everything You Need to Know About PDFs
The Portable Document Format (PDF) is everywhere—contracts, tax forms, e-books, datasheets, and archival records. As a developer, you may need to extract text, index documents for search, validate structure, or build document processing pipelines that parse thousands of PDFs. Understanding how PDFs work under the hood helps you choose the right parsing strategy, debug encoding issues, and handle the wide variety of real-world files. This post covers the specification, file anatomy, object model, text extraction versus OCR, fonts and encoding, parsing approaches, security, linearization, accessibility, and the tooling ecosystem. The focus is technical and implementation-oriented: enough detail to reason about parsing behavior, encoding pitfalls, and when to use which tool, without reprinting the full ISO specification. For hands-on parsing, try our free document tools or read the documentation.
History and Specification
PDF was created at Adobe in the early 1990s. The goal was a format that would look the same on any device and operating system—hence "portable." Version 1.0 was released in 1993. The format evolved through PDF 1.1–1.7 with additions such as encryption, metadata (XMP), layers (Optional Content Groups), and improved compression. In 2008, PDF 1.7 was standardized as ISO 32000-1. A second edition (ISO 32000-2) and ongoing ISO maintenance define the current specification.
PDF is related to PostScript: both use similar graphics models (paths, text, images, coordinate transforms). PDF can be thought of as a more structured, random-access cousin of PostScript—optimized for viewing and incremental updates rather than sending to a printer. Many PDFs are still generated by "distilling" PostScript (e.g., via Ghostscript or print-to-PDF).
| Milestone | Year | Notes |
|---|---|---|
| PDF 1.0 | 1993 | First release; basic objects, fonts, images |
| PDF 1.3 | 2000 | Digital signatures, OCGs |
| PDF 1.4 | 2001 | JBIG2, transparency |
| PDF 1.7 / ISO 32000-1 | 2008 | Full ISO standardization |
| ISO 32000-2 | 2020 | Second edition; clarifications and extensions |
Understanding that PDF is an ISO standard matters: the behavior of objects, encryption, and structure is defined by the spec, so implementations that follow it can interoperate. PDF 2.0 (ISO 32000-2) refines and extends the specification; it is backward compatible in the sense that PDF 1.x readers can often open PDF 2.0 files while ignoring newer features, though strict validation may require a 2.0-aware parser.
File Structure
A PDF file has a clear high-level layout. Opening one in a hex editor or viewing the first and last bytes makes this visible.
Header. The file must begin with %PDF-1.x where x is a digit (e.g. %PDF-1.4). This identifies the format and the version the file claims to follow. Optionally, the header may include binary bytes (e.g. %PDF-1.4‰âãÏÓ) to make the file detectable as binary by tools that sniff content.
Body. The bulk of the file is the body: a sequence of indirect objects. Each indirect object has a positive object number, a generation number (usually 0), and a value. Values can be dictionaries, arrays, streams, strings, numbers, booleans, or the null object. Streams hold binary or encoded data (e.g. page content, embedded fonts, images) and are preceded by a dictionary that specifies length, filters (e.g. FlateDecode for zlib compression), and optional parameters.
Cross-reference table (xref). The xref table is a list of byte offsets (and generation numbers) for every indirect object. It allows a reader to jump directly to any object without scanning the whole file. This is what makes random access to pages and resources possible. In modern PDFs the xref may be stored as a stream (cross-reference stream) for compression.
Trailer. The trailer appears at the end of the file. It is a dictionary that points to the root of the document catalog (the top-level object) and to the xref table. The reader starts by parsing the trailer, follows the reference to the xref, then can resolve any object by number. After the trailer you typically see %%EOF (optional whitespace and end-of-file). Multiple trailers can exist in incrementally updated files; the one at the end of the file defines the current view of the document.
%PDF-1.4 ... body: indirect objects ... xref 0 n ... offsets ... trailer << /Size ... /Root 1 0 R ... >> startxref 12345 %%EOF
This structure implies that "text" in the usual sense is not stored as a contiguous block. What you see as a page is produced by interpreting content streams (which reference fonts, images, and other resources). Extracting text therefore requires a parser that understands the object model and the graphics operators that draw text. The xref table can be in two forms: the classic xref table (keyword xref followed by lines of offset and generation) or a cross-reference stream (an indirect object that is a stream containing compressed offset data). The trailer’s startxref value gives the byte offset to the beginning of the xref so the reader can find it quickly from the end of the file.
Object Model
The PDF object model is built from a small set of types. Indirect objects are the units of storage: each has an object number and generation, and other objects refer to them by reference (e.g. 5 0 R). Dictionaries map names (e.g. /Type, /Subtype) to values. Arrays are ordered lists. Streams are dictionaries followed by binary data; they hold page content, font programs, and images. Names (e.g. /Font, /Pages) and strings (literal or hexadecimal) are used for keys and text. Numbers and booleans round out the set.
The document catalog is the root. It has a /Type of /Catalog and typically a /Pages entry pointing to the page tree. The page tree is a hierarchy of node dictionaries: intermediate nodes have /Kids (array of child nodes) and /Count; leaf nodes are page objects. Each page object has a /MediaBox (page size), /Resources (fonts, XObjects, etc.), and /Contents (one or more content streams that describe what to draw).
When a viewer renders a page, it:
- Finds the page object from the page tree.
- Loads resources (fonts, images) referenced from that page.
- Interprets the content stream(s) in order. Operators like
BT(begin text),Tf(set font and size),Tj(show text string), andET(end text) draw text; path and image operators draw graphics.
So the "content" of a page is a program (the content stream) that runs in a small graphics state machine. Text extraction works by either re-executing that program and recording text-drawing operations (with positions and font info) or by parsing the stream and identifying text operators. Libraries like PDF.js do the former: they build an internal representation of the page and expose APIs such as getTextContent() that return text items with coordinates and font metadata. Names in PDF are tokens that begin with / and are used for dictionary keys and some predefined values (e.g. /Type, /Font, /Width). Strings can be literal (like this) or hexadecimal <4C696E65>. Streams must have a /Length (or be marked with an indirect length) and can specify filters such as /FlateDecode (zlib) or /DCTDecode (JPEG) so that the stream data is decompressed before use.
Text vs. Images
Not all PDFs contain extractable text. There are two broad cases.
Native (digital) PDFs. The file was created from an application (Word, InDesign, LaTeX, etc.). Text is represented by font resources and text-drawing operators in content streams. A parser can extract this text by interpreting those operators and resolving character codes through the font’s encoding (and optional CMap). This is fast and accurate and preserves logical order if the content stream is well structured.
Scanned PDFs. The file was produced by scanning paper. Each page is typically one or more images (XObject or inline). There are no text operators—only "draw this image here." To get text you must run OCR (optical character recognition) on the raster image. OCR is slower, can introduce errors, and depends on image quality and language support. You can try image-to-text OCR online for scanned pages.
In practice many PDFs are mixed: some pages have native text, others are scanned. A robust pipeline will detect per-page whether there is enough extractable text (e.g. character count above a threshold) and fall back to OCR when there is not. For example, a parser might use a heuristic like "if a page yields fewer than 50 characters of text, treat it as scanned and run OCR." That way, digital pages use fast text extraction and scanned pages still get processed.
Fonts and Encoding
Why does copy-pasting from some PDFs produce gibberish or wrong characters? Usually the answer is encoding.
A font in a PDF has an encoding that maps byte values (or character codes) to glyphs. Common encodings include:
- Standard 14 fonts: Built-in encodings (e.g. WinAnsiEncoding, MacRomanEncoding) map a single byte to a glyph. These are well defined and widely supported.
- Embedded fonts: The font program can define a custom encoding. If the creator used a non-standard mapping, a reader that assumes WinAnsi will show wrong characters.
- Identity-H / Identity-V: Used for Unicode-capable fonts (e.g. CID fonts). Character codes are multi-byte (typically 2 bytes). The mapping from Unicode to these codes is given by a CMap (character map). If the CMap is missing or not applied, extraction yields raw codes instead of text.
So "garbled" extraction often means: the extractor used the wrong encoding, or the font is CID/Unicode and the CMap was not used. Correct text extraction must use the font’s encoding and, when present, the CMap from the font descriptor or the document. Libraries like PDF.js handle this internally when you call getTextContent(); they resolve ToUnicode CMaps and encoding so that the returned strings are Unicode text.
| Encoding type | Typical use | Extraction note |
|---|---|---|
| WinAnsiEncoding | Western Latin | Single byte; usually correct if font is standard |
| MacRomanEncoding | Legacy Mac | Single byte; similar to WinAnsi with different high bytes |
| Identity-H/V | CJK, Unicode | Requires CMap (ToUnicode) for correct Unicode output |
| Custom | Embedded fonts | Encoding embedded in font; reader must use it |
Embedded subsets are common: only the glyphs used in the document are included. That does not change the encoding logic; it just means the font resource is smaller. When a PDF references a ToUnicode CMap, that CMap defines a mapping from character codes (as used in the content stream) to Unicode code points. Extractors that honor ToUnicode can produce correct UTF-8 or UTF-16 output even for CJK and other scripts. If you are building a custom pipeline, prefer libraries that already implement this so you do not have to reimplement CMap parsing yourself.
Parsing Strategies
Choosing how to parse a PDF depends on what you need: plain text, structured layout, tables, or pixel-perfect rendering.
Text extraction APIs. The most reliable way to get text from native PDFs is to use a library that interprets the content stream and exposes text with positions. PDF.js (Mozilla) does this and is used in browsers and Node (e.g. via pdfjs-dist). You load the document, get each page, call page.getTextContent(), and receive items with str (the text string) and transform data (position, size). You can then concatenate strings or reconstruct lines/paragraphs using the Y coordinates. This approach respects fonts and encoding and is the same one used by DocLD’s PDF parser for native text. The transform matrix for each item gives you font size and position in PDF units; converting to a consistent coordinate system (e.g. top-left origin, pixels) lets you build bounding boxes for highlighting or citation. Many parsers also expose metadata such as font name and whether the item is part of a marked-content sequence, which can help with structure detection.
Layout and tables. If you need reading order, columns, or tables, you need layout analysis on top of text items. For structured output (pages, blocks, tables), try PDF to JSON. You might cluster by Y position, detect tables by alignment and lines (or by heuristics on whitespace), and output structured blocks (title, paragraph, table). Tools like pdfplumber (Python) and PyMuPDF (MuPDF) provide higher-level APIs that return tables and blocks. For maximum control, some pipelines render pages to images and use computer vision for table detection, then optionally merge with text from getTextContent().
When to use OCR. If a page has no or almost no extractable text (e.g. below ~50 characters for the whole page), treat it as scanned and run OCR. You can render the page to an image (PDF.js, Poppler, or similar) and send the image to an OCR engine (Tesseract, cloud vision APIs, or a vision model). DocLD uses a vision-based OCR path for scanned PDFs so that all pages can be indexed regardless of origin. Our image-to-text and PDF tools support both native and scanned content.
Incremental updates and hybrid PDFs. Some PDFs are built incrementally: a new version appends additional objects and a new xref and trailer at the end, leaving the previous version intact. A reader typically uses the last trailer (the one nearest end-of-file). When parsing, you get the current state; older revisions are still in the file but are not followed. This is useful for saving edits without rewriting the whole file, but it can make file size grow and can complicate strict validation.
Summary of approaches:
| Goal | Approach | Example tools |
|---|---|---|
| Plain text | Text extraction API | pdfjs-dist, PyMuPDF, Poppler pdftotext |
| Text + positions | getTextContent-style API | PDF.js, pdfjs-dist |
| Tables, layout | Layout analysis + optional CV | pdfplumber, PyMuPDF, Camelot |
| Scanned pages | Render to image + OCR | Tesseract, cloud vision, DocLD OCR |
| Full fidelity | Render to image or use viewer | PDF.js canvas, Poppler, MuPDF |
Security and Restrictions
PDF supports password protection and permission flags. Encryption was introduced in PDF 1.1 and refined in later versions. There are two passwords: user and owner. The user password (if set) is required to open the file. The owner password is used to set permissions and can be used to open the file without the user password if the application allows it.
Permissions are stored in the encryption dictionary and enforced by conforming readers. Common flags control: printing (low/high resolution), modifying content, copying text and graphics, adding annotations, filling forms, and assembling the document. These are enforcement by convention: a compliant viewer must respect them, but the format does not cryptographically bind content to permissions. So "unlocking" typically means either knowing the owner password, using a tool that ignores restrictions (often by stripping the encryption dictionary or re-saving without encryption), or exploiting a bug. For parsing pipelines, encrypted PDFs usually need to be decrypted first (with the user password) before text extraction or OCR.
When building document processing systems, it is important to decide how to handle password-protected and restricted PDFs: reject them, request a password from the user, or use a secure decryption step in a controlled environment. Encryption in PDF 1.4 and later can use the standard security handler (password-based) or a custom handler. The standard handler supports 40-bit and 128-bit RC4 and 128-bit and 256-bit AES. Permissions are stored encrypted with the owner password so that only a viewer that has the key can know and enforce them; the actual document content may be encrypted with a key derived from the user password (or both).
Linearization (Fast Web View)
Linearization, also known as "Fast Web View," is defined in the PDF spec. A linearized PDF is structured so that the first page and essential objects (catalog, page tree, first-page resources) appear near the start of the file. The xref is placed early (often right after the first page’s data) so that a viewer can open the file, read the first page, and display it after downloading only a portion of the file. The rest of the document can be loaded on demand. This is intended for web delivery: the browser or plugin can show page one quickly while the rest streams.
Creating a linearized PDF requires reordering objects and writing a correct linearization dictionary and hint streams. Tools like qpdf --linearize or Acrobat’s "Optimize for fast web view" produce linearized files. Parsers do not need to special-case linearization for correctness—a valid linearized PDF is still a valid PDF—but they can use the linearization hints to optimize which bytes they read first if they only need the beginning of the document.
Accessibility
PDFs can be tagged to support accessibility. A tagged PDF includes a structure tree that defines logical order (e.g. headings, paragraphs, lists, figures) and can associate alt text with figures and tables. Screen readers and other assistive technologies use this tree to present content in the right order and to announce images via their alt text. The PDF/UA (ISO 14289) standard defines requirements for accessible PDFs.
Many PDFs in the wild are not tagged. They may have been exported from an application that did not emit structure, or they may be scanned (image-only) with no text layer or tags. In those cases, the document is not accessible without adding a text layer (e.g. via OCR) and optionally tagging. So when you need accessible content, either generate tagged PDFs from the source or run a pipeline that adds structure and alt text (and, for scans, an OCR text layer). The structure tree uses standard role types (e.g. Document, Part, Sect, P, H1–H6, L, LBody, Figure, Formula) so that assistive tech can infer semantics. Reading order is specified by the tree order and the optional /Order and /Role entries on page objects.
Tooling and Ecosystem
Command-line and system libraries. Poppler provides pdftotext, pdfinfo, pdftoppm, and others; it is widely used on Linux and is the engine behind many GUI tools. qpdf is excellent for linearizing, decrypting, and inspecting PDF structure. Ghostscript can convert PostScript to PDF, merge PDFs, and apply various transformations. pdftk (and its successors) is often used for merging, splitting, and filling forms.
Libraries. PDF.js (Mozilla) is the engine in Firefox and is available as pdfjs-dist for Node and bundlers; it is a solid choice for cross-platform text extraction and rendering. PyMuPDF (MuPDF bindings for Python) is fast and exposes detailed layout and text. pdfplumber (Python) sits on top of PDF.js or similar and adds table extraction and layout analysis. Apache PDFBox (Java) offers parsing, text extraction, and signing. pypdf (formerly PyPDF2) is pure Python and good for metadata, merging, and simple extraction but less accurate for complex layout.
Cloud and APIs. Many document APIs (e.g. from Adobe, AWS Textract, Google Document AI) accept PDFs and return text, entities, or structured data. They often combine native extraction with OCR and table detection. Use them when you want minimal infrastructure and can send data to a third party.
Choosing. For in-process parsing with minimal dependencies and good Unicode handling, PDF.js/pdfjs-dist is a strong default. For a quick way to try extraction without code, use our free PDF and document tools—including convert PDF to Markdown, convert PDF to text, and compare PDFs. For Python pipelines that need tables and blocks, pdfplumber or PyMuPDF are common. For quick CLI extraction, pdftotext (Poppler) is simple. For OCR on scanned pages, combine any renderer (to get an image) with Tesseract or a vision API. When evaluating a library, check whether it supports the PDF version you care about (e.g. 1.7 or 2.0), whether it correctly handles encrypted and linearized files, and whether it exposes text with positions (for layout and citation) or only raw text. Performance varies: native bindings (e.g. MuPDF) are often faster than pure-JS or pure-Python parsers for very large or complex documents.
Summary and Further Reading
PDF is a rich, standardized format: a header and trailer bracket a body of indirect objects and an xref table. Content is drawn by executing streams of operators; text is represented via fonts and encodings (and optionally CMaps for Unicode). Native PDFs can be parsed with text-extraction APIs; scanned PDFs require OCR. Encoding and font handling are the usual causes of bad copy-paste or extraction. Security is enforced by passwords and permission flags; linearization optimizes web viewing; tagged PDFs support accessibility.
For deeper dives, the ISO 32000 series (32000-1 and 32000-2) is the authoritative spec. The PDF Association (pdfa.org) publishes guides and best practices. Mozilla’s PDF.js documentation and source are useful for implementation details of text extraction and rendering. For OCR and document understanding, the documentation of Tesseract and major cloud document APIs is a good next step. If you work with archival or compliance, look into PDF/A and PDF/UA; if you generate PDFs programmatically, consider libraries that emit well-formed structure (and optionally tags) so that your outputs are both viewable and accessible. For document intelligence at scale, see our pricing and document processing cost calculator. More terms are in our glossary (e.g. parsing, file format). A practical checklist when integrating PDF parsing: (1) detect encrypted files and handle passwords or reject; (2) use a library that respects font encoding and ToUnicode for correct text; (3) decide per-page whether to use text extraction or OCR based on extracted character count or similar heuristic; (4) validate or sanitize output if you feed it to search indexes or downstream ML.
Frequently Asked Questions
Usually the font encoding or CMap is not being applied correctly. Native PDFs store character codes, not Unicode code points. The viewer must use the font’s encoding (WinAnsi, Identity-H, etc.) and any ToUnicode CMap to convert those codes to Unicode. If your pipeline uses a library that does this (e.g. PDF.js getTextContent), you get correct text; low-level or custom parsers that ignore encoding will produce wrong characters.
PDF/A is a family of standards (ISO 19005) for long-term archiving. PDF/A documents restrict certain features (e.g. no JavaScript, no external references, embedded fonts required) so that the file remains viewable and predictable for many years. PDF/A-1, PDF/A-2, and PDF/A-3 differ in which PDF features they allow. Regular PDFs can use any feature allowed by ISO 32000 and may not be suitable for strict archival.
Use text extraction (e.g. getTextContent) when the page has native text—content streams with text operators and fonts. Use OCR when the page is (or is effectively) just an image: scanned documents, screenshots, or PDFs where the creator embedded only images. A simple heuristic is to extract text first; if the number of characters per page is below a threshold (e.g. 50), treat the page as image-only and run OCR.
Yes. PDF.js is designed for the browser and is used by Firefox. You can load a PDF from a URL or ArrayBuffer, get pages, and call getTextContent() or render to a canvas. The same library (pdfjs-dist) works in Node.js for server-side parsing. So one codebase can support both environments if you use PDF.js.
Linearization reorders the PDF so that the first page and the objects needed to display it appear near the start of the file. A client can then display page one after downloading only the first chunk of bytes, and fetch the rest on demand. It improves perceived load time for PDFs served over the web. Parsers do not need to treat linearized PDFs differently for correctness; they are still valid PDFs.