Structured Extraction in DocLD — Schemas, Jobs, and Corrections
Turning PDFs and other documents into structured data—invoices into JSON, forms into fields, contracts into key terms—is at the core of DocLD's extraction feature. You define what you want (a schema), run an extraction job, and then review or correct the results. This post walks through how schemas, jobs, and corrections work so you can use the Extract API and the dashboard with confidence.
Defining extraction schemas
A schema describes the shape of the data you want to extract: field names, types (string, number, boolean, date, array, object), whether each field is required, and optional descriptions to guide the model. You can also add high-level instructions (e.g. “Extract all line items; if tax is separate, capture it as its own field”).
In DocLD you can:
- Create custom schemas in the dashboard or via the API—name, description, instructions, and a
fieldsarray. Schemas are stored per organization and can be shared or kept private. - Use prebuilt schemas for common document types (invoices, receipts, contracts, resumes, W-9, etc.) via the prebuilt schemas API or the schema picker in the UI.
- Generate a schema from a sample document (
POST /api/extract/schemas/generate) or from natural language (POST /api/extract/schemas/generate-from-description). - Skip schemas entirely with zero-shot extraction: send a
description(e.g. “company name, all dates, every dollar amount”) toPOST /api/extract/runand the service infers a schema and runs extraction in one shot.
Schema settings (e.g. vision scope, array extraction mode, agentic extraction) control how the pipeline behaves—for example, whether to use the full document or first page for vision, and whether to run the multi-step agentic pipeline for complex documents with tables or many fields. The extraction docs describe field types and options in detail.
Running extraction jobs
Once you have a schema (or a description for zero-shot), you run an extraction job against a document. Under the hood, DocLD creates a job record, runs the extraction pipeline (chunking/vision as needed, LLM calls per field or region, validation), and stores the result as an extraction linked to that job.
- Single document: In the UI, open the Extract page, pick a document and schema (or “Describe what you want”), and run. Via API, call
POST /api/extract/runwithdocument_idandschema_id(orconfig, ordescription). The response includesdata,field_results, confidence scores, and optional citations. - Streaming: Send
Accept: text/event-streamor?stream=1with the same payload to get Server-Sent Events:progress,field_extracted,chunk_done, and finallycompletewith the full result. Useful for long documents or showing fields as they arrive; closing the connection cancels the job. - Batch: Use
POST /api/extract/batchwithdocument_idsandschema_idto process many documents. You can run immediately or queue for later and check status via the batch API. - Agentic extraction: For complex layouts (tables, many fields), enable agentic extraction in schema settings or in the run config. The pipeline identifies regions (e.g. headers, tables, form blocks), extracts per region with the same schema, then merges and validates. See the Extract API for
settings.agenticExtractionModeand related options.
Each extraction stores field results with values, confidence, and optional evidence (source text, page, bounding box, chunk reference) so you can trace where a value came from—important for compliance and debugging.
Fixing results: corrections UI and API
Extractions are not always perfect. DocLD lets you correct field values and persist those corrections.
- In the UI: On the document or extraction view, edit a field value and save. The correction is stored and associated with that extraction and field.
- Via API:
POST /api/extract/correctionsadds a single correction (extraction id, field id, corrected value).PUT /api/extract/correctionsapplies multiple corrections in one request. Corrections are stored in the extraction’s correction history.
Corrections are used for:
- Display: The corrected value is what you see as the “current” value for that field.
- Quality and ground truth: You can set ground truth for a document/schema and compare extraction results (and corrections) against it for accuracy metrics. Correction history is available for auditing.
So the loop is: run extraction → review field results and confidence → correct where needed (UI or API) → export or feed into workflows. For full endpoint details and request shapes, see the Extract API reference and the Extraction feature docs.