Data Extraction
Extract structured data from documents using schemas and AI-powered field detection.
Overview
DocLD’s extraction feature lets you:
- Define JSON schemas for your data structure
- Use AI to detect and extract field values
- Validate extractions with confidence scores
- Correct results manually or automatically
- Batch process multiple documents
- Compare extractions against ground truth
How It Works
Document → Schema → AI Extraction → Field Results → Validation- Upload document or reference existing
- Apply extraction schema
- AI extracts field values with confidence
- Review and correct if needed
- Export or use in workflows
Extraction Schemas
Creating Schemas
Define what data to extract:
{
"name": "Invoice",
"description": "Extract invoice data",
"instructions": "Extract all line items and totals",
"fields": [
{
"name": "invoice_number",
"type": "string",
"description": "The invoice number",
"required": true
},
{
"name": "total_amount",
"type": "number",
"description": "Total amount due",
"required": true
},
{
"name": "line_items",
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"price": { "type": "number" }
}
}
}
]
}Field Types
| Type | Description | Example |
|---|---|---|
string | Text value | ”INV-001” |
number | Numeric value | 1250.00 |
boolean | True/false | true |
date | Date value | ”2024-01-15” |
array | List of items | [{...}, {...}] |
object | Nested object | {"name": "..."} |
Schema Instructions
Add instructions to guide extraction:
{
"instructions": "Extract invoice data carefully. For line items, include all items even if they span multiple pages. If tax is shown separately, extract it as a separate field."
}Running Extractions
Single Document
curl -X POST "/api/extract/run" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"document_id": "doc-uuid",
"schema_id": "invoice-schema",
"include_citations": true
}'Extract anything (zero-shot)
For ad-hoc use cases you can skip defining a schema. On the Extract page, choose Describe what you want, upload a document, and type what to extract in plain language (e.g. company name, all dates, and every dollar amount). The system infers a schema and runs extraction in one shot. No schema library or field builder required.
Via API, send the same POST /api/extract/run with a description field instead of schema_id or config. The description must be at least 10 characters. You cannot combine description with schema_id or config.
Agentic extraction (complex docs)
For documents with tables, many fields, or complex layout, enable Agentic extraction in Settings. DocLD runs a multi-step pipeline:
- Identify regions – Segment the document into headers, tables, form blocks, and body (using layout metadata when available, or an LLM pass).
- Extract by region – Run extraction on each region with the same schema.
- Merge and validate – Combine results (scalars: first non-null; arrays: merged and deduped) and optional validation.
This goes beyond a single prompt + schema and improves accuracy on complex layouts. Toggle “Agentic extraction (complex docs)” in the Extract settings or set config.settings.agenticExtractionMode: true in the API.
Streaming extraction (SSE)
For long documents or when you want progressive feedback, use streaming: send Accept: text/event-stream (or ?stream=1) with the same request body. The response is a Server-Sent Events stream:
| Event | Description |
|---|---|
progress | Progress percentage and optional message (e.g. “Extracting chunk 2/5…”) |
field_extracted | One field has been extracted; payload includes field (fieldId, fieldName, value, confidence, etc.) |
chunk_done | One chunk or page is done (array/chunked mode); payload includes chunk_index and optional partial_data |
complete | Final event with full result (same shape as the non-streaming JSON response) |
error | Something went wrong (e.g. “Cancelled”) |
You can show fields as they arrive in the UI. Closing the connection (e.g. user clicks Cancel) cancels the extraction on the server; the job is marked failed with error “Cancelled”.
Batch Extraction
Process multiple documents:
curl -X POST "/api/extract/batch" \
-d '{
"name": "January Invoices",
"document_ids": ["doc1", "doc2", "doc3"],
"schema_id": "invoice-schema",
"run_immediately": true
}'Results
Field Results
{
"data": {
"invoice_number": "INV-2024-001",
"total_amount": 1250.00,
"line_items": [...]
},
"field_results": [
{
"field": "invoice_number",
"value": "INV-2024-001",
"confidence": 0.98,
"evidence": {
"text": "Invoice #: INV-2024-001",
"page": 1,
"bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
"confidence": 0.98,
"chunk_id": "chunk-uuid",
"span": { "start": 120, "end": 156 }
},
"citations": [
{
"text": "Invoice #: INV-2024-001",
"page": 1,
"bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
"confidence": "high"
}
]
}
],
"overall_confidence": 0.95
}Confidence Scores
| Score | Meaning | Action |
|---|---|---|
| 0.9+ | High confidence | Usually accurate |
| 0.7-0.9 | Medium confidence | Spot check |
| < 0.7 | Low confidence | Manual review |
Evidence and Citations
Each field can include a structured evidence object for traceability and compliance (“show me where this number came from”). Evidence supports agentic and document-grounded systems that need precise references.
| Field | Description |
|---|---|
text | Source text snippet from the document |
page | Page number (1-indexed) |
bbox | Bounding box (normalized 0–1: x, y, width, height) |
confidence | Numeric confidence (0–1) |
chunk_id | Chunk reference when available |
span | Optional start/end offset in parsed text for agentic use |
Citations remain available for backward compatibility and follow the same structure.
Manual Corrections
Adding Corrections
Fix extraction errors:
curl -X POST "/api/extract/corrections" \
-d '{
"extraction_id": "ext-uuid",
"field_id": "total_amount",
"corrected_value": 1300.00
}'Bulk Corrections
curl -X PUT "/api/extract/corrections" \
-d '{
"extraction_id": "ext-uuid",
"corrections": [
{"field_id": "total_amount", "corrected_value": 1300.00},
{"field_id": "due_date", "corrected_value": "2024-02-01"}
]
}'Ground Truth
Set verified values for accuracy measurement:
curl -X POST "/api/extract/ground-truth" \
-d '{
"document_id": "doc-uuid",
"schema_id": "schema-uuid",
"field_values": {
"invoice_number": "INV-001",
"total_amount": 1250.00
}
}'Compare extractions against ground truth:
curl -X POST "/api/extract/ground-truth/compare" \
-d '{
"document_id": "doc-uuid",
"field_results": [...],
"fields": [...]
}'Prebuilt Schemas
Use prebuilt schemas for common document types:
| Category | Schemas |
|---|---|
| Invoice | Invoice, Receipt, Purchase Order |
| Contract | Contract, NDA, Agreement |
| Resume | Resume, CV |
| Financial | Bank Statement, Tax Form |
| Form | W-9, W-4, 1099 |
curl -X GET "/api/extract/schemas/prebuilt?category=invoice"Schema Generation
From Document
Generate schema by analyzing a sample:
curl -X POST "/api/extract/schemas/generate" \
-d '{"document_id": "sample-doc-uuid"}'From Description
Generate from natural language:
curl -X POST "/api/extract/schemas/generate-from-description" \
-d '{"description": "Extract invoice number, vendor name, line items with description, quantity, and price, and the total amount"}'Suggested Schema
Get a suggested prebuilt schema for a document (from pipeline classification or on-demand detection):
curl -X GET "/api/extract/suggested-schema?document_id=doc-uuid"Response:
{
"suggestedSchema": {
"id": "prebuilt-invoice",
"name": "Invoice",
"formType": "invoice",
"category": "financial",
"fieldCount": 12
}
}Batch Comparison
Compare extractions across documents:
curl -X GET "/api/extract/batch/{id}/comparison?only_discrepancies=true"View:
- Field values across documents
- Statistical variance
- Outliers and anomalies
CLI Support
Extract from command line:
# Extract with schema
docld extract invoice.pdf -s schemas/invoice.json
# Extract multiple files
docld extract ./invoices -s schemas/invoice.json
# Include citations
docld extract invoice.pdf -s schema.json --citationsCredit Usage
| Operation | Credits |
|---|---|
| Extraction | 2.0 per extraction |
API Reference
See the Extract API for complete endpoint documentation.