Extract API
Extract structured data from documents using schemas. Define the fields you want, and DocLD extracts them with confidence scores. See Data Extraction for concepts and Custom Extraction for schema design. For a single request/response without streaming, use the Embed API with action: 'extract'.
Run Extraction
POST /api/extract/runExtract data from a document using a schema or custom configuration.
Input Options
| Input Type | Format | Description |
|---|---|---|
| Document ID | {"document_id": "uuid"} | Extract from document |
| DocLD reference | {"input": "docld://..."} | Extract from reference |
| Job ID | {"input": "jobid://..."} | Reuse parsed content |
| URL | {"input": "https://..."} | Fetch and extract |
Request
{
"document_id": "abc123",
"schema_id": "schema-uuid",
"include_citations": true
}You can enable agentic extraction for complex documents (e.g. with tables or many fields). In your schema settings or inline config, set settings.agenticExtractionMode: true. The service will then run a multi-step pipeline: identify regions (headers, tables, form blocks) → extract per region → merge and validate. Use it when single-pass or array extraction is insufficient.
Or with inline configuration:
{
"input": "docld://abc123",
"config": {
"fields": [
{
"name": "invoice_number",
"type": "string",
"description": "The invoice number"
},
{
"name": "total_amount",
"type": "number",
"description": "Total amount due"
},
{
"name": "line_items",
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"price": { "type": "number" }
}
}
}
]
},
"include_citations": true
}Zero-shot (description only): You can omit both schema_id and config and pass a natural-language description of what to extract. The service infers a schema and runs extraction in one shot. Provide exactly one of schema_id, config, or description. Minimum length for description is 10 characters.
{
"document_id": "abc123",
"description": "Extract company name, all dates, and every dollar amount",
"include_citations": true
}The response shape is the same as when using a schema or inline config. Zero-shot is single-document only; batch extraction requires a schema.
Response
{
"success": true,
"job_id": "job-uuid",
"extraction_id": "extraction-uuid",
"document_id": "abc123",
"data": {
"invoice_number": "INV-2024-001",
"total_amount": 1250.00,
"line_items": [
{
"description": "Widget A",
"quantity": 10,
"price": 50.00
},
{
"description": "Widget B",
"quantity": 5,
"price": 150.00
}
]
},
"field_results": [
{
"field": "invoice_number",
"value": "INV-2024-001",
"confidence": 0.98,
"evidence": {
"text": "Invoice Number: INV-2024-001",
"page": 1,
"bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
"confidence": 0.98,
"chunk_id": "chunk-uuid",
"span": { "start": 120, "end": 156 }
},
"citations": [
{
"text": "Invoice Number: INV-2024-001",
"bbox": { "page": 1, "left": 0.1, "top": 0.05, "width": 0.4, "height": 0.03 },
"confidence": "high"
}
]
}
],
"overall_confidence": 0.95,
"processing_time": 1.2,
"usage": {
"num_pages": 2,
"credits": 2.0
}
}When include_citations is true, each field includes an evidence object (primary traceability) and a citations array (backward compatibility). Evidence provides text, page, bbox, confidence, optional chunk_id, and optional span (start/end offset in parsed text) for agentic and document-grounded use cases.
Example
curl -X POST "https://your-domain.com/api/extract/run" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"document_id": "abc123",
"schema_id": "invoice-schema",
"include_citations": true
}'Streaming (SSE)
To receive progressive updates (field-by-field or chunk-by-chunk), use streaming:
- Query:
POST /api/extract/run?stream=1
or send header:Accept: text/event-stream - Request body is unchanged. The response is a stream of Server-Sent Events.
Event types:
| Type | Payload | Description |
|---|---|---|
progress | progress (0–100), optional progress_message | Coarse or per-chunk progress |
field_extracted | field (fieldId, fieldName, value, confidence, etc.) | One field extracted; UI can show it immediately |
chunk_done | chunk_index, optional partial_data | One chunk/page done (array extraction mode) |
complete | job, result | Final event; result has the same shape as the non-streaming JSON response |
error | error (string) | Failure or cancellation message |
Cancel: Closing the connection (e.g. AbortController.abort() on the client) cancels the extraction. The server marks the job as failed with error "Cancelled".
Extraction Schemas
List Schemas
GET /api/extract/schemasList available extraction schemas.
Query Parameters:
| Parameter | Description |
|---|---|
organization | Include organization schemas |
organization_only | Only organization schemas |
organization_id | Filter by organization |
Response:
[
{
"id": "uuid",
"name": "Invoice",
"description": "Extract invoice data",
"fields": [...],
"is_shared": false,
"created_at": "2024-01-15T10:00:00Z"
}
]Create Schema
POST /api/extract/schemasCreate a new extraction schema.
Request:
{
"name": "Invoice",
"description": "Extract invoice data",
"instructions": "Extract all line items and totals",
"fields": [
{
"name": "invoice_number",
"type": "string",
"description": "The invoice number",
"required": true
},
{
"name": "total_amount",
"type": "number",
"description": "Total amount due",
"required": true
},
{
"name": "due_date",
"type": "date",
"description": "Payment due date"
}
],
"settings": {
"model": "gpt-4o",
"temperature": 0
},
"is_shared": false
}Field Types
| Type | Description | Example |
|---|---|---|
string | Text value | ”INV-001” |
number | Numeric value | 1250.00 |
boolean | True/false | true |
date | Date value | ”2024-01-15” |
array | List of items | [{...}, {...}] |
object | Nested object | {"name": "...", "value": ...} |
Get Schema
GET /api/extract/schemas/{id}Update Schema
PUT /api/extract/schemas/{id}Delete Schema
DELETE /api/extract/schemas/{id}Prebuilt Schemas
GET /api/extract/schemas/prebuiltList prebuilt schemas for common document types.
Query Parameters:
| Parameter | Description |
|---|---|
category | Filter by category |
id | Get specific prebuilt schema |
Categories:
invoice- Invoices and receiptscontract- Legal contractsresume- Resumes and CVsform- Standard formsfinancial- Financial documents
Generate Schema
From Document
POST /api/extract/schemas/generateGenerate a schema by analyzing a document.
Request:
| Field | Type | Required | Description |
|---|---|---|---|
document_id | string | Yes | Document to analyze |
merge_with_current | boolean | No | If true, append suggested fields to existing schema instead of replacing |
current_fields | array | No | Existing field definitions when merge_with_current is true |
{
"document_id": "abc123",
"merge_with_current": true,
"current_fields": [
{
"id": "invoice_number",
"name": "Invoice Number",
"type": "string",
"description": "The invoice number",
"required": true
}
]
}Response:
Returns fields — an array of field definitions. Each field may include suggestionReason (short explanation of why it was suggested).
{
"fields": [
{
"id": "vendor_name",
"name": "Vendor Name",
"type": "string",
"description": "Name of the vendor or supplier",
"required": true,
"suggestionReason": "The document header contains a vendor/supplier section with company name."
}
],
"message": "Generated N field suggestions based on document content"
}From Description
POST /api/extract/schemas/generate-from-descriptionGenerate a schema from a text description.
Request:
{
"description": "Extract invoice number, vendor name, line items with description, quantity, and price, and the total amount"
}Response:
Returns fields — an array of field definitions. Each field may include suggestionReason (short explanation of how the field maps to the description).
{
"fields": [
{
"id": "invoice_number",
"name": "Invoice Number",
"type": "string",
"description": "The invoice number",
"required": true,
"suggestionReason": "Explicitly requested in the description."
}
],
"message": "Generated N fields from your description"
}Batch Extraction
Create Batch
POST /api/extract/batchRun extraction on multiple documents.
Request:
{
"name": "January Invoices",
"document_ids": ["doc1", "doc2", "doc3"],
"schema_id": "invoice-schema",
"config": {},
"run_immediately": true
}Response:
{
"id": "batch-uuid",
"name": "January Invoices",
"status": "processing",
"progress": 0,
"document_ids": ["doc1", "doc2", "doc3"],
"created_at": "2024-01-15T10:00:00Z"
}List Batches
GET /api/extract/batchGet Batch
GET /api/extract/batch/{id}Response:
{
"id": "batch-uuid",
"name": "January Invoices",
"status": "completed",
"progress": 100,
"results": {
"doc1": { "success": true, "extraction_id": "..." },
"doc2": { "success": true, "extraction_id": "..." },
"doc3": { "success": false, "error": "..." }
},
"comparison_data": {...}
}Get Batch Comparison
GET /api/extract/batch/{id}/comparisonCompare extraction results across documents.
Query Parameters:
| Parameter | Description |
|---|---|
only_discrepancies | Show only fields with differences |
min_variance | Minimum variance threshold |
max_confidence | Maximum confidence threshold (numeric) |
fields | Comma-separated field names to compare |
format | Output format: json (default), csv, or xlsx |
When format is csv or xlsx, the response is a file download (not JSON). The response includes Content-Type (text/csv or application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) and Content-Disposition: attachment; filename="comparison-{batch_id}.csv" (or .xlsx).
Ground Truth
Get Ground Truth
GET /api/extract/ground-truth?document_id={id}Set Ground Truth
POST /api/extract/ground-truthSet correct values for validation.
Request:
{
"document_id": "abc123",
"schema_id": "invoice-schema",
"field_values": {
"invoice_number": "INV-001",
"total_amount": 1250.00
},
"notes": "Verified by human reviewer"
}Compare with Ground Truth
POST /api/extract/ground-truth/compareCompare extraction results against ground truth.
Request:
{
"document_id": "abc123",
"field_results": [...],
"fields": [...]
}Corrections
Add Correction
POST /api/extract/correctionsAdd a manual correction to an extraction.
Request:
{
"extraction_id": "extraction-uuid",
"field_id": "total_amount",
"corrected_value": 1300.00,
"confidence_override": 1.0
}Bulk Corrections
PUT /api/extract/correctionsUpdate multiple corrections.
Request:
{
"extraction_id": "extraction-uuid",
"corrections": [
{ "field_id": "total_amount", "corrected_value": 1300.00 },
{ "field_id": "due_date", "corrected_value": "2024-02-01" }
]
}Suggested Schema
GET /api/extract/suggested-schema?document_id={id}Returns a suggested prebuilt schema when the document was classified during processing or runs on-demand detection when no classification exists. Used by the Extract UI to suggest “Apply Invoice schema” etc.
Query Parameters:
| Parameter | Description |
|---|---|
document_id | Document ID (required) |
Response:
{
"suggestedSchema": {
"id": "prebuilt-invoice",
"name": "Invoice",
"formType": "invoice",
"category": "financial",
"fieldCount": 12
}
}Returns { "suggestedSchema": null } when no matching schema is found.
Get Extraction Results
GET /api/extract/results/{id}Get results of an extraction. Use id as extraction ID (default) or job ID when type=job.
Query Parameters:
| Parameter | Default | Description |
|---|---|---|
type | extraction | extraction — look up by extraction ID; job — look up by job ID. |
Response: Object with optional job (id, status, progress, error, created_at, updated_at) and optional extraction (id, document_id, schema_id, schema_name, data, citations, field_results, corrections, confidence, created_at). When type=job, job is always returned; extraction is included when the job is completed.
See also: Extraction, Custom Extraction, Embed API.