Extract API

Extract structured data from documents using schemas. Define the fields you want, and DocLD extracts them with confidence scores. See Data Extraction for concepts and Custom Extraction for schema design. For a single request/response without streaming, use the Embed API with action: 'extract'.

Run Extraction


POST /api/extract/run

Extract data from a document using a schema or custom configuration.

Input Options

Input Type	Format	Description
Document ID	`{"document_id": "uuid"}`	Extract from document
DocLD reference	`{"input": "docld://..."}`	Extract from reference
Job ID	`{"input": "jobid://..."}`	Reuse parsed content
URL	`{"input": "https://..."}`	Fetch and extract

Request


{
  "document_id": "abc123",
  "schema_id": "schema-uuid",
  "include_citations": true
}

You can enable agentic extraction for complex documents (e.g. with tables or many fields). In your schema settings or inline config, set settings.agenticExtractionMode: true. The service will then run a multi-step pipeline: identify regions (headers, tables, form blocks) → extract per region → merge and validate. Use it when single-pass or array extraction is insufficient.

Or with inline configuration:


{
  "input": "docld://abc123",
  "config": {
    "fields": [
      {
        "name": "invoice_number",
        "type": "string",
        "description": "The invoice number"
      },
      {
        "name": "total_amount",
        "type": "number",
        "description": "Total amount due"
      },
      {
        "name": "line_items",
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": { "type": "string" },
            "quantity": { "type": "number" },
            "price": { "type": "number" }
          }
        }
      }
    ]
  },
  "include_citations": true
}

Zero-shot (description only): You can omit both schema_id and config and pass a natural-language description of what to extract. The service infers a schema and runs extraction in one shot. Provide exactly one of schema_id, config, or description. Minimum length for description is 10 characters.


{
  "document_id": "abc123",
  "description": "Extract company name, all dates, and every dollar amount",
  "include_citations": true
}

The response shape is the same as when using a schema or inline config. Zero-shot is single-document only; batch extraction requires a schema.

Response


{
  "success": true,
  "job_id": "job-uuid",
  "extraction_id": "extraction-uuid",
  "document_id": "abc123",
  "data": {
    "invoice_number": "INV-2024-001",
    "total_amount": 1250.00,
    "line_items": [
      {
        "description": "Widget A",
        "quantity": 10,
        "price": 50.00
      },
      {
        "description": "Widget B",
        "quantity": 5,
        "price": 150.00
      }
    ]
  },
  "field_results": [
    {
      "field": "invoice_number",
      "value": "INV-2024-001",
      "confidence": 0.98,
      "evidence": {
        "text": "Invoice Number: INV-2024-001",
        "page": 1,
        "bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
        "confidence": 0.98,
        "chunk_id": "chunk-uuid",
        "span": { "start": 120, "end": 156 }
      },
      "citations": [
        {
          "text": "Invoice Number: INV-2024-001",
          "bbox": { "page": 1, "left": 0.1, "top": 0.05, "width": 0.4, "height": 0.03 },
          "confidence": "high"
        }
      ]
    }
  ],
  "overall_confidence": 0.95,
  "processing_time": 1.2,
  "usage": {
    "num_pages": 2,
    "credits": 2.0
  }
}

When include_citations is true, each field includes an evidence object (primary traceability) and a citations array (backward compatibility). Evidence provides text, page, bbox, confidence, optional chunk_id, and optional span (start/end offset in parsed text) for agentic and document-grounded use cases.

Example


curl -X POST "https://your-domain.com/api/extract/run" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "abc123",
    "schema_id": "invoice-schema",
    "include_citations": true
  }'

Streaming (SSE)

To receive progressive updates (field-by-field or chunk-by-chunk), use streaming:

Query: POST /api/extract/run?stream=1
or send header: Accept: text/event-stream
Request body is unchanged. The response is a stream of Server-Sent Events.

Event types:

Type	Payload	Description
`progress`	`progress` (0–100), optional `progress_message`	Coarse or per-chunk progress
`field_extracted`	`field` (fieldId, fieldName, value, confidence, etc.)	One field extracted; UI can show it immediately
`chunk_done`	`chunk_index`, optional `partial_data`	One chunk/page done (array extraction mode)
`complete`	`job`, `result`	Final event; `result` has the same shape as the non-streaming JSON response
`error`	`error` (string)	Failure or cancellation message

Cancel: Closing the connection (e.g. AbortController.abort() on the client) cancels the extraction. The server marks the job as failed with error "Cancelled".

Extraction Schemas

List Schemas


GET /api/extract/schemas

List available extraction schemas.

Query Parameters:

Parameter	Description
`organization`	Include organization schemas
`organization_only`	Only organization schemas
`organization_id`	Filter by organization

Response:


[
  {
    "id": "uuid",
    "name": "Invoice",
    "description": "Extract invoice data",
    "fields": [...],
    "is_shared": false,
    "created_at": "2024-01-15T10:00:00Z"
  }
]

Create Schema


POST /api/extract/schemas

Create a new extraction schema.

Request:


{
  "name": "Invoice",
  "description": "Extract invoice data",
  "instructions": "Extract all line items and totals",
  "fields": [
    {
      "name": "invoice_number",
      "type": "string",
      "description": "The invoice number",
      "required": true
    },
    {
      "name": "total_amount",
      "type": "number",
      "description": "Total amount due",
      "required": true
    },
    {
      "name": "due_date",
      "type": "date",
      "description": "Payment due date"
    }
  ],
  "settings": {
    "model": "gpt-4o",
    "temperature": 0
  },
  "is_shared": false
}

Field Types

Type	Description	Example
`string`	Text value	”INV-001”
`number`	Numeric value	1250.00
`boolean`	True/false	true
`date`	Date value	”2024-01-15”
`array`	List of items	`[{...}, {...}]`
`object`	Nested object	`{"name": "...", "value": ...}`

Get Schema


GET /api/extract/schemas/{id}

Update Schema


PUT /api/extract/schemas/{id}

Delete Schema


DELETE /api/extract/schemas/{id}

Prebuilt Schemas


GET /api/extract/schemas/prebuilt

List prebuilt schemas for common document types.

Query Parameters:

Parameter	Description
`category`	Filter by category
`id`	Get specific prebuilt schema

Categories:

invoice - Invoices and receipts
contract - Legal contracts
resume - Resumes and CVs
form - Standard forms
financial - Financial documents

Generate Schema

From Document


POST /api/extract/schemas/generate

Generate a schema by analyzing a document.

Request:

Field	Type	Required	Description
`document_id`	string	Yes	Document to analyze
`merge_with_current`	boolean	No	If true, append suggested fields to existing schema instead of replacing
`current_fields`	array	No	Existing field definitions when `merge_with_current` is true


{
  "document_id": "abc123",
  "merge_with_current": true,
  "current_fields": [
    {
      "id": "invoice_number",
      "name": "Invoice Number",
      "type": "string",
      "description": "The invoice number",
      "required": true
    }
  ]
}

Response:

Returns fields — an array of field definitions. Each field may include suggestionReason (short explanation of why it was suggested).


{
  "fields": [
    {
      "id": "vendor_name",
      "name": "Vendor Name",
      "type": "string",
      "description": "Name of the vendor or supplier",
      "required": true,
      "suggestionReason": "The document header contains a vendor/supplier section with company name."
    }
  ],
  "message": "Generated N field suggestions based on document content"
}

From Description


POST /api/extract/schemas/generate-from-description

Generate a schema from a text description.

Request:


{
  "description": "Extract invoice number, vendor name, line items with description, quantity, and price, and the total amount"
}

Response:

Returns fields — an array of field definitions. Each field may include suggestionReason (short explanation of how the field maps to the description).


{
  "fields": [
    {
      "id": "invoice_number",
      "name": "Invoice Number",
      "type": "string",
      "description": "The invoice number",
      "required": true,
      "suggestionReason": "Explicitly requested in the description."
    }
  ],
  "message": "Generated N fields from your description"
}

Batch Extraction

Create Batch


POST /api/extract/batch

Run extraction on multiple documents.

Request:


{
  "name": "January Invoices",
  "document_ids": ["doc1", "doc2", "doc3"],
  "schema_id": "invoice-schema",
  "config": {},
  "run_immediately": true
}

Response:


{
  "id": "batch-uuid",
  "name": "January Invoices",
  "status": "processing",
  "progress": 0,
  "document_ids": ["doc1", "doc2", "doc3"],
  "created_at": "2024-01-15T10:00:00Z"
}

List Batches


GET /api/extract/batch

Get Batch


GET /api/extract/batch/{id}

Response:


{
  "id": "batch-uuid",
  "name": "January Invoices",
  "status": "completed",
  "progress": 100,
  "results": {
    "doc1": { "success": true, "extraction_id": "..." },
    "doc2": { "success": true, "extraction_id": "..." },
    "doc3": { "success": false, "error": "..." }
  },
  "comparison_data": {...}
}

Get Batch Comparison


GET /api/extract/batch/{id}/comparison

Compare extraction results across documents.

Query Parameters:

Parameter	Description
`only_discrepancies`	Show only fields with differences
`min_variance`	Minimum variance threshold
`max_confidence`	Maximum confidence threshold (numeric)
`fields`	Comma-separated field names to compare
`format`	Output format: `json` (default), `csv`, or `xlsx`

When format is csv or xlsx, the response is a file download (not JSON). The response includes Content-Type (text/csv or application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) and Content-Disposition: attachment; filename="comparison-{batch_id}.csv" (or .xlsx).

Ground Truth

Get Ground Truth


GET /api/extract/ground-truth?document_id={id}

Set Ground Truth


POST /api/extract/ground-truth

Set correct values for validation.

Request:


{
  "document_id": "abc123",
  "schema_id": "invoice-schema",
  "field_values": {
    "invoice_number": "INV-001",
    "total_amount": 1250.00
  },
  "notes": "Verified by human reviewer"
}

Compare with Ground Truth


POST /api/extract/ground-truth/compare

Compare extraction results against ground truth.

Request:


{
  "document_id": "abc123",
  "field_results": [...],
  "fields": [...]
}

Corrections

Add Correction


POST /api/extract/corrections

Add a manual correction to an extraction.

Request:


{
  "extraction_id": "extraction-uuid",
  "field_id": "total_amount",
  "corrected_value": 1300.00,
  "confidence_override": 1.0
}

Bulk Corrections


PUT /api/extract/corrections

Update multiple corrections.

Request:


{
  "extraction_id": "extraction-uuid",
  "corrections": [
    { "field_id": "total_amount", "corrected_value": 1300.00 },
    { "field_id": "due_date", "corrected_value": "2024-02-01" }
  ]
}

Suggested Schema


GET /api/extract/suggested-schema?document_id={id}

Returns a suggested prebuilt schema when the document was classified during processing or runs on-demand detection when no classification exists. Used by the Extract UI to suggest “Apply Invoice schema” etc.

Query Parameters:

Parameter	Description
`document_id`	Document ID (required)

Response:


{
  "suggestedSchema": {
    "id": "prebuilt-invoice",
    "name": "Invoice",
    "formType": "invoice",
    "category": "financial",
    "fieldCount": 12
  }
}

Returns { "suggestedSchema": null } when no matching schema is found.

Get Extraction Results


GET /api/extract/results/{id}

Get results of an extraction. Use id as extraction ID (default) or job ID when type=job.

Query Parameters:

Parameter	Default	Description
`type`	`extraction`	`extraction` — look up by extraction ID; `job` — look up by job ID.

Response: Object with optional job (id, status, progress, error, created_at, updated_at) and optional extraction (id, document_id, schema_id, schema_name, data, citations, field_results, corrections, confidence, created_at). When type=job, job is always returned; extraction is included when the job is completed.

Extract API

Run Extraction


POST /api/extract/run

Extract data from a document using a schema or custom configuration.

Input Options

Input Type	Format	Description
Document ID	`{"document_id": "uuid"}`	Extract from document
DocLD reference	`{"input": "docld://..."}`	Extract from reference
Job ID	`{"input": "jobid://..."}`	Reuse parsed content
URL	`{"input": "https://..."}`	Fetch and extract

Request


{
  "document_id": "abc123",
  "schema_id": "schema-uuid",
  "include_citations": true
}

Or with inline configuration:


{
  "input": "docld://abc123",
  "config": {
    "fields": [
      {
        "name": "invoice_number",
        "type": "string",
        "description": "The invoice number"
      },
      {
        "name": "total_amount",
        "type": "number",
        "description": "Total amount due"
      },
      {
        "name": "line_items",
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": { "type": "string" },
            "quantity": { "type": "number" },
            "price": { "type": "number" }
          }
        }
      }
    ]
  },
  "include_citations": true
}


{
  "document_id": "abc123",
  "description": "Extract company name, all dates, and every dollar amount",
  "include_citations": true
}

The response shape is the same as when using a schema or inline config. Zero-shot is single-document only; batch extraction requires a schema.

Response


{
  "success": true,
  "job_id": "job-uuid",
  "extraction_id": "extraction-uuid",
  "document_id": "abc123",
  "data": {
    "invoice_number": "INV-2024-001",
    "total_amount": 1250.00,
    "line_items": [
      {
        "description": "Widget A",
        "quantity": 10,
        "price": 50.00
      },
      {
        "description": "Widget B",
        "quantity": 5,
        "price": 150.00
      }
    ]
  },
  "field_results": [
    {
      "field": "invoice_number",
      "value": "INV-2024-001",
      "confidence": 0.98,
      "evidence": {
        "text": "Invoice Number: INV-2024-001",
        "page": 1,
        "bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
        "confidence": 0.98,
        "chunk_id": "chunk-uuid",
        "span": { "start": 120, "end": 156 }
      },
      "citations": [
        {
          "text": "Invoice Number: INV-2024-001",
          "bbox": { "page": 1, "left": 0.1, "top": 0.05, "width": 0.4, "height": 0.03 },
          "confidence": "high"
        }
      ]
    }
  ],
  "overall_confidence": 0.95,
  "processing_time": 1.2,
  "usage": {
    "num_pages": 2,
    "credits": 2.0
  }
}

Example


curl -X POST "https://your-domain.com/api/extract/run" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "abc123",
    "schema_id": "invoice-schema",
    "include_citations": true
  }'

Streaming (SSE)

To receive progressive updates (field-by-field or chunk-by-chunk), use streaming:

Query: POST /api/extract/run?stream=1
or send header: Accept: text/event-stream
Request body is unchanged. The response is a stream of Server-Sent Events.

Event types:

Type	Payload	Description
`progress`	`progress` (0–100), optional `progress_message`	Coarse or per-chunk progress
`field_extracted`	`field` (fieldId, fieldName, value, confidence, etc.)	One field extracted; UI can show it immediately
`chunk_done`	`chunk_index`, optional `partial_data`	One chunk/page done (array extraction mode)
`complete`	`job`, `result`	Final event; `result` has the same shape as the non-streaming JSON response
`error`	`error` (string)	Failure or cancellation message

Cancel: Closing the connection (e.g. AbortController.abort() on the client) cancels the extraction. The server marks the job as failed with error "Cancelled".

Extraction Schemas

List Schemas


GET /api/extract/schemas

List available extraction schemas.

Query Parameters:

Parameter	Description
`organization`	Include organization schemas
`organization_only`	Only organization schemas
`organization_id`	Filter by organization

Response:


[
  {
    "id": "uuid",
    "name": "Invoice",
    "description": "Extract invoice data",
    "fields": [...],
    "is_shared": false,
    "created_at": "2024-01-15T10:00:00Z"
  }
]

Create Schema


POST /api/extract/schemas

Create a new extraction schema.

Request:


{
  "name": "Invoice",
  "description": "Extract invoice data",
  "instructions": "Extract all line items and totals",
  "fields": [
    {
      "name": "invoice_number",
      "type": "string",
      "description": "The invoice number",
      "required": true
    },
    {
      "name": "total_amount",
      "type": "number",
      "description": "Total amount due",
      "required": true
    },
    {
      "name": "due_date",
      "type": "date",
      "description": "Payment due date"
    }
  ],
  "settings": {
    "model": "gpt-4o",
    "temperature": 0
  },
  "is_shared": false
}

Field Types

Type	Description	Example
`string`	Text value	”INV-001”
`number`	Numeric value	1250.00
`boolean`	True/false	true
`date`	Date value	”2024-01-15”
`array`	List of items	`[{...}, {...}]`
`object`	Nested object	`{"name": "...", "value": ...}`

Get Schema


GET /api/extract/schemas/{id}

Update Schema


PUT /api/extract/schemas/{id}

Delete Schema


DELETE /api/extract/schemas/{id}

Prebuilt Schemas


GET /api/extract/schemas/prebuilt

List prebuilt schemas for common document types.

Query Parameters:

Parameter	Description
`category`	Filter by category
`id`	Get specific prebuilt schema

Categories:

invoice - Invoices and receipts
contract - Legal contracts
resume - Resumes and CVs
form - Standard forms
financial - Financial documents

Generate Schema

From Document


POST /api/extract/schemas/generate

Generate a schema by analyzing a document.

Request:

Field	Type	Required	Description
`document_id`	string	Yes	Document to analyze
`merge_with_current`	boolean	No	If true, append suggested fields to existing schema instead of replacing
`current_fields`	array	No	Existing field definitions when `merge_with_current` is true


{
  "document_id": "abc123",
  "merge_with_current": true,
  "current_fields": [
    {
      "id": "invoice_number",
      "name": "Invoice Number",
      "type": "string",
      "description": "The invoice number",
      "required": true
    }
  ]
}

Response:

Returns fields — an array of field definitions. Each field may include suggestionReason (short explanation of why it was suggested).


{
  "fields": [
    {
      "id": "vendor_name",
      "name": "Vendor Name",
      "type": "string",
      "description": "Name of the vendor or supplier",
      "required": true,
      "suggestionReason": "The document header contains a vendor/supplier section with company name."
    }
  ],
  "message": "Generated N field suggestions based on document content"
}

From Description


POST /api/extract/schemas/generate-from-description

Generate a schema from a text description.

Request:


{
  "description": "Extract invoice number, vendor name, line items with description, quantity, and price, and the total amount"
}

Response:

Returns fields — an array of field definitions. Each field may include suggestionReason (short explanation of how the field maps to the description).


{
  "fields": [
    {
      "id": "invoice_number",
      "name": "Invoice Number",
      "type": "string",
      "description": "The invoice number",
      "required": true,
      "suggestionReason": "Explicitly requested in the description."
    }
  ],
  "message": "Generated N fields from your description"
}

Batch Extraction

Create Batch


POST /api/extract/batch

Run extraction on multiple documents.

Request:


{
  "name": "January Invoices",
  "document_ids": ["doc1", "doc2", "doc3"],
  "schema_id": "invoice-schema",
  "config": {},
  "run_immediately": true
}

Response:


{
  "id": "batch-uuid",
  "name": "January Invoices",
  "status": "processing",
  "progress": 0,
  "document_ids": ["doc1", "doc2", "doc3"],
  "created_at": "2024-01-15T10:00:00Z"
}

List Batches


GET /api/extract/batch

Get Batch


GET /api/extract/batch/{id}

Response:


{
  "id": "batch-uuid",
  "name": "January Invoices",
  "status": "completed",
  "progress": 100,
  "results": {
    "doc1": { "success": true, "extraction_id": "..." },
    "doc2": { "success": true, "extraction_id": "..." },
    "doc3": { "success": false, "error": "..." }
  },
  "comparison_data": {...}
}

Get Batch Comparison


GET /api/extract/batch/{id}/comparison

Compare extraction results across documents.

Query Parameters:

Parameter	Description
`only_discrepancies`	Show only fields with differences
`min_variance`	Minimum variance threshold
`max_confidence`	Maximum confidence threshold (numeric)
`fields`	Comma-separated field names to compare
`format`	Output format: `json` (default), `csv`, or `xlsx`

Ground Truth

Get Ground Truth


GET /api/extract/ground-truth?document_id={id}

Set Ground Truth


POST /api/extract/ground-truth

Set correct values for validation.

Request:


{
  "document_id": "abc123",
  "schema_id": "invoice-schema",
  "field_values": {
    "invoice_number": "INV-001",
    "total_amount": 1250.00
  },
  "notes": "Verified by human reviewer"
}

Compare with Ground Truth


POST /api/extract/ground-truth/compare

Compare extraction results against ground truth.

Request:


{
  "document_id": "abc123",
  "field_results": [...],
  "fields": [...]
}

Corrections

Add Correction


POST /api/extract/corrections

Add a manual correction to an extraction.

Request:


{
  "extraction_id": "extraction-uuid",
  "field_id": "total_amount",
  "corrected_value": 1300.00,
  "confidence_override": 1.0
}

Bulk Corrections


PUT /api/extract/corrections

Update multiple corrections.

Request:


{
  "extraction_id": "extraction-uuid",
  "corrections": [
    { "field_id": "total_amount", "corrected_value": 1300.00 },
    { "field_id": "due_date", "corrected_value": "2024-02-01" }
  ]
}

Suggested Schema


GET /api/extract/suggested-schema?document_id={id}

Query Parameters:

Parameter	Description
`document_id`	Document ID (required)

Response:


{
  "suggestedSchema": {
    "id": "prebuilt-invoice",
    "name": "Invoice",
    "formType": "invoice",
    "category": "financial",
    "fieldCount": 12
  }
}

Returns { "suggestedSchema": null } when no matching schema is found.

Get Extraction Results


GET /api/extract/results/{id}

Get results of an extraction. Use id as extraction ID (default) or job ID when type=job.

Query Parameters:

Parameter	Default	Description
`type`	`extraction`	`extraction` — look up by extraction ID; `job` — look up by job ID.