Data Extraction

Extract structured data from documents using schemas and AI-powered field detection.

Overview

DocLD’s extraction feature lets you:

Define JSON schemas for your data structure
Use AI to detect and extract field values
Validate extractions with confidence scores
Correct results manually or automatically
Batch process multiple documents
Compare extractions against ground truth

How It Works


Document → Schema → AI Extraction → Field Results → Validation

Upload document or reference existing
Apply extraction schema
AI extracts field values with confidence
Review and correct if needed
Export or use in workflows

Extraction Schemas

Creating Schemas

Define what data to extract:


{
  "name": "Invoice",
  "description": "Extract invoice data",
  "instructions": "Extract all line items and totals",
  "fields": [
    {
      "name": "invoice_number",
      "type": "string",
      "description": "The invoice number",
      "required": true
    },
    {
      "name": "total_amount",
      "type": "number",
      "description": "Total amount due",
      "required": true
    },
    {
      "name": "line_items",
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "price": { "type": "number" }
        }
      }
    }
  ]
}

Field Types

Type	Description	Example
`string`	Text value	”INV-001”
`number`	Numeric value	1250.00
`boolean`	True/false	true
`date`	Date value	”2024-01-15”
`array`	List of items	`[{...}, {...}]`
`object`	Nested object	`{"name": "..."}`

Schema Instructions

Add instructions to guide extraction:


{
  "instructions": "Extract invoice data carefully. For line items, include all items even if they span multiple pages. If tax is shown separately, extract it as a separate field."
}

Running Extractions

Single Document


curl -X POST "https://your-domain.com/api/v1/extract/run?sync=true" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "doc-uuid",
    "schema_id": "invoice-schema",
    "include_citations": true
  }'

Extract anything (zero-shot)

For ad-hoc use cases you can skip defining a schema. On the Extract page, choose Describe what you want, upload a document, and type what to extract in plain language (e.g. company name, all dates, and every dollar amount). The system infers a schema and runs extraction in one shot. No schema library or field builder required.

Via API (Bearer key), send the same request to POST /api/v1/extract/run (add ?sync=true for a single JSON response) with a description field instead of schema_id or config. The description must be at least 10 characters. You cannot combine description with schema_id or config. Streaming (SSE) is only available on the session-authenticated POST /api/extract/run route in the app.

Agentic extraction (complex docs)

For documents with tables, many fields, or complex layout, enable Agentic extraction in Settings. DocLD runs a multi-step pipeline:

Identify regions – Segment the document into headers, tables, form blocks, and body (using layout metadata when available, or an LLM pass).
Extract by region – Run extraction on each region with the same schema.
Merge and validate – Combine results (scalars: first non-null; arrays: merged and deduped) and optional validation.

This goes beyond a single prompt + schema and improves accuracy on complex layouts. Toggle “Agentic extraction (complex docs)” in the Extract settings or set config.settings.agenticExtractionMode: true in the API.

Streaming extraction (SSE)

For long documents or when you want progressive feedback, use streaming on the session-authenticated route POST /api/extract/run (dashboard or cookie session): send Accept: text/event-stream (or ?stream=1) with the same request body. The v1 POST /api/v1/extract/run API is for non-streaming JSON (use ?sync=true for a single completed response). The SSE stream uses these event types:

Event	Description
`progress`	Progress percentage and optional message (e.g. “Extracting chunk 2/5…”)
`field_extracted`	One field has been extracted; payload includes `field` (fieldId, fieldName, value, confidence, etc.)
`chunk_done`	One chunk or page is done (array/chunked mode); payload includes `chunk_index` and optional `partial_data`
`complete`	Final event with full `result` (same shape as the non-streaming JSON response)
`error`	Something went wrong (e.g. “Cancelled”)

You can show fields as they arrive in the UI. Closing the connection (e.g. user clicks Cancel) cancels the extraction on the server; the job is marked failed with error “Cancelled”.

Programmatic v1 vs session `/api/extract/*`

Capability	Bearer API key (`/api/v1/...`)	Session / dashboard (`/api/extract/...`)
Sync extraction (one JSON response)	`POST /api/v1/extract/run?sync=true`	`POST /api/extract/run`
Create extraction (schema_id, fields, or description)	`POST /api/v1/extractions`	UI or same session routes
List / manage saved schemas	`GET/POST /api/v1/schemas`, `GET/PATCH/DELETE /api/v1/schemas/:id`	Full schema UI + helpers below
Get one extraction by id	`GET /api/v1/extractions/:id`	`GET /api/extract/results/:id` (session)
Batch, corrections, ground truth, prebuilt catalog, schema generators, batch comparison	Not on v1 — use per-document `/api/v1/extract/run` or `/api/v1/extractions` from automation	`/api/extract/batch`, `/api/extract/corrections`, etc. (session cookies)

Batch Extraction

Process multiple documents in one job (session-authenticated /api/extract/batch — not available on /api/v1 yet). From integrations, either call this route with a user session (cookie) or loop POST /api/v1/extract/run?sync=true per document.


curl -X POST "https://your-domain.com/api/extract/batch" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "name": "January Invoices",
    "document_ids": ["doc1", "doc2", "doc3"],
    "schema_id": "invoice-schema",
    "run_immediately": true
  }'

Results

Field Results


{
  "data": {
    "invoice_number": "INV-2024-001",
    "total_amount": 1250.00,
    "line_items": [...]
  },
  "field_results": [
    {
      "field": "invoice_number",
      "value": "INV-2024-001",
      "confidence": 0.98,
      "evidence": {
        "text": "Invoice #: INV-2024-001",
        "page": 1,
        "bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
        "confidence": 0.98,
        "chunk_id": "chunk-uuid",
        "span": { "start": 120, "end": 156 }
      },
      "citations": [
        {
          "text": "Invoice #: INV-2024-001",
          "page": 1,
          "bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
          "confidence": "high"
        }
      ]
    }
  ],
  "overall_confidence": 0.95
}

Confidence Scores

Score	Meaning	Action
0.9+	High confidence	Usually accurate
0.7-0.9	Medium confidence	Spot check
< 0.7	Low confidence	Manual review

Evidence and Citations

Each field can include a structured evidence object for traceability and compliance (“show me where this number came from”). Evidence supports agentic and document-grounded systems that need precise references.

Field	Description
`text`	Source text snippet from the document
`page`	Page number (1-indexed)
`bbox`	Bounding box (normalized 0–1: x, y, width, height)
`confidence`	Numeric confidence (0–1)
`chunk_id`	Chunk reference when available
`span`	Optional start/end offset in parsed text for agentic use

Citations remain available for backward compatibility and follow the same structure.

Manual Corrections

Corrections are implemented on /api/extract/corrections (session auth only — no v1 equivalent yet).

Adding Corrections

Fix extraction errors:


curl -X POST "https://your-domain.com/api/extract/corrections" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "extraction_id": "ext-uuid",
    "field_id": "total_amount",
    "corrected_value": 1300.00
  }'

Bulk Corrections


curl -X PUT "https://your-domain.com/api/extract/corrections" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "extraction_id": "ext-uuid",
    "corrections": [
      {"field_id": "total_amount", "corrected_value": 1300.00},
      {"field_id": "due_date", "corrected_value": "2024-02-01"}
    ]
  }'

Ground Truth

Ground truth and compare endpoints are session-only (/api/extract/ground-truth, /api/extract/ground-truth/compare).

Set verified values for accuracy measurement:


curl -X POST "https://your-domain.com/api/extract/ground-truth" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "document_id": "doc-uuid",
    "schema_id": "schema-uuid",
    "field_values": {
      "invoice_number": "INV-001",
      "total_amount": 1250.00
    }
  }'

Compare extractions against ground truth:


curl -X POST "https://your-domain.com/api/extract/ground-truth/compare" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "document_id": "doc-uuid",
    "field_results": [...],
    "fields": [...]
  }'

Prebuilt Schemas

Use prebuilt schemas for common document types:

Category	Schemas
Invoice	Invoice, Receipt, Purchase Order
Contract	Contract, NDA, Agreement
Resume	Resume, CV
Financial	Bank Statement, Tax Form
Form	W-9, W-4, 1099

API key: list schemas you can reference (including IDs to use with /api/v1/extract/run) via:


curl -X GET "https://your-domain.com/api/v1/schemas" \
  -H "Authorization: Bearer YOUR_API_KEY"

Session: browse the prebuilt catalog by category:


curl -X GET "https://your-domain.com/api/extract/schemas/prebuilt?category=invoice" \
  -H "Cookie: your-session-cookie"

Schema Generation

Schema generation helpers live under /api/extract/schemas/* (session only). With an API key, define fields inline or use description on POST /api/v1/extractions / POST /api/v1/extract/run instead of a separate generate step.

From Document

Generate schema by analyzing a sample:


curl -X POST "https://your-domain.com/api/extract/schemas/generate" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{"document_id": "sample-doc-uuid"}'

From Description

Generate from natural language:


curl -X POST "https://your-domain.com/api/extract/schemas/generate-from-description" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{"description": "Extract invoice number, vendor name, line items with description, quantity, and price, and the total amount"}'

Suggested Schema

Get a suggested prebuilt schema for a document (from pipeline classification or on-demand detection). Session only — not on /api/v1.


curl -X GET "https://your-domain.com/api/extract/suggested-schema?document_id=doc-uuid" \
  -H "Cookie: your-session-cookie"

Response:


{
  "suggestedSchema": {
    "id": "prebuilt-invoice",
    "name": "Invoice",
    "formType": "invoice",
    "category": "financial",
    "fieldCount": 12
  }
}

Batch Comparison

Compare extractions across documents (session only):


curl -X GET "https://your-domain.com/api/extract/batch/{id}/comparison?only_discrepancies=true" \
  -H "Cookie: your-session-cookie"

View:

Field values across documents
Statistical variance
Outliers and anomalies

CLI Support

Extract from command line:


# Extract with schema
docld extract invoice.pdf -s schemas/invoice.json
 
# Extract multiple files
docld extract ./invoices -s schemas/invoice.json
 
# Include citations
docld extract invoice.pdf -s schema.json --citations

Page usage

Extraction consumes pages based on document size (1 page = 1 unit). Pricing: 200 free pages per month, then $0.05 per page. See the dashboard or API Reference for balance and usage.

API Reference

See the Extractions API for complete endpoint documentation.

Data Extraction

Extract structured data from documents using schemas and AI-powered field detection.

Overview

DocLD’s extraction feature lets you:

Define JSON schemas for your data structure
Use AI to detect and extract field values
Validate extractions with confidence scores
Correct results manually or automatically
Batch process multiple documents
Compare extractions against ground truth

How It Works


Document → Schema → AI Extraction → Field Results → Validation

Upload document or reference existing
Apply extraction schema
AI extracts field values with confidence
Review and correct if needed
Export or use in workflows

Extraction Schemas

Creating Schemas

Define what data to extract:


{
  "name": "Invoice",
  "description": "Extract invoice data",
  "instructions": "Extract all line items and totals",
  "fields": [
    {
      "name": "invoice_number",
      "type": "string",
      "description": "The invoice number",
      "required": true
    },
    {
      "name": "total_amount",
      "type": "number",
      "description": "Total amount due",
      "required": true
    },
    {
      "name": "line_items",
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "price": { "type": "number" }
        }
      }
    }
  ]
}

Field Types

Type	Description	Example
`string`	Text value	”INV-001”
`number`	Numeric value	1250.00
`boolean`	True/false	true
`date`	Date value	”2024-01-15”
`array`	List of items	`[{...}, {...}]`
`object`	Nested object	`{"name": "..."}`

Schema Instructions

Add instructions to guide extraction:


{
  "instructions": "Extract invoice data carefully. For line items, include all items even if they span multiple pages. If tax is shown separately, extract it as a separate field."
}

Running Extractions

Single Document


curl -X POST "https://your-domain.com/api/v1/extract/run?sync=true" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_id": "doc-uuid",
    "schema_id": "invoice-schema",
    "include_citations": true
  }'

Extract anything (zero-shot)

Agentic extraction (complex docs)

For documents with tables, many fields, or complex layout, enable Agentic extraction in Settings. DocLD runs a multi-step pipeline:

Identify regions – Segment the document into headers, tables, form blocks, and body (using layout metadata when available, or an LLM pass).
Extract by region – Run extraction on each region with the same schema.
Merge and validate – Combine results (scalars: first non-null; arrays: merged and deduped) and optional validation.

Streaming extraction (SSE)

Event	Description
`progress`	Progress percentage and optional message (e.g. “Extracting chunk 2/5…”)
`field_extracted`	One field has been extracted; payload includes `field` (fieldId, fieldName, value, confidence, etc.)
`chunk_done`	One chunk or page is done (array/chunked mode); payload includes `chunk_index` and optional `partial_data`
`complete`	Final event with full `result` (same shape as the non-streaming JSON response)
`error`	Something went wrong (e.g. “Cancelled”)

You can show fields as they arrive in the UI. Closing the connection (e.g. user clicks Cancel) cancels the extraction on the server; the job is marked failed with error “Cancelled”.

Programmatic v1 vs session `/api/extract/*`

Capability	Bearer API key (`/api/v1/...`)	Session / dashboard (`/api/extract/...`)
Sync extraction (one JSON response)	`POST /api/v1/extract/run?sync=true`	`POST /api/extract/run`
Create extraction (schema_id, fields, or description)	`POST /api/v1/extractions`	UI or same session routes
List / manage saved schemas	`GET/POST /api/v1/schemas`, `GET/PATCH/DELETE /api/v1/schemas/:id`	Full schema UI + helpers below
Get one extraction by id	`GET /api/v1/extractions/:id`	`GET /api/extract/results/:id` (session)
Batch, corrections, ground truth, prebuilt catalog, schema generators, batch comparison	Not on v1 — use per-document `/api/v1/extract/run` or `/api/v1/extractions` from automation	`/api/extract/batch`, `/api/extract/corrections`, etc. (session cookies)

Batch Extraction


curl -X POST "https://your-domain.com/api/extract/batch" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "name": "January Invoices",
    "document_ids": ["doc1", "doc2", "doc3"],
    "schema_id": "invoice-schema",
    "run_immediately": true
  }'

Results

Field Results


{
  "data": {
    "invoice_number": "INV-2024-001",
    "total_amount": 1250.00,
    "line_items": [...]
  },
  "field_results": [
    {
      "field": "invoice_number",
      "value": "INV-2024-001",
      "confidence": 0.98,
      "evidence": {
        "text": "Invoice #: INV-2024-001",
        "page": 1,
        "bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
        "confidence": 0.98,
        "chunk_id": "chunk-uuid",
        "span": { "start": 120, "end": 156 }
      },
      "citations": [
        {
          "text": "Invoice #: INV-2024-001",
          "page": 1,
          "bbox": { "x": 0.1, "y": 0.05, "width": 0.4, "height": 0.03 },
          "confidence": "high"
        }
      ]
    }
  ],
  "overall_confidence": 0.95
}

Confidence Scores

Score	Meaning	Action
0.9+	High confidence	Usually accurate
0.7-0.9	Medium confidence	Spot check
< 0.7	Low confidence	Manual review

Evidence and Citations

Field	Description
`text`	Source text snippet from the document
`page`	Page number (1-indexed)
`bbox`	Bounding box (normalized 0–1: x, y, width, height)
`confidence`	Numeric confidence (0–1)
`chunk_id`	Chunk reference when available
`span`	Optional start/end offset in parsed text for agentic use

Citations remain available for backward compatibility and follow the same structure.

Manual Corrections

Corrections are implemented on /api/extract/corrections (session auth only — no v1 equivalent yet).

Adding Corrections

Fix extraction errors:


curl -X POST "https://your-domain.com/api/extract/corrections" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "extraction_id": "ext-uuid",
    "field_id": "total_amount",
    "corrected_value": 1300.00
  }'

Bulk Corrections


curl -X PUT "https://your-domain.com/api/extract/corrections" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "extraction_id": "ext-uuid",
    "corrections": [
      {"field_id": "total_amount", "corrected_value": 1300.00},
      {"field_id": "due_date", "corrected_value": "2024-02-01"}
    ]
  }'

Ground Truth

Ground truth and compare endpoints are session-only (/api/extract/ground-truth, /api/extract/ground-truth/compare).

Set verified values for accuracy measurement:


curl -X POST "https://your-domain.com/api/extract/ground-truth" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "document_id": "doc-uuid",
    "schema_id": "schema-uuid",
    "field_values": {
      "invoice_number": "INV-001",
      "total_amount": 1250.00
    }
  }'

Compare extractions against ground truth:


curl -X POST "https://your-domain.com/api/extract/ground-truth/compare" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{
    "document_id": "doc-uuid",
    "field_results": [...],
    "fields": [...]
  }'

Prebuilt Schemas

Use prebuilt schemas for common document types:

Category	Schemas
Invoice	Invoice, Receipt, Purchase Order
Contract	Contract, NDA, Agreement
Resume	Resume, CV
Financial	Bank Statement, Tax Form
Form	W-9, W-4, 1099

API key: list schemas you can reference (including IDs to use with /api/v1/extract/run) via:


curl -X GET "https://your-domain.com/api/v1/schemas" \
  -H "Authorization: Bearer YOUR_API_KEY"

Session: browse the prebuilt catalog by category:


curl -X GET "https://your-domain.com/api/extract/schemas/prebuilt?category=invoice" \
  -H "Cookie: your-session-cookie"

Schema Generation

From Document

Generate schema by analyzing a sample:


curl -X POST "https://your-domain.com/api/extract/schemas/generate" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{"document_id": "sample-doc-uuid"}'

From Description

Generate from natural language:


curl -X POST "https://your-domain.com/api/extract/schemas/generate-from-description" \
  -H "Content-Type: application/json" \
  -H "Cookie: your-session-cookie" \
  -d '{"description": "Extract invoice number, vendor name, line items with description, quantity, and price, and the total amount"}'

Suggested Schema

Get a suggested prebuilt schema for a document (from pipeline classification or on-demand detection). Session only — not on /api/v1.


curl -X GET "https://your-domain.com/api/extract/suggested-schema?document_id=doc-uuid" \
  -H "Cookie: your-session-cookie"

Response:


{
  "suggestedSchema": {
    "id": "prebuilt-invoice",
    "name": "Invoice",
    "formType": "invoice",
    "category": "financial",
    "fieldCount": 12
  }
}

Batch Comparison

Compare extractions across documents (session only):


curl -X GET "https://your-domain.com/api/extract/batch/{id}/comparison?only_discrepancies=true" \
  -H "Cookie: your-session-cookie"

View:

Field values across documents
Statistical variance
Outliers and anomalies

CLI Support

Extract from command line:


# Extract with schema
docld extract invoice.pdf -s schemas/invoice.json
 
# Extract multiple files
docld extract ./invoices -s schemas/invoice.json
 
# Include citations
docld extract invoice.pdf -s schema.json --citations

Page usage

Extraction consumes pages based on document size (1 page = 1 unit). Pricing: 200 free pages per month, then $0.05 per page. See the dashboard or API Reference for balance and usage.

API Reference

See the Extractions API for complete endpoint documentation.

Data Extraction

Overview

How It Works

Extraction Schemas

Creating Schemas

Field Types

Schema Instructions

Running Extractions

Single Document

Extract anything (zero-shot)

Agentic extraction (complex docs)

Streaming extraction (SSE)

Programmatic v1 vs session /api/extract/*

Batch Extraction

Results

Field Results

Confidence Scores

Evidence and Citations

Manual Corrections

Adding Corrections

Bulk Corrections

Ground Truth

Prebuilt Schemas

Schema Generation

From Document

From Description

Suggested Schema

Batch Comparison

CLI Support

Page usage

API Reference

Data Extraction

Overview

How It Works

Extraction Schemas

Creating Schemas

Field Types

Schema Instructions

Running Extractions

Single Document

Extract anything (zero-shot)

Agentic extraction (complex docs)

Streaming extraction (SSE)

Programmatic v1 vs session /api/extract/*

Batch Extraction

Results

Field Results

Confidence Scores

Evidence and Citations

Manual Corrections

Adding Corrections

Bulk Corrections

Ground Truth

Prebuilt Schemas

Schema Generation

From Document

From Description

Suggested Schema

Batch Comparison

CLI Support

Page usage

API Reference

Programmatic v1 vs session `/api/extract/*`

Programmatic v1 vs session `/api/extract/*`