Custom Extraction Guide
Create custom extraction schemas to pull specific data from your documents. See Data Extraction for concepts and the Extract API for schemas and run endpoints.
Overview
Custom extraction lets you define exactly what data to extract from documents. DocLD uses AI to find and extract the specified fields with confidence scores.
Creating a Schema
From the Dashboard
- Go to Extract → Schemas
- Click Create Schema
- Define your fields
- Save the schema
Via API
Use the Extract API schemas endpoints:
curl -X POST "/api/extract/schemas" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"name": "Invoice Schema",
"description": "Extract invoice data",
"instructions": "Extract all invoice details including line items",
"fields": [
{
"name": "invoice_number",
"type": "string",
"description": "The invoice number or ID",
"required": true
},
{
"name": "total_amount",
"type": "number",
"description": "Total amount due",
"required": true
}
]
}'Field Types
Basic Types
| Type | Description | Example |
|---|---|---|
string | Text value | ”INV-001” |
number | Numeric value | 1250.00 |
boolean | True/false | true |
date | Date value | ”2024-01-15” |
Complex Types
Arrays
Extract lists of items:
{
"name": "line_items",
"type": "array",
"description": "List of invoice line items",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"price": { "type": "number" }
}
}
}Nested Objects
Extract structured data:
{
"name": "vendor",
"type": "object",
"description": "Vendor information",
"properties": {
"name": { "type": "string" },
"address": { "type": "string" },
"phone": { "type": "string" }
}
}Field Properties
Required Fields
Mark fields as required:
{
"name": "invoice_number",
"type": "string",
"required": true
}Descriptions
Provide context for better extraction:
{
"name": "due_date",
"type": "date",
"description": "The payment due date, usually found near the invoice date"
}Default Values
Specify defaults for missing data:
{
"name": "currency",
"type": "string",
"default": "USD"
}Schema Instructions
Add global instructions to guide extraction:
{
"name": "Invoice Schema",
"instructions": "Extract invoice data carefully. For line items, include all items even if they span multiple pages. If tax is shown separately, extract it as a separate field.",
"fields": [...]
}Good instructions:
- Explain edge cases
- Clarify ambiguous fields
- Describe where to find data
Running Extractions
With a Schema
curl -X POST "/api/extract/run" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"document_id": "doc-uuid",
"schema_id": "schema-uuid",
"include_citations": true
}'With Inline Config
curl -X POST "/api/extract/run" \
-d '{
"document_id": "doc-uuid",
"config": {
"fields": [
{"name": "title", "type": "string"},
{"name": "author", "type": "string"}
]
}
}'Understanding Results
Field Results
{
"field_results": [
{
"field": "invoice_number",
"value": "INV-2024-001",
"confidence": 0.98,
"citations": [
{
"text": "Invoice #: INV-2024-001",
"page": 1,
"bbox": {...}
}
]
}
]
}Confidence Scores
| Score | Meaning |
|---|---|
| 0.9+ | High confidence |
| 0.7-0.9 | Medium confidence |
| < 0.7 | Low confidence - review |
Improving Extraction
Better Field Descriptions
// Less effective
{"name": "date", "type": "date"}
// More effective
{
"name": "invoice_date",
"type": "date",
"description": "The date the invoice was issued, typically at the top of the document near the invoice number"
}Schema Instructions
{
"instructions": "This is a medical invoice. Look for CPT codes in the procedure column. The patient responsibility is usually labeled 'Amount Due' or 'Your Balance'."
}Ground Truth
Set ground truth for accuracy validation:
curl -X POST "/api/extract/ground-truth" \
-d '{
"document_id": "doc-uuid",
"schema_id": "schema-uuid",
"field_values": {
"invoice_number": "INV-001",
"total_amount": 1250.00
}
}'Example Schemas
Invoice Schema
{
"name": "Invoice",
"fields": [
{"name": "invoice_number", "type": "string", "required": true},
{"name": "invoice_date", "type": "date", "required": true},
{"name": "due_date", "type": "date"},
{"name": "vendor_name", "type": "string", "required": true},
{"name": "vendor_address", "type": "string"},
{"name": "subtotal", "type": "number"},
{"name": "tax", "type": "number"},
{"name": "total_amount", "type": "number", "required": true},
{
"name": "line_items",
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"amount": {"type": "number"}
}
}
}
]
}Contract Schema
{
"name": "Contract",
"fields": [
{"name": "contract_type", "type": "string"},
{"name": "effective_date", "type": "date"},
{"name": "expiration_date", "type": "date"},
{"name": "party_a", "type": "string"},
{"name": "party_b", "type": "string"},
{"name": "contract_value", "type": "number"},
{"name": "payment_terms", "type": "string"},
{"name": "termination_clause", "type": "string"},
{"name": "governing_law", "type": "string"}
]
}Batch Extraction
Extract from multiple documents:
curl -X POST "/api/extract/batch" \
-d '{
"name": "January Invoices",
"document_ids": ["doc1", "doc2", "doc3"],
"schema_id": "invoice-schema",
"run_immediately": true
}'Monitor progress and compare results across documents.
Prebuilt Schemas
Use prebuilt schemas for common document types:
curl -X GET "/api/extract/schemas/prebuilt?category=invoice"Categories: invoice, contract, resume, form, financial