Schema

A schema defines the structure of data to extract from documents. It specifies field names, types (string, number, date, array, object), whether fields are required, and instructions that guide the AI for zero-shot extraction. Schemas are used by extraction to produce structured output with confidence scores and citations.

Schema Structure

{
  "name": "Invoice",
  "fields": [
    { "name": "invoice_number", "type": "string", "required": true },
    { "name": "total_amount", "type": "number", "required": true },
    { "name": "line_items", "type": "array", "items": { "type": "object" } }
  ],
  "instructions": "Extract all line items; include tax if shown separately."
}

The name identifies the schema. fields define what to extract. instructions guide the LLM for edge cases and formatting.

Field Types

Type	Example	Notes
`string`	"INV-001"	Text values
`number`	1250.00	Numeric values; specify precision in instructions
`boolean`	true	Yes/no, present/absent
`date`	"2024-01-15"	ISO 8601 format by default
`array`	`[{...}, {...}]`	List of items; define `items` type
`object`	`{"name": "..."}`	Nested structure; define nested fields

Field types constrain output and help the model format values correctly. Use required: true for critical fields.

Creating Schemas

Method	Description
From document	Generate a schema by analyzing a sample; DocLD infers fields from content
From description	Describe what to extract in natural language; DocLD generates schema
Prebuilt	Use prebuilt schemas for common types (Invoice, Contract, Resume)

Instructions help the AI handle edge cases (e.g., "If tax is separate, extract as its own field"). Refine instructions based on extraction quality.

Advanced Types and Validation

Nested objects — Define object fields with nested fields
Array items — Define items.type and items.fields for arrays of objects
Validation — Field types enforce structure; instructions guide semantic validation
Versioning — Schemas can be versioned; use schema ID + version for reproducibility

Best Practices

Clear instructions — Be specific about edge cases (nulls, rounding, formats)
Required vs optional — Mark critical fields as required; optional fields allow partial extraction
Use prebuilt — Start with prebuilt schemas; customize as needed
Iterate — Refine schemas based on confidence scores and ground truth feedback

Schemas drive extraction. Prebuilt schemas accelerate setup. Zero-shot extraction uses schemas without training. Confidence scores and citations accompany extracted values.

Schema Structure

{
  "name": "Invoice",
  "fields": [
    { "name": "invoice_number", "type": "string", "required": true },
    { "name": "total_amount", "type": "number", "required": true },
    { "name": "line_items", "type": "array", "items": { "type": "object" } }
  ],
  "instructions": "Extract all line items; include tax if shown separately."
}

The name identifies the schema. fields define what to extract. instructions guide the LLM for edge cases and formatting.

Field Types

Type	Example	Notes
`string`	"INV-001"	Text values
`number`	1250.00	Numeric values; specify precision in instructions
`boolean`	true	Yes/no, present/absent
`date`	"2024-01-15"	ISO 8601 format by default
`array`	`[{...}, {...}]`	List of items; define `items` type
`object`	`{"name": "..."}`	Nested structure; define nested fields

Field types constrain output and help the model format values correctly. Use required: true for critical fields.

Creating Schemas

Method	Description
From document	Generate a schema by analyzing a sample; DocLD infers fields from content
From description	Describe what to extract in natural language; DocLD generates schema
Prebuilt	Use prebuilt schemas for common types (Invoice, Contract, Resume)

Instructions help the AI handle edge cases (e.g., "If tax is separate, extract as its own field"). Refine instructions based on extraction quality.

Advanced Types and Validation

Nested objects — Define object fields with nested fields
Array items — Define items.type and items.fields for arrays of objects
Validation — Field types enforce structure; instructions guide semantic validation
Versioning — Schemas can be versioned; use schema ID + version for reproducibility

Best Practices

Clear instructions — Be specific about edge cases (nulls, rounding, formats)
Required vs optional — Mark critical fields as required; optional fields allow partial extraction
Use prebuilt — Start with prebuilt schemas; customize as needed
Iterate — Refine schemas based on confidence scores and ground truth feedback

Schemas drive extraction. Prebuilt schemas accelerate setup. Zero-shot extraction uses schemas without training. Confidence scores and citations accompany extracted values.

Schema Structure

Field Types

Creating Schemas

Advanced Types and Validation

Best Practices

Frequently Asked Questions

Schema

Schema Structure

Field Types

Creating Schemas

Advanced Types and Validation

Best Practices

Frequently Asked Questions

Schema Structure

Field Types

Creating Schemas

Advanced Types and Validation

Best Practices

Related Concepts

Frequently Asked Questions

Schema Structure

Field Types

Creating Schemas

Advanced Types and Validation

Best Practices

Related Concepts

Frequently Asked Questions