Schema
ConceptsA schema defines the structure of data to extract from documents. It specifies field names, types (string, number, date, array, object), whether fields are required, and instructions that guide the AI for zero-shot extraction. Schemas are used by extraction to produce structured output with confidence scores and citations.
Schema Structure
{ "name": "Invoice", "fields": [ { "name": "invoice_number", "type": "string", "required": true }, { "name": "total_amount", "type": "number", "required": true }, { "name": "line_items", "type": "array", "items": { "type": "object" } } ], "instructions": "Extract all line items; include tax if shown separately." }
The name identifies the schema. fields define what to extract. instructions guide the LLM for edge cases and formatting.
Field Types
| Type | Example | Notes |
|---|---|---|
string | "INV-001" | Text values |
number | 1250.00 | Numeric values; specify precision in instructions |
boolean | true | Yes/no, present/absent |
date | "2024-01-15" | ISO 8601 format by default |
array | [{...}, {...}] | List of items; define items type |
object | {"name": "..."} | Nested structure; define nested fields |
Field types constrain output and help the model format values correctly. Use required: true for critical fields.
Creating Schemas
| Method | Description |
|---|---|
| From document | Generate a schema by analyzing a sample; DocLD infers fields from content |
| From description | Describe what to extract in natural language; DocLD generates schema |
| Prebuilt | Use prebuilt schemas for common types (Invoice, Contract, Resume) |
Instructions help the AI handle edge cases (e.g., "If tax is separate, extract as its own field"). Refine instructions based on extraction quality.
Advanced Types and Validation
- Nested objects — Define
objectfields with nestedfields - Array items — Define
items.typeanditems.fieldsfor arrays of objects - Validation — Field types enforce structure; instructions guide semantic validation
- Versioning — Schemas can be versioned; use schema ID + version for reproducibility
Best Practices
- Clear instructions — Be specific about edge cases (nulls, rounding, formats)
- Required vs optional — Mark critical fields as required; optional fields allow partial extraction
- Use prebuilt — Start with prebuilt schemas; customize as needed
- Iterate — Refine schemas based on confidence scores and ground truth feedback
Related Concepts
Schemas drive extraction. Prebuilt schemas accelerate setup. Zero-shot extraction uses schemas without training. Confidence scores and citations accompany extracted values.