Extraction
AIExtraction pulls structured data from documents using AI and a schema. Define the fields you need (e.g., invoice number, total amount, line items), and DocLD extracts values with confidence scores and citations. DocLD uses zero-shot extraction: no training on your documents is required.
How Extraction Works
Document + Schema → AI Extraction → Field Results + Confidence + Citations
- Schema — Define fields, types, and instructions in a schema
- Document — Send document (or document ID) to the extraction API
- Extract — LLM identifies and extracts values matching the schema
- Validate — Confidence scores indicate reliability per field
- Return — Field values, confidence, and citations are returned
- Correct — Manually fix low-confidence or incorrect values; optionally promote to ground truth
Extraction runs asynchronously for large documents; use jobs and webhooks to track completion.
Pipeline Diagram
| Stage | Input | Output |
|---|---|---|
| Parse | Document file | Text, tables, layout |
| Extract | Text + Schema | Field values |
| Score | Extracted values | Confidence scores |
| Cite | Source passages | Citations per field |
Parsing runs first (or uses existing parsed content). Extraction consumes the parsed text and applies the schema.
Output
| Component | Description |
|---|---|
| Field values | Extracted data in your schema structure |
| Confidence | Per-field and overall scores (0–1) |
| Citations | Where each value was found (text, page, coordinates) |
Use confidence scores to prioritize review. Low-confidence fields should be verified against the citation.
Retries and Error Handling
- Retries — Failed extractions can be retried; transient errors (rate limits, timeouts) may succeed on retry
- Partial results — Some fields may extract successfully while others fail; results include per-field status
- Error details — API returns error messages for failed extractions; inspect schema and document for issues
For batch processing, handle partial success: some documents may extract fully while others fail or return low confidence.
Prebuilt Schemas
DocLD offers prebuilt schemas for common document types: Invoice, Receipt, Contract, NDA, Resume, Bank Statement, and more. Use them directly or as templates for custom schemas. Form detection can suggest a prebuilt schema for mixed document batches.
Ground Truth
Set verified values to measure extraction accuracy and compare runs. Ground truth enables quality analytics and schema tuning. Corrections made during review can be promoted to ground truth for future accuracy measurement.
Related Concepts
Extraction uses schemas and produces confidence scores and citations. Prebuilt schemas accelerate setup. Zero-shot extraction requires no training. Jobs and webhooks track async extraction.