Best Practices

Tips, patterns, and recommendations for getting the most out of DocLD.

Document Preparation

File Quality

Use high-resolution scans — For scanned PDFs and images, 300 DPI or higher improves OCR accuracy
Prefer native digital documents — PDFs with embedded text layers parse faster and more accurately than scanned pages
Avoid heavy compression — JPEG artifacts and low-quality scans can degrade text extraction

Format Choice

Format	Best for	Notes
PDF (text layer)	Contracts, reports	Fastest processing, best accuracy
PDF (scanned)	Legacy documents	OCR runs automatically; may take longer
Images (PNG, JPEG)	Forms, receipts	OCR required; ensure good contrast
Spreadsheets	Tables, structured data	Preserves cell structure
Presentations	Slide content	Extracts text and speaker notes

When OCR Helps

OCR runs automatically when:

The file is an image (PNG, JPEG, TIFF, etc.)
A PDF has no extractable text (scanned document)
Parsing yields little or no text

For mixed documents (some pages scanned, some digital), DocLD handles each page appropriately.

Chunking and Retrieval

Chunking Strategy

DocLD uses layout-aware chunking. You can tune settings per knowledge base:

Semantic boundaries — Splitting at paragraph and section boundaries preserves context
Size limits — Default chunk sizes balance retrieval relevance and context length
Overlap — Optional overlap between chunks helps queries that span boundaries

See the RAG Setup Guide for configuration details.

Structuring Knowledge Bases

One domain per knowledge base — Keep related documents together (e.g., “HR Policies”, “Contract Library”)
Curate content — Exclude irrelevant sections or documents that add noise
Update regularly — Remove outdated content; add new documents as they become available

Extraction

Schema Design

Write clear field descriptions — Tell the model where to find data: “The invoice number, usually at the top near the date”
Use appropriate granularity — Prefer specific fields (e.g., invoice_date, due_date) over vague ones (dates)
Add schema-level instructions — Explain edge cases: “If tax is shown separately, extract it as its own field”

Confidence Handling

Confidence	Action
0.9+	High confidence; typically safe to automate
0.7–0.9	Medium; consider human review for critical fields
< 0.7	Low; review or refine schema/instructions

Use the Custom Extraction Guide for schema examples and field types.

RAG and Chat

Query Optimization

Be specific — “What is the vacation policy for full-time employees?” works better than “vacation?”
Provide context — For multi-document bases, mention document type or topic when relevant
Use follow-ups — Conversation context improves multi-turn answers

Citation Quality

Check retrieval settings — Adjust top_k and threshold in RAG configuration
Enable reranking — Improves relevance of retrieved chunks
Choose response mode — Use thorough for complex questions needing more citations

Knowledge Base Organization

Single topic per KB — Avoid mixing unrelated domains
Appropriate size — Not too small (limited coverage) or too large (noise and slower retrieval)
Related content — Documents should logically relate to each other

Performance and Cost

Batching

Batch uploads — Use the batch API for multiple documents instead of sequential uploads
Batch extraction — Run extractions across many documents in a single job
Batch document addition — Add multiple documents to a knowledge base in one request

Reprocessing

Monitor status — Use the dashboard or API to track processing status
Reprocess failed documents — Fix issues (e.g., file corruption) and retry from the document detail page
Update and reprocess — When chunking or vectorization settings change, reprocess affected documents

Monitoring

Use Analytics — Track usage, quality metrics, and costs
Review feedback — Thumbs up/down and user signals help tune retrieval and prompts
Check confidence scores — Low scores may indicate schema or document quality issues

Quick Reference

Area	Key tip
Documents	Prefer native PDFs; use 300+ DPI for scans
Chunking	Semantic strategy; tune overlap for your use case
Extraction	Clear descriptions; specific fields; handle edge cases in instructions
RAG	Specific queries; single-topic KBs; tune retrieval settings
Cost	Batch operations; monitor analytics; reprocess only when needed