Best Practices
Tips, patterns, and recommendations for getting the most out of DocLD.
Document Preparation
File Quality
- Use high-resolution scans — For scanned PDFs and images, 300 DPI or higher improves OCR accuracy
- Prefer native digital documents — PDFs with embedded text layers parse faster and more accurately than scanned pages
- Avoid heavy compression — JPEG artifacts and low-quality scans can degrade text extraction
Format Choice
| Format | Best for | Notes |
|---|---|---|
| PDF (text layer) | Contracts, reports | Fastest processing, best accuracy |
| PDF (scanned) | Legacy documents | OCR runs automatically; may take longer |
| Images (PNG, JPEG) | Forms, receipts | OCR required; ensure good contrast |
| Spreadsheets | Tables, structured data | Preserves cell structure |
| Presentations | Slide content | Extracts text and speaker notes |
When OCR Helps
OCR runs automatically when:
- The file is an image (PNG, JPEG, TIFF, etc.)
- A PDF has no extractable text (scanned document)
- Parsing yields little or no text
For mixed documents (some pages scanned, some digital), DocLD handles each page appropriately.
Chunking and Retrieval
Chunking Strategy
DocLD uses layout-aware chunking. You can tune settings per knowledge base:
- Semantic boundaries — Splitting at paragraph and section boundaries preserves context
- Size limits — Default chunk sizes balance retrieval relevance and context length
- Overlap — Optional overlap between chunks helps queries that span boundaries
See the RAG Setup Guide for configuration details.
Structuring Knowledge Bases
- One domain per knowledge base — Keep related documents together (e.g., “HR Policies”, “Contract Library”)
- Curate content — Exclude irrelevant sections or documents that add noise
- Update regularly — Remove outdated content; add new documents as they become available
Extraction
Schema Design
- Write clear field descriptions — Tell the model where to find data: “The invoice number, usually at the top near the date”
- Use appropriate granularity — Prefer specific fields (e.g.,
invoice_date,due_date) over vague ones (dates) - Add schema-level instructions — Explain edge cases: “If tax is shown separately, extract it as its own field”
Confidence Handling
| Confidence | Action |
|---|---|
| 0.9+ | High confidence; typically safe to automate |
| 0.7–0.9 | Medium; consider human review for critical fields |
| < 0.7 | Low; review or refine schema/instructions |
Use the Custom Extraction Guide for schema examples and field types.
RAG and Chat
Query Optimization
- Be specific — “What is the vacation policy for full-time employees?” works better than “vacation?”
- Provide context — For multi-document bases, mention document type or topic when relevant
- Use follow-ups — Conversation context improves multi-turn answers
Citation Quality
- Check retrieval settings — Adjust
top_kandthresholdin RAG configuration - Enable reranking — Improves relevance of retrieved chunks
- Choose response mode — Use
thoroughfor complex questions needing more citations
Knowledge Base Organization
- Single topic per KB — Avoid mixing unrelated domains
- Appropriate size — Not too small (limited coverage) or too large (noise and slower retrieval)
- Related content — Documents should logically relate to each other
Performance and Cost
Batching
- Batch uploads — Use the batch API for multiple documents instead of sequential uploads
- Batch extraction — Run extractions across many documents in a single job
- Batch document addition — Add multiple documents to a knowledge base in one request
Reprocessing
- Monitor status — Use the dashboard or API to track processing status
- Reprocess failed documents — Fix issues (e.g., file corruption) and retry from the document detail page
- Update and reprocess — When chunking or vectorization settings change, reprocess affected documents
Monitoring
- Use Analytics — Track usage, quality metrics, and costs
- Review feedback — Thumbs up/down and user signals help tune retrieval and prompts
- Check confidence scores — Low scores may indicate schema or document quality issues
Quick Reference
| Area | Key tip |
|---|---|
| Documents | Prefer native PDFs; use 300+ DPI for scans |
| Chunking | Semantic strategy; tune overlap for your use case |
| Extraction | Clear descriptions; specific fields; handle edge cases in instructions |
| RAG | Specific queries; single-topic KBs; tune retrieval settings |
| Cost | Batch operations; monitor analytics; reprocess only when needed |
Last updated on