Glossary
Definitions for terms used across DocLD documentation and the product. Use this glossary to understand document intelligence concepts, from parsing and chunking to RAG and vector search. Each term page includes Related terms and See also links to other glossary entries and long-form docs for deeper dives.
A
API
ConceptsThe DocLD REST API provides programmatic access to document upload, parsing, extraction, chat, and knowledge bases.
API Key
ConceptsA secret credential used to authenticate requests to the DocLD API. Passed in the Authorization header.
Async Processing
ConceptsProcessing that runs in the background. Requests return immediately with a job ID; results are available later.
Authentication
ConceptsVerifying the identity of the caller when accessing the API. DocLD uses API keys in the Authorization header.
Authorization
ConceptsDetermining what an authenticated caller is allowed to do. Access to documents, knowledge bases, and resources is scoped by organization and permissions.
B
Batch Processing
ProcessingProcessing multiple documents together in a single operation. DocLD supports batch uploads and bulk extraction for high-volume document workflows.
Bounding Box
ConceptsCoordinates that define a region in a document (e.g., page and rectangle). Used for highlighting and citation of extracted or retrieved content.
Bucket
StorageA container for objects in object storage. DocLD uses buckets (e.g., on R2 or S3-compatible storage) for document and artifact storage.
C
Chunk ID
ConceptsA unique identifier for a chunk within a document. Used in metadata, retrieval, and citations to reference the exact passage.
Chunk Overlap
ProcessingOverlapping content between adjacent chunks. Overlap can improve recall for content that spans chunk boundaries.
Chunking
ProcessingThe process of splitting documents into smaller segments for embedding and retrieval. DocLD uses semantic chunking to respect logical boundaries.
Citation
ConceptsSource attribution linking answers to document passages. Citations show where information came from for transparency and verification.
Completion
AIThe model's generated output in response to an input prompt. Used in chat, extraction, and text generation.
Confidence Score
ConceptsA 0–1 measure of how reliable an extraction or retrieval result is. DocLD provides per-field and overall confidence for extraction.
Confidence Threshold
ConceptsA minimum confidence score below which results are flagged for review or rejected. Used in extraction and sometimes in RAG.
Content Stream
ProcessingIn PDF, the stream of graphics and text operators that define how a page is drawn. Text extraction reads from content streams.
Context Window
AIThe maximum amount of input (tokens or text) an LLM can process at once.
Corpus
ConceptsA collection of documents used for search, RAG, or analysis. In DocLD, a knowledge base holds a corpus of documents.
Cosine Similarity
AIA measure of similarity between two vectors based on the angle between them. Used for ranking vector search results.
CSV Export
ConceptsExporting extracted or tabular data to CSV format. Useful for spreadsheets, reporting, and downstream systems.
D
Dashboard
ConceptsThe DocLD web UI for managing documents, knowledge bases, extraction schemas, workflows, API keys, and usage.
Dimensionality
AIThe number of dimensions (size) of an embedding vector. Dimensionality affects storage and similarity computation.
Document Classification
ProcessingAssigning documents to categories or types (e.g., invoice, contract, receipt). Supports routing and schema selection for extraction.
Document ID
ConceptsA unique identifier for a document in DocLD. Used in API calls, metadata, and citations to reference the source document.
Document Intelligence
ConceptsAI-powered understanding of documents: parsing, extraction, search, and generation. DocLD delivers document intelligence through parsing, RAG, and extraction.
Document Pipeline
ProcessingA sequence of processing steps applied to documents, from upload through parsing, chunking, and optional extraction or indexing.
Document Processing
ProcessingThe end-to-end pipeline of ingesting, parsing, and transforming documents into searchable or structured data. DocLD handles upload, parsing, chunking, and extraction.
Document Set
ConceptsA defined group of documents (e.g., by filter, folder, or IDs) used for a specific workflow, knowledge base, or extraction job.
Document Upload
ProcessingSending a file to DocLD for processing. Upload triggers parsing, chunking, and embedding for RAG and extraction.
E
Embedding
AIA numerical vector representation of text used for semantic search. DocLD uses Pinecone llama-text-embed-v2 for document chunks and queries.
Embedding Model
AIThe AI model that converts text into numerical vectors for semantic search.
Embedding Space
AIThe high-dimensional vector space where embeddings live. Similarity in this space corresponds to semantic similarity.
Endpoint
ConceptsA specific URL and HTTP method that exposes an API operation. DocLD endpoints cover upload, parsing, extraction, chat, and more.
Event
ConceptsA notification that something happened. Webhooks deliver events when documents are processed or jobs complete.
Extraction
AIAI-powered extraction of structured data from documents using schemas. DocLD returns field values with confidence scores and citations.
F
Few-Shot
AIProviding a small number of examples in the prompt to guide model behavior. Reduces the need for fine-tuning.
Field Mapping
ConceptsDefining how source content or document structure maps to schema fields. Used in extraction and export.
File Format
ConceptsThe document type and extension DocLD supports for parsing. Supported formats include PDF, images, spreadsheets, presentations, and office documents.
Fine-Tuning
AITraining a model on additional data to improve performance on a specific task or domain. Contrast with zero-shot and few-shot.
Form Detection
ProcessingDetecting form fields, labels, and input regions in documents to support extraction of key-value pairs and structured form data.
Full-Text Search
AIKeyword-based search over document content. Matches exact or stemmed words rather than semantic meaning.
G
H
I
Idempotency
ConceptsPerforming an operation multiple times produces the same result as performing it once. Safe retries rely on idempotency.
Index
StorageThe vector index stores embeddings for similarity search. DocLD uses Pinecone as its vector index for document chunks and queries.
Inference
AIRunning an LLM to generate output from input. Inference happens during RAG answer generation and extraction.
Ingestion
ConceptsThe process of bringing documents into the system and making them searchable or processable. Includes upload, parsing, chunking, and indexing.
J
Job
ConceptsAn asynchronous processing unit for parsing or extraction. Jobs track document processing status and return results when complete.
JSON Schema
ConceptsA standard for describing the structure and validation of JSON data. DocLD extraction schemas define field names, types, and optional constraints.
K
L
Latency
ConceptsThe time from request to response. Lower latency means faster answers and better user experience.
Layout Analysis
ProcessingIdentifying document structure such as headings, paragraphs, tables, and figures. Layout analysis improves chunking and extraction quality.
LLM
AILarge Language Model. The AI model that powers RAG answers and extraction in DocLD. LLMs generate text from context and instructions.
M
N
O
P
Pagination
ConceptsRetrieving a large list in pages (e.g., cursor- or offset-based). API list endpoints return paginated results.
Parsing
ProcessingExtracting text, tables, and layout from documents. DocLD parses PDFs, images, spreadsheets, presentations, and more with OCR and semantic structure.
PDF/A
ConceptsAn ISO standard for long-term archiving of PDFs. PDF/A restricts features to improve preservation and compliance.
Pinecone
StorageThe vector database DocLD uses for embedding storage and similarity search. Pinecone provides integrated embeddings with llama-text-embed-v2.
Prebuilt Schema
ConceptsReady-to-use extraction schemas for common document types like invoices, contracts, and resumes. Prebuilt schemas speed up extraction setup.
Presigned URL
StorageA time-limited URL that grants temporary access to upload or download from object storage without exposing credentials.
Prompt Engineering
AIDesigning prompts and instructions to get consistent, accurate behavior from LLMs. Used in extraction and RAG.
Prompting
AIThe practice of crafting instructions and context for LLMs to guide behavior and output quality.
Q
R
RAG
AIRetrieval-Augmented Generation combines document retrieval with AI generation for accurate, citation-backed answers from your documents.
Rate Limit
ConceptsA cap on the number of requests or operations per time period. Rate limits prevent overloading the API.
Reindex
ConceptsRe-running ingestion (parse, chunk, embed, upsert) on already-uploaded documents. Used after changing chunking or embedding settings.
Reranking
AIReordering retrieval results by relevance before generation. DocLD supports heuristic, LLM, and hybrid reranking for improved RAG accuracy.
Retrieval
AIThe step of fetching relevant documents or chunks for a query. In RAG, retrieval precedes generation.
S
Scanned Document
ProcessingA document stored as images of pages (e.g., from a scanner). Text must be extracted using OCR.
Schema
ConceptsA JSON schema that defines which fields to extract from documents. Schemas specify field names, types, and instructions for AI extraction.
SDK
ConceptsSoftware development kit—libraries and tools for integrating with the DocLD API in your preferred language.
Semantic Search
AISearch by meaning rather than exact keywords. Uses embeddings and vector similarity to find relevant content.
Session
ConceptsA chat conversation with history. Sessions preserve context across messages for multi-turn RAG chat.
Similarity Score
AIA measure of how similar two embedding vectors are. Higher scores indicate greater semantic closeness.
Source Document
ConceptsThe original file or document uploaded for processing. Parsing and extraction operate on the source document.
Structured Data
ConceptsData in a fixed schema (e.g., JSON, CSV) with named fields and types. Extraction produces structured data from unstructured documents.
System Prompt
AIInstructions that set the LLM's role, behavior, and constraints. Used in chat and RAG to enforce citation and tone.
T
Table Extraction
ProcessingIdentifying and extracting tabular data from documents as structured rows and columns. Supports RAG and schema-based extraction.
Temperature
AIAn LLM parameter that controls randomness of generated output. Lower values are more deterministic, higher more diverse.
Tenant
ConceptsA logical boundary for multi-tenancy. In DocLD, the organization is the primary tenant; resources and billing are isolated per tenant.
Text Extraction
ProcessingPulling raw text from documents regardless of format. In PDFs and images, this may involve parsing content streams or running OCR.
Throughput
ConceptsThe rate at which documents or operations are processed. Higher throughput means more documents per unit time.
Tokenization
AISplitting text into tokens for LLMs and embedding models. Token boundaries affect chunk size and context limits.
Top-K
AIThe number of chunks retrieved from vector search before reranking or generation. Top-k balances recall with context size and latency.
Top-p
AIA sampling parameter that limits the model to the smallest set of tokens whose cumulative probability exceeds p. Controls diversity of output.
Trigger
ConceptsWhat starts a workflow. Triggers include upload, webhook, schedule, or manual run.
U
V
Vector
AIA numerical representation of text or data. Embeddings produce vectors for semantic similarity and search.
Vector Database
StorageA database optimized for storing and querying vector embeddings. Enables fast similarity search for RAG and semantic search.
Vector Search
AISimilarity search over embeddings to find documents or passages that match a query by meaning. DocLD uses Pinecone for vector search.
W
Webhook
ConceptsHTTP callbacks that notify your system when events occur. DocLD webhooks fire on document processing, extraction completion, and workflow events.
Webhook Payload
ConceptsThe JSON body sent to your webhook URL when an event occurs. Contains event type, resource IDs, and relevant data.
Workflow
ConceptsAutomated document processing pipelines with triggers and integrations. Workflows run parse, extract, or other steps when events occur.
Workflow Run
ConceptsA single execution of a workflow. Each run has a status, logs, and output from the workflow steps.