Loading…

Features Pricing

Log in Get started

Glossary

Definitions for terms used across DocLD documentation and the product. Use this glossary to understand document intelligence concepts, from parsing and chunking to RAG and vector search. Each term page includes Related terms and See also links to other glossary entries and long-form docs for deeper dives.

113 terms

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A

API

The DocLD REST API provides programmatic access to document upload, parsing, extraction, chat, and knowledge bases.

API Key

A secret credential used to authenticate requests to the DocLD API. Passed in the Authorization header.

Async Processing

Processing that runs in the background. Requests return immediately with a job ID; results are available later.

Audit Log

A chronological record of actions and events for compliance, security, and debugging. DocLD can log API usage, document access, and configuration changes.

Authentication

Verifying the identity of the caller when accessing the API. DocLD uses API keys in the Authorization header.

Authorization

Determining what an authenticated caller is allowed to do. Access to documents, knowledge bases, and resources is scoped by organization and permissions.

B

Batch Processing

Processing multiple documents together in a single operation. DocLD supports batch uploads and bulk extraction for high-volume document workflows.

Bounding Box

Coordinates that define a region in a document (e.g., page and rectangle). Used for highlighting and citation of extracted or retrieved content.

Bucket

A container for objects in object storage. DocLD uses buckets (e.g., on R2 or S3-compatible storage) for document and artifact storage.

C

Chunk ID

A unique identifier for a chunk within a document. Used in metadata, retrieval, and citations to reference the exact passage.

Chunk Overlap

Overlapping content between adjacent chunks. Overlap can improve recall for content that spans chunk boundaries.

Chunking

The process of splitting documents into smaller segments for embedding and retrieval. DocLD uses semantic chunking to respect logical boundaries.

Citation

Source attribution linking answers to document passages. Citations show where information came from for transparency and verification.

Completion

The model's generated output in response to an input prompt. Used in chat, extraction, and text generation.

Confidence Score

A 0–1 measure of how reliable an extraction or retrieval result is. DocLD provides per-field and overall confidence for extraction.

Confidence Threshold

A minimum confidence score below which results are flagged for review or rejected. Used in extraction and sometimes in RAG.

Connector

A prebuilt integration that pulls documents from external sources (e.g. cloud storage, CRM) into DocLD for parsing and indexing.

Content Stream

In PDF, the stream of graphics and text operators that define how a page is drawn. Text extraction reads from content streams.

Context Window

The maximum amount of input (tokens or text) an LLM can process at once.

Corpus

A collection of documents used for search, RAG, or analysis. In DocLD, a knowledge base holds a corpus of documents.

Cosine Similarity

A measure of similarity between two vectors based on the angle between them. Used for ranking vector search results.

Credits

Usage is page-based; API responses may include a credits field equal to pages for backward compatibility. See pricing for free tier and per-page rate.

CSV Export

Exporting extracted or tabular data to CSV format. Useful for spreadsheets, reporting, and downstream systems.

D

Dashboard

The DocLD web UI for managing documents, knowledge bases, extraction schemas, workflows, API keys, and usage.

Dimensionality

The number of dimensions (size) of an embedding vector. Dimensionality affects storage and similarity computation.

Document

A file or unit of content (PDF, image, spreadsheet, etc.) that DocLD can parse, index, and use for extraction and RAG.

Document Classification

Assigning documents to categories or types (e.g., invoice, contract, receipt). Supports routing and schema selection for extraction.

Document ID

A unique identifier for a document in DocLD. Used in API calls, metadata, and citations to reference the source document.

Document Intelligence

AI-powered understanding of documents: parsing, extraction, search, and generation. DocLD delivers document intelligence through parsing, RAG, and extraction.

Document Pipeline

A sequence of processing steps applied to documents, from upload through parsing, chunking, and optional extraction or indexing.

Document Processing

The end-to-end pipeline of ingesting, parsing, and transforming documents into searchable or structured data. DocLD handles upload, parsing, chunking, and extraction.

Document Set

A defined group of documents (e.g., by filter, folder, or IDs) used for a specific workflow, knowledge base, or extraction job.

Document Upload

Sending a file to DocLD for processing. Upload triggers parsing, chunking, and embedding for RAG and extraction.

E

Embedding

A numerical vector representation of text used for semantic search. DocLD uses the vector database with llama-text-embed-v2 for document chunks and queries.

Embedding Model

The AI model that converts text into numerical vectors for semantic search.

Embedding Space

The high-dimensional vector space where embeddings live. Similarity in this space corresponds to semantic similarity.

Endpoint

A specific URL and HTTP method that exposes an API operation. DocLD endpoints cover upload, parsing, extraction, chat, and more.

Event

A notification that something happened. Webhooks deliver events when documents are processed or jobs complete.

Extraction

AI-powered extraction of structured data from documents using schemas. DocLD returns field values with confidence scores and citations.

F

Few-Shot

Providing a small number of examples in the prompt to guide model behavior. Reduces the need for fine-tuning.

Field Mapping

Defining how source content or document structure maps to schema fields. Used in extraction and export.

File Format

The document type and extension DocLD supports for parsing. Supported formats include PDF, images, spreadsheets, presentations, and office documents.

Fine-Tuning

Training a model on additional data to improve performance on a specific task or domain. Contrast with zero-shot and few-shot.

Form Detection

Detecting form fields, labels, and input regions in documents to support extraction of key-value pairs and structured form data.

Full-Text Search

Keyword-based search over document content. Matches exact or stemmed words rather than semantic meaning.

G

Ground Truth

Verified, human-validated data used to measure extraction accuracy. Ground truth lets you compare AI extraction results against known correct values.

H

Hallucination

When an LLM invents information not present in its context. RAG and citations reduce hallucination by grounding answers in retrieved document content.

Hybrid Search

Combining keyword (full-text) search with vector search for better retrieval across diverse queries.

I

Idempotency

Performing an operation multiple times produces the same result as performing it once. Safe retries rely on idempotency.

Index

The vector index stores embeddings for similarity search. DocLD uses a vector database as its vector index for document chunks and queries.

Indexing

The process of adding parsed and embedded document chunks to a vector index so they can be searched. Indexing makes documents available for RAG and semantic search.

Inference

Running an LLM to generate output from input. Inference happens during RAG answer generation and extraction.

Ingestion

The process of bringing documents into the system and making them searchable or processable. Includes upload, parsing, chunking, and indexing.

J

Job

An asynchronous processing unit for parsing or extraction. Jobs track document processing status and return results when complete.

JSON Schema

A standard for describing the structure and validation of JSON data. DocLD extraction schemas define field names, types, and optional constraints.

K

Knowledge Base

A collection of documents organized for semantic search and RAG chat. Knowledge bases scope which documents are searched when answering questions.

L

Latency

The time from request to response. Lower latency means faster answers and better user experience.

Layout Analysis

Identifying document structure such as headings, paragraphs, tables, and figures. Layout analysis improves chunking and extraction quality.

LLM

Large Language Model. The AI model that powers RAG answers and extraction in DocLD. LLMs generate text from context and instructions.

M

Markdown

A lightweight markup language for plain-text documents. DocLD can parse and output Markdown for document content and exports.

Metadata

Structured information about documents and chunks. Metadata includes file info, page numbers, processing details, and custom tags.

N

Namespace

A logical partition in the vector database for scoping vectors. DocLD uses namespaces to isolate document embeddings and support multi-tenant deployments.

Native PDF

A PDF that contains embedded text and fonts (not just images). Text can be extracted directly without OCR.

O

OCR

Optical Character Recognition converts images and scanned documents into machine-readable text. DocLD uses VLM-based OCR for 50+ languages.

Organization

The top-level tenant in DocLD. Resources like documents, knowledge bases, and API keys belong to an organization.

P

Pagination

Retrieving a large list in pages (e.g., cursor- or offset-based). API list endpoints return paginated results.

Parsing

Extracting text, tables, and layout from documents. DocLD parses PDFs, images, spreadsheets, presentations, and more with OCR and semantic structure.

PDF/A

An ISO standard for long-term archiving of PDFs. PDF/A restricts features to improve preservation and compliance.

Prebuilt Schema

Ready-to-use extraction schemas for common document types like invoices, contracts, and resumes. Prebuilt schemas speed up extraction setup.

Presigned URL

A time-limited URL that grants temporary access to upload or download from object storage without exposing credentials.

Prompt Engineering

Designing prompts and instructions to get consistent, accurate behavior from LLMs. Used in extraction and RAG.

Prompting

The practice of crafting instructions and context for LLMs to guide behavior and output quality.

Q

Query

The user's question or search text. Queries are embedded and sent to vector search to retrieve relevant document chunks for RAG.

Quota

A limit on usage (e.g., API calls, documents, storage) per plan or organization. Exceeding quota may result in throttling or errors.

R

RAG

Retrieval-Augmented Generation combines document retrieval with AI generation for accurate, citation-backed answers from your documents.

Rate Limit

A cap on the number of requests or operations per time period. Rate limits prevent overloading the API.

Reindex

Re-running ingestion (parse, chunk, embed, upsert) on already-uploaded documents. Used after changing chunking or embedding settings.

Relevance

How well a retrieved passage or result matches a query. DocLD uses similarity scores and optional reranking to surface the most relevant chunks for RAG.

Reranking

Reordering retrieval results by relevance before generation. DocLD supports heuristic, LLM, and hybrid reranking for improved RAG accuracy.

Retention

How long DocLD keeps documents, chunks, and related data. Retention policies define when content is deleted or archived for compliance and storage management.

Retrieval

The step of fetching relevant documents or chunks for a query. In RAG, retrieval precedes generation.

Retry

Automatically re-attempting a failed request or job. DocLD and your integration can use retries to handle transient errors and improve reliability.

S

Scanned Document

A document stored as images of pages (e.g., from a scanner). Text must be extracted using OCR.

Schema

A JSON schema that defines which fields to extract from documents. Schemas specify field names, types, and instructions for AI extraction.

SDK

Software development kit—libraries and tools for integrating with the DocLD API in your preferred language.

Semantic Search

Search by meaning rather than exact keywords. Uses embeddings and vector similarity to find relevant content.

Session

A chat conversation with history. Sessions preserve context across messages for multi-turn RAG chat.

Similarity Score

A measure of how similar two embedding vectors are. Higher scores indicate greater semantic closeness.

Source Document

The original file or document uploaded for processing. Parsing and extraction operate on the source document.

Structured Data

Data in a fixed schema (e.g., JSON, CSV) with named fields and types. Extraction produces structured data from unstructured documents.

Sync

Keeping DocLD in sync with an external source. Connectors sync documents from storage or apps; you can also sync by re-uploading or reindexing to reflect changes.

System Prompt

Instructions that set the LLM's role, behavior, and constraints. Used in chat and RAG to enforce citation and tone.

T

Table Extraction

Identifying and extracting tabular data from documents as structured rows and columns. Supports RAG and schema-based extraction.

Temperature

An LLM parameter that controls randomness of generated output. Lower values are more deterministic, higher more diverse.

Tenant

A logical boundary for multi-tenancy. In DocLD, the organization is the primary tenant; resources and billing are isolated per tenant.

Text Extraction

Pulling raw text from documents regardless of format. In PDFs and images, this may involve parsing content streams or running OCR.

Throughput

The rate at which documents or operations are processed. Higher throughput means more documents per unit time.

Tokenization

Splitting text into tokens for LLMs and embedding models. Token boundaries affect chunk size and context limits.

Top-K

The number of chunks retrieved from vector search before reranking or generation. Top-k balances recall with context size and latency.

Top-p

A sampling parameter that limits the model to the smallest set of tokens whose cumulative probability exceeds p. Controls diversity of output.

Trigger

What starts a workflow. Triggers include upload, webhook, schedule, or manual run.

U

Unstructured Data

Raw documents (PDFs, images, text) without fixed schema. DocLD parses unstructured data into structured content for extraction and search.

V

Vector

A numerical representation of text or data. Embeddings produce vectors for semantic similarity and search.

Vector Database

A database optimized for storing and querying vector embeddings. Enables fast similarity search for RAG and semantic search.

Vector Search

Similarity search over embeddings to find documents or passages that match a query by meaning. DocLD uses a vector database for vector search.

W

Webhook

HTTP callbacks that notify your system when events occur. DocLD webhooks fire on document processing, extraction completion, and workflow events.

Webhook Event

A single occurrence (e.g. job completed, document processed) that DocLD sends to your webhook endpoint. Events trigger HTTP POSTs with a payload describing what happened.

Webhook Payload

The JSON body sent to your webhook URL when an event occurs. Contains event type, resource IDs, and relevant data.

Workflow

Automated document processing pipelines with triggers and integrations. Workflows run parse, extract, or other steps when events occur.

Workflow Run

A single execution of a workflow. Each run has a status, logs, and output from the workflow steps.

X

XML

Extensible Markup Language used for structured output. DocLD can return extraction results in XML format for integration with downstream systems.

Y

Yield

Streaming or incremental delivery of results. DocLD can yield extraction results incrementally as they are produced, reducing latency for large documents.

Z

Zero-Shot Extraction

Extracting structured data from documents without training on your specific documents. DocLD uses zero-shot extraction with schema-defined instructions.

Product

Features
Pricing
API Reference

Industries

Healthcare
Retail
Food & Beverage
E-commerce
Construction
View all

Company

About
Careers

Resources

Documentation
Blog
Help Center
Status

Legal

Privacy Policy
Terms of Service
Trust

Connect

X
GitHub
LinkedIn

© 2026 DocLD, Inc.SOC audit in progress