How RAG Works in DocLD — Retrieval, Reranking, and Citations Under the Hood | DocLD Blog

When you ask a question in DocLD chat, your message goes through a multi-stage pipeline before you see an answer with citations. That pipeline—search the vector index (Pinecone embeds query server-side), rerank results, select chunks, and generate a grounded response—is what makes RAG in DocLD both accurate and transparent. This post walks through each stage so you can reason about latency, quality, and tuning.

End-to-End RAG Flow

From the moment you send a message until you receive a cited reply, the following path is executed:

The same flow applies whether you call the API synchronously or use streaming: retrieval and chunk selection happen first; then the LLM generates the answer using only the selected excerpts and is required to cite them.

From the client’s perspective, a single message triggers the full pipeline:

Stage 1: Search (Pinecone Integrated Embeddings)

Your question is sent directly to Pinecone. Pinecone embeds it using the same model as document chunks (llama-text-embed-v2) and runs vector search server-side. No separate embedding API call is needed.

Input: The raw message string (and any query expansion or reformulation your client sends).
Output: Top-K similar records with scores and metadata.
Where it happens: queryByText(query, options) sends the query to Pinecone; embeddings and search happen in one step.

Using Pinecone's integrated embeddings keeps retrieval consistent, low-latency, and avoids cross-model mismatch.

Stage 2: Vector Search & Scoping

Pinecone runs top-K similarity search and returns the most relevant chunks. Knowledge bases are scoped by metadata filter (document_id $in list of documents in the KB); a single namespace is used. Learn how to create and use them in the DocLD dashboard and documentation.

Concept	Detail
Scoping	Filter by `document_id: { $in: [doc1, doc2, ...] }` for the KB's documents
Default top-K	10; for reranking we often fetch more (e.g. 3× or at least 20–30) to give rerankers a good candidate set
Filters	Optional metadata filters (e.g. by `document_id`, `page_number`) can be applied at query time
Metadata	We request metadata (chunk_id, document_id, page, filename, etc.) so we can enrich results and build citations

Example of how KB scoping isolates results:

Knowledge Base A  →  filter: document_id $in [doc_A1, doc_A2, ...]  →  only chunks from KB A
Knowledge Base B  →  filter: document_id $in [doc_B1, doc_B2, ...]  →  only chunks from KB B

So when you chat with "Company Policies", we only search chunks from that KB’s documents. No cross-KB leakage.

Stage 3: Reranking

Raw vector similarity is useful but not perfect. We optionally rerank the retrieved chunks so that the most relevant ones rise to the top. That improves answer quality and citation accuracy.

Strategy	Description	Best For
none	Use vector scores as-is; no reranking	Low latency, small candidate sets
heuristic	Score boosts for exact phrase match, keyword overlap, and filename match	Default; good balance of speed and quality
llm	Cross-encoder style: an LLM scores each chunk’s relevance to the query	Highest accuracy; slower and more costly
hybrid	Heuristic to narrow candidates, then LLM rerank on the shortlist	Complex questions when you want both recall and precision

Default in DocLD is heuristic. It applies things like:

Boosting chunks that contain the exact query phrase.
Boosting chunks whose content overlaps with important query terms.
Boosting chunks from documents whose filename matches the query (e.g. "handbook.pdf" when the user asks about the handbook).

After reranking, we keep a larger set (e.g. up to ~16 chunks) and then run chunk selection so we don’t overload the LLM context.

Stage 4: Chunk Selection

We don’t send every retrieved (or reranked) chunk to the LLM. We cap the total number of chunks and cap how many chunks can come from a single document. That keeps context focused and avoids one long document dominating the context.

Parameter	Default	Purpose
maxSelectedChunks	8	Maximum chunks included in the LLM context
maxChunksPerDocument	5	Maximum chunks from any one document

Selection is score-based (after reranking) and respects both limits. So you get the top-scoring chunks overall, with diversity across documents. The result is a list of selected chunks that are then formatted as numbered excerpts (e.g. Excerpt 1, Excerpt 2, …) for the prompt.

Stage 5: Generate and Cite

The LLM receives a system prompt that includes:

Hallucination prevention — Use only the provided excerpts; do not invent or extrapolate; if the excerpts don’t contain the answer, say so.
Citation instructions — Cite by excerpt number (e.g. [1], [2]) immediately after the claim; only cite excerpts that actually support the claim; prefer quoting exact text when possible.
Output sanitization — Respond in a natural, conversational way; don’t repeat the question or echo "based on the excerpts" boilerplate.

So the model is constrained to the retrieved text and trained (via instructions) to tie every factual claim to a specific excerpt. That’s how we get inline citations that map back to document and page in the UI.

Prompt building block	Role
Hallucination prevention	Restrict answers to the provided excerpts only
Citation instructions	Inline [1], [2] and optional quoted text
Citation examples	Good vs bad citation placement and accuracy
Output sanitization	Natural tone, no repeated metadata or headers

How Chunking Affects RAG

Chunks are created when documents are processed (before they’re embedded and indexed). The chunking strategy directly affects retrieval quality.

Strategy	Description	Impact on RAG
Semantic	Split by meaning and natural boundaries (paragraphs, sections)	Best for most docs; chunks tend to be self-contained and relevant to a single topic
Fixed	Fixed character count with optional overlap	Predictable size; can split mid-sentence or mid-concept
Page	One chunk per page	Good for slide decks or page-bound content where "one page = one idea"

Chunking parameters (e.g. max_chunk_size, overlap) are set at parse/upload time. Larger chunks carry more context but may mix topics; smaller chunks are more precise but can lose narrative flow. Defaults (e.g. 1000 characters, 100 overlap) are a good starting point for RAG.

Retrieval and Knowledge Base Settings

At the knowledge base level you can tune retrieval behavior (depending on what’s exposed in the API or dashboard):

Setting	Default	Description
top_k	5–10 (mode-dependent)	Number of chunks to retrieve before reranking/selection
threshold	e.g. 0.7	Minimum relevance score (if supported)
reranking	true (heuristic)	Whether and how to rerank (none / heuristic / llm / hybrid)

Example-style config (conceptual; actual keys may differ in your deployment):

{
  "retrieval": {
    "top_k": 10,
    "threshold": 0.7,
    "reranking": true
  },
  "chunking": {
    "strategy": "semantic",
    "max_size": 1000,
    "overlap": 100
  }
}

These influence how many candidates we pull from Pinecone and whether we rerank them before chunk selection.

API and Chat Modes

You send a message to POST /api/chat with knowledge_base_id and optional mode. Full details are in the API documentation. Modes trade off speed vs depth of retrieval and length of response.

Mode	Description	Best For
fast	Quick response, fewer chunks and citations	Simple, factual questions
balanced	Default balance of speed and accuracy	Most day-to-day queries
thorough	Deeper search, more chunks and citations	Complex or multi-part questions

Example request:

curl -X POST "/api/chat" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is the vacation policy?",
    "knowledge_base_id": "kb-uuid",
    "mode": "thorough"
  }'

Streaming is supported: the same retrieval pipeline runs first; then the model stream is returned so you can show the answer and citations as they’re generated.

Hybrid Search (Vector + Lexical)

Besides pure vector search, DocLD can combine vector similarity with lexical (keyword) search (e.g. BM25) and fuse the results (e.g. Reciprocal Rank Fusion). That can help when the query has very specific terms that matter for relevance.

Vector: Semantic similarity (embeddings).
Lexical: Term overlap (BM25 over chunk text).
Fusion: Methods like RRF merge the two ranked lists into a single ordering.

Hybrid search is available in the codebase for experiments or specific use cases. The main RAG path today is vector search + optional reranking; hybrid can be enabled where you need stronger keyword sensitivity without losing semantic recall. For document intelligence pricing and usage, see pricing, our calculators, and the document processing cost calculator. More concepts: embedding, reranking, and the full glossary.

Frequently Asked Questions

End-to-End RAG Flow

From the moment you send a message until you receive a cited reply, the following path is executed:

From the client’s perspective, a single message triggers the full pipeline:

Stage 1: Search (Pinecone Integrated Embeddings)

Input: The raw message string (and any query expansion or reformulation your client sends).
Output: Top-K similar records with scores and metadata.
Where it happens: queryByText(query, options) sends the query to Pinecone; embeddings and search happen in one step.

Using Pinecone's integrated embeddings keeps retrieval consistent, low-latency, and avoids cross-model mismatch.

Stage 2: Vector Search & Scoping

Concept	Detail
Scoping	Filter by `document_id: { $in: [doc1, doc2, ...] }` for the KB's documents
Default top-K	10; for reranking we often fetch more (e.g. 3× or at least 20–30) to give rerankers a good candidate set
Filters	Optional metadata filters (e.g. by `document_id`, `page_number`) can be applied at query time
Metadata	We request metadata (chunk_id, document_id, page, filename, etc.) so we can enrich results and build citations

Example of how KB scoping isolates results:

Knowledge Base A  →  filter: document_id $in [doc_A1, doc_A2, ...]  →  only chunks from KB A
Knowledge Base B  →  filter: document_id $in [doc_B1, doc_B2, ...]  →  only chunks from KB B

So when you chat with "Company Policies", we only search chunks from that KB’s documents. No cross-KB leakage.

Stage 3: Reranking

Raw vector similarity is useful but not perfect. We optionally rerank the retrieved chunks so that the most relevant ones rise to the top. That improves answer quality and citation accuracy.

Strategy	Description	Best For
none	Use vector scores as-is; no reranking	Low latency, small candidate sets
heuristic	Score boosts for exact phrase match, keyword overlap, and filename match	Default; good balance of speed and quality
llm	Cross-encoder style: an LLM scores each chunk’s relevance to the query	Highest accuracy; slower and more costly
hybrid	Heuristic to narrow candidates, then LLM rerank on the shortlist	Complex questions when you want both recall and precision

Default in DocLD is heuristic. It applies things like:

Boosting chunks that contain the exact query phrase.
Boosting chunks whose content overlaps with important query terms.
Boosting chunks from documents whose filename matches the query (e.g. "handbook.pdf" when the user asks about the handbook).

After reranking, we keep a larger set (e.g. up to ~16 chunks) and then run chunk selection so we don’t overload the LLM context.

Stage 4: Chunk Selection

Parameter	Default	Purpose
maxSelectedChunks	8	Maximum chunks included in the LLM context
maxChunksPerDocument	5	Maximum chunks from any one document

Stage 5: Generate and Cite

The LLM receives a system prompt that includes:

Hallucination prevention — Use only the provided excerpts; do not invent or extrapolate; if the excerpts don’t contain the answer, say so.
Citation instructions — Cite by excerpt number (e.g. [1], [2]) immediately after the claim; only cite excerpts that actually support the claim; prefer quoting exact text when possible.
Output sanitization — Respond in a natural, conversational way; don’t repeat the question or echo "based on the excerpts" boilerplate.

Prompt building block	Role
Hallucination prevention	Restrict answers to the provided excerpts only
Citation instructions	Inline [1], [2] and optional quoted text
Citation examples	Good vs bad citation placement and accuracy
Output sanitization	Natural tone, no repeated metadata or headers

How Chunking Affects RAG

Chunks are created when documents are processed (before they’re embedded and indexed). The chunking strategy directly affects retrieval quality.

Strategy	Description	Impact on RAG
Semantic	Split by meaning and natural boundaries (paragraphs, sections)	Best for most docs; chunks tend to be self-contained and relevant to a single topic
Fixed	Fixed character count with optional overlap	Predictable size; can split mid-sentence or mid-concept
Page	One chunk per page	Good for slide decks or page-bound content where "one page = one idea"

Retrieval and Knowledge Base Settings

At the knowledge base level you can tune retrieval behavior (depending on what’s exposed in the API or dashboard):

Setting	Default	Description
top_k	5–10 (mode-dependent)	Number of chunks to retrieve before reranking/selection
threshold	e.g. 0.7	Minimum relevance score (if supported)
reranking	true (heuristic)	Whether and how to rerank (none / heuristic / llm / hybrid)

Example-style config (conceptual; actual keys may differ in your deployment):

{
  "retrieval": {
    "top_k": 10,
    "threshold": 0.7,
    "reranking": true
  },
  "chunking": {
    "strategy": "semantic",
    "max_size": 1000,
    "overlap": 100
  }
}

These influence how many candidates we pull from Pinecone and whether we rerank them before chunk selection.

API and Chat Modes

You send a message to POST /api/chat with knowledge_base_id and optional mode. Full details are in the API documentation. Modes trade off speed vs depth of retrieval and length of response.

Mode	Description	Best For
fast	Quick response, fewer chunks and citations	Simple, factual questions
balanced	Default balance of speed and accuracy	Most day-to-day queries
thorough	Deeper search, more chunks and citations	Complex or multi-part questions

Example request:

curl -X POST "/api/chat" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is the vacation policy?",
    "knowledge_base_id": "kb-uuid",
    "mode": "thorough"
  }'

Streaming is supported: the same retrieval pipeline runs first; then the model stream is returned so you can show the answer and citations as they’re generated.

Hybrid Search (Vector + Lexical)

Vector: Semantic similarity (embeddings).
Lexical: Term overlap (BM25 over chunk text).
Fusion: Methods like RRF merge the two ranked lists into a single ordering.