How RAG Works in DocLD — Retrieval, Reranking, and Citations Under the Hood
When you ask a question in DocLD chat, your message goes through a multi-stage pipeline before you see an answer with citations. That pipeline—search the vector index (Pinecone embeds query server-side), rerank results, select chunks, and generate a grounded response—is what makes RAG in DocLD both accurate and transparent. This post walks through each stage so you can reason about latency, quality, and tuning.
End-to-End RAG Flow
From the moment you send a message until you receive a cited reply, the following path is executed:
The same flow applies whether you call the API synchronously or use streaming: retrieval and chunk selection happen first; then the LLM generates the answer using only the selected excerpts and is required to cite them.
From the client’s perspective, a single message triggers the full pipeline:
Stage 1: Search (Pinecone Integrated Embeddings)
Your question is sent directly to Pinecone. Pinecone embeds it using the same model as document chunks (llama-text-embed-v2) and runs vector search server-side. No separate embedding API call is needed.
- Input: The raw message string (and any query expansion or reformulation your client sends).
- Output: Top-K similar records with scores and metadata.
- Where it happens:
queryByText(query, options)sends the query to Pinecone; embeddings and search happen in one step.
Using Pinecone's integrated embeddings keeps retrieval consistent, low-latency, and avoids cross-model mismatch.
Stage 2: Vector Search & Scoping
Pinecone runs top-K similarity search and returns the most relevant chunks. Knowledge bases are scoped by metadata filter (document_id $in list of documents in the KB); a single namespace is used. Learn how to create and use them in the DocLD dashboard and documentation.
| Concept | Detail |
|---|---|
| Scoping | Filter by document_id: { $in: [doc1, doc2, ...] } for the KB's documents |
| Default top-K | 10; for reranking we often fetch more (e.g. 3× or at least 20–30) to give rerankers a good candidate set |
| Filters | Optional metadata filters (e.g. by document_id, page_number) can be applied at query time |
| Metadata | We request metadata (chunk_id, document_id, page, filename, etc.) so we can enrich results and build citations |
Example of how KB scoping isolates results:
Knowledge Base A → filter: document_id $in [doc_A1, doc_A2, ...] → only chunks from KB A Knowledge Base B → filter: document_id $in [doc_B1, doc_B2, ...] → only chunks from KB B
So when you chat with "Company Policies", we only search chunks from that KB’s documents. No cross-KB leakage.
Stage 3: Reranking
Raw vector similarity is useful but not perfect. We optionally rerank the retrieved chunks so that the most relevant ones rise to the top. That improves answer quality and citation accuracy.
| Strategy | Description | Best For |
|---|---|---|
| none | Use vector scores as-is; no reranking | Low latency, small candidate sets |
| heuristic | Score boosts for exact phrase match, keyword overlap, and filename match | Default; good balance of speed and quality |
| llm | Cross-encoder style: an LLM scores each chunk’s relevance to the query | Highest accuracy; slower and more costly |
| hybrid | Heuristic to narrow candidates, then LLM rerank on the shortlist | Complex questions when you want both recall and precision |
Default in DocLD is heuristic. It applies things like:
- Boosting chunks that contain the exact query phrase.
- Boosting chunks whose content overlaps with important query terms.
- Boosting chunks from documents whose filename matches the query (e.g. "handbook.pdf" when the user asks about the handbook).
After reranking, we keep a larger set (e.g. up to ~16 chunks) and then run chunk selection so we don’t overload the LLM context.
Stage 4: Chunk Selection
We don’t send every retrieved (or reranked) chunk to the LLM. We cap the total number of chunks and cap how many chunks can come from a single document. That keeps context focused and avoids one long document dominating the context.
| Parameter | Default | Purpose |
|---|---|---|
| maxSelectedChunks | 8 | Maximum chunks included in the LLM context |
| maxChunksPerDocument | 5 | Maximum chunks from any one document |
Selection is score-based (after reranking) and respects both limits. So you get the top-scoring chunks overall, with diversity across documents. The result is a list of selected chunks that are then formatted as numbered excerpts (e.g. Excerpt 1, Excerpt 2, …) for the prompt.
Stage 5: Generate and Cite
The LLM receives a system prompt that includes:
- Hallucination prevention — Use only the provided excerpts; do not invent or extrapolate; if the excerpts don’t contain the answer, say so.
- Citation instructions — Cite by excerpt number (e.g. [1], [2]) immediately after the claim; only cite excerpts that actually support the claim; prefer quoting exact text when possible.
- Output sanitization — Respond in a natural, conversational way; don’t repeat the question or echo "based on the excerpts" boilerplate.
So the model is constrained to the retrieved text and trained (via instructions) to tie every factual claim to a specific excerpt. That’s how we get inline citations that map back to document and page in the UI.
| Prompt building block | Role |
|---|---|
| Hallucination prevention | Restrict answers to the provided excerpts only |
| Citation instructions | Inline [1], [2] and optional quoted text |
| Citation examples | Good vs bad citation placement and accuracy |
| Output sanitization | Natural tone, no repeated metadata or headers |
How Chunking Affects RAG
Chunks are created when documents are processed (before they’re embedded and indexed). The chunking strategy directly affects retrieval quality.
| Strategy | Description | Impact on RAG |
|---|---|---|
| Semantic | Split by meaning and natural boundaries (paragraphs, sections) | Best for most docs; chunks tend to be self-contained and relevant to a single topic |
| Fixed | Fixed character count with optional overlap | Predictable size; can split mid-sentence or mid-concept |
| Page | One chunk per page | Good for slide decks or page-bound content where "one page = one idea" |
Chunking parameters (e.g. max_chunk_size, overlap) are set at parse/upload time. Larger chunks carry more context but may mix topics; smaller chunks are more precise but can lose narrative flow. Defaults (e.g. 1000 characters, 100 overlap) are a good starting point for RAG.
Retrieval and Knowledge Base Settings
At the knowledge base level you can tune retrieval behavior (depending on what’s exposed in the API or dashboard):
| Setting | Default | Description |
|---|---|---|
| top_k | 5–10 (mode-dependent) | Number of chunks to retrieve before reranking/selection |
| threshold | e.g. 0.7 | Minimum relevance score (if supported) |
| reranking | true (heuristic) | Whether and how to rerank (none / heuristic / llm / hybrid) |
Example-style config (conceptual; actual keys may differ in your deployment):
{ "retrieval": { "top_k": 10, "threshold": 0.7, "reranking": true }, "chunking": { "strategy": "semantic", "max_size": 1000, "overlap": 100 } }
These influence how many candidates we pull from Pinecone and whether we rerank them before chunk selection.
API and Chat Modes
You send a message to POST /api/chat with knowledge_base_id and optional mode. Full details are in the API documentation. Modes trade off speed vs depth of retrieval and length of response.
| Mode | Description | Best For |
|---|---|---|
| fast | Quick response, fewer chunks and citations | Simple, factual questions |
| balanced | Default balance of speed and accuracy | Most day-to-day queries |
| thorough | Deeper search, more chunks and citations | Complex or multi-part questions |
Example request:
curl -X POST "/api/chat" \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "message": "What is the vacation policy?", "knowledge_base_id": "kb-uuid", "mode": "thorough" }'
Streaming is supported: the same retrieval pipeline runs first; then the model stream is returned so you can show the answer and citations as they’re generated.
Hybrid Search (Vector + Lexical)
Besides pure vector search, DocLD can combine vector similarity with lexical (keyword) search (e.g. BM25) and fuse the results (e.g. Reciprocal Rank Fusion). That can help when the query has very specific terms that matter for relevance.
- Vector: Semantic similarity (embeddings).
- Lexical: Term overlap (BM25 over chunk text).
- Fusion: Methods like RRF merge the two ranked lists into a single ordering.
Hybrid search is available in the codebase for experiments or specific use cases. The main RAG path today is vector search + optional reranking; hybrid can be enabled where you need stronger keyword sensitivity without losing semantic recall. For document intelligence pricing and usage, see pricing, our calculators, and the document processing cost calculator. More concepts: embedding, reranking, and the full glossary.
Frequently Asked Questions
Your message is sent to Pinecone, which embeds it and runs vector search server-side (llama-text-embed-v2). We search the knowledge base’s documents. We rerank the results (by default with heuristics), select up to a fixed number of chunks with a per-document cap, then call the LLM with those excerpts and strict citation instructions. The response (sync or stream) is generated from that context only.
- Use semantic chunking for most documents.
- Prefer thorough (or equivalent) mode for complex questions so we retrieve and consider more chunks.
- Keep knowledge bases focused (one domain or use case per KB) so retrieval stays on-topic.
- If you need maximum accuracy and can afford latency/cost, consider llm or hybrid reranking where available.
Yes. Retrieval and chunk selection run once at the start. The model then streams the answer; citations are attached to the response (e.g. in the payload or in the UI) so you can show "[1]", "[2]" and link them to document and page.
Excerpts in the prompt are numbered (Excerpt 1, Excerpt 2, …). The model is instructed to cite with inline numbers like [1] or [1][2] immediately after the supported claim. See our citation and RAG glossary entries for more. The API/dashboard maps those indices back to chunk metadata (document name, page, and optionally bounding box or link).
No difference in the RAG pipeline. Both use the same search → rerank → select → generate flow. The only difference is whether the final answer is returned in one payload or as a stream. Use streaming for better perceived latency in the UI.