Top-K
AITop-K is the number of chunks retrieved from vector search before passing them to reranking or the LLM. For example, top-k=10 means the 10 most similar chunks to the query are returned. Top-k balances recall (getting enough relevant content) with context size and latency.
How Top-K Works
- Query — User question is embedded and sent to Pinecone
- Retrieve — Vector index returns the top-k most similar chunks (e.g., by cosine similarity)
- Rerank — Optional reranking may reorder or filter these chunks
- Generate — The LLM receives the (possibly reranked) chunks as context for RAG answer generation
Higher top-k improves recall: you're more likely to include all relevant chunks. But higher top-k also increases context size, latency, and cost. DocLD lets you configure top-k per knowledge base or chat session.
Choosing Top-K
| Top-K | Trade-off | Best For |
|---|---|---|
| 5–10 | Fast, smaller context | Simple questions, narrow domains |
| 10–20 | Balanced | Most RAG use cases |
| 20+ | Higher recall | Complex questions, broad domains |
Reranking can help: retrieve a larger top-k (e.g., 20), rerank to the best 5–10, then pass to the LLM. This improves precision without losing recall.
Best Practices
- Start moderate — Default top-k (e.g., 10) works for most cases
- Increase if recall is low — If answers miss key information, try higher top-k
- Use reranking — Reranking lets you retrieve more and filter down
- Match context window — Ensure top-k × chunk size fits within the LLM context window
Related Concepts
Top-k controls vector search retrieval. Reranking refines the order of retrieved chunks. Chunking determines chunk size; RAG uses retrieved chunks as context for the LLM.