Indexing
ProcessingIndexing is the process of adding document content to a searchable vector index. After parsing and chunking, each chunk is embedded and stored in the index so vector search and RAG can retrieve relevant passages.
How Indexing Works in DocLD
- Parse — Extract text and structure from the document.
- Chunk — Split into segments suitable for embedding.
- Embed — Convert each chunk to a vector using an embedding model.
- Upsert — Write vectors and metadata (e.g. document ID, knowledge base) to the vector index (e.g. Pinecone).
Once indexed, the document is searchable. If the source file or chunking settings change, you may need to reindex to refresh the index.
Related Concepts
Reindex is re-running indexing for existing documents. Vector index and vector database store the indexed vectors. Ingestion often includes parsing, chunking, and indexing together.