Tokenization
AITokenization is the process of splitting text into discrete units (tokens) that language models and embedding models process. Tokens can be words, subwords, or characters depending on the model; token boundaries affect chunking size, context-window limits, and embedding cost.
Why Tokenization Matters
- Chunk sizing — Chunking often uses token or character counts; exceeding model limits can cause truncation or errors
- Context window — LLM and embedding models have token limits; tokenization determines how much content fits
- Cost — Some APIs bill per token; tokenization affects pricing for inference and embeddings
DocLD abstracts tokenization via chunking and embedding configuration. Pinecone handles embedding generation with model-specific limits.
Related Concepts
Tokenization underlies chunking, embedding, and context-window constraints. RAG and vector search depend on correctly sized chunks for retrieval and generation.