RAG grounds LLM generation in verified external knowledge by retrieving semantically relevant documents at query time and injecting them into the prompt context — reducing hallucination and enabling real-time knowledge access.
Embedding models compress text into dense high-dimensional vectors where semantic similarity corresponds to geometric proximity. The resulting space encodes meaning: synonyms cluster, antonyms diverge, analogical relationships form parallelogram structures.
Chunking strategy fundamentally determines retrieval precision. Fixed-size, sentence-aware, recursive, semantic, and proposition-level chunkers each produce different recall/precision trade-offs depending on document structure and query type.
At query time, the input is embedded into the same vector space as indexed chunks. Approximate nearest neighbor search returns the top-K most semantically similar chunks — the geometry of the embedding space determines recall quality.
Initial vector retrieval is fast but noisy. A cross-encoder reranker re-scores each candidate chunk against the full query with full attention — dramatically improving precision at the cost of additional latency. Hybrid BM25+vector fusion further improves recall.
Advanced RAG architectures maintain multi-tier memory: dense vector stores for semantic search, sparse BM25 indexes for keyword precision, knowledge graphs for structured reasoning, and conversation history for session continuity.
The researchers and engineers shaping the retrieval-augmented generation landscape.
Core vocabulary for retrieval-augmented generation architecture and implementation.