A normal cache keyed on the exact request string is almost useless for LLM calls, because every paraphrase is a miss. Semantic caching keys on meaning instead — embed the query, search for a near-identical past question, and return its answer with no model call. Here's the architecture, the threshold problem that makes or breaks it, and real pgvector code.
Plain vector RAG can't answer multi-hop or 'across everything' questions — the answer is spread across chunks that no single chunk contains. GraphRAG extracts a knowledge graph instead. Here's how it works, the honest cost, and how to start in Postgres without a graph database.
Retrieval-augmented generation has been wrapped in enough mystique to obscure that it's mostly an ETL problem. What the pipeline actually looks like, where the real engineering happens, and the failure modes that have nothing to do with the model.