A normal cache keyed on the exact request string is almost useless for LLM calls, because every paraphrase is a miss. Semantic caching keys on meaning instead — embed the query, search for a near-identical past question, and return its answer with no model call. Here's the architecture, the threshold problem that makes or breaks it, and real pgvector code.
A modest app somehow grew Postgres, Redis, RabbitMQ, Elasticsearch and a vector DB — five things to back up, secure and pay for. Most of that is now one Postgres. Here's the queue, vector, search and pub/sub SQL, and the honest signals for when to graduate.
Click 'pay' twice when the page hangs and you shouldn't get charged twice. Here's how to make that promise — with one Redis key, a TTL, and a small race-condition guard.