Why streaming an 8-second answer makes it feel fast even though it isn't, how to pick between long-poll, SSE, and WebSockets, and how to build a NestJS SSE endpoint that proxies a streaming LLM call — including the proxy-buffering, cancellation, and mid-stream-error gotchas that bite everyone the first time.
A normal cache keyed on the exact request string is almost useless for LLM calls, because every paraphrase is a miss. Semantic caching keys on meaning instead — embed the query, search for a near-identical past question, and return its answer with no model call. Here's the architecture, the threshold problem that makes or breaks it, and real pgvector code.
Plain vector RAG can't answer multi-hop or 'across everything' questions — the answer is spread across chunks that no single chunk contains. GraphRAG extracts a knowledge graph instead. Here's how it works, the honest cost, and how to start in Postgres without a graph database.
A modest app somehow grew Postgres, Redis, RabbitMQ, Elasticsearch and a vector DB — five things to back up, secure and pay for. Most of that is now one Postgres. Here's the queue, vector, search and pub/sub SQL, and the honest signals for when to graduate.
Your OrderService saves to Postgres and publishes to Kafka — two systems, no shared transaction. There is no safe order to do them in. The outbox pattern makes the write atomic and lets the broker catch up. Here's how, with the relay tradeoffs and the guarantees you actually get.
Half the 'how to become an AI engineer' advice tells you to start with linear algebra. After years on backend, here's the honest, narrower path — what to learn, what to skip, and what the job actually looks like in 2026.