Prompt engineering was about wording one message. Context engineering is about managing the entire context window as a scarce budget — what goes in, in what order, and what gets evicted. For a backend engineer, it's working-set management applied to an LLM.
Why streaming an 8-second answer makes it feel fast even though it isn't, how to pick between long-poll, SSE, and WebSockets, and how to build a NestJS SSE endpoint that proxies a streaming LLM call — including the proxy-buffering, cancellation, and mid-stream-error gotchas that bite everyone the first time.
A normal cache keyed on the exact request string is almost useless for LLM calls, because every paraphrase is a miss. Semantic caching keys on meaning instead — embed the query, search for a near-identical past question, and return its answer with no model call. Here's the architecture, the threshold problem that makes or breaks it, and real pgvector code.
Treating LLM calls as just another upstream dependency. How to use Spring AI to build a multi-provider gateway with retries, circuit breakers, prompt versioning, and observability — the same hygiene you'd put around any external API.