How we ship LLM features without breaking prod
Evals, traffic shadowing, and the boring deploy gates that keep our agents trustworthy.
LLM features are the only things in our codebase that change behavior without us deploying. Models update silently, prompts drift as we tweak them, and a fix in one place can break four others. Here's the process we use to ship them safely.
Every LLM feature ships with three things: an eval suite, a fallback path, and observability. None are optional.
The eval suite is a list of input → expected-output pairs, scored automatically. We run it on every PR, every model upgrade, and every prompt change. If scores drop, the change doesn't merge.
The fallback path is what happens when the LLM call fails or returns garbage. For agentic workflows that's usually "queue for human review"; for chat that's a "let me get someone" handoff; for content classification that's "tag as unclassified, escalate to ops".
Observability is end-to-end traces of every LLM call: input, output, model version, latency, tokens, cost. We use Langfuse for this and log every production call. When something goes wrong, we know within minutes.
On top of that, we shadow-traffic new prompts before they go live. Real production input gets sent to both the old and new prompt; we compare outputs offline before flipping. This catches the kind of drift evals miss.
Keep reading
More from the blog.
Designing AI agents that escalate well
The hardest part of an autonomous agent is teaching it when not to be autonomous.
Mar 08, 2026 · 9 min read
Why we still write CRMs from scratch in 2026
Salesforce and HubSpot are great. Sometimes a custom Laravel CRM is still the right answer.
Feb 26, 2026 · 7 min read
Lighthouse 98 is the new minimum
How we ship Next.js sites that score 98+ across the board, and why it now matters more than ever for SEO.
Jan 15, 2026 · 5 min read
Get started
Want this in your inbox?
We email occasionally — when there's something genuinely useful to share. No spam.