How we ship LLM features without breaking prod

Evals, traffic shadowing, and the boring deploy gates that keep our agents trustworthy.

March 21, 2026 11 min read

Engineering

LLM features are the only things in our codebase that change behavior without us deploying. Models update silently, prompts drift as we tweak them, and a fix in one place can break four others. Here's the process we use to ship them safely.

Every LLM feature ships with three things: an eval suite, a fallback path, and observability. None are optional.

The eval suite is a list of input → expected-output pairs, scored automatically. We run it on every PR, every model upgrade, and every prompt change. If scores drop, the change doesn't merge.

The fallback path is what happens when the LLM call fails or returns garbage. For agentic workflows that's usually "queue for human review"; for chat that's a "let me get someone" handoff; for content classification that's "tag as unclassified, escalate to ops".

Observability is end-to-end traces of every LLM call: input, output, model version, latency, tokens, cost. We use Langfuse for this and log every production call. When something goes wrong, we know within minutes.

On top of that, we shadow-traffic new prompts before they go live. Real production input gets sent to both the old and new prompt; we compare outputs offline before flipping. This catches the kind of drift evals miss.

Tags#engineering #evals #production

Share Email a colleague

Keep reading

Want this in your inbox?

We email occasionally — when there's something genuinely useful to share. No spam.

How we ship LLM features without breaking prod

More from the blog.

Designing AI agents that escalate well

Why we still write CRMs from scratch in 2026

Lighthouse 98 is the new minimum

Want this in your inbox?