Skip to content
Back to blogEngineering

How we ship LLM features without breaking prod

Evals, traffic shadowing, and the boring deploy gates that keep our agents trustworthy.

March 21, 2026 11 min read
Engineering

LLM features are the only things in our codebase that change behavior without us deploying. Models update silently, prompts drift as we tweak them, and a fix in one place can break four others. Here's the process we use to ship them safely.

Every LLM feature ships with three things: an eval suite, a fallback path, and observability. None are optional.

The eval suite is a list of input → expected-output pairs, scored automatically. We run it on every PR, every model upgrade, and every prompt change. If scores drop, the change doesn't merge.

The fallback path is what happens when the LLM call fails or returns garbage. For agentic workflows that's usually "queue for human review"; for chat that's a "let me get someone" handoff; for content classification that's "tag as unclassified, escalate to ops".

Observability is end-to-end traces of every LLM call: input, output, model version, latency, tokens, cost. We use Langfuse for this and log every production call. When something goes wrong, we know within minutes.

On top of that, we shadow-traffic new prompts before they go live. Real production input gets sent to both the old and new prompt; we compare outputs offline before flipping. This catches the kind of drift evals miss.

Get started

Want this in your inbox?

We email occasionally — when there's something genuinely useful to share. No spam.