Making AI production-ready isn't about the model

TL;DR the 30-second version

The hard part of shipping AI isn’t picking the model — it’s everything around it: retries, fallbacks, observability, and keeping the bill sane. The leaderboard rarely decides whether your system survives production. The system around the model does.

A lot of teams obsess over which model scores highest on a leaderboard. In production, that’s rarely the bottleneck. The bottleneck is the system around the model.

The model is one box on the request path. Bounded retries, a deterministic fallback, and a circuit breaker are what keep one slow provider from taking down the whole request.

Reliability first

LLM calls fail — rate limits, timeouts, malformed JSON. Treat them like any flaky network dependency: bounded retries with backoff, a deterministic fallback, and a circuit breaker so one slow provider doesn’t take down the request path. Doing this carefully cut our error rate by about 25%.

Make it observable

You can’t fix what you can’t see. Log prompts, model, latency, token counts, and outcome for every call. When quality regresses, you want to diff prompts and inputs — not guess.

Spend like it’s your money

Model selection is a cost lever, not just a quality lever. Routing easy requests to a cheaper model and reserving the expensive one for hard cases saved us $5,000+/yr without users noticing.

The unglamorous work is the work. The model is the easy part.

#reliability #llmops #cost #observability

Reliability first

Make it observable

Spend like it’s your money

Related reading