If you've shipped one LLM feature, you've made a decision about which model serves it. If you've shipped three, you've discovered that the right model differs by task, by tenant, by cost, and by what's awake at 2am. The fourth feature is where teams either build a router or buy themselves into lock-in. This note is about the first option.
Model classes, not models.
The most useful abstraction is a model class, a logical tier with quality, latency, and cost characteristics, backed by one or more concrete models. Code calls into the class; the router decides the model.
- flagship · Claude Opus, GPT-4: high-quality synthesis, expensive, slowest.
- fast · Claude Haiku, GPT-4 mini: extraction, tagging, classification.
- open-weight · self-hosted Llama or Mistral: data-sensitive paths, cost-bounded inference.
- specialist · re-rankers, embeddings, vision: narrow models, not chat.
Application code never asks for "Claude Opus 4.6 at 0.7 temperature." It asks for flagship.synthesize(...). The model behind the class can be swapped in minutes without touching call sites.
Router shape.
The router is a small piece of code with surprisingly high leverage. Three inputs decide what it sends: request shape (task type, tenant, sensitivity), system state (provider health, budget consumed), and policy (per-tenant ceilings, fallback ordering).
class Router:
def route(self, request: Request) -> ModelChoice:
klass = self.classes[request.task]
candidates = klass.candidates_for(request)
# filter by health and budget
live = [m for m in candidates if self.health.ok(m)]
affordable = [m for m in live if self.budget.ok(request.tenant, m)]
if not affordable:
return klass.degraded(request) # open-weight fallback
return self.policy.pick(affordable, request)The router is not where you want clever code. It's where you want code so dumb and so observable that you can debug it at 3am with a stale dashboard.
Failover, designed for the bad day.
Provider outages are not edge cases anymore. They're a normal part of the operational picture. The two patterns that actually help:
- Circuit breakers per provider, not per model. If Anthropic's API is degraded, Claude Opus and Claude Haiku are both at risk. Break at the provider.
- Degraded modes, declared up front. Every class has a documented degraded path. Flagship → fast is acceptable; fast → open-weight is acceptable for many use cases. Synthesis-quality features may need to fail explicitly instead of silently degrading.
- Quality regression alerts. If 30% of synthesis requests are running on the degraded class, that's a paging event. Silently degraded quality is worse than declared downtime.
The router's job isn't to be clever. It's to be the one place we look when we need to explain what the platform decided to do.
Consistent telemetry across providers.
Every provider reports tokens, latency, and errors differently. The router normalizes them into a single shape, and that shape is the source of truth for cost dashboards, p95 alerts, and quality regression tracking.
The non-negotiables:
- Tokens in / tokens out, normalized per provider's tokenization where it matters.
- Cost in cents, computed at call time, not estimated.
- Time to first token and time to last token, separately. Both matter for streaming.
- Provider error code + class: rate limit, content filter, timeout, server error, content policy. Distinct categories drive distinct alerting.
- Tenant + task + model class on every record. Without these you can't answer "what did we spend on tenant X last month?"
Closing.
Multi-model routing is one of those investments that looks like over-engineering on day one and underinvestment on day ninety. Build it once, build it small, and it pays back the first time a provider has a bad afternoon.
If you're stuck on whether this is the right pattern for your shape, we're happy to talk it through.