Model governance for agents that run in production
Which model runs which step, which version is pinned, what passes before rollout — these are policy decisions, not strings buried in agent code. We make model choice a governed, observable, reversible layer.
- Per-step model routing
- Version pinning & promotion
- Pre-rollout evaluation
- Provider fallback & redundancy
Model choice is an architectural layer, not a constant
The cheapest way to ship an agent is to hard-code one model. It is also the fastest way to a brittle, expensive system.
When a model name is a literal inside agent logic, every concern collapses onto one string: cost, quality, latency, provider availability, and deprecation risk all become invisible until they break. You cannot A/B a swap, you cannot roll back cleanly, and you cannot answer the audit question — which model produced this decision, and was it the version we validated?
Model governance separates the choice of model from the work the agent does. A routing policy maps each step to a model class, version pins lock what actually ships, and an evaluation gate decides what gets promoted. The agent calls a step; the policy decides the model. That indirection is what makes a fleet safe to operate and cheap to evolve.
The pieces of a governance layer
Each is a deliberate engineering choice with its own tradeoffs — and each links to the part of the architecture it touches.
Per-step routing
Profile every step by reasoning depth, context, latency, and cost ceiling, then route to the right model class instead of one model for everything.
Version pinning
Pin the exact model + version each step runs. No provider auto-upgrade reaches production without passing your gate first.
Promotion policy
A new model is shadow-tested on real traffic and promoted only when it beats the incumbent on quality, cost, and latency.
Fallback & redundancy
When a provider degrades or rate-limits, the policy fails over to an equivalent model so the workflow doesn't stall.
Cost controls
Per-step budgets, model-class caps, and token ceilings stop a runaway loop from quietly burning a frontier-model bill.
Decision lineage
Every step records its model, version, prompt hash, and cost, so any output traces back to the exact configuration that produced it.
Promoting a model, safely
The path a new or swapped model takes before it touches a single live decision.
Pin
Register the candidate model + version in the policy as a shadow, with no production traffic routed to it yet.
Evaluate
Replay it against the held-out evaluation set and recent real traffic, scoring quality, cost, and latency vs. the incumbent.
Shadow
Run it in parallel on a slice of live steps, comparing outputs without acting on them, to catch tail-case regressions.
Promote
Flip the routing policy to the new pin. If anything regresses, roll back by reverting one config value — no code deploy.
Right-sized, not biggest-available
The instinct is to point every step at the strongest frontier model and trust it. For the handful of genuinely hard steps — planning, adjudication, ambiguous extraction — that's correct. For the high-volume majority — classification, routing, structured extraction, formatting — it's wasteful and often slower than it needs to be.
We benchmark each step against your own cases, then route the bulk of traffic to small fast models and reserve frontier capacity for the steps that earn it. The result is a fleet that's cheaper, lower-latency, and — because the policy is explicit — auditable. Governance is part of the architecture, not bolted on after the bill arrives.
- Step-level benchmarks on your real cases
- Frontier models reserved for hard steps
- Explicit, reviewable routing policy
Hard-coded model vs. governed model layer
The difference between an agent that demos well and a fleet you can operate.
| Hard-coded model string | Governed model layer | |
|---|---|---|
| Model choice | One literal in agent code | Routing policy, per-step |
| New version | Silent or manual swap | Pinned, evaluated, then promoted |
| Rollback | Code change + redeploy | Revert one config value |
| Provider outage | Workflow stalls | Fails over to equivalent model |
| Audit | Unknown which model ran | Exact model + version logged |
| Cost | One model for every step | Right-sized per step |
Frequently asked questions
Why not just hard-code one model and move on?
Because a single string in your agent code becomes a single point of failure and a silent cost center. A pinned model gets deprecated, a cheaper one ships, or a step starts failing — and with no governance layer you find out in production. A policy layer lets you swap, route, and roll back without touching agent logic.
How do you decide which model handles which step?
We profile each step by what it actually needs — reasoning depth, context length, tool-calling reliability, latency budget, and cost ceiling — then route accordingly. A cheap fast model classifies and extracts; a frontier model plans and adjudicates. Routing is a policy, evaluated against held-out cases, not a guess.
What happens when a provider ships a new model version?
Nothing automatic. New versions are pinned and shadow-tested against your evaluation set first. We compare quality, cost, and latency on your real traffic before promotion, and every step records the exact model and version it ran so a regression is traceable to a single change.
Does governance slow the agents down or add cost?
Net, it usually lowers cost. Routing the bulk of cheap, high-volume steps to small models and reserving frontier models for the few hard ones often cuts spend 40 to 70 percent. The governance layer itself is config and logging — negligible latency, and it pays for itself the first time a swap saves a rollback.
Related architecture decisions
Model governance sits alongside the rest of the agent-architecture stack.
Make model choice a decision you can defend
Bring one agent workflow. We'll map its steps, size the right model for each, and show you the governance layer that keeps it cheap and reversible.