Version every agent like the production system it is
Prompts, tools, models, and policy all change behavior — and they drift independently. We package them into one immutable, semantically versioned artifact so every run is reproducible and any change is one promotion or rollback away.
- Immutable, content-hashed versions
- Semantic versioning for behavior changes
- Canary, shadow, and instant rollback
- Lineage that ties outcomes to versions
An agent has no single source of truth — unless you build one
The hard part of versioning agents is that behavior lives in four places at once.
A model deploy versions a container. An agent's behavior is the joint product of its system prompt, its tool schemas, the model and decoding parameters, the retrieval corpus it reads, and the guardrail policy that constrains it. Change any one and the agent acts differently — but each of those is usually owned by a different system, edited by a different person, and shipped on a different cadence.
That is why "it worked yesterday" is so common with agents. Someone tweaked a prompt, the RAG index reindexed overnight, or the provider rotated a model snapshot under a floating alias. Without a version that binds all of it, you cannot reproduce a run, you cannot attribute a regression, and you cannot safely roll back.
Agent versioning fixes the seam: capture every behavior-determining input as one immutable artifact, give it a real version number, and make that version the unit you deploy, evaluate, and revert.
What an agent version pins
A version is only useful if it captures everything that can change the answer. These are the inputs we freeze together.
Prompts & instructions
System prompt, tool descriptions, and few-shot examples — the text most likely to be edited casually and the most likely to shift behavior.
Tool & action schemas
The function signatures and the action layer the agent can call. Adding, renaming, or re-scoping a tool is a behavior change.
Model & decoding params
The exact model ID (a dated snapshot, never a floating alias), plus temperature, max tokens, and stop conditions.
Retrieval corpus & index
The embedding model and a snapshot of the indexed knowledge, so a reindex doesn't silently rewrite what the agent knows.
Guardrail policy
Risk thresholds, approval gates, and redaction rules bound into the version — policy is behavior, so it ships with the agent.
Orchestration graph
The topology of sub-agents, routing, and state handling that decides which steps run and in what order.
From change to production, safely
Every promotion follows the same gated path, so shipping an agent change feels like shipping software.
Build
Assemble the artifact, content-hash it, and assign a semantic version. Identical inputs always produce the same version.
Evaluate
Run the candidate against a regression suite and golden transcripts. Behavior diffs and cost deltas surface before any traffic.
Canary
Route a small slice of live traffic (or shadow it read-only) to the new version and watch quality, latency, and spend in parallel.
Promote
Move the production alias to the new version. If anything regresses, flip the alias back — no redeploy, no in-flight breakage.
Major, minor, patch — for behavior
We treat agent changes the way good libraries treat their APIs. A patch is a non-behavioral fix — a typo in a prompt, a logging tweak. A minor bump adds a capability without breaking existing flows — a new tool, an expanded corpus. A major bump changes behavior in ways downstream consumers must review — a model family swap, a reworked policy, a restructured orchestration graph.
Encoding intent in the version number means consumers know whether to re-run their evals before adopting it, and on-call engineers know at a glance whether last night's bump could plausibly explain a regression. The version stops being an opaque build hash and becomes a contract.
- Major = behavior changes that need re-review
- Minor = additive, backward-compatible capability
- Patch = no behavior change
Unversioned agents vs. versioned agents
The difference between hoping a change is safe and knowing it.
| Unversioned agent | Versioned agent | |
|---|---|---|
| Reproduce a run | Best-effort guesswork | Replay the exact pinned artifact |
| Roll back | Re-edit prompts, redeploy, hope | Flip the traffic alias in seconds |
| Attribute a regression | "Something changed somewhere" | Lineage names the exact version |
| Model snapshot | Floating alias, drifts silently | Dated snapshot frozen in the version |
| Releasing a change | All-at-once, fingers crossed | Eval → canary → promote, gated |
Frequently asked questions
What exactly gets versioned in an agent version?
Everything that can change behavior: system and tool prompts, the tool/function schemas, the model ID and its decoding params, retrieval indexes, guardrail policy, and the orchestration graph. We hash them together into one immutable artifact with a semantic version, so a version is a single thing you can pin, diff, and roll back.
Why not just rely on the model provider's version pin?
Pinning the model (for example a dated Claude or GPT snapshot) only fixes one input. Your prompt, tools, RAG corpus, and policy drift independently and change behavior just as much. Agent versioning captures all of them together, so a run is reproducible even when one underlying piece is updated.
How do you roll back a bad agent version in production?
Because each version is immutable and addressable, rollback is a routing change, not a redeploy. We flip the traffic alias back to the last-good version, in-flight runs finish on the version they started, and the decision lineage records exactly which version produced which outcome.
Can different versions run at the same time?
Yes. We run canary and shadow traffic so a new version takes a small percentage of live load (or runs read-only alongside the current one) before promotion. Evaluations and cost run against the candidate version before it ever owns the full alias.
Related architecture decisions
Versioning is one node in the governance graph. These pair with it.
Make every agent change reversible
Bring the agent you're afraid to touch. We'll leave you with a versioning scheme, a promotion path, and a rollback you can trust.