Agent Architecture

Version every agent like the production system it is

Prompts, tools, models, and policy all change behavior — and they drift independently. We package them into one immutable, semantically versioned artifact so every run is reproducible and any change is one promotion or rollback away.

  • Immutable, content-hashed versions
  • Semantic versioning for behavior changes
  • Canary, shadow, and instant rollback
  • Lineage that ties outcomes to versions
1 artifact
prompt + tools + model + policy, hashed together
< 1 min
to roll a version back via traffic alias
100%
of runs traceable to an exact version
0
redeploys needed to flip live traffic
// the problem

An agent has no single source of truth — unless you build one

The hard part of versioning agents is that behavior lives in four places at once.

A model deploy versions a container. An agent's behavior is the joint product of its system prompt, its tool schemas, the model and decoding parameters, the retrieval corpus it reads, and the guardrail policy that constrains it. Change any one and the agent acts differently — but each of those is usually owned by a different system, edited by a different person, and shipped on a different cadence.

That is why "it worked yesterday" is so common with agents. Someone tweaked a prompt, the RAG index reindexed overnight, or the provider rotated a model snapshot under a floating alias. Without a version that binds all of it, you cannot reproduce a run, you cannot attribute a regression, and you cannot safely roll back.

Agent versioning fixes the seam: capture every behavior-determining input as one immutable artifact, give it a real version number, and make that version the unit you deploy, evaluate, and revert.

// what's in a version

What an agent version pins

A version is only useful if it captures everything that can change the answer. These are the inputs we freeze together.

// the lifecycle

From change to production, safely

Every promotion follows the same gated path, so shipping an agent change feels like shipping software.

01

Build

Assemble the artifact, content-hash it, and assign a semantic version. Identical inputs always produce the same version.

02

Evaluate

Run the candidate against a regression suite and golden transcripts. Behavior diffs and cost deltas surface before any traffic.

03

Canary

Route a small slice of live traffic (or shadow it read-only) to the new version and watch quality, latency, and spend in parallel.

04

Promote

Move the production alias to the new version. If anything regresses, flip the alias back — no redeploy, no in-flight breakage.

// semantic versioning for agents

Major, minor, patch — for behavior

We treat agent changes the way good libraries treat their APIs. A patch is a non-behavioral fix — a typo in a prompt, a logging tweak. A minor bump adds a capability without breaking existing flows — a new tool, an expanded corpus. A major bump changes behavior in ways downstream consumers must review — a model family swap, a reworked policy, a restructured orchestration graph.

Encoding intent in the version number means consumers know whether to re-run their evals before adopting it, and on-call engineers know at a glance whether last night's bump could plausibly explain a regression. The version stops being an opaque build hash and becomes a contract.

  • Major = behavior changes that need re-review
  • Minor = additive, backward-compatible capability
  • Patch = no behavior change

Unversioned agents vs. versioned agents

The difference between hoping a change is safe and knowing it.

Unversioned agentVersioned agent
Reproduce a runBest-effort guessworkReplay the exact pinned artifact
Roll backRe-edit prompts, redeploy, hopeFlip the traffic alias in seconds
Attribute a regression"Something changed somewhere"Lineage names the exact version
Model snapshotFloating alias, drifts silentlyDated snapshot frozen in the version
Releasing a changeAll-at-once, fingers crossedEval → canary → promote, gated

Frequently asked questions

What exactly gets versioned in an agent version?

Everything that can change behavior: system and tool prompts, the tool/function schemas, the model ID and its decoding params, retrieval indexes, guardrail policy, and the orchestration graph. We hash them together into one immutable artifact with a semantic version, so a version is a single thing you can pin, diff, and roll back.

Why not just rely on the model provider's version pin?

Pinning the model (for example a dated Claude or GPT snapshot) only fixes one input. Your prompt, tools, RAG corpus, and policy drift independently and change behavior just as much. Agent versioning captures all of them together, so a run is reproducible even when one underlying piece is updated.

How do you roll back a bad agent version in production?

Because each version is immutable and addressable, rollback is a routing change, not a redeploy. We flip the traffic alias back to the last-good version, in-flight runs finish on the version they started, and the decision lineage records exactly which version produced which outcome.

Can different versions run at the same time?

Yes. We run canary and shadow traffic so a new version takes a small percentage of live load (or runs read-only alongside the current one) before promotion. Evaluations and cost run against the candidate version before it ever owns the full alias.

Related architecture decisions

Versioning is one node in the governance graph. These pair with it.

Make every agent change reversible

Bring the agent you're afraid to touch. We'll leave you with a versioning scheme, a promotion path, and a rollback you can trust.