Agent architecture

AI redundancy: no single point of failure

Frontier models go down, rate-limit, and hallucinate. We architect agent systems so a failure on one path falls over to another — model, provider, or deterministic backstop — without dropping the work or duplicating it.

  • Model & provider failover
  • Idempotent retries
  • Graceful degradation
  • Human escalation at the limit
99.9%
workflow completion target across model outages
2-3
ranked model paths per critical step
0
duplicate side effects from idempotency keys
<2s
typical failover to a healthy path
// the failure modes

What actually breaks in production

Redundancy is a response to specific, observed failure modes — not a generic 'add a backup' instinct.

A single-model agent has at least five ways to stall: the provider has an incident, your region is throttled, the context window is exceeded, the model returns malformed output, or it confidently returns something wrong. Each needs a different recovery, and most teams discover that the hard way at 2 a.m.

The engineering decision isn't whether to add redundancy — it's how much, and where. Blanket triple-redundancy triples your cost and latency for steps that don't need it. The work is mapping each step to its blast radius and choosing the cheapest recovery that meets the SLA.

// the layers

Four layers of redundancy

Each addresses a distinct failure class. Most critical paths use two or three; low-stakes steps use one.

// the runtime

How a request fails over

What happens, in order, when a critical step's primary path goes down.

01

Attempt

Route to the primary model. Validate the output against the step's schema and confidence checks before accepting it.

02

Detect

On error, refusal, timeout, or validation failure, mark the path unhealthy and trip its circuit breaker if the failure repeats.

03

Reroute

Fall over to the next ranked model or provider. The idempotency key ensures the retried action stays exactly-once downstream.

04

Escalate

If every automated path is exhausted, checkpoint state and hand off to a human — the work pauses, it never disappears.

// the tradeoff

Redundancy has a price — spend it where it matters

Every extra path costs latency, money, and complexity. A second model call you make 'just in case' is a tax on every run, not only the failing ones. The discipline is matching redundancy to blast radius: a step that posts to your general ledger earns multi-provider failover; a step that drafts an internal summary does not.

We make that explicit. Each workflow step gets a redundancy tier tied to its risk threshold and SLA, so cost is a deliberate choice you can see and tune — not an accident buried in a framework default.

  • Per-step redundancy tiers, not a blanket policy
  • Cost & latency budgets enforced at runtime
  • Tiers tied to documented risk thresholds

Naive retry vs. designed redundancy

Why a try/except loop around one model is not a failover strategy.

Retry the same modelDesigned redundancy
Provider outageStill down — retries fail tooFails over to another provider
Bad outputSame model, same mistakeEscalates to a stronger model
Side effectsRisk of double-firingIdempotency keys keep it exactly-once
Cost shapeUnbounded retry stormsBudgeted tiers + circuit breakers
Dead endSilent failure or hangHuman handoff with full lineage

Frequently asked questions

Isn't a single frontier model good enough to just retry?

Retrying the same model against the same prompt rarely fixes a hard failure — and does nothing for a provider outage, a region brownout, or a rate-limit wall. Redundancy means a second path exists: a different model, a different provider, or a deterministic fallback that still completes the task within your SLA.

Do model fallbacks hurt output quality?

Only if you treat all fallbacks as equals. We rank candidates by capability and cost, route to the strongest model that's healthy, and gate the answer with the same validators regardless of which model produced it. A degraded path is allowed to return a narrower result — never a silently worse one.

How do you avoid duplicate side effects when a step is retried?

Every action that touches your systems carries an idempotency key tied to the workflow step, so a retried or failed-over call can't double-charge, double-email, or double-write. State is checkpointed between steps, so failover resumes from the last durable point rather than replaying the whole run.

Where does redundancy stop and human escalation begin?

At your risk threshold. When every automated path is exhausted or confidence drops below the line you set, the workflow stops and routes to a human with full context and lineage — a controlled handoff, not a dropped task.

Related architecture decisions

Redundancy is one choice in a connected set. Explore the rest.

Find the single points of failure before they find you

Bring your most business-critical workflow. We'll map its failure modes and the redundancy tiers each step actually needs.