Agent architecture

AI redundancy: no single point of failure

Q: Isn't a single frontier model good enough to just retry?

Retrying the same model against the same prompt rarely fixes a hard failure — and does nothing for a provider outage, a region brownout, or a rate-limit wall. Redundancy means a second path exists: a different model, a different provider, or a deterministic fallback that still completes the task within your SLA.

Q: Do model fallbacks hurt output quality?

Only if you treat all fallbacks as equals. We rank candidates by capability and cost, route to the strongest model that's healthy, and gate the answer with the same validators regardless of which model produced it. A degraded path is allowed to return a narrower result — never a silently worse one.

Q: How do you avoid duplicate side effects when a step is retried?

Every action that touches your systems carries an idempotency key tied to the workflow step, so a retried or failed-over call can't double-charge, double-email, or double-write. State is checkpointed between steps, so failover resumes from the last durable point rather than replaying the whole run.

Q: Where does redundancy stop and human escalation begin?

At your risk threshold. When every automated path is exhausted or confidence drops below the line you set, the workflow stops and routes to a human with full context and lineage — a controlled handoff, not a dropped task.

Frontier models go down, rate-limit, and hallucinate. We architect agent systems so a failure on one path falls over to another — model, provider, or deterministic backstop — without dropping the work or duplicating it.

Model & provider failover
Idempotent retries
Graceful degradation
Human escalation at the limit

Book a Call Get Started

99.9%

workflow completion target across model outages

2-3

ranked model paths per critical step

duplicate side effects from idempotency keys

<2s

typical failover to a healthy path

// the failure modes

What actually breaks in production

Redundancy is a response to specific, observed failure modes — not a generic 'add a backup' instinct.

A single-model agent has at least five ways to stall: the provider has an incident, your region is throttled, the context window is exceeded, the model returns malformed output, or it confidently returns something wrong. Each needs a different recovery, and most teams discover that the hard way at 2 a.m.

The engineering decision isn't whether to add redundancy — it's how much, and where. Blanket triple-redundancy triples your cost and latency for steps that don't need it. The work is mapping each step to its blast radius and choosing the cheapest recovery that meets the SLA.

// the layers

Four layers of redundancy

Each addresses a distinct failure class. Most critical paths use two or three; low-stakes steps use one.

Model fallback

A ranked list of models per step. If the primary errors, refuses, or fails validation, the next-strongest healthy model takes the call.

Provider failover

Same capability, different vendor and region. A provider incident reroutes without a code change or a redeploy.

Idempotent retry

Retries with backoff and jitter, keyed so a re-run of a side-effecting action can't double-fire downstream.

Graceful degradation

When no full path is healthy, return a narrower, validated result or queue for later — degrade the scope, never the correctness.

Human escalation

The last backstop. Exhausted paths or low confidence route to a person with full context and lineage attached.

Health & circuit breaking

Per-path health checks trip a circuit breaker on repeated failures so the system stops hammering a dead endpoint.

// the runtime

How a request fails over

What happens, in order, when a critical step's primary path goes down.

Attempt

Route to the primary model. Validate the output against the step's schema and confidence checks before accepting it.

Detect

On error, refusal, timeout, or validation failure, mark the path unhealthy and trip its circuit breaker if the failure repeats.

Reroute

Fall over to the next ranked model or provider. The idempotency key ensures the retried action stays exactly-once downstream.

Escalate

If every automated path is exhausted, checkpoint state and hand off to a human — the work pauses, it never disappears.

// the tradeoff

Redundancy has a price — spend it where it matters

Every extra path costs latency, money, and complexity. A second model call you make 'just in case' is a tax on every run, not only the failing ones. The discipline is matching redundancy to blast radius: a step that posts to your general ledger earns multi-provider failover; a step that drafts an internal summary does not.

We make that explicit. Each workflow step gets a redundancy tier tied to its risk threshold and SLA, so cost is a deliberate choice you can see and tune — not an accident buried in a framework default.

Per-step redundancy tiers, not a blanket policy
Cost & latency budgets enforced at runtime
Tiers tied to documented risk thresholds

Cost controls

Naive retry vs. designed redundancy

Why a try/except loop around one model is not a failover strategy.

	Retry the same model	Designed redundancy
Provider outage	Still down — retries fail too	Fails over to another provider
Bad output	Same model, same mistake	Escalates to a stronger model
Side effects	Risk of double-firing	Idempotency keys keep it exactly-once
Cost shape	Unbounded retry storms	Budgeted tiers + circuit breakers
Dead end	Silent failure or hang	Human handoff with full lineage

Frequently asked questions

Isn't a single frontier model good enough to just retry?

Retrying the same model against the same prompt rarely fixes a hard failure — and does nothing for a provider outage, a region brownout, or a rate-limit wall. Redundancy means a second path exists: a different model, a different provider, or a deterministic fallback that still completes the task within your SLA.

Do model fallbacks hurt output quality?

Only if you treat all fallbacks as equals. We rank candidates by capability and cost, route to the strongest model that's healthy, and gate the answer with the same validators regardless of which model produced it. A degraded path is allowed to return a narrower result — never a silently worse one.

How do you avoid duplicate side effects when a step is retried?

Every action that touches your systems carries an idempotency key tied to the workflow step, so a retried or failed-over call can't double-charge, double-email, or double-write. State is checkpointed between steps, so failover resumes from the last durable point rather than replaying the whole run.

Where does redundancy stop and human escalation begin?

At your risk threshold. When every automated path is exhausted or confidence drops below the line you set, the workflow stops and routes to a human with full context and lineage — a controlled handoff, not a dropped task.

Related architecture decisions

Redundancy is one choice in a connected set. Explore the rest.

Single vs. multi-agent Stateful vs. stateless Exception handling Risk thresholds Model governance Agent versioning Decision lineage

Find the single points of failure before they find you

Bring your most business-critical workflow. We'll map its failure modes and the redundancy tiers each step actually needs.

Book a Call Get Started