AI redundancy: no single point of failure
Frontier models go down, rate-limit, and hallucinate. We architect agent systems so a failure on one path falls over to another — model, provider, or deterministic backstop — without dropping the work or duplicating it.
- Model & provider failover
- Idempotent retries
- Graceful degradation
- Human escalation at the limit
What actually breaks in production
Redundancy is a response to specific, observed failure modes — not a generic 'add a backup' instinct.
A single-model agent has at least five ways to stall: the provider has an incident, your region is throttled, the context window is exceeded, the model returns malformed output, or it confidently returns something wrong. Each needs a different recovery, and most teams discover that the hard way at 2 a.m.
The engineering decision isn't whether to add redundancy — it's how much, and where. Blanket triple-redundancy triples your cost and latency for steps that don't need it. The work is mapping each step to its blast radius and choosing the cheapest recovery that meets the SLA.
Four layers of redundancy
Each addresses a distinct failure class. Most critical paths use two or three; low-stakes steps use one.
Model fallback
A ranked list of models per step. If the primary errors, refuses, or fails validation, the next-strongest healthy model takes the call.
Provider failover
Same capability, different vendor and region. A provider incident reroutes without a code change or a redeploy.
Idempotent retry
Retries with backoff and jitter, keyed so a re-run of a side-effecting action can't double-fire downstream.
Graceful degradation
When no full path is healthy, return a narrower, validated result or queue for later — degrade the scope, never the correctness.
Human escalation
The last backstop. Exhausted paths or low confidence route to a person with full context and lineage attached.
Health & circuit breaking
Per-path health checks trip a circuit breaker on repeated failures so the system stops hammering a dead endpoint.
How a request fails over
What happens, in order, when a critical step's primary path goes down.
Attempt
Route to the primary model. Validate the output against the step's schema and confidence checks before accepting it.
Detect
On error, refusal, timeout, or validation failure, mark the path unhealthy and trip its circuit breaker if the failure repeats.
Reroute
Fall over to the next ranked model or provider. The idempotency key ensures the retried action stays exactly-once downstream.
Escalate
If every automated path is exhausted, checkpoint state and hand off to a human — the work pauses, it never disappears.
Redundancy has a price — spend it where it matters
Every extra path costs latency, money, and complexity. A second model call you make 'just in case' is a tax on every run, not only the failing ones. The discipline is matching redundancy to blast radius: a step that posts to your general ledger earns multi-provider failover; a step that drafts an internal summary does not.
We make that explicit. Each workflow step gets a redundancy tier tied to its risk threshold and SLA, so cost is a deliberate choice you can see and tune — not an accident buried in a framework default.
- Per-step redundancy tiers, not a blanket policy
- Cost & latency budgets enforced at runtime
- Tiers tied to documented risk thresholds
Naive retry vs. designed redundancy
Why a try/except loop around one model is not a failover strategy.
| Retry the same model | Designed redundancy | |
|---|---|---|
| Provider outage | Still down — retries fail too | Fails over to another provider |
| Bad output | Same model, same mistake | Escalates to a stronger model |
| Side effects | Risk of double-firing | Idempotency keys keep it exactly-once |
| Cost shape | Unbounded retry storms | Budgeted tiers + circuit breakers |
| Dead end | Silent failure or hang | Human handoff with full lineage |
Frequently asked questions
Isn't a single frontier model good enough to just retry?
Retrying the same model against the same prompt rarely fixes a hard failure — and does nothing for a provider outage, a region brownout, or a rate-limit wall. Redundancy means a second path exists: a different model, a different provider, or a deterministic fallback that still completes the task within your SLA.
Do model fallbacks hurt output quality?
Only if you treat all fallbacks as equals. We rank candidates by capability and cost, route to the strongest model that's healthy, and gate the answer with the same validators regardless of which model produced it. A degraded path is allowed to return a narrower result — never a silently worse one.
How do you avoid duplicate side effects when a step is retried?
Every action that touches your systems carries an idempotency key tied to the workflow step, so a retried or failed-over call can't double-charge, double-email, or double-write. State is checkpointed between steps, so failover resumes from the last durable point rather than replaying the whole run.
Where does redundancy stop and human escalation begin?
At your risk threshold. When every automated path is exhausted or confidence drops below the line you set, the workflow stops and routes to a human with full context and lineage — a controlled handoff, not a dropped task.
Related architecture decisions
Redundancy is one choice in a connected set. Explore the rest.
Find the single points of failure before they find you
Bring your most business-critical workflow. We'll map its failure modes and the redundancy tiers each step actually needs.