Cost Controls

Keep autonomous agents inside a cost envelope

Token spend is the one variable that scales with autonomy. We design budgets, ceilings, and loop guards into the agent architecture so cost is bounded by design — not discovered on the invoice.

  • Per-run and per-tenant token budgets
  • Model tiering & prompt caching
  • Loop and runaway-spend circuit breakers
  • Usage ledger for chargeback
10–40x
spend variance between a tuned and an untuned agent
<1%
of runs should ever hit a hard budget ceiling
per-step
granularity of budget enforcement
100%
of model & tool calls written to a usage ledger
// the engineering decision

Autonomy turns cost into a control problem

A chatbot costs roughly one call per turn. An agent decides how many calls to make.

The moment you give a system the freedom to plan, retry, branch, and call tools, you hand it the ability to spend. A single ambiguous task can fan out into dozens of model calls, recursive sub-agents, and tool loops that each pull a fresh context window. The cost of a run is no longer a constant you can read off a price sheet — it's an emergent property of the agent's behavior.

That makes cost a first-class architectural concern, on the same footing as latency and correctness. The wrong answer is a provider-side spending cap that fires after the damage is done and takes production down with it. The right answer is to instrument the orchestration layer so every run carries a budget, every model call checks it, and the system degrades deliberately when it approaches the edge.

// the control surface

Where cost actually gets governed

Five places to put a control, each cheaper to enforce than the runaway it prevents.

// how we implement it

From open-ended spend to a bounded envelope

A measured path that lowers cost before it caps it.

01

Profile

Trace real runs to find where tokens and tool calls actually go. The biggest line item is rarely the one teams expect.

02

Tier & cache

Re-route steps to the right model, add prompt caching and retrieval, and cut the redundant context that dominates the bill.

03

Budget

Wire per-run, per-step, and per-tenant ledgers into the orchestrator, with graceful degradation paths at each threshold.

04

Guard & watch

Add loop guards and circuit breakers, then run live cost dashboards and anomaly alerts against the usage ledger.

// the tradeoff

Bounded cost without a lobotomized agent

The naive way to cut agent spend is to truncate: smaller context, fewer steps, a cheaper model everywhere. It works right up until the agent can no longer do the job, and then you've traded an invoice problem for a quality problem.

We treat the budget as a routing signal, not a guillotine. When a run nears its ceiling it can fall back to a smaller model, return a partial result with a flag, or escalate to a human checkpoint — explicit, observable choices instead of a hard failure. You get a predictable cost envelope and an agent that still earns its keep.

  • Graceful degradation, not hard truncation
  • Budget as a routing input to the planner
  • Escalate to human review near the ceiling
  • Every fallback logged for later tuning

Provider cap vs. architected cost control

Two ways to bound spend — only one survives contact with production.

A provider spending capArchitected cost control
When it firesAfter the budget is spentBefore each model call
GranularityWhole account, per monthPer run, step, and tenant
Failure modeHard-stops productionDegrades or escalates gracefully
AttributionOne aggregate numberTagged ledger by workflow & version
Loop protectionNoneDepth, time, and retry circuit breakers

Frequently asked questions

Why can't we just set a monthly cap on the provider dashboard?

A monthly cap is a smoke alarm, not a sprinkler. It tells you after a runaway agent has already burned the budget, and it can hard-stop production mid-workflow. Cost control belongs in the orchestration layer, where you can budget per-run, per-tenant, and per-step, and degrade gracefully instead of failing closed.

Doesn't a token budget just make the agent dumber?

Only if you implement it as a blunt truncation. We tier it: cheap models for routing and classification, the expensive model reserved for the steps that actually need reasoning, plus prompt caching and retrieval to keep context lean. Most agents get cheaper and faster without losing quality — the spend was going to redundant tokens.

How do you stop an agent from looping forever?

Hard limits on iteration depth and wall-clock time, a per-run token ledger checked before each model call, and a circuit breaker that trips when an agent retries the same failing action. Crossing a threshold routes to an exception handler or a human, never silent retries that quietly compound the bill.

Can we attribute cost back to a customer or workflow?

Yes. Every model and tool call is tagged with tenant, workflow, and agent version, then written to a usage ledger. That powers chargeback, per-feature margin analysis, and the data you need to decide whether a workflow is worth running at all.

Put a ceiling on spend, not on capability

Bring an agent that's burning more than it should. We'll trace where the tokens go and design the controls to bound it.