Technology

Open-source models, used where they win

Open-weight LLMs — Llama, Qwen, Mistral, Gemma, DeepSeek — are the right engine for the hot path of an agent: private, fine-tunable, and cheap at volume. We use them deliberately, not dogmatically.

  • Self-hosted in your VPC or on-prem
  • Fine-tuned on your data
  • OpenAI-compatible, model-portable
  • Honest about license & capability
8–70B
params we tune for production hot paths
10–30×
cheaper per token than frontier APIs
0
tokens leaving your perimeter when self-hosted
100%
weights you own and can re-host
// the case for open weights

Not cheaper for its own sake — strategic

An open-weight model is a thing you own, not a thing you rent.

When you self-host an open model, the weights live on your hardware, your data never leaves your network, and your unit economics stop scaling linearly with usage. For a single chat that doesn't matter. For an agent firing thousands of tool-formatting and classification calls per workflow, it changes the whole cost curve.

It's also the cleanest answer to lock-in. APIs change pricing, deprecate models, and adjust terms on their schedule, not yours. A fine-tuned open model you've deployed keeps working regardless — and it's portable across clouds and hardware.

We're not zealots about it. Frontier hosted models are genuinely better at hard, open-ended reasoning, and we say so. The skill is knowing which steps of an agent actually need that and which don't.

// where we reach for them

Jobs open models do well

The high-volume, well-scoped steps that make up most of an agent's runtime.

Extraction & parsing

Pulling structured fields out of documents, emails, and tickets — high throughput, narrow task, easy to tune.

Routing & classification

Triaging inputs, choosing the next tool, and labeling at scale where a small tuned model beats a big general one.

Tool-call formatting

Turning intent into well-formed function calls — fast, cheap, and deterministic enough to run on the hot path.

Retrieval & reranking

Embedding, reranking, and grounding answers against your corpus without shipping it to a third party.

Sensitive data

Anything touching PII, PHI, or regulated records — kept fully inside your perimeter, weights and all.

High-volume batch

Overnight reprocessing and backfills where per-token cost dominates and latency is forgiving.

// how we deploy them

From model card to running engine

A repeatable path that keeps the choice reversible.

01

Select

Pick the smallest model that passes the task's evals, and read its license for commercial fit.

02

Tune

LoRA or full fine-tune on your data, plus an eval set so we can prove the gain over the base model.

03

Serve

Deploy via vLLM, TGI, or Ollama in your VPC or on-prem, behind an OpenAI-compatible endpoint.

04

Route

Wire it into the agent so easy steps hit the open model and hard cases escalate to a frontier one.

// the hybrid architecture

Open and hosted, in the same agent

The best production agents aren't open-vs-hosted — they're both, behind a router. A tuned open model carries the high-volume hot path. A frontier model like Claude is reserved for the genuinely hard reasoning, the long plans, and the low-confidence escalations.

Because we keep the action layer and prompts model-portable, the router is a policy you can tune, not a rewrite. You can dial the open/hosted split by cost, latency, accuracy, or data-sensitivity per step — and change it next quarter without touching the agent's logic.

  • Per-step routing by cost, latency, and risk
  • Confidence-based escalation to frontier models
  • One OpenAI-compatible interface for every engine

Self-hosted open weights vs. a hosted API

Both have a place — the trade-offs are what decide each step.

Self-hosted open modelHosted frontier API
Data pathStays inside your perimeterLeaves to the provider
Cost at volumeFixed GPU spend, cheap per callLinear per-token, adds up fast
Peak reasoningGood on scoped tasksBest for hard, open-ended work
TuningFull LoRA / fine-tune controlLimited to provider features
Lock-inYou own the weightsSubject to provider terms

Frequently asked questions

Are open-source models good enough to run an agent?

For frontier reasoning and long, multi-step planning, hosted models like Claude still tend to win. But for scoped, high-volume steps — classification, extraction, routing, tool-call formatting — a tuned 8B–70B open-weight model is fast, cheap, and accurate enough. Most of our agents are hybrid: an open model handles the hot path, a hosted model handles the hard cases.

Open-source vs. open-weight — does the distinction matter?

Yes, and we're honest about it. Most 'open' LLMs (Llama, Qwen, Gemma) ship open weights under a license, not fully open training data or code. That's fine for self-hosting and fine-tuning, but read the license — some restrict commercial use above a user threshold or for certain applications. We check it before we build on it.

Where do these models actually run?

Wherever your data has to stay. We deploy open weights into your VPC, your on-prem GPUs, or an air-gapped network — served through vLLM, TGI, or Ollama behind an OpenAI-compatible endpoint so the agent code doesn't care which model answers.

What about avoiding vendor lock-in?

Open weights are the strongest hedge there is: you hold the model, you can fine-tune it, and you can run it after any vendor changes terms or sunsets an API. We design the action layer and prompts to be model-portable, so swapping the engine is a config change, not a rewrite.

Bring a workflow. We'll pick the right engine.

One session to size where an open model earns its keep — and where a frontier model still has to.