Open-source models, used where they win
Open-weight LLMs — Llama, Qwen, Mistral, Gemma, DeepSeek — are the right engine for the hot path of an agent: private, fine-tunable, and cheap at volume. We use them deliberately, not dogmatically.
- Self-hosted in your VPC or on-prem
- Fine-tuned on your data
- OpenAI-compatible, model-portable
- Honest about license & capability
Not cheaper for its own sake — strategic
An open-weight model is a thing you own, not a thing you rent.
When you self-host an open model, the weights live on your hardware, your data never leaves your network, and your unit economics stop scaling linearly with usage. For a single chat that doesn't matter. For an agent firing thousands of tool-formatting and classification calls per workflow, it changes the whole cost curve.
It's also the cleanest answer to lock-in. APIs change pricing, deprecate models, and adjust terms on their schedule, not yours. A fine-tuned open model you've deployed keeps working regardless — and it's portable across clouds and hardware.
We're not zealots about it. Frontier hosted models are genuinely better at hard, open-ended reasoning, and we say so. The skill is knowing which steps of an agent actually need that and which don't.
Jobs open models do well
The high-volume, well-scoped steps that make up most of an agent's runtime.
Extraction & parsing
Pulling structured fields out of documents, emails, and tickets — high throughput, narrow task, easy to tune.
Routing & classification
Triaging inputs, choosing the next tool, and labeling at scale where a small tuned model beats a big general one.
Tool-call formatting
Turning intent into well-formed function calls — fast, cheap, and deterministic enough to run on the hot path.
Retrieval & reranking
Embedding, reranking, and grounding answers against your corpus without shipping it to a third party.
Sensitive data
Anything touching PII, PHI, or regulated records — kept fully inside your perimeter, weights and all.
High-volume batch
Overnight reprocessing and backfills where per-token cost dominates and latency is forgiving.
From model card to running engine
A repeatable path that keeps the choice reversible.
Select
Pick the smallest model that passes the task's evals, and read its license for commercial fit.
Tune
LoRA or full fine-tune on your data, plus an eval set so we can prove the gain over the base model.
Serve
Deploy via vLLM, TGI, or Ollama in your VPC or on-prem, behind an OpenAI-compatible endpoint.
Route
Wire it into the agent so easy steps hit the open model and hard cases escalate to a frontier one.
Open and hosted, in the same agent
The best production agents aren't open-vs-hosted — they're both, behind a router. A tuned open model carries the high-volume hot path. A frontier model like Claude is reserved for the genuinely hard reasoning, the long plans, and the low-confidence escalations.
Because we keep the action layer and prompts model-portable, the router is a policy you can tune, not a rewrite. You can dial the open/hosted split by cost, latency, accuracy, or data-sensitivity per step — and change it next quarter without touching the agent's logic.
- Per-step routing by cost, latency, and risk
- Confidence-based escalation to frontier models
- One OpenAI-compatible interface for every engine
Self-hosted open weights vs. a hosted API
Both have a place — the trade-offs are what decide each step.
| Self-hosted open model | Hosted frontier API | |
|---|---|---|
| Data path | Stays inside your perimeter | Leaves to the provider |
| Cost at volume | Fixed GPU spend, cheap per call | Linear per-token, adds up fast |
| Peak reasoning | Good on scoped tasks | Best for hard, open-ended work |
| Tuning | Full LoRA / fine-tune control | Limited to provider features |
| Lock-in | You own the weights | Subject to provider terms |
Frequently asked questions
Are open-source models good enough to run an agent?
For frontier reasoning and long, multi-step planning, hosted models like Claude still tend to win. But for scoped, high-volume steps — classification, extraction, routing, tool-call formatting — a tuned 8B–70B open-weight model is fast, cheap, and accurate enough. Most of our agents are hybrid: an open model handles the hot path, a hosted model handles the hard cases.
Open-source vs. open-weight — does the distinction matter?
Yes, and we're honest about it. Most 'open' LLMs (Llama, Qwen, Gemma) ship open weights under a license, not fully open training data or code. That's fine for self-hosting and fine-tuning, but read the license — some restrict commercial use above a user threshold or for certain applications. We check it before we build on it.
Where do these models actually run?
Wherever your data has to stay. We deploy open weights into your VPC, your on-prem GPUs, or an air-gapped network — served through vLLM, TGI, or Ollama behind an OpenAI-compatible endpoint so the agent code doesn't care which model answers.
What about avoiding vendor lock-in?
Open weights are the strongest hedge there is: you hold the model, you can fine-tune it, and you can run it after any vendor changes terms or sunsets an API. We design the action layer and prompts to be model-portable, so swapping the engine is a config change, not a rewrite.
Bring a workflow. We'll pick the right engine.
One session to size where an open model earns its keep — and where a frontier model still has to.