Technology

Domain-specific models, used where they actually win

A general LLM is the right default for most agent reasoning. A model tuned to your domain earns its place on the narrow, high-volume, or private tasks where accuracy, cost, and latency stop being free.

  • Fine-tuned & open-weight specialists
  • Routed behind a general reasoner
  • Grounded with retrieval
  • Evaluated against a frozen baseline
1 task
a specialist should do well, not everything adequately
10-50x
cheaper per call than a frontier model on narrow jobs
<300ms
achievable latency for small routed classifiers
100%
of releases gated against a frozen eval set
// the idea

A specialist is a tool, not the brain

Domain-specific models do one bounded thing very well.

The instinct to fine-tune a model for every problem is usually wrong. Frontier models are extraordinary generalists, and most agent work — planning, reasoning over context, deciding what to do next — is exactly what they are best at. Reaching for a custom model there trades a little quality for a lot of operational burden.

But there are tasks a generalist handles at a premium: classifying a million support tickets a day, extracting structured fields from clinical notes, scoring risk on a fixed schema, or running entirely inside an air-gapped network. On those, a model tuned to the domain can be cheaper, faster, more consistent, and deployable where a hosted API can't go.

So we treat a domain-specific model the way we treat any other capability — as a tool the agent calls, behind a router, with the general reasoner still in charge of orchestration. The specialist answers a narrow question; the agent decides what to do with the answer.

// when we reach for one

Signals that justify a specialist

We don't fine-tune on instinct. One of these has to be true before a custom model earns its keep.

// how we build one

From general baseline to routed specialist

We earn the right to fine-tune by proving the generalist isn't enough first.

01

Baseline

Ship the task on a frontier model with good prompting and retrieval. Measure quality, cost, and latency — that's the bar to beat.

02

Decide

Only if a volume, latency, privacy, or vocabulary signal is real do we choose a specialist: fine-tune, train an open-weight base, or distill.

03

Build & ground

Curate data, tune the model, and pair it with retrieval so it learns behavior, not facts it would otherwise memorize and outdate.

04

Route & guard

Place it behind a router as one tool, with validators, a general-model reviewer on high-stakes calls, and a frozen eval set per release.

// fit in the architecture

Behind a router, in front of nothing it can't defend

A domain-specific model is never the front door. The agent's general reasoner receives the request, decides whether the specialist is the right tool, calls it through the same interface as every other model, and keeps a fallback ready if confidence is low.

The specialist stays grounded: retrieval supplies the facts it shouldn't memorize, validators check the shape of its output, and on consequential decisions a general model reviews before the action commits. Every call is logged with the model version and inputs, so a regression is caught against a baseline rather than discovered in production.

  • Routed as one tool among many
  • Grounded with retrieval, not memorized facts
  • Reviewed by a general model on high-stakes calls
  • Versioned and logged for regression tracking

General model vs. domain-specific model

Two tools for two jobs — the win is using each where it's strong.

General frontier modelDomain-specific model
Best atReasoning, planning, open-ended workOne narrow, repeated task
Cost per callHigher; fine on low volumeFar lower at scale
LatencyHundreds of ms to secondsCan be sub-300ms when distilled
DeploymentUsually a hosted APIOpen weights run on-prem or air-gapped
Role in the agentOrchestrator / reasonerA tool the orchestrator calls

Frequently asked questions

Do I need a fine-tuned model, or is a frontier model enough?

Usually the frontier model is enough — start there. We reach for a domain-specific model when a narrow task runs at high volume, when latency or cost per call matters, when the vocabulary is genuinely foreign to a general model (clinical coding, claims, legal citations), or when an offline/private deployment rules out a hosted API.

What exactly is a 'domain-specific model'?

We use it as an umbrella term: a model tuned for a narrow job. That spans a fine-tuned frontier model, an open-weight base trained on your corpus, a small classifier or extractor, or an embedding model adapted to your vocabulary. The shared trait is that it does one bounded thing very well, not everything adequately.

Does this lock me into one model or vendor?

No. A domain-specific model sits behind a router as one tool among several, with the same interface as your general models. You keep your evals, your fallbacks, and the freedom to swap base models — including frontier APIs and open weights — without rewriting the agent.

How do you keep a specialized model from drifting or hallucinating?

We pair it with retrieval for facts it shouldn't memorize, hold a frozen eval set per release, gate outputs with validators and a general-model reviewer on high-stakes calls, and log every decision so regressions are caught against a baseline, not vibes.

Related models & frameworks

How specialists connect to the rest of the stack.

Not sure you need a custom model? Good.

Bring the task. We'll baseline it on a frontier model first and tell you honestly whether a domain-specific model would actually pay for itself.