Technology / Orchestration

Apache Airflow, where agents need a real data plane

Airflow is the scheduler we reach for when an agent's value depends on fresh, well-prepared data. It runs the batch DAGs that ingest, transform, and embed — so the agent inherits a clean, current view of the world.

  • Scheduled & backfillable DAGs
  • Retry-safe data ingestion
  • Embedding & index refresh
  • Self-hosted or managed
2014
open-sourced at Airbnb, now an Apache top-level project
Python
DAGs are code — version-controlled, testable, reviewable
1000s
of tasks per DAG, scheduled and backfilled
BSL-free
Apache 2.0 licensed — no vendor lock-in
// what it is

A scheduler for directed pipelines of work

Airflow expresses work as a DAG — a directed acyclic graph of tasks with explicit dependencies — and runs it on a schedule.

Each task is plain Python. The scheduler decides what can run, in what order, and retries failures with backoff. Because a DAG is code, it lives in your repository: diffed in pull requests, unit-tested, and deployed like any other service.

What Airflow is good at is batch-shaped, time-driven work: "every night at 02:00, pull yesterday's records, transform them, and refresh the index." It gives you backfills, idempotent reruns, a clear execution history, and a UI that shows exactly which task failed and why.

What Airflow is not built for is low-latency, event-driven, long-lived workflows that wait on a human for three days. That is a different shape of problem — and we reach for a different tool when we hit it.

// when we reach for it

Where Airflow earns its place in an agent build

We use Airflow for the data plane around an agent — the jobs that keep its context accurate — not as the agent's reasoning runtime.

// the data plane loop

How a DAG feeds an agent

A nightly cycle that keeps an autonomous agent grounded in current data.

01

Extract

Scheduled tasks pull new and changed records from your systems of record, with retries on transient failure.

02

Transform

Clean, normalize, and chunk the data — the prep that determines whether retrieval later returns signal or noise.

03

Embed

Generate embeddings and upsert into the vector store, tracking versions so stale vectors are evicted.

04

Verify

Run data-quality checks; on failure the DAG alerts and holds, so the agent never reasons over broken context.

// honest boundaries

We pick the tool that fits the workflow

Airflow is not the answer to every orchestration question, and we won't pretend it is. It shines for scheduled, batch, data-centric DAGs. It's an awkward fit for live, stateful, human-in-the-loop agent workflows that may run for days and need durable per-step state.

On most engagements the two coexist. Airflow runs the data plane — ingestion, embedding, enrichment — on a schedule. A durable workflow engine like Temporal runs the live agent, holding state across tool calls and approval gates. Each does what it's genuinely good at.

  • Airflow for scheduled, batch DAGs
  • Temporal for long-running agent state
  • n8n for lighter event-driven glue
  • No single-vendor lock-in

Airflow vs. a durable workflow engine

Two orchestrators, two jobs. We often run both.

Apache AirflowTemporal
Shape of workScheduled, batch DAGsLong-running, event-driven workflows
StatePer-run, data-centricDurable per-step, survives restarts
TriggerTime / schedule firstEvents and signals
Human-in-the-loopAwkward — not its strengthFirst-class waits and approvals
Best for the agentThe data plane that feeds itThe live reasoning workflow itself

Frequently asked questions

Isn't Airflow just for data engineering, not agents?

That's exactly why we use it. Agents are only as good as the data they act on. Airflow runs the scheduled extract-transform-load and embedding-refresh jobs that keep an agent's context fresh — it's the plumbing around the agent, not the agent's own reasoning loop.

Airflow or Temporal — how do you choose?

Airflow is built for scheduled, batch-shaped DAGs where the unit of work is a data transformation. Temporal is built for long-running, event-driven workflows with durable per-step state and human-in-the-loop waits. Many of our builds run both: Airflow for the nightly data plane, Temporal for the live agent workflow.

Do you lock us into a managed Airflow vendor?

No. Airflow is Apache-licensed open source. We deploy it as self-hosted Airflow, or on MWAA, Cloud Composer, or Astronomer if you already run one — your DAGs are portable Python either way, and you keep the repo.

Can Airflow call our agents directly?

Yes, and it's a common pattern. A DAG task can invoke an agent run, wait on the result, and branch on it — so you get LLM-powered steps (classification, extraction, enrichment) inside an otherwise deterministic, observable, retry-safe pipeline.

Got a pipeline that should feed an agent?

Bring the data flow you want automated. We'll show you where Airflow fits, where it doesn't, and what the agent on top of it looks like.