Why Event-Driven Beats DAG-Based Agent Orchestration

Every agent framework ships with some version of the same abstraction: define a graph, wire up the nodes, run it. LangGraph gives you StateGraph. CrewAI gives you crews and tasks. Airflow gives you DAGs. The mental model is always the same -- draw the flowchart, then execute it.

We think this is the wrong abstraction for AI agents. Here's why.

DAGs assume you know the shape of your workflow

A DAG (directed acyclic graph) requires you to declare every node and every edge before execution starts. That works for ETL pipelines where the shape of the work is known at design time. Extract from Postgres, transform with dbt, load into BigQuery. The graph doesn't change at runtime.

Agent workflows are different. An agent might discover that the data it retrieved is malformed and needs a cleanup step that wasn't in the original plan. A code review agent might find a security issue that needs routing to a specialized security agent. A research agent might find three subtopics worth exploring in parallel where you expected one.

In a DAG framework, you handle this by making the graph more complex. You add conditional edges, branching logic, state machines. Your "simple" four-node graph becomes a 15-node monster with six conditional branches and a retry loop.

In an event-driven system, you handle this by having agents emit events:

// researcher.event
{
  "agent": "researcher",
  "status": "complete",
  "findings": ["security_vuln", "perf_regression", "api_breaking_change"],
  "trigger": ["security-reviewer", "perf-analyst", "api-compat-checker"]
}

No graph needed. The researcher found three things, so three agents spin up. If it had found one thing, one agent spins up. The system adapts because events carry intent, not just state transitions.

Event files are debuggable in ways graphs aren't

When a LangGraph chain fails at step 4 of 7, you read logs. You set breakpoints. You add print() statements to your node functions and re-run the whole thing. The graph is an abstraction that hides the intermediate state.

When an event-driven pipeline fails, you read the event files:

$ ls .events/
researcher.event
writer.event
editor.event       # <-- this one has status: "error"

$ cat .events/editor.event
{
  "agent": "editor",
  "status": "error",
  "error": "context_length_exceeded",
  "input_tokens": 142000,
  "model": "gpt-5.4",
  "timestamp": "2026-03-19T14:23:01Z"
}

That's it. No log aggregation, no tracing infrastructure, no stepping through a debugger. The event file is the state. You can grep it, diff it against yesterday's run, git blame it when the format changes.

This matters more than it sounds. When you're debugging a production agent pipeline at 2am, the difference between "read a file" and "set up the right logging level, find the right trace ID, correlate across services" is the difference between a 5-minute fix and a 2-hour investigation.

Decoupled agents are replaceable agents

Here's how you wire up a two-agent chain in LangGraph:

from langgraph.graph import StateGraph

graph = StateGraph(AgentState)
graph.add_node("researcher", researcher_agent)
graph.add_node("writer", writer_agent)
graph.add_edge("researcher", "writer")
chain = graph.compile()

The researcher and writer are coupled through the graph definition. If you want to swap the writer for a different model, add an editor in between, or route to different writers based on content type -- you're editing the graph. Every change is a structural change.

In an event-driven system, the researcher doesn't know or care what happens after it runs. It emits an event. Something else picks it up:

# agents/researcher.yaml
name: researcher
model: claude-sonnet-4-20250514
on_complete: emit researcher.event

# agents/writer.yaml
name: writer
watch: researcher.event
condition: status == "complete"

Want to add an editor? Create editor.yaml that watches writer.event. Want to A/B test two writer models? Create writer-a.yaml and writer-b.yaml, both watching researcher.event, with conditions that split traffic. The researcher agent is unchanged in both cases.

This is the same insight that made microservices win over monoliths: loose coupling through messaging is more maintainable than tight coupling through direct calls. It's not a new idea. We're just applying it to agents.

Error recovery becomes creative, not mechanical

DAG frameworks handle errors the same way: retry N times, then fail. Maybe you get exponential backoff. Maybe you get a dead letter queue. The assumption is that the same operation, run again, will eventually succeed.

Agent errors are different. If an agent exceeds context length, retrying with the same input won't help. If an LLM returns malformed JSON, retrying might work but it might also burn tokens for no reason. If a web scraping agent hits a 403, retrying is actively counterproductive.

Event-driven error handling looks like this:

# agents/recovery-router.yaml
name: recovery-router
watch: "*.event"
condition: status == "error"

# This agent reads the error event and decides what to do
# - context_length_exceeded -> route to summarizer, then retry
# - malformed_json -> route to json-fixer agent
# - rate_limited -> schedule retry with backoff
# - 403_forbidden -> route to alternative-source agent

The recovery agent is itself just an agent. It reads the error event, understands the failure mode, and triggers the appropriate response. This is creative recovery -- the system can invent new paths through the workflow based on what actually went wrong.

You can't do this in a DAG without encoding every possible failure mode as an edge in the graph. And you can't predict every failure mode upfront, which brings us back to the fundamental problem: DAGs require you to know the shape of your workflow before it runs.

Fan-out and fan-in without ceremony

One of the most common agent patterns is "do N things in parallel, then combine the results." In DAG frameworks, this requires explicit fan-out and fan-in nodes, barrier synchronization, and state merging logic.

In an event system, fan-out is just multiple agents watching the same event. Fan-in is an agent that watches for multiple events:

# Fan-out: three agents all triggered by the same event
name: analyst-financial
watch: data-collector.event

name: analyst-legal
watch: data-collector.event

name: analyst-technical
watch: data-collector.event

# Fan-in: waits for all three
name: report-synthesizer
watch:
  - analyst-financial.event
  - analyst-legal.event
  - analyst-technical.event
condition: all(status == "complete")

Adding a fourth analyst is one YAML file. Removing one is deleting a file. No graph restructuring, no state schema changes, no recompilation.

The tradeoffs are real

Event-driven agent orchestration isn't all upside. We'd be dishonest if we didn't acknowledge the costs:

No visual graph. You can't look at a DAG diagram and see the full workflow at a glance. With events, you need to trace which agents watch which events to reconstruct the flow. We address this with tooling that generates a flow diagram from agent configs, but it's a derived view, not the source of truth.

Convention over configuration. Event formats need to be consistent across agents. If your researcher emits {"state": "done"} and your writer watches for status == "complete", nothing happens. We handle this by being permissive about formats (JSON, YAML, or markdown headers all work) and validating event schemas at deploy time.

Harder to reason about completeness. With a DAG, you can statically verify that every node has an incoming edge. With events, an agent could be watching for an event that nothing emits. Again, tooling helps -- but it's a runtime guarantee, not a compile-time one.

Ordering is implicit. DAGs make execution order explicit by definition. Event-driven systems determine order by the watch/emit relationships. This is usually fine -- but when you need strict sequential execution, you have to be deliberate about it.

The right tool for the right abstraction

DAGs are great for data pipelines. We use Dagster internally for our own ETL. The key difference is that data pipelines are deterministic -- the same input produces the same output, the shape of the work is known upfront, and the failure modes are well-understood.

AI agents are none of those things. Their outputs are probabilistic. The shape of the work depends on what they find. Failure modes are creative and context-dependent. Trying to force agents into a DAG abstraction means fighting the framework every time an agent does something unexpected -- which, with LLMs, is frequently.

Event-driven architecture matches the nature of the work: asynchronous, loosely coupled, dynamically structured, observable by default. It's not the only valid approach. But we think it's a better default than "draw the flowchart first."

If you want to see what this looks like in practice, sign up for early access and we'll give you an instance to build on. Your events, your agents, your infrastructure.