Error Handling Patterns for AI Agent Chains

Your agent chain works in development. Every agent completes, the events flow, and the output is exactly what you expected. Then you deploy it. And within 48 hours, you discover that "works on my machine" means absolutely nothing when your chain is running against rate-limited APIs, flaky network connections, and model providers that occasionally return gibberish.

Production agent chains fail. The question isn't whether they'll fail -- it's whether they'll fail gracefully or catastrophically. Here are the error handling patterns that separate a demo from a production system.

Retry with exponential backoff

The simplest and most effective pattern. When an agent fails due to a transient error -- API rate limit, network timeout, temporary service outage -- retry after a delay. Double the delay each time.

{
  "name": "data-enrichment",
  "agents": [
    {
      "name": "enricher",
      "prompt": "Enrich the input records with company data from the API.",
      "triggers": ["chain:start"],
      "emits": ["enrichment:complete"],
      "retry": {
        "max_attempts": 3,
        "backoff": "exponential",
        "initial_delay_ms": 1000,
        "max_delay_ms": 30000
      }
    }
  ]
}

The retry block tells Mentiko to attempt the agent up to three times. First retry after 1 second, second after 2 seconds, third after 4 seconds. If all three fail, the agent transitions to its on_error event.

Retries are only appropriate for transient errors. If the agent fails because the prompt is bad or the input is malformed, retrying will just burn tokens and produce the same bad result three times. Mentiko distinguishes between retryable errors (HTTP 429, 503, timeout) and permanent errors (bad input, model refusal, auth failure). Only retryable errors trigger the retry loop.

Fallback agents

When retries are exhausted, you need a Plan B. A fallback agent is a cheaper, simpler, or differently-sourced alternative that can produce an acceptable result when the primary agent can't.

{
  "name": "content-generation",
  "agents": [
    {
      "name": "primary-generator",
      "prompt": "Generate a detailed product description using {PRIMARY_MODEL}.",
      "triggers": ["chain:start"],
      "emits": ["content:ready"],
      "on_error": "generation:primary-failed",
      "retry": { "max_attempts": 2, "backoff": "exponential", "initial_delay_ms": 2000 }
    },
    {
      "name": "fallback-generator",
      "prompt": "Generate a product description using {FALLBACK_MODEL}. Keep it concise.",
      "triggers": ["generation:primary-failed"],
      "emits": ["content:ready"],
      "on_error": "generation:all-failed"
    },
    {
      "name": "formatter",
      "prompt": "Format the product description for the storefront.",
      "triggers": ["content:ready"],
      "emits": ["chain:complete"]
    }
  ]
}

The primary generator tries the expensive model with retries. If it still fails, the fallback generator tries a cheaper model with a simpler prompt. The formatter doesn't care which path produced the content -- it triggers on the same content:ready event either way. If both fail, generation:all-failed can route to an alerting agent or a dead letter queue.

Fallback agents aren't just for model failures. Use them for provider diversity: if OpenAI is down, fall back to Anthropic. If the web scraper can't reach the target, fall back to cached data. The pattern is the same -- primary fails, fallback catches.

Error routing and recovery chains

Sometimes a failure in one chain should trigger an entirely different chain. Error routing lets you treat failures as first-class events that kick off recovery workflows.

{
  "name": "order-processing",
  "agents": [
    {
      "name": "validator",
      "prompt": "Validate the order data. Check inventory, pricing, and customer status.",
      "triggers": ["chain:start"],
      "emits": ["order:valid"],
      "on_error": "order:validation-failed"
    },
    {
      "name": "fulfillment",
      "prompt": "Submit the validated order to the fulfillment API.",
      "triggers": ["order:valid"],
      "emits": ["chain:complete"],
      "on_error": "order:fulfillment-failed"
    }
  ],
  "error_routing": {
    "order:validation-failed": "notify-support-chain",
    "order:fulfillment-failed": "retry-fulfillment-chain"
  }
}

The error_routing block maps error events to recovery chains. A validation failure routes to a support notification chain (because it probably needs a human). A fulfillment failure routes to a retry chain with different logic (maybe a different fulfillment provider, maybe a manual queue). Each recovery chain is its own independent chain definition that you can test, version, and monitor separately.

Circuit breakers

When an external service is down, retrying every request is wasteful. A circuit breaker detects repeated failures and short-circuits subsequent attempts, returning a failure immediately instead of waiting for the inevitable timeout.

{
  "name": "api-dependent-chain",
  "agents": [
    {
      "name": "api-caller",
      "prompt": "Fetch data from the external analytics API.",
      "triggers": ["chain:start"],
      "emits": ["data:fetched"],
      "on_error": "data:fetch-failed",
      "circuit_breaker": {
        "failure_threshold": 5,
        "reset_timeout_ms": 60000,
        "half_open_attempts": 1
      }
    }
  ]
}

After 5 consecutive failures, the circuit opens. For the next 60 seconds, any run that reaches this agent immediately gets the data:fetch-failed event without even attempting the API call. After 60 seconds, the circuit enters half-open state: one attempt is allowed through. If it succeeds, the circuit closes and normal operation resumes. If it fails, the circuit opens again for another 60 seconds.

Circuit breakers prevent cascade failures. If your chain runs on a schedule every 5 minutes and the API is down for an hour, you'd burn 12 sets of retries (36 API calls) for nothing. With a circuit breaker, you make 5 calls, detect the outage, and skip the rest until the service recovers.

Graceful degradation

Not every chain needs to produce a perfect result or nothing. Graceful degradation means accepting partial results when full results aren't available.

Consider a research chain that queries five data sources. If two sources are down, you have three options: fail the entire chain (bad), retry forever (worse), or proceed with the three sources that responded (usually the right answer).

{
  "name": "multi-source-research",
  "agents": [
    {
      "name": "source-a",
      "triggers": ["chain:start"],
      "emits": ["source:complete"],
      "on_error": "source:complete",
      "error_metadata": { "degraded": true, "source": "a" }
    },
    {
      "name": "source-b",
      "triggers": ["chain:start"],
      "emits": ["source:complete"],
      "on_error": "source:complete",
      "error_metadata": { "degraded": true, "source": "b" }
    },
    {
      "name": "synthesizer",
      "prompt": "Combine all source results. Note any degraded sources in the output.",
      "triggers": ["source:complete"],
      "collect": 2,
      "emits": ["chain:complete"]
    }
  ]
}

The trick: failed source agents emit the same source:complete event as successful ones, but with degraded: true in the metadata. The synthesizer still collects the expected number of events and proceeds, but it knows which sources failed and can note that in the output. The consumer of the chain's output sees "results based on 3 of 5 sources" instead of "chain failed."

Dead letter queues

When a chain fails beyond recovery -- retries exhausted, fallbacks failed, circuit breaker open -- the run needs to go somewhere. A dead letter queue captures failed runs with their full context so you can investigate and replay them later.

{
  "name": "critical-pipeline",
  "dead_letter": {
    "enabled": true,
    "retain_days": 30,
    "alert_threshold": 5,
    "alert_channel": "slack:ops-alerts"
  }
}

Every failed run gets written to the dead letter queue with: the original input, the chain configuration at the time, every event that fired, the error details, and a timestamp. You can inspect these runs, fix the underlying issue, and replay them from the Mentiko dashboard or via API.

The alert_threshold triggers a notification when failed runs accumulate. Five failures in the queue means something systemic is wrong -- don't wait for the morning standup to find out.

Alerting thresholds

Error handling isn't just about recovering from failures. It's about knowing when failures are happening at a rate that demands attention.

{
  "name": "high-volume-pipeline",
  "monitoring": {
    "error_rate_threshold": 0.05,
    "window_minutes": 15,
    "alert_channels": ["slack:pipeline-alerts", "pagerduty:on-call"]
  }
}

A 5% error rate over 15 minutes triggers an alert. Track not just failures but retries, fallback invocations, and circuit breaker trips. A chain that's technically succeeding but hitting its fallback path 40% of the time is a chain with a problem -- the primary path needs attention even though the output is acceptable.

Putting it together

A production chain uses multiple patterns simultaneously. Retries on the primary agent. Fallback if retries fail. Dead letter queue if the fallback fails too. Circuit breaker to prevent retry storms during outages. Monitoring to alert when error rates spike. Graceful degradation to produce partial results instead of total failure.

The patterns are declarative -- they live in your chain JSON, they version with your code, and they're visible to everyone on your team. No hidden retry logic buried in application code. No mystery fallbacks that nobody remembers adding.

Start with retries and fallbacks. Add circuit breakers when you're running at scale. Add dead letter queues when you need auditability. Add alerting thresholds when you need sleep. Each pattern is independent and composable -- add them as your chain matures from prototype to production.

Build your first resilient chain with Mentiko's chain builder, or explore other chain patterns to see what's possible.