Skip to content
← all posts
8 min read

Deploying Agent Chains: Blue-Green, Canary, and Rolling Updates

Mentiko Team

Deploying a traditional web app is a solved problem. You push code, the CI/CD pipeline builds it, the load balancer drains connections, and the new version takes over. Agent chains are different. A chain might be mid-execution when you deploy a new version. An agent might be waiting for a human decision that won't come for hours. A scheduled chain might fire right in the middle of your deployment window. You can't just replace the binary and restart.

This guide covers deployment strategies adapted for multi-agent pipelines -- how to update chains in production without killing in-flight executions, breaking schedules, or introducing inconsistencies.

Why agent chain deployments are different

Three properties of agent chains make deployment harder than typical application deployments:

Long-running executions. A web request takes milliseconds to seconds. An agent chain execution can take minutes, hours, or days (if it includes human-in-the-loop decision flows). You can't just drain connections and switch over. You need to handle executions that span the deployment boundary.

Stateful mid-execution. When a chain is halfway through -- agent 3 of 5 has completed, agent 4 is running -- there's accumulated state. Event files from previous agents, intermediate results, context that the next agent expects. A deployment can't disrupt this state without corrupting the execution.

Mixed versioning risk. If you update a chain while it's running, agent 3 might have been executed under the old prompt, and agent 4 runs under the new prompt. The old output format might not match what the new agent expects. This is the agent chain equivalent of a schema migration mid-request.

Strategy 1: Blue-Green deployment

Blue-green is the safest strategy for critical chains. You run two complete environments and switch traffic between them.

How it works

  1. Blue is the current production environment running your existing chain versions.
  2. You deploy updated chains to the Green environment.
  3. You run validation chains against Green to verify the new versions work.
  4. You switch new executions to Green.
  5. Blue continues running until all in-flight executions complete.
  6. Once Blue is drained, it becomes the next staging environment.
{
  "deployment": {
    "strategy": "blue-green",
    "chain": "content-pipeline",
    "blue": {
      "version": "2.3.1",
      "status": "draining",
      "in_flight_executions": 3,
      "accepts_new": false
    },
    "green": {
      "version": "2.4.0",
      "status": "active",
      "in_flight_executions": 12,
      "accepts_new": true
    },
    "switch_timestamp": "2026-03-19T14:00:00Z",
    "drain_timeout_hours": 24
  }
}

The version pinning requirement

The key to blue-green for agent chains is version pinning. When an execution starts, it's pinned to a specific version of every agent in the chain. Even if you deploy a new version mid-execution, the running execution continues with its original agent definitions.

{
  "execution": {
    "id": "exec-a1b2c3",
    "chain": "content-pipeline",
    "pinned_version": "2.3.1",
    "started_at": "2026-03-19T13:45:00Z",
    "agents": [
      {"name": "Researcher", "version": "2.3.1", "status": "completed"},
      {"name": "Writer", "version": "2.3.1", "status": "running"},
      {"name": "Editor", "version": "2.3.1", "status": "pending"},
      {"name": "Publisher", "version": "2.3.1", "status": "pending"}
    ]
  }
}

Even though version 2.4.0 is now live, execution exec-a1b2c3 continues with 2.3.1 agent definitions until it completes or fails.

When to use blue-green

Use blue-green when downtime is unacceptable and you need a fast rollback path. If Green has issues, you switch traffic back to Blue immediately. No rollback deployment, no waiting for a build -- Blue is still running.

The downside is cost. You need double the infrastructure during the transition period. For agent chains that use Docker workspaces, that means double the containers. For teams running on cloud infrastructure, it's a meaningful cost during the drain period.

Strategy 2: Canary deployment

Canary deployments send a small percentage of new executions to the updated version while the majority continue on the current version. You monitor the canary, and if it looks good, you gradually increase the percentage.

How it works

  1. Deploy the new chain version alongside the current one.
  2. Route 5-10% of new executions to the new version.
  3. Monitor success rates, execution times, error rates, and output quality.
  4. If metrics are healthy, increase to 25%, then 50%, then 100%.
  5. If metrics degrade, route 100% back to the old version.
{
  "deployment": {
    "strategy": "canary",
    "chain": "support-triage",
    "current_version": "1.8.0",
    "canary_version": "1.9.0",
    "traffic_split": {
      "current": 90,
      "canary": 10
    },
    "promotion_criteria": {
      "min_executions": 50,
      "max_error_rate": 0.02,
      "max_latency_increase_pct": 20,
      "min_quality_score": 0.85
    },
    "auto_promote": true,
    "rollback_on_failure": true
  }
}

Quality-based promotion

For agent chains, latency and error rates aren't enough. You also need to evaluate output quality. A chain might complete successfully but produce worse results with the new prompts.

The canary controller can include a quality evaluation step:

{
  "canary_quality_check": {
    "evaluator_agent": "QualityComparer",
    "prompt": "Compare the output of the canary execution to what the current version would produce for the same input. Score the canary output on accuracy, completeness, and format compliance. Output a score from 0.0 to 1.0.",
    "sample_rate": 0.5,
    "min_score": 0.85
  }
}

Run the same input through both versions, compare outputs, and use the comparison to decide whether to promote the canary.

When to use canary

Canary works well when you're changing agent prompts or model versions and want to validate that the new version produces equivalent or better results. It's less useful for structural changes (adding or removing agents from a chain) because the outputs might not be directly comparable.

Strategy 3: Rolling updates

Rolling updates replace agents one at a time within a chain, rather than deploying the entire chain as a unit.

How it works

  1. Identify which agents in the chain have changed.
  2. Update them one at a time, starting from the end of the chain (downstream first).
  3. After each agent update, monitor executions that pass through the updated agent.
  4. If metrics hold, update the next agent.
  5. If metrics degrade, roll back the last agent update.
{
  "deployment": {
    "strategy": "rolling",
    "chain": "data-pipeline",
    "agents": [
      {"name": "Collector", "current": "1.2.0", "target": "1.3.0", "status": "pending"},
      {"name": "Transformer", "current": "1.2.0", "target": "1.3.0", "status": "pending"},
      {"name": "Validator", "current": "1.2.0", "target": "1.3.0", "status": "updated"},
      {"name": "Loader", "current": "1.2.0", "target": "1.3.0", "status": "updated"}
    ],
    "update_order": ["Loader", "Validator", "Transformer", "Collector"],
    "pause_between_agents_minutes": 30
  }
}

Why downstream-first

Updating from the end of the chain backward ensures compatibility. When you update the Loader, it still receives output from the old Validator -- and since the Loader is the one changing, you've designed it to handle the existing format. When you then update the Validator, the new Validator's output goes to the new Loader, which you've already verified works.

If you updated upstream first (Collector), the new Collector's output format might break the old Transformer that hasn't been updated yet. Downstream-first avoids this.

The contract testing prerequisite

Rolling updates only work if you have contracts between agents. Each agent should have a defined input schema and output schema. Before deploying, verify that the new agent version's output schema is compatible with the next agent's expected input.

{
  "agent_contract": {
    "name": "Transformer",
    "version": "1.3.0",
    "input_schema": {
      "type": "object",
      "required": ["raw_records", "source_metadata"],
      "properties": {
        "raw_records": {"type": "array"},
        "source_metadata": {"type": "object"}
      }
    },
    "output_schema": {
      "type": "object",
      "required": ["transformed_records", "transform_log"],
      "properties": {
        "transformed_records": {"type": "array"},
        "transform_log": {"type": "object"}
      }
    }
  }
}

If the new Transformer version changes the output schema (renames a field, removes a required property), the deployment should fail validation before it starts.

When to use rolling

Rolling updates work well when only one or two agents in a chain have changed and you want to minimize the blast radius. They're also useful for large chains (10+ agents) where blue-green would require too much duplicate infrastructure.

Handling scheduled chains during deployment

Scheduled chains add a timing dimension to deployment. A cron-scheduled chain fires at 6 AM. If your deployment starts at 5:55 AM, does the 6 AM execution use the old version or the new one?

The schedule fence

Implement a schedule fence around deployments. Before starting the deployment, pause the schedule. Let any currently-running scheduled execution complete. Deploy the new version. Resume the schedule.

{
  "deployment_schedule_policy": {
    "pre_deploy": "pause_schedule",
    "wait_for_in_flight": true,
    "in_flight_timeout_minutes": 60,
    "post_deploy": "resume_schedule",
    "missed_execution_policy": "run_immediately"
  }
}

If the schedule was supposed to fire during the deployment window, the missed_execution_policy determines what happens: run immediately after deployment completes, skip it, or run at the next scheduled time.

Handling decision flow chains

Chains with human-in-the-loop decision flows are the hardest to deploy. A chain might pause for a human decision and not resume for hours or days. You can't keep the old environment running indefinitely.

The decision boundary approach

Treat each decision flow as a natural deployment boundary. When a chain pauses for human input, snapshot the execution state. When the human makes their decision, the chain resumes with whatever version is current at that moment.

This requires backward compatibility between versions. The new version must be able to accept the state produced by the old version's agents. If it can't, you need a state migration step that transforms the old state into the format the new version expects.

{
  "decision_boundary_config": {
    "on_pause": "snapshot_state",
    "on_resume": "check_version_compatibility",
    "if_incompatible": "run_state_migration",
    "state_migration_chain": "ops/migrate-execution-state"
  }
}

Rollback strategies

Every deployment strategy needs a rollback plan.

Blue-green rollback: Switch traffic back to Blue. Immediate. No data loss.

Canary rollback: Route 100% to the current version. Kill canary executions or let them drain.

Rolling rollback: Reverse the update order. Re-deploy old versions starting from the last updated agent.

For all strategies, maintain the previous version's chain definitions for at least one full execution cycle. If your chain runs daily, keep the old version for at least a day. If it runs hourly, keep it for at least a few hours.

The deployment checklist

Before deploying any chain update to production:

  • [ ] Contract tests pass between all agent pairs
  • [ ] New version tested against representative inputs in staging
  • [ ] Rollback plan documented and tested
  • [ ] Schedule fence configured (for scheduled chains)
  • [ ] Decision flow compatibility verified (for chains with human gates)
  • [ ] Monitoring alerts configured for error rate and quality metrics
  • [ ] In-flight execution policy decided (drain vs. pin vs. migrate)
  • [ ] Previous version retained for rollback window

Agent chain deployment isn't harder than application deployment. It's different. The execution model -- long-running, stateful, sometimes paused for human input -- requires strategies that account for time in a way that stateless web deployments don't. Pick the strategy that matches your risk tolerance and infrastructure budget, and build the automation to make it repeatable.


For more on running chains in production, see monitoring agent chains and debugging guide. Build your first chain with our 5-minute tutorial.

Get new posts in your inbox

No spam. Unsubscribe anytime.