Migrating Agent Chains Between Environments: Dev to Staging to Production

You built an agent chain in development. It works on your laptop. Now you need to run it in production, and you're about to discover every assumption you made about the environment.

Promoting agent chains across environments is harder than deploying traditional code. Chains carry configuration, secrets, model dependencies, and behavioral expectations that change between environments. Here's the systematic approach to getting a chain from dev to staging to production without breaking things.

Why chains are different from code deploys

A typical code deploy promotes a build artifact. The artifact is deterministic -- same inputs produce same outputs across environments. The only things that change are configuration and infrastructure.

Agent chains are non-deterministic by nature. The LLM might produce different output in production than it did in development, even with identical prompts, because:

Model versions may differ (providers update models)
Rate limits and latency affect chain timing
Input data distribution in production doesn't match test data
Concurrent execution creates race conditions not present in dev

This means you can't just "deploy" a chain and assume it works. You need validation gates that verify the chain behaves correctly in each environment before promoting it to the next.

The three-environment model

Most teams need three environments:

Development. Your laptop or a dev server. Fast iteration, no cost constraints, test data. Chains run manually or on-demand. You're experimenting with prompts, testing agent configurations, and validating that the chain logic is correct.

Staging. A shared environment that mirrors production infrastructure. Production-like data (anonymized or synthetic), production API keys with reduced rate limits, and the same orchestration platform version as production. Chains run on their production schedules but with monitoring for correctness rather than business impact.

Production. The real thing. Real data, real users depending on the output, real money being spent on tokens. Changes here need to be safe, validated, and reversible.

What changes between environments

A chain definition is more than code. Here's everything that can differ across environments:

Secrets and API keys

Every environment needs its own secrets:

# dev
OPENAI_API_KEY: sk-dev-xxx
SLACK_WEBHOOK: https://hooks.slack.com/dev-channel
DATABASE_URL: postgres://localhost:5432/dev

# staging
OPENAI_API_KEY: sk-staging-xxx
SLACK_WEBHOOK: https://hooks.slack.com/staging-channel
DATABASE_URL: postgres://staging-db:5432/staging

# production
OPENAI_API_KEY: sk-prod-xxx
SLACK_WEBHOOK: https://hooks.slack.com/prod-channel
DATABASE_URL: postgres://prod-db:5432/prod

Never hardcode secrets in chain definitions. Use environment variables or a secrets vault. In Mentiko, secrets are stored per-workspace and injected at runtime. The chain definition references $OPENAI_API_KEY and the platform resolves it based on the execution environment.

Model selection

You might use cheaper models in dev and staging to reduce costs:

# dev: fast and cheap for iteration
model: gpt-5.4-mini

# staging: production model, lower rate limit
model: claude-sonnet

# production: production model, full rate limit
model: claude-sonnet

Be careful here. If your dev model is significantly less capable than your production model, you might miss prompt issues in dev that only surface in production. Our recommendation: use the production model in staging. Use whatever's fastest in dev for iteration, but validate on the production model before promoting.

Execution parameters

Timeouts, retry counts, and concurrency limits should be stricter in production:

# dev
timeout: 300s
retries: 0
concurrency: 1

# staging
timeout: 120s
retries: 2
concurrency: 5

# production
timeout: 60s
retries: 3
concurrency: 20

Longer timeouts in dev let you debug without chains dying mid-execution. Zero retries in dev surface failures immediately instead of masking them. Higher concurrency in production handles real traffic volume.

Data sources

Chains that read from databases, APIs, or file systems need to point at the right data source per environment. A chain that enriches customer records should read from the test customer database in staging, not production.

Notification targets

Where does the chain send its output? Slack channels, email lists, webhook endpoints -- all of these should differ per environment. A chain that posts daily reports should post to #staging-reports in staging, not #product-reports.

The migration checklist

Before promoting a chain from one environment to the next, verify:

Functional validation

[ ] Chain completes successfully end-to-end
[ ] All agents produce expected output format
[ ] Event handoffs are correct (event names match triggers)
[ ] Output quality meets acceptance criteria
[ ] Edge cases handled (empty input, malformed data, large payloads)

Configuration validation

[ ] All secrets are set in the target environment
[ ] Model selection is appropriate for the target environment
[ ] Timeouts and retry config match target environment requirements
[ ] Data sources point to the correct environment
[ ] Notification targets point to the correct channels
[ ] Cost limits are set (per-run and daily caps)

Infrastructure validation

[ ] Target environment has sufficient compute resources
[ ] Network access to required APIs is available
[ ] File system permissions are correct
[ ] Workspace type (local/Docker/SSH) is available
[ ] Log collection is configured and working

Rollback plan

[ ] Previous chain version is tagged and recoverable
[ ] Rollback procedure is documented
[ ] Rollback can be executed in < 5 minutes
[ ] Data produced by the new version can be identified and isolated

Implementing validation gates

A validation gate is an automated check that blocks promotion if the chain doesn't meet criteria. Three types:

Smoke test gate

Run the chain once with a known input and verify the output matches expectations. This catches configuration errors (wrong API keys, missing environment variables) and basic chain logic failures.

# Run chain with test input
mentiko run content-pipeline --input test-fixtures/sample.json

# Verify output exists and is valid
test -f output/report.md || exit 1
mentiko validate output/report.md --schema report-schema.json

Automate this as a promotion step: the chain only moves to staging if the smoke test passes in dev. It only moves to production if the smoke test passes in staging.

Quality gate

Run the chain on a set of representative inputs and evaluate output quality. This catches prompt regressions -- the chain runs but the output quality has degraded.

Quality evaluation can be automated with a reviewer agent:

name: quality-gate-reviewer
prompt: |
  Review this agent chain output against the quality criteria.
  Score each criterion 1-5:
  - Completeness: are all required sections present?
  - Accuracy: are facts and references correct?
  - Formatting: does the output match the expected format?
  - Relevance: is the content relevant to the input?
  Minimum passing score: 4.0 average

If the quality gate fails, the promotion is blocked. The PM or chain owner reviews the output and decides whether to fix the chain or adjust the quality criteria.

Cost gate

Run the chain and verify the per-run cost is within acceptable bounds. A chain that costs $0.12 per run in dev should cost roughly the same in staging (assuming the same model). If it costs $2.00 per run, something is wrong -- probably a prompt that's much longer in the staging environment, or an agent stuck in a retry loop.

# Check last run cost
COST=$(mentiko runs last --format json | jq '.cost')
MAX_COST=0.50

if (( $(echo "$COST > $MAX_COST" | bc -l) )); then
  echo "Cost gate failed: $COST exceeds $MAX_COST"
  exit 1
fi

Environment parity with configuration overrides

The chain definition should be identical across environments. Only the configuration changes. This is the same principle as 12-factor app configuration -- the artifact is immutable, the environment provides the config.

Structure your chain project like this:

chains/
  content-pipeline/
    chain.yaml          # chain definition (same everywhere)
    agents/
      researcher.yaml   # agent configs (same everywhere)
      writer.yaml
    config/
      dev.env           # environment-specific config
      staging.env
      production.env
    test-fixtures/
      sample-input.json # smoke test data
      expected-output.json

The chain definition references variables:

name: content-pipeline
agents:
  - name: researcher
    model: ${RESEARCHER_MODEL}
    timeout: ${AGENT_TIMEOUT}
  - name: writer
    model: ${WRITER_MODEL}
    timeout: ${AGENT_TIMEOUT}

Each environment file provides the values:

# production.env
RESEARCHER_MODEL=claude-haiku
WRITER_MODEL=claude-sonnet
AGENT_TIMEOUT=60

This pattern gives you environment parity with explicit, auditable differences. You can diff staging.env against production.env and see exactly what changes during promotion.

Rollback strategies

When a production chain misbehaves, you need to roll back fast. Three approaches:

Version tagging

Tag every chain version promoted to production. When you need to roll back, deploy the previous tag.

# Promote v2.3 to production
mentiko deploy content-pipeline --version v2.3 --env production

# Something goes wrong, roll back to v2.2
mentiko deploy content-pipeline --version v2.2 --env production

Blue-green chains

Run the new version alongside the old version. Route a percentage of traffic to the new version. If it performs well, shift all traffic. If not, shift back.

This works for chains that process incoming requests. It doesn't work as well for scheduled chains -- you'd end up with duplicate runs.

Feature flags

Use a feature flag to switch between the old and new chain logic. The flag is environment-specific. Flip it to enable the new version, flip it back to disable.

chain: ${USE_NEW_PIPELINE:content-pipeline-v2:content-pipeline-v1}

This is the safest approach because rollback is instant (flag flip) and doesn't require redeployment.

The promotion workflow

Putting it all together, a chain promotion from dev to production looks like this:

Chain is developed and tested in dev
Developer triggers promotion to staging
Smoke test runs automatically in staging
Quality gate evaluates output against criteria
Cost gate verifies per-run cost is within bounds
If all gates pass, chain is eligible for production
Chain owner approves the promotion
Chain is deployed to production with the previous version tagged for rollback
Production monitoring watches for anomalies in the first 24 hours
If anomalies detected, automatic rollback to previous version

Steps 3-6 are automated. Step 7 is human approval. Steps 9-10 are automated monitoring with human notification.

This process adds maybe 30 minutes to a chain deployment compared to "just push it to prod." That 30 minutes has saved us from every production incident we would have had.

New to agent chains? Start with your first chain in five minutes or learn the 5 chain patterns.