Capacity Planning for Agent Chains: How Much Compute Do You Need?
Mentiko Team
You deployed your first agent chain and it works. Then you deployed ten more. Then you put them on schedules. Then marketing decided every inbound lead should trigger a chain, and suddenly your server is at 98% memory and chains are queuing for 20 minutes. You didn't plan capacity because you didn't know what to plan for.
Agent chain workloads are unlike traditional web applications. A web server handles thousands of short-lived, stateless requests. An agent chain is a long-lived, stateful process that may run for minutes, spawn subprocesses, hold files open, and make multiple external API calls. Sizing infrastructure for agent chains requires thinking about different bottlenecks than you're used to.
Understanding the Resource Profile
A single agent chain run consumes four types of resources:
CPU: Minimal during most of the run. Agents spend most of their time waiting for API responses from model providers. CPU spikes happen during prompt construction, response parsing, and file I/O. Unless you're running local models, CPU is rarely your bottleneck.
Memory: This is the primary constraint for most deployments. Each running chain maintains its execution context: the chain definition, event history, agent outputs, and workspace files. A simple three-agent chain might use 50-100MB. A chain with large file-based handoffs or shared state directories can use 500MB-1GB per run.
Disk I/O: Mentiko's file-based event system means every agent writes to disk. For typical workloads (JSON documents, text content), this is negligible on SSDs. For chains that process large datasets or generate substantial file outputs, disk throughput matters.
Network: Agent chains are network-heavy. Every agent that calls an LLM provider sends a prompt and receives a response over HTTPS. If you're running 20 concurrent chains with 5 agents each, that's potentially 100 concurrent outbound connections. Your network bandwidth is usually fine, but connection limits and DNS resolution can become bottlenecks.
Measuring Your Baseline
Before you plan capacity, measure what you have. Run your chains and record:
{
"chain": "lead-qualification",
"agents": 4,
"avg_duration_seconds": 87,
"peak_memory_mb": 210,
"disk_written_mb": 12,
"api_calls": 6,
"avg_api_latency_ms": 2400
}
Measure at least 20 runs per chain to get stable averages. Capture peak memory, not average -- a chain that averages 150MB but peaks at 400MB during the synthesis agent will OOM at a lower concurrency than you'd expect from the average.
Run this measurement under realistic conditions. A chain that processes a 50-row test CSV will use a fraction of the memory it needs for the 50,000-row production dataset. Use production-sized inputs for capacity planning.
The Concurrency Equation
The core question: how many chains can you run simultaneously?
Start with memory as the binding constraint. If your server has 8GB of available memory (total minus OS, minus other services) and your chains average 200MB peak memory:
max_concurrent = available_memory / peak_memory_per_chain
max_concurrent = 8000MB / 200MB = 40 concurrent chains
But this is the theoretical maximum. Apply a 70% utilization target to leave headroom for memory fragmentation, OS caches, and unexpected spikes:
safe_concurrent = max_concurrent * 0.70
safe_concurrent = 40 * 0.70 = 28 concurrent chains
Now cross-check against your other resources. If each chain makes 6 API calls and your model provider allows 60 requests per minute, you can sustain at most 10 chains starting per minute before you hit rate limits. If your chains run for 90 seconds on average, the steady-state concurrency at 10 starts/minute is 15 concurrent chains -- below your memory limit of 28, so the API rate limit is actually your binding constraint.
steady_state = starts_per_minute * avg_duration_minutes
steady_state = 10 * 1.5 = 15 concurrent chains
Map this to your actual demand. If you expect 200 chain triggers per hour during peak, that's ~3.3 per minute. At 90 seconds average duration, steady-state concurrency is 5 chains. You have ample headroom.
If you expect 2,000 triggers per hour during peak, that's 33 per minute. Steady-state concurrency is 50 chains. You're over both your memory limit and your API rate limit. Time to scale.
Scaling Up vs Scaling Out
Scaling up means a bigger server. Scaling out means more servers.
Scale up when: your bottleneck is memory on a single machine, your chains share state that's hard to distribute, or your concurrency needs are under ~100 chains. Going from 8GB to 32GB is the simplest solution and often the cheapest. A 32GB server costs a fraction of the engineering time to build a distributed orchestration layer.
Server sizing tiers:
Starter: 4 CPU, 8GB RAM -> ~20 concurrent chains
Growth: 8 CPU, 32GB RAM -> ~80 concurrent chains
Scale: 16 CPU, 64GB RAM -> ~160 concurrent chains
Scale out when: you need more than ~200 concurrent chains, you need geographic distribution, or you need fault tolerance (one server dying shouldn't stop all chain execution). Scaling out requires a coordinator that distributes chain runs across workers and routes events between them.
{
"cluster": {
"workers": 4,
"coordinator": "mentiko-coordinator:8080",
"assignment_strategy": "least-loaded",
"health_check_interval_ms": 5000
}
}
The coordinator assigns incoming chain runs to the worker with the most available capacity. If a worker goes down, its in-progress chains are reassigned to other workers (assuming your chains are idempotent -- and they should be).
Mentiko's file-based event system makes scaling out more straightforward than message-queue-based systems. Each worker needs access to the shared filesystem (NFS, EFS, or similar). Events are files. Workers read files from a shared directory. No message broker to scale, no queue depth to manage.
Memory Optimization
If memory is your bottleneck, optimize before throwing hardware at it.
Stream large files instead of loading them entirely into memory. If an agent needs to process a 200MB CSV, don't instruct it to "read the entire file." Have it process in chunks:
{
"name": "chunk-processor",
"prompt": "Process the next 1000 records from the input file. Track your position.",
"triggers": ["chain:start", "chunk:processed"],
"emits": ["chunk:processed", "processing:complete"],
"max_iterations": 50,
"memory_hint": {
"max_input_mb": 10,
"streaming": true
}
}
Clean up intermediate files aggressively. If your chain has 8 agents and each writes a 10MB output file, that's 80MB of disk that's also cached in memory. Configure earlier agents to clean up their output files once the downstream agent has consumed them:
{
"name": "transformer",
"input": "raw_data.json",
"output": "transformed_data.json",
"cleanup_input": true
}
Use cleanup_input: true to delete raw_data.json after the transformer has read it. The extractor's output is no longer needed once the transformer is done. If you need to preserve intermediate files for debugging, do it in a staging environment, not production.
API Rate Limit Management
Model provider rate limits are the most common bottleneck in production agent deployments. Every major provider has them, and they vary by tier.
Map your chain portfolio to your rate budget:
Provider rate limit: 300 requests/minute (tier 2)
Chain portfolio:
lead-qualification: 6 calls/run, 200 runs/hour = 1200 calls/hour
content-generation: 4 calls/run, 50 runs/hour = 200 calls/hour
support-triage: 3 calls/run, 150 runs/hour = 450 calls/hour
Total: 1850 calls/hour
Per minute: 30.8 calls/minute
At 31 calls per minute against a 300/minute limit, you're at 10% utilization. Plenty of headroom. But rate limits aren't just per-minute -- they're often per-second as well. If all 200 lead-qualification runs trigger at the top of the hour (batch import), you'd spike to 1200 API calls in a burst. You need request queuing with rate limiting:
{
"rate_limiting": {
"provider": "openai",
"max_requests_per_minute": 250,
"max_requests_per_second": 10,
"queue_strategy": "fifo",
"max_queue_depth": 500,
"queue_timeout_ms": 120000
}
}
This caps outbound requests to 10/second and 250/minute (leaving 50/minute headroom for the provider limit). Excess requests queue FIFO and time out after 2 minutes. If the queue depth exceeds 500, new requests are rejected immediately rather than queuing indefinitely.
Schedule Stagger
If you have 15 chains on hourly schedules and they all run at :00, you get a concurrency spike every hour and idle capacity the rest of the time. Stagger your schedules:
Before:
chain-a: 0 * * * * (every hour at :00)
chain-b: 0 * * * * (every hour at :00)
chain-c: 0 * * * * (every hour at :00)
After:
chain-a: 0 * * * * (every hour at :00)
chain-b: 20 * * * * (every hour at :20)
chain-c: 40 * * * * (every hour at :40)
This reduces peak concurrency by 3x without changing the total throughput. It's the single easiest capacity optimization and it costs nothing.
For event-triggered chains (webhooks, file watchers), you can't stagger the triggers. Use the semaphore pattern instead: limit how many instances of a chain can run concurrently, and queue the rest.
Monitoring for Capacity Decisions
You need three metrics to make capacity decisions:
Queue depth: how many chain runs are waiting to start. If this number is consistently above zero, you need more capacity. If it spikes periodically, you need to stagger schedules or add burst capacity.
Chain duration: how long runs take from start to finish. If this is trending up, either your agents are doing more work (expected growth) or you're hitting resource contention (infrastructure problem). Compare agent-level timings to distinguish between the two.
Memory utilization: track the 95th percentile, not the average. If p95 memory hits 85% of your server capacity, you're one unusual chain run away from OOM. Scale before you get there.
{
"monitoring": {
"metrics": [
{
"name": "chain_queue_depth",
"alert_threshold": 10,
"window_minutes": 5
},
{
"name": "chain_duration_p95",
"alert_threshold_seconds": 300,
"window_minutes": 60
},
{
"name": "memory_utilization_p95",
"alert_threshold_percent": 85,
"window_minutes": 15
}
]
}
}
Set alerts on all three. When queue depth alerts fire, you need more capacity or better scheduling. When duration alerts fire, investigate which agents are slowing down and why. When memory alerts fire, optimize your chains' memory usage or scale up.
Planning for Growth
Your capacity plan should project 3-6 months ahead. Use the formula:
required_capacity = current_peak * growth_rate * safety_margin
If your current peak is 30 concurrent chains, you're growing chains by 20% per month, and you want a 1.5x safety margin:
3 months: 30 * 1.2^3 * 1.5 = 77 concurrent chains
6 months: 30 * 1.2^6 * 1.5 = 134 concurrent chains
If your current server handles 80 concurrent chains, you're fine for 3 months and need to scale before month 6. Put the scaling work on the roadmap now, not when you're at 95% capacity and chains are failing.
Start with one well-sized server. Measure everything. Optimize before scaling. When you do scale, scale out to 2-3 workers before investing in a full orchestration layer. Most teams never need more than that.
Check the getting started guide to deploy your first chain, or read about monitoring agents in production for the observability side of capacity management.
Get new posts in your inbox
No spam. Unsubscribe anytime.