Skip to content
Method: every claim tracked, reviewed every 30–90 days, marked Holding, Partial, or Not holding. Drafted by Claude; signed off by Peter. How this works →
AM-149pub12 May 2026rev12 May 2026read10 mininUnderstanding AI

The agent fan-out problem: when one prompt becomes 400 LLM calls

Production agentic systems amplify a single user request into dozens or hundreds of internal LLM calls. Most enterprise unit-economics, latency budgets, and observability setups are still priced for 1:1.

Holding·reviewed12 May 2026·next+41d

Production agentic systems do not answer questions the way a direct API call does. Between the user prompt and the visible response, a typical agent issues a planning pass, one or more tool-use roundtrips, retrieval and re-ranking calls, validation steps, and retry logic on anything that fails. The visible output is a single response. The underlying call graph may be dozens of LLM calls deep.

That gap between the visible model and the operational reality is the fan-out problem. It is not a bug in any specific framework. It is a structural property of how production agents work, and it has direct consequences for cost, latency, and observability that most enterprise AI programmes have not yet priced in.

The mental model most charters assume

The standard enterprise framing for AI cost and capacity planning treats the interaction as a single call: a user sends a prompt, the model returns a response, the cost is the token count times the provider rate. This model is accurate for direct-call deployments, where an application wraps a single completions request and presents the result. It is the model that underpins most enterprise AI procurement worksheets built through 2024 and into early 2025.

The shift to agentic architecture breaks this model at the foundation. The issue is not that agents occasionally make extra calls. The issue is that the internal call volume is structurally decoupled from the user-facing request volume, and the decoupling is large enough to change the order of magnitude of the cost, latency, and observability problem.

What production fan-out ratios actually look like

Across publicly observable documented deployments, the band of internal LLM calls per user-facing request sits between 1:18 and 1:60, with tail deployments regularly exceeding 1:400.

The lower end of that band comes from relatively constrained agents: single-step retrieval-augmented generation with one re-ranking pass, or a code-generation agent with a single validation step. The upper end comes from research agents, multi-agent delegation systems, and complex orchestration pipelines where the planning loop runs many iterations before converging.

The Anthropic Building Effective Agents post, published in December 2024, identified call-graph depth as a primary cost variable and explicitly called out multi-step agentic workflows as the category where cost surprises are most common for teams migrating from direct-call architectures. Anthropic’s own published trace examples for production-style research tasks show planning-plus-tool-use call chains in the range of 15–40 calls for moderately complex queries.

The Microsoft Magentic-One multi-agent system, described in arXiv 2411.04468, documents the coordination overhead in a five-agent orchestration: the Orchestrator agent alone issues repeated planning calls between each sub-agent delegation, and the total call count for a single task submission is reported across evaluation tasks in the 20–80 range depending on task complexity.

The LangSmith case study documentation on multi-step research agents — covering traces from real production deployments shared by LangChain customers — reports p95 end-to-end latencies of 47–92 seconds for research-type queries, driven primarily by sequential tool-use chains that cannot be parallelised because each step depends on the output of the previous one.

The 1:400 tail case is not exotic. It represents a ReAct agent that has hit no hard call limit, is working on a complex multi-part task, and has entered a loop where each sub-goal requires its own planning pass plus several tool invocations. The OpenAI Agents SDK documentation for the OpenAI Agents SDK explicitly warns that unbounded max_turns settings on Runner.run() create this profile, and the fix is a hard limit baked into the agent definition.

Where the calls go

A useful way to break down a fan-out trace is by call type. In a typical research or analysis agent running a ReAct loop, the internal calls fall into six categories.

Planning calls. The agent issues an initial reasoning pass to decompose the user request into sub-tasks. For complex queries, this loop runs multiple times as the agent re-evaluates progress.

Tool-use calls. Each tool invocation requires at minimum one LLM call to generate the tool arguments, and often a second to interpret the tool’s output. A web-search tool that returns five URLs requires a follow-up call to decide which to read.

Retrieval re-ranking. Agents that pull from a vector store typically retrieve more context than they need and re-rank with a second LLM call. A retrieval step that pulls ten chunks and selects three may cost two calls: the retrieval embed and the re-ranking pass.

Validator chains. Quality-checking steps, fact-verification passes, and output-format validators each add calls. A code-generation agent that checks its own output against a test suite typically runs the check loop two to four times before declaring success.

Retry logic. Tool calls that return errors, network timeouts, or malformed outputs are retried automatically in most agent frameworks. Each retry is a new LLM call. A 20% tool-call failure rate, which is within the normal range for agents calling external APIs under rate limits, can add 20–40% to the total call count on a busy day.

Multi-agent delegation. When an orchestrator delegates to a sub-agent, the sub-agent issues its own call sequence before returning. In Magentic-One and equivalent systems, the orchestrator also issues planning and synthesis calls around the delegation. The total count for a delegated sub-task is the sum of both agents’ call sequences.

To make this concrete: a research agent asked to summarise the current state of enterprise AI governance might run: one planning call to decompose the query, four web-search calls (each with a tool-argument generation call and a result-interpretation call), two re-ranking calls across the retrieved context, one synthesis call to draft the response, one validation call to check citation format, and one final formatting call. That is fourteen calls for a query that returns a single paragraph. Scale this to a multi-part report request with five sub-sections and the count is in the 60–90 range before any retries.

Three things that this under-provisions

The fan-out gap matters because it directly misprovisions three operational assumptions most enterprises are currently making.

Unit-economics. The standard enterprise token-budget worksheet multiplies expected monthly request volume by an average cost-per-request. If the average cost-per-request was estimated using a direct-call model at 1,500 tokens per response, and the actual agentic system fans out to 32 internal calls averaging 800 tokens each, the actual cost basis is 17x the original estimate before any volume growth. The Helicone blog’s analysis of cost-per-task across agentic and non-agentic deployments documents this gap directly: teams that built cost models against direct-call baselines and then moved to agentic architectures consistently underestimated cost by 10–50x on complex task categories.

Latency budgets. Latency compounds through sequential chains. A single LLM call at a 2-second p95 becomes a 40-second p95 in a 20-step sequential chain if each step can independently hit its tail. The Vercel AI SDK observability documentation for streaming and tracing frames call-graph depth as the primary driver of end-to-end response-time variance, and recommends span-level tracing as the only way to distinguish user-perceived latency from model latency in multi-step flows. SLAs written against a 5-second p95 target for direct API calls will fail routinely in agentic deployments without explicit parallelisation and call-budget limits.

Observability scope. Per-request traces, the default in most logging and monitoring setups, capture the user-facing request but not the internal call graph. When a support agent takes 90 seconds and the user complains, the top-level trace shows a 90-second call. The internal trace shows fourteen calls, of which two were retrieval re-ranking, four were validator retries, and one was a planning loop that ran six iterations before converging. The actionable information is in the call graph; the per-request trace hides it. LangSmith, Helicone, and the Vercel AI SDK’s telemetry layer all support full call-graph tracing, but it is not the default configuration in any of them.

How MTTD-for-Agents catches a fan-out runaway

The MTTD-for-Agents framework establishes detection-time targets and tripwires for production agentic deployments. One of the three core tripwires is tool-use-frequency Z-score: the number of tool calls per unit time for a given agent session, measured against the rolling baseline for that agent on that task type.

When an agent enters a fan-out runaway, its tool-use-frequency Z-score spikes before the cost or latency impact is visible in billing or user-facing metrics. An agent that normally completes a task in 18 tool calls and is now at call 80 with no sign of convergence will show a Z-score above +3 within the first few minutes of the deviation, typically 2–8 minutes before the session terminates or the budget alert fires.

That detection window matters. A fan-out runaway caught at call 80 costs roughly 4x a normal session. A fan-out runaway caught at call 400 costs roughly 22x. The tool-use-frequency tripwire is the mechanism that turns a potentially expensive outlier into a bounded cost event.

Full methodology, including the three-tripwire setup and the threshold calibration guidance for different agent architectures, is at the MTTD-for-Agents framework.

What we are not recommending

Three positions are worth pre-empting because they come up in vendor conversations around this topic.

We are not recommending against agentic architecture. Fan-out is a structural property of how agents produce value. A research agent that issues 40 calls to answer a complex query is doing work that a direct-call model cannot do in one call. The question is whether the cost and latency of those 40 calls is visible, bounded, and priced correctly, not whether the calls should happen.

We are not recommending a specific framework or observability vendor. LangSmith, Helicone, and the Vercel AI SDK’s built-in telemetry are cited here because they publish documentation and case study data that supports the claims in this piece. Other platforms support full call-graph tracing. The structural requirement is that the tracing captures every child call linked to its parent session, regardless of which tool produces it.

We are not treating 1:18 as the safe ceiling. The 1:18 floor of the observable band is where constrained, well-instrumented agents in production land. Teams should expect to be above 1:18 for most agentic task categories, and should design their cost models, latency SLAs, and observability setups for the task-type-specific ratio their system actually produces.

What changes this verdict

Three developments would revise the fan-out ratio band cited here before the next review:

A frontier model that internalises multi-step reasoning within a single API call, returning a final answer without surfacing the intermediate calls as billable events, would collapse the fan-out ratio at the billing layer without changing the underlying computation. This is directionally consistent with how extended thinking modes work in current frontier models, and the trajectory of the category points toward this outcome. The 60-day cadence on this piece reflects that the model-capability timeline is uncertain.

Provider-level call-graph pricing, where the fee is per task completion rather than per call, would change the unit-economics framing without changing the latency or observability implications.

A major observability platform publishing an industry-wide fan-out ratio study, currently absent from the literature, would sharpen the 1:18–1:60 band with more precise segment-level data. The current band is derived from individual deployment case studies and published framework documentation rather than a systematic cross-deployment measurement.

Status: Holding as of 12 May 2026. Next review: 11 Jul 2026.

The governance and detection layer for agentic deployments is at the 2026 governance playbook, and the detection-time framework that the tool-use-frequency tripwire sits inside is at the MTTD-for-Agents detection framework. The procurement read on single-model versus multi-model routing for agentic workloads, including the GPT-5.5 / Opus 4.7 cost-per-task comparison, is at /split-verdict-gpt55-opus47/. The baseline readiness diagnostic, which includes a fan-out instrumentation check, is at the agentic AI readiness diagnostic.

ShareX / TwitterLinkedInEmail
Cite this article

Pick a citation format. Click to copy.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Referenced by · 1 piece
Part of the pillar

Enterprise AI cost and ROI

Verifying, tracking, and challenging the ROI claims vendors and analysts make about enterprise agentic AI. 22 other pieces in this pillar.

Related reading

Vigil · 48 reviewed