The bimodal ROI distribution in enterprise agentic AI: why the high-performing cohort is structurally distinct
Enterprise agentic AI ROI is bimodal, not normally distributed. Stanford DEL, Gartner, McKinsey State of AI, and MIT NANDA data converge on the same shape: a small high-performing tail and a much larger struggling body. What separates the two is operational discipline, not model selection — and the 73%/27% framing in the slug captures that pattern more cleanly than the original AI-slop body did.
Holding·reviewed5 May 2026·next+58d
Bottom line: The bimodal ROI distribution in enterprise agentic AI is now visible in four independent datasets: Stanford Digital Economy Lab (12% clearing 300%+ ROI, 88% at or below break-even), Gartner Q1 2026 Infrastructure & Operations (28% of projects “fully paying off”), McKinsey State of AI 2025 (23% scaling, 17% EBIT-attribution), and MIT NANDA GenAI Divide (95% of pilots no measurable P&L impact, 67% buy vs 22% build success). The 73%/27% slug rounds the broader pattern. The procurement-relevant finding is that the variable separating the two cohorts is operational discipline, not model capability, and the gap is reproducible enough to instrument against.
If you run agentic AI procurement for a mid-market or enterprise organisation in 2026, you have probably absorbed some version of “AI is delivering ROI for the leaders.” The pitch is real, and there are deployments delivering on it. The harder question, and the one this piece tries to answer, is whether your deployment is in the small high-performing cohort the headline numbers describe or the much larger struggling body the same datasets quietly document. The answer matters because the gap between the two is not a capability gap.
This piece replaces the original 2025-vintage body that lived at this URL. That body was written before the bimodal data was published and used composite case studies that did not survive the publication’s editorial-standard pass in April 2026. The new body anchors on four primary datasets, the GAUGE governance framework that the publication uses to score deployments, and the procurement language that the high-performing cohort actually uses. The original retraction record for the previous claim is preserved in the Holding-up ledger.
What four datasets actually say about the shape
Stanford Digital Economy Lab’s 2026 Enterprise AI Playbook (Pereira, Graylin, and Brynjolfsson, March 2026) tracks 51 enterprise agentic AI deployments at 12-18 months post-production-deployment. The headline finding is that 12% clear 300%+ ROI while 88% operate at or below break-even. The distribution is not a Gaussian with a long tail; it is two distinguishable peaks separated by a discontinuity. The paper is the cleanest published source for the bimodal shape.
Gartner Q1 2026 Infrastructure & Operations Survey reports 28% of AI projects are “fully paying off” (Gartner’s own language), with the remaining 72% at various points along an underperformance distribution. The percentages do not match Stanford’s exactly (Gartner is broader in scope and includes non-agentic AI), but the shape is consistent: a minority cohort delivering on the business case, a majority cohort not.
McKinsey State of AI 2025 (November 2025, n=1,993 across 105 nations) reports 23% of organisations scaling an agentic AI system and 39% experimenting. The 17% EBIT-attribution figure (covered separately at AM-053) is a 12-month self-reported figure. The numbers do not measure the same thing as Stanford’s 12% ROI cohort, but they triangulate the same pattern: roughly one-fifth to one-quarter of organisations are converting agentic AI into measurable business outcomes.
MIT NANDA’s GenAI Divide (State of AI in Business 2025, 150 executives + 350 employees + 300 projects) finds 95% of analysed pilots delivered no measurable P&L impact. Per the reading at AM-128, the 95% is dominated by pilots without documented pre-deployment baselines: absence of measurement, not necessarily absence of operational benefit. The 5% that did create significant value are roughly the high-discipline cohort the McKinsey 23% and Stanford 12% figures triangulate against. MIT also finds purchased AI tools succeed 67% of the time versus internal builds at roughly 22%, the same bimodal pattern at a different cut.
Read together, the four datasets describe the same shape. Calling it “73%/27%” rounds the broader pattern more cleanly than calling it “12%/88%” or “23%/77%”; the slug carries an aggregation of the four numbers rather than any single one. The procurement-relevant question is what determines which side of the split a given deployment lands on.
What the high-performing cohort instruments
Across the four datasets, the high-performing cohort is not distinguished by foundation-model selection, vendor relationship, or deployment scale. It is distinguished by what surrounds the deployment. Six dimensions recur across the documented patterns; the publication tracks them under the GAUGE framework.
Governance maturity. The cohort has a named accountable owner for the deployment, a documented decision authority for tool-use changes, and an escalation path that is exercised at least quarterly. Deployments without a named owner in the GAUGE record default into the struggling cohort regardless of other strengths. McKinsey’s “AI EBIT-attribution” finding is structurally a governance finding: enterprises that report measurable EBIT impact have invested disproportionately in talent and process redesign rather than in model selection.
Threat model. The cohort treats the agent’s tool graph as a security surface and runs an explicit red-team cycle against it. The agent red-teaming companion piece (claim AM-126) walks the four disciplines (prompt injection, tool misuse, context-window attacks, multi-turn objective drift) and the evidence model the cohort uses. The struggling cohort has typically run a generalised pen-test that does not exercise any of the four agent-specific surfaces and passed.
ROI evidence. The cohort has a documented pre-deployment baseline before pilot day 1. MIT NANDA’s central finding (that 95% of pilots produced no measurable P&L impact) is dominated by pilots that did not establish baselines, not pilots that operationally failed. A deployment without a baseline does not produce a number to commit to regardless of how well the agent actually performs. The mid-market 90-day ROI piece (claim AM-129) walks the four artefacts a CFO can audit at the 90-day review.
Change management. The cohort assumes the deployment changes the surrounding workflow rather than slotting into it. MIT NANDA’s “startup advantage” finding (that startups deploy AI into workflows still being designed while enterprises deploy AI into workflows whose process structure was designed for non-AI tools) is the structural diagnostic. Enterprise procurement teams that scope a deployment without budgeting for workflow redesign are budgeting for the struggling cohort outcome.
Vendor lock-in posture. The cohort treats lock-in as an explicit procurement dimension (exit data portability, kill-switch operability, sub-processor expansion rights, model-deprecation rights) rather than as something to be discovered at renewal. The 60-question agentic AI RFP (claim AM-026) operationalises the dimension as one of the GAUGE axes. The struggling cohort typically signs the vendor’s MSA with light edits and discovers the lock-in surface at month 18.
Compliance posture. The cohort runs the deployment against the regulatory regime that actually applies (EU AI Act Article 6/11/12/16 for high-risk deployments, 21 CFR Part 11 plus GxP plus Annex 11 for pharma, HIPAA plus state law for healthcare) and treats the audit substrate as a load-bearing part of the deployment architecture rather than a documentation afterthought. The EU AI Act Article 12 audit-evidence piece walks the structural element.
The cohort that scores well across the six dimensions is the cohort that delivers on the business case. The cohort that scores poorly is the cohort that produces the 73% failure rate the slug names. The reproducibility of the gap across four independent datasets is what makes the framework actionable.
What the struggling cohort typically gets wrong
Three failure patterns recur across the documented post-mortems.
Pattern 1: missing baseline. The deployment ships without a documented pre-deployment baseline against which to measure post-deployment impact. The MIT NANDA finding is the data; the operational reality is that procurement teams accept the vendor’s case-study numbers as the baseline, the actual baseline is never measured, and the 12-month review produces “no measurable P&L impact” not because the deployment failed but because there was nothing to measure against. The fix is to require the baseline as a contractual precondition for pilot launch.
Pattern 2: fabricated case studies absorbed into the business case. The original body of this article, which lived at this URL until April 2026, used composite Fortune-500-bank framing with specific multi-million-dollar savings figures (source:“our-estimate”, since retracted) that did not name the bank, did not cite the methodology, and did not survive editorial scrutiny. The publication retracted the original because the procurement-grade reader could not act on it. Procurement teams that absorb similar composite case studies into their CFO business cases sign off on numbers that do not survive contact with their own deployment’s reality. The agentic AI 2024-2025 retrospective (claim AM-130) walks the four classes of evidence the procurement reader should distinguish from each other.
Pattern 3: build-when-buy-was-the-answer. MIT NANDA’s 67%-vs-22% build-vs-buy spread is the cleanest signal in the published data. A team scoping an internal build is accepting a 3x worse outcome distribution than a vendor-tool deployment unless the team can defensibly argue why their build will outperform the documented base rate by a factor of 3 or more. Most internal-build proposals cannot make that argument; the 22% prior is what the procurement-defensible decision rests on. The pattern is not “never build”. It is “be explicit about the 22% prior when the build approach is selected, and underwrite the additional risk in the budget.”
The three patterns together account for most of the 73% the slug names. The fix in each case is operational, not technological.
What the GAUGE diagnostic actually measures
The GAUGE framework is the publication’s six-dimension instrument for scoring agentic AI deployments. The six dimensions map to the patterns above: Governance, Audit substrate, Use-case maturity, Guardrails (security and red-team), Evidence (ROI baseline plus measurement infrastructure), and Exit posture (vendor lock-in plus portability).
A deployment scores from 0 to 4 on each dimension. The cumulative score (out of 24) places the deployment in the bimodal distribution: scores above 18 cluster in the high-performing 27% cohort, scores below 12 cluster in the struggling 73%, and the 12-18 band is the transition zone where deployments either improve into the top cohort within 1-2 quarters or drift into the struggling body. The framework is not predictive in a strict statistical sense; it is a diagnostic that captures the operational pattern visible in the four published datasets.
The free Excel diagnostic at agentmodeai.com/gauge/ runs in 30-45 minutes for a deployment lead and produces both the cumulative score and the per-dimension breakdown. The publication uses the same instrument internally on the deployments it reports on.
The procurement implication
The 73%/27% framing produces three concrete procurement actions for a 2026 enterprise.
Action 1: score every active deployment on GAUGE before the next review cycle. Deployments scoring below 12 are in the struggling cohort regardless of how the project status is currently reported internally. The score is the leading indicator; the project-status report is the lagging one. Reconciling the two is the procurement work.
Action 2: treat low-scoring deployments as a portfolio kill-or-fix decision, not a continuation default. The realistic ‘move from 88 to 12’ horizon for a single deployment under sustained discipline is 12 months. Deployments where the team cannot commit to the discipline within 1-2 quarters are better killed than rescued, and the kill decision is procurement-defensible on the bimodal data. Continuation default into a struggling deployment is the path that produces the 73% the slug names.
Action 3: anchor the next procurement against the cohort, not the average. Vendor case studies typically describe the high-performing cohort. The procurement team’s deployment will land in the bimodal distribution that the four datasets document. The honest procurement question is “what is the deployment shape that lands us in the top cohort,” not “what is the average outcome we should expect.” The latter framing produces the over-promising business case that downstream becomes the 73% failure.
The bimodal pattern is not a vendor-marketing artefact. It is a reproducible feature of how enterprise agentic AI deployments converge on outcomes across four independent datasets. The procurement-defensible action is to instrument against the cohort separation rather than the portfolio average. The cohort that does that is the cohort that delivers.
What this piece does not claim
This piece does not claim that the 73%/27% split is precise to the percentage point. The four datasets converge on the bimodal shape but disagree on the exact percentages: Stanford’s 12/88, Gartner’s 28/72, McKinsey’s 23/77, MIT NANDA’s 5/95 (on a different metric), and the original slug’s 73/27 are different cuts of the same underlying pattern. The slug rounds the broader shape; treating the rounded number as a precise claim is the same error pattern that this publication’s credibility scanner flags elsewhere.
This piece does not claim that low-scoring deployments cannot recover. The GAUGE framework is a diagnostic, not a prophecy. Deployments scoring below 12 can move into the high-performing cohort if the operational discipline is introduced and sustained; the realistic horizon is 12 months and the realistic prior is that most do not.
This piece does not claim that the 300%+ ROI figure is an achievable forecast input. The figure is an output of sustained discipline, documented as an outcome for a small subset of deployments. Modelling 300%+ ROI into a CFO business case for a fresh deployment is the over-promising pattern that produces the 73% failure rate downstream. The realistic 90-day deliverable for a disciplined mid-market deployment is the four-artefact pattern walked at AM-129, not the headline number.
The 73%/27% framing is useful as a procurement anchor because the bimodal shape is reproducible. The framing is misleading if treated as a precise quantitative claim. The publication tracks the underlying claim on a 60-day cadence; the next review is in early July 2026.
Spotted an error? See corrections policy →
Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.
Enterprise AI cost and ROI →
Verifying, tracking, and challenging the ROI claims vendors and analysts make about enterprise agentic AI. 13 other pieces in this pillar.