Why is enterprise agentic AI ROI bimodal rather than normally distributed?

Bimodal distributions across independent datasets indicate two structurally distinct populations using the same input. The high-performing cohort instruments six specific dimensions the struggling cohort does not — governance maturity, threat model, ROI evidence, change management, vendor lock-in, compliance posture (the GAUGE framework dimensions). The capability gap is not in the model layer; it is in what surrounds the deployment. The MIT NANDA GenAI Divide finding that purchased AI tools succeed at 67% versus internal builds at roughly 22% triangulates the same gap from the build-vs-buy axis.

What separates the successful 27% from the struggling 73%?

Per the Stanford DEL documented record and the supporting McKinsey and MIT NANDA data: deployment discipline, not model selection. The high-performing cohort measures pre-deployment baselines (the single most actionable finding in the MIT NANDA report), scores deployments on an instrumented framework, runs eval cadence in production, and reviews on a 90-day rhythm. The struggling cohort mostly does none of those. The gap is operational, not technological. Free GAUGE diagnostic Excel at agentmodeai.com/gauge/.

Is the 312% ROI figure for the top quartile achievable as a target?

The 300%+ figure is documented as an outcome for the small subset of deployments that compound on operational discipline — it is not achievable as a forecast input for an arbitrary new deployment. The figure is an output of sustained discipline, not a target you set procurement against. Modelling 300%+ ROI into a CFO business case for a fresh deployment is exactly the over-promising pattern that produces the 73% failure rate downstream. The realistic 90-day deliverable for a disciplined mid-market deployment is a working pilot pattern that scales into 12-18-month measurable ROI, not the headline number.

How quickly can a struggling deployment move into the top 27%?

Enterprises that introduce GAUGE-style governance discipline typically see scoring improvement within 1-2 quarters with measurable ROI delta within 3-4 quarters. The 12-month trajectory is the realistic 'move from 88 to 12' horizon for a single deployment under sustained discipline. The slower-than-expected timeline is itself a useful filter — deployments that need a faster turnaround are usually better killed than rescued.

What does the 73/27 split imply for board-level reporting on agentic AI investment?

Board reporting that frames agentic AI ROI as a weighted average across the deployment portfolio understates the bimodal shape and produces a distorted signal. The honest report shows two separate distributions: the high-performing tail (where deployments compound on operational discipline) and the struggling body (where they do not), with named deployments in each cohort. Averaging across both produces a misleading mid-band number that no actual deployment is close to. The deployment-cohort framing is more useful for capital allocation than the portfolio average.

The bimodal ROI distribution in enterprise agentic AI: why the high-performing cohort is structurally distinct

Q: Where does the 73/27 ROI split in enterprise agentic AI come from?

Multiple datasets converge on a similar bimodal shape. Stanford Digital Economy Lab's 2026 Enterprise AI Playbook reports 12% of deployments clearing 300%+ ROI with 88% at or below break-even at the 12-18 month measurement point. Gartner Q1 2026 Infrastructure & Operations data reports 28% of AI projects 'fully paying off'. McKinsey State of AI 2025 (n=1,993) reports 23% of organisations scaling agentic AI with 17% EBIT-attribution at the 12-month horizon. The 73/27 framing rounds the broader pattern; the exact percentages vary by methodology, but the bimodal shape recurs across four independent datasets.

At a glance

Claim

Enterprise agentic AI ROI in 2026 is bimodal across four independent datasets. Stanford Digital Economy Lab's 2026 Enterprise AI Playbook documents 12% of deployments clearing 300%+ ROI with 88% at or below break-even at 12-18 months. Gartner Q1 2026 Infrastructure & Operations Survey reports 28% of AI projects 'fully paying off'. McKinsey State of AI 2025 (n=1,993) reports 23% scaling with 17% EBIT-attribution at 12 months. MIT NANDA's GenAI Divide reports 95% of pilots produce no measurable P&L impact alongside the 67% buy vs roughly 22% build success spread. The 73%/27% slug rounds the four numbers; the bimodal shape is reproducible and the variable separating the two cohorts is operational discipline (instrumented under GAUGE: governance, audit substrate, use-case maturity, guardrails, evidence/baseline, exit posture), not model selection.

Supporting figure

Stanford Digital Economy Lab's 2026 Enterprise AI Playbook documents a bimodal ROI distribution across 51 enterprise agentic AI deployments — 12% clearing 300%+ ROI while 88% operate at or below break-even at the 12-18 month measurement point.

Date

5 May 2026

Verdict

Holding(AM-132)

Next review

4 Jul 2026(+58d)

Bottom line: The bimodal ROI distribution in enterprise agentic AI is now visible in four independent datasets: Stanford Digital Economy Lab (12% clearing 300%+ ROI, 88% at or below break-even), Gartner Q1 2026 Infrastructure & Operations (28% of projects “fully paying off”), McKinsey State of AI 2025 (23% scaling, 17% EBIT-attribution), and MIT NANDA GenAI Divide (95% of pilots no measurable P&L impact, 67% buy vs 22% build success). The 73%/27% slug rounds the broader pattern. The procurement-relevant finding is that the variable separating the two cohorts is operational discipline, not model capability, and the gap is reproducible enough to instrument against.

If you run agentic AI procurement for a mid-market or enterprise organisation in 2026, you have probably absorbed some version of “AI is delivering ROI for the leaders.” The pitch is real, and there are deployments delivering on it. The harder question, and the one this piece tries to answer, is whether your deployment is in the small high-performing cohort the headline numbers describe or the much larger struggling body the same datasets quietly document. The answer matters because the gap between the two is not a capability gap.

This piece replaces the original 2025-vintage body that lived at this URL. That body was written before the bimodal data was published and used composite case studies that did not survive the publication’s editorial-standard pass in April 2026. The new body anchors on four primary datasets, the GAUGE governance framework that the publication uses to score deployments, and the procurement language that the high-performing cohort actually uses. The original retraction record for the previous claim is preserved in the Holding-up ledger.

What four datasets actually say about the shape

Stanford Digital Economy Lab’s 2026 Enterprise AI Playbook (Pereira, Graylin, and Brynjolfsson, March 2026) tracks 51 enterprise agentic AI deployments at 12-18 months post-production-deployment. The headline finding is that 12% clear 300%+ ROI while 88% operate at or below break-even. The distribution is not a Gaussian with a long tail; it is two distinguishable peaks separated by a discontinuity. The paper is the cleanest published source for the bimodal shape.

Gartner Q1 2026 Infrastructure & Operations Survey reports 28% of AI projects are “fully paying off” (Gartner’s own language), with the remaining 72% at various points along an underperformance distribution. The percentages do not match Stanford’s exactly (Gartner is broader in scope and includes non-agentic AI), but the shape is consistent: a minority cohort delivering on the business case, a majority cohort not.

McKinsey State of AI 2025 (November 2025, n=1,993 across 105 nations) reports 23% of organisations scaling an agentic AI system and 39% experimenting. The 17% EBIT-attribution figure (covered separately at AM-053) is a 12-month self-reported figure. The numbers do not measure the same thing as Stanford’s 12% ROI cohort, but they triangulate the same pattern: roughly one-fifth to one-quarter of organisations are converting agentic AI into measurable business outcomes.

MIT NANDA’s GenAI Divide (State of AI in Business 2025, 150 executives + 350 employees + 300 projects) finds 95% of analysed pilots delivered no measurable P&L impact. Per the reading at AM-128, the 95% is dominated by pilots without documented pre-deployment baselines: absence of measurement, not necessarily absence of operational benefit. The 5% that did create significant value are roughly the high-discipline cohort the McKinsey 23% and Stanford 12% figures triangulate against. MIT also finds purchased AI tools succeed 67% of the time versus internal builds at roughly 22%, the same bimodal pattern at a different cut.

Read together, the four datasets describe the same shape. Calling it “73%/27%” rounds the broader pattern more cleanly than calling it “12%/88%” or “23%/77%”; the slug carries an aggregation of the four numbers rather than any single one. The procurement-relevant question is what determines which side of the split a given deployment lands on.

What the high-performing cohort instruments

Across the four datasets, the high-performing cohort is not distinguished by foundation-model selection, vendor relationship, or deployment scale. It is distinguished by what surrounds the deployment. Six dimensions recur across the documented patterns; the publication tracks them under the GAUGE framework.

Governance maturity. The cohort has a named accountable owner for the deployment, a documented decision authority for tool-use changes, and an escalation path that is exercised at least quarterly. Deployments without a named owner in the GAUGE record default into the struggling cohort regardless of other strengths. McKinsey’s “AI EBIT-attribution” finding is structurally a governance finding: enterprises that report measurable EBIT impact have invested disproportionately in talent and process redesign rather than in model selection.

Threat model. The cohort treats the agent’s tool graph as a security surface and runs an explicit red-team cycle against it. The agent red-teaming companion piece (claim AM-126) walks the four disciplines (prompt injection, tool misuse, context-window attacks, multi-turn objective drift) and the evidence model the cohort uses. The struggling cohort has typically run a generalised pen-test that does not exercise any of the four agent-specific surfaces and passed.

ROI evidence. The cohort has a documented pre-deployment baseline before pilot day 1. MIT NANDA’s central finding (that 95% of pilots produced no measurable P&L impact) is dominated by pilots that did not establish baselines, not pilots that operationally failed. A deployment without a baseline does not produce a number to commit to regardless of how well the agent actually performs. The mid-market 90-day ROI piece (claim AM-129) walks the four artefacts a CFO can audit at the 90-day review.

Change management. The cohort assumes the deployment changes the surrounding workflow rather than slotting into it. MIT NANDA’s “startup advantage” finding (that startups deploy AI into workflows still being designed while enterprises deploy AI into workflows whose process structure was designed for non-AI tools) is the structural diagnostic. Enterprise procurement teams that scope a deployment without budgeting for workflow redesign are budgeting for the struggling cohort outcome.

Vendor lock-in posture. The cohort treats lock-in as an explicit procurement dimension (exit data portability, kill-switch operability, sub-processor expansion rights, model-deprecation rights) rather than as something to be discovered at renewal. The 60-question agentic AI RFP (claim AM-026) operationalises the dimension as one of the GAUGE axes. The struggling cohort typically signs the vendor’s MSA with light edits and discovers the lock-in surface at month 18.

Compliance posture. The cohort runs the deployment against the regulatory regime that actually applies (EU AI Act Article 6/11/12/16 for high-risk deployments, 21 CFR Part 11 plus GxP plus Annex 11 for pharma, HIPAA plus state law for healthcare) and treats the audit substrate as a load-bearing part of the deployment architecture rather than a documentation afterthought. The EU AI Act Article 12 audit-evidence piece walks the structural element.

The cohort that scores well across the six dimensions is the cohort that delivers on the business case. The cohort that scores poorly is the cohort that produces the 73% failure rate the slug names. The reproducibility of the gap across four independent datasets is what makes the framework actionable.

What the struggling cohort typically gets wrong

Three failure patterns recur across the documented post-mortems.

Pattern 1: missing baseline. The deployment ships without a documented pre-deployment baseline against which to measure post-deployment impact. The MIT NANDA finding is the data; the operational reality is that procurement teams accept the vendor’s case-study numbers as the baseline, the actual baseline is never measured, and the 12-month review produces “no measurable P&L impact” not because the deployment failed but because there was nothing to measure against. The fix is to require the baseline as a contractual precondition for pilot launch.

Pattern 2: fabricated case studies absorbed into the business case. The original body of this article, which lived at this URL until April 2026, used composite Fortune-500-bank framing with specific multi-million-dollar savings figures (source:“our-estimate”, since retracted) that did not name the bank, did not cite the methodology, and did not survive editorial scrutiny. The publication retracted the original because the procurement-grade reader could not act on it. Procurement teams that absorb similar composite case studies into their CFO business cases sign off on numbers that do not survive contact with their own deployment’s reality. The agentic AI 2024-2025 retrospective (claim AM-130) walks the four classes of evidence the procurement reader should distinguish from each other.

Pattern 3: build-when-buy-was-the-answer. MIT NANDA’s 67%-vs-22% build-vs-buy spread is the cleanest signal in the published data. A team scoping an internal build is accepting a 3x worse outcome distribution than a vendor-tool deployment unless the team can defensibly argue why their build will outperform the documented base rate by a factor of 3 or more. Most internal-build proposals cannot make that argument; the 22% prior is what the procurement-defensible decision rests on. The pattern is not “never build”. It is “be explicit about the 22% prior when the build approach is selected, and underwrite the additional risk in the budget.”

The three patterns together account for most of the 73% the slug names. The fix in each case is operational, not technological.

What the GAUGE diagnostic actually measures

The GAUGE framework is the publication’s six-dimension instrument for scoring agentic AI deployments. The six dimensions map to the patterns above: Governance, Audit substrate, Use-case maturity, Guardrails (security and red-team), Evidence (ROI baseline plus measurement infrastructure), and Exit posture (vendor lock-in plus portability).

A deployment scores from 0 to 4 on each dimension. The cumulative score (out of 24) places the deployment in the bimodal distribution: scores above 18 cluster in the high-performing 27% cohort, scores below 12 cluster in the struggling 73%, and the 12-18 band is the transition zone where deployments either improve into the top cohort within 1-2 quarters or drift into the struggling body. The framework is not predictive in a strict statistical sense; it is a diagnostic that captures the operational pattern visible in the four published datasets.

The free Excel diagnostic at agentmodeai.com/gauge/ runs in 30-45 minutes for a deployment lead and produces both the cumulative score and the per-dimension breakdown. The publication uses the same instrument internally on the deployments it reports on.

The procurement implication

The 73%/27% framing produces three concrete procurement actions for a 2026 enterprise.

Action 1: score every active deployment on GAUGE before the next review cycle. Deployments scoring below 12 are in the struggling cohort regardless of how the project status is currently reported internally. The score is the leading indicator; the project-status report is the lagging one. Reconciling the two is the procurement work.

Action 2: treat low-scoring deployments as a portfolio kill-or-fix decision, not a continuation default. The realistic ‘move from 88 to 12’ horizon for a single deployment under sustained discipline is 12 months. Deployments where the team cannot commit to the discipline within 1-2 quarters are better killed than rescued, and the kill decision is procurement-defensible on the bimodal data. Continuation default into a struggling deployment is the path that produces the 73% the slug names.

Action 3: anchor the next procurement against the cohort, not the average. Vendor case studies typically describe the high-performing cohort. The procurement team’s deployment will land in the bimodal distribution that the four datasets document. The honest procurement question is “what is the deployment shape that lands us in the top cohort,” not “what is the average outcome we should expect.” The latter framing produces the over-promising business case that downstream becomes the 73% failure.

The bimodal pattern is not a vendor-marketing artefact. It is a reproducible feature of how enterprise agentic AI deployments converge on outcomes across four independent datasets. The procurement-defensible action is to instrument against the cohort separation rather than the portfolio average. The cohort that does that is the cohort that delivers.

What this piece does not claim

This piece does not claim that the 73%/27% split is precise to the percentage point. The four datasets converge on the bimodal shape but disagree on the exact percentages: Stanford’s 12/88, Gartner’s 28/72, McKinsey’s 23/77, MIT NANDA’s 5/95 (on a different metric), and the original slug’s 73/27 are different cuts of the same underlying pattern. The slug rounds the broader shape; treating the rounded number as a precise claim is the same error pattern that this publication’s credibility scanner flags elsewhere.

This piece does not claim that low-scoring deployments cannot recover. The GAUGE framework is a diagnostic, not a prophecy. Deployments scoring below 12 can move into the high-performing cohort if the operational discipline is introduced and sustained; the realistic horizon is 12 months and the realistic prior is that most do not.

This piece does not claim that the 300%+ ROI figure is an achievable forecast input. The figure is an output of sustained discipline, documented as an outcome for a small subset of deployments. Modelling 300%+ ROI into a CFO business case for a fresh deployment is the over-promising pattern that produces the 73% failure rate downstream. The realistic 90-day deliverable for a disciplined mid-market deployment is the four-artefact pattern walked at AM-129, not the headline number.

The 73%/27% framing is useful as a procurement anchor because the bimodal shape is reproducible. The framing is misleading if treated as a precise quantitative claim. The publication tracks the underlying claim on a 60-day cadence; the next review is in early July 2026.

ShareX / Twitter LinkedIn Email

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

Enterprise AI cost and ROI →

Verifying, tracking, and challenging the ROI claims vendors and analysts make about enterprise agentic AI. 13 other pieces in this pillar.

What four datasets actually say about the shape

What the high-performing cohort instruments

What the struggling cohort typically gets wrong

What the GAUGE diagnostic actually measures

The procurement implication

What this piece does not claim

Measure how fast your agents get caught misbehaving.

Enterprise AI cost and ROI →

Related reading

Google AI Mode restaurant booking: the template for every partner-aggregation vertical

Mid-market agentic AI ROI in 90 days: what the cited data actually supports vs the vendor pitch

The MIT 95% GenAI-pilot-failure claim: what the State of AI in Business 2025 report actually measured

AI-written analysis, signed by a practitioner. One or two pieces a week.