Has any audited mid-market deployment actually produced +240% ROI in 90 days from agentic AI?

No. Read against McKinsey State of AI 2025 (n=1,993; 23% scaling, 17% EBIT-attribution at 12-month horizon), MIT NANDA GenAI Divide (95% of pilots produce no measurable P&L impact, 67% buy vs 22% build success spread), and Stanford Digital Economy Lab Enterprise AI Playbook (12/88 bimodal ROI distribution at 12-18 months), the realistic 90-day mid-market band for the highest-discipline 12% cohort is 20-40% operator-time savings on bounded use cases plus a working pilot pattern that scales into 12-18-month measurable ROI. The 240%-in-90-days framing requires a deployment to clear in 90 days what Stanford's data shows the 12% cohort takes 12-18 months to clear.

Mid-market agentic AI 90-day ROI: what data supports

Q: What is the realistic 90-day deliverable for a disciplined mid-market agentic AI pilot?

Four artefacts a CFO can audit at the 90-day review. (1) Documented pre-deployment baseline — per-task time, per-task cost, per-task error rate, customer-satisfaction scores. (2) Bounded production deployment with ≥10 users, ≥4 weeks of operating data, documented use-case scope. (3) Per-class action error budget with documented expected vs measured success rate and an exhaustion response. (4) Scaling-vs-stop decision documented and anchored in baseline data. These four are what the 12% cohort actually produces at the 90-day mark; the 12-18 month measurable ROI is the downstream outcome, not the 90-day deliverable.

Q: Which four operational patterns distinguish the cohort that scales from the cohort that does not?

Pattern 1: buy first, build second — MIT NANDA's 67%-vs-22% spread means an internal build accepts a 3x worse outcome distribution than a vendor-tool deployment. Pattern 2: back-end cost reduction over front-of-house revenue lift — cost reduction is auditable through a P&L line, revenue lift requires multi-variable attribution. Pattern 3: shadow AI as realistic baseline (40% licensed, 90% using) not zero — the existing uncoordinated employee AI use is the real baseline. Pattern 4: human-in-the-loop oversight is steady-state not training-wheel — BT's 35% case-resolution improvement was achieved with documented random checks; the Klarna walk-back was about removing oversight too aggressively.

Q: What does a procurement-grade 90-day pilot contract actually contain?

Vendor commitments: per-action-class telemetry sufficient to compute action-bounded availability, sub-hourly granularity on the four MTTD-for-Agents signals, a published baseline for output-distribution characteristics on calibration prompts, per-action-class error-budget framework with documented exhaustion response. Customer commitments: documented pre-deployment baseline before pilot day 1, bounded use-case scope written into the contract, named human-oversight roles per action class, 90-day review with binary scale-vs-stop decision and 12-month re-review if scale is selected. Mutual commitments: published documentation of any claims-vs-reality gap, correction-log discipline for any pilot-outcome claim needing revision at re-review.

Q: Why does the headline 240% ROI framing produce the 88% bimodal-failure cohort?

A 90-day pilot pitched against 240%-in-90-days typically does not establish the documented pre-deployment baseline the 12% cohort uses to scale. Without a baseline, the pilot ends up in the 95% MIT-NANDA pilot-failure category — not because the AI failed but because the pilot was not designed to measure what it was claiming. The mid-market procurement question for 2026 is not 'can we hit 240% ROI in 90 days' but 'what 90-day pattern should our pilot produce so the 12-18 month measurable ROI is in the 12% bimodal cohort rather than the 88% one.' The data answers it: bounded scope, documented baseline, vendor tool not internal build, back-end cost reduction not front-of-house revenue lift, human-oversight steady-state, action-class error budget.

At a glance

Claim

No mid-market enterprise has produced a documented +240% ROI in 90 days from agentic AI under audited conditions. Read against McKinsey State of AI 2025 (n=1,993; 23% scaling, 17% EBIT-attribution at 12-month horizon), MIT NANDA GenAI Divide (95% of pilots produce no measurable P&L impact, 67% buy vs 22% build success spread), and Stanford Digital Economy Lab Enterprise AI Playbook (12/88 bimodal ROI distribution at 12-18 months), the realistic 90-day mid-market ROI band for the highest-discipline 12% cohort is 20-40% operator-time savings on bounded use cases plus a working pilot pattern that scales into 12-18-month measurable ROI — not the 240% ROI in 90 days the vendor pitch frames it as. The four-artefact 90-day deliverable (documented baseline, bounded production deployment, per-class action error budget, scaling-vs-stop decision) is what the 12% cohort actually produces.

Date

4 May 2026

Verdict

Holding(AM-129)

Next review

3 Jul 2026(+57d)

Bottom line. No mid-market enterprise has produced a documented +240% ROI in 90 days from agentic AI under audited conditions. The McKinsey State of AI 2025 (n=1,993) finds 23% of organisations are scaling agentic AI and 17% report ≥5% EBIT-attribution from genAI; both are 12-18-month figures, not 90-day ones. MIT NANDA’s GenAI Divide finds 95% of pilots produce no measurable P&L impact. The realistic 90-day mid-market ROI band for the highest-discipline 12% cohort: 20-40% operator-time savings on bounded use cases, ~$50-200K in annualised cost-base reduction, and a deployment pattern that scales into 12-18-month measurable ROI. The 240% pitch is a vendor frame; what 90 days actually buys is a working pilot and a measurable baseline.

If you run AI procurement for a mid-market enterprise (1,000 to 5,000 employees, $200M to $2B in revenue) and you have been pitched on “240% ROI in 90 days from agentic AI”, the question is not whether to push back. The question is what the realistic 90-day ROI band actually looks like and how to build a procurement decision against the audited evidence rather than the vendor frame.

This piece reads the three most-cited 2026 enterprise agentic AI datasets against each other to produce a defensible 90-day mid-market ROI expectation. The framing is procurement-first: what can a CFO commit to in writing for a 90-day pilot, what cannot, and what the deployment-discipline pattern of the 12% cohort that actually reaches measurable ROI looks like.

Where the 240% number comes from (and where it doesn’t)

The “240% ROI” figure circulates in 2026 mid-market AI vendor pitches without a consistently-cited primary source. It does not appear in McKinsey’s State of AI 2025, MIT NANDA’s GenAI Divide, Stanford Digital Economy Lab’s Enterprise AI Playbook, BCG’s published AI surveys, or any Big-4 published research with disclosed methodology. The figure resembles vendor-published case-study aggregates, where the highest-performing single deployment in a vendor’s reference base produces a 200-300% return, and the vendor extrapolates that to a category-level expectation.

The pattern is structurally identical to the 2024-vintage “X% productivity gain” claims that mid-market buyers now read with appropriate skepticism. A single high-performing case study is not a category-level ROI expectation. The procurement question is not “can a 240% ROI happen” (it can, in narrow circumstances, for the 12% cohort that reaches measurable ROI at all), but “what is the median mid-market deployment producing in its first 90 days, and what does the ROI distribution look like.”

That question has a much more useful answer.

What the published data actually says about 90-day ROI

Three datasets bound the realistic 90-day mid-market ROI distribution.

McKinsey State of AI 2025 (Nov 2025, n=1,993 across 105 nations) reports 23% of organisations scaling an agentic AI system and 39% experimenting. The 17% EBIT-attribution figure (covered separately at AM-053) is a 12-month self-reported figure across the entire respondent base, not a 90-day pilot figure. McKinsey does not publish 90-day specific ROI bands; the implicit baseline from the survey is that meaningful EBIT impact, when it occurs, occurs at the 12-18 month mark for the cohort that reaches scaling.

MIT NANDA’s GenAI Divide (State of AI in Business 2025, 150 executives + 350 employees + 300 projects) finds 95% of analysed pilots delivered no measurable P&L impact. Per the reading at AM-128, the 95% is dominated by pilots without documented pre-deployment baselines: absence of measurement, not necessarily absence of operational benefit. The 5% that did create significant value are roughly the high-discipline cohort the McKinsey 23% and Stanford 12% figures triangulate against. MIT also finds purchased AI tools succeed 67% of the time vs internal builds at roughly 22%, the single most actionable mid-market procurement finding in the report.

Stanford Digital Economy Lab’s Enterprise AI Playbook (covered at AM-029) tracks 51 enterprise agentic AI deployments at 12-18 months post-production-deployment and finds 12% clear 300%+ ROI while 88% operate at or below break-even. The bimodal distribution is the editorially load-bearing finding: the 12% are not “the 88% with more time”; they are structurally in a different operating mode, distinguished by the GAUGE governance dimensions the publication tracks.

Read together, the three datasets bound the realistic 90-day expectation. The 23% scaling cohort and the 12% bimodal-success cohort overlap in shape, suggesting roughly 12-23% of mid-market deployments reach measurable ROI at all. Of that cohort, the 12% cleared 300%+ ROI at 12-18 months, meaning the median 90-day result for even the highest-discipline cohort is materially below the 12-18 month outcome. Translating to a defensible mid-market 90-day band: 20-40% time savings on bounded use cases, ~$50-200K annualised cost-base reduction equivalent for a 1,000-employee mid-market deployment, a measurable but small baseline.

The 240% in 90 days framing requires a deployment to clear in 90 days what Stanford’s data shows the 12% cohort takes 12-18 months to clear. The pattern does occur. But it is not the median 90-day outcome and should not be the procurement assumption.

What 90 days actually buys for a mid-market deployment

The realistic 90-day deliverable for a disciplined mid-market agentic AI pilot is a working deployment pattern, not a measurable ROI. The pattern includes four artefacts a CFO can audit at the 90-day review.

Artefact 1: documented pre-deployment baseline. The MIT NANDA finding underlines this. The 95% pilot-failure rate is largely a function of pilots without baselines, not pilots that operationally failed. A 90-day pilot that produces a baseline (per-task time, per-task cost, per-task error rate, customer-satisfaction scores) is a procurement asset for everything downstream. The vendor-pitched 240% ROI requires a baseline to compute against; the absence of one means there is no number to commit to.

Artefact 2: bounded production deployment. The pilot is in production use, not a sandbox demo, with ≥10 users, ≥4 weeks of operating data, and a documented use-case scope. The bounded scope is what allows the 90-day review to be a real review. Unbounded “explore agentic AI” pilots produce qualitative findings; bounded pilots produce numbers.

Artefact 3: per-class action error budget. The agent’s task class has a documented expected success rate, a measured actual success rate, and an exhaustion response when the budget is depleted. The SLA architecture piece walks the four-metric surface (action-bounded availability, MTTD-for-Agents, output-distribution drift, per-class action error budget) that distinguishes a production-shaped pilot from a demo-shaped pilot.

Artefact 4: scaling-vs-stop decision documented. At day 90, the deployment lead has a written recommendation: scale to additional use cases, hold and instrument further, or stop. The decision is anchored in the baseline data from artefact 1 against the pilot’s actual operating data. The MIT NANDA finding that internal builds succeed at 22% versus purchased tools at 67% should inform the scaling decision: the cohort that scales internal builds successfully is materially smaller than the cohort that scales vendor-tool deployments successfully.

These four artefacts together are what the 12% cohort actually produces at the 90-day mark. The 12-18 month measurable ROI is the downstream outcome, not the 90-day deliverable. A mid-market procurement team that scopes the 90-day pilot against these four artefacts is operating against the audited evidence; a team that scopes against the 240% ROI pitch is operating against a vendor frame.

The mid-market deployment pattern that scales into measurable ROI

Reading across the three datasets plus the BT Now Assist customer evidence (35% case-resolution time reduction with active human oversight), the UK Government Digital Service M365 Copilot trial (26 minutes saved per user per day across 20,000 staff in Q4 2024), and the Klarna walk-back pattern (publicly retracted productivity claims in May 2025), four operational patterns distinguish the cohort that scales from the cohort that does not.

Pattern 1: buy first, build second. MIT NANDA’s 67%-vs-22% spread is the cleanest mid-market procurement signal in the published data. A mid-market team scoping an internal build is accepting a 3x worse outcome distribution than a vendor-tool deployment. This is not an argument against ever building; it is an argument for being explicit about the 22% prior when the build approach is selected. The 60-question agentic AI RFP operationalises the buy-vs-build decision under dimension 5 (vendor lock-in) of the GAUGE framework.

Pattern 2: back-end cost reduction over front-of-house revenue lift. MIT NANDA finds enterprises “deploying AI in marketing and sales, when the tools might have a much bigger impact if used to take costs out of back-end processes.” The asymmetry is measurable: cost reduction is auditable through a P&L line, revenue lift requires multi-variable attribution. A mid-market 90-day pilot deployed against an L1-helpdesk deflection use case (cost reduction) produces a cleaner 90-day ROI number than the same 90 days deployed against a sales-content-generation use case (revenue attribution).

Pattern 3: shadow AI as the realistic baseline, not zero. MIT NANDA finds 40% of companies have official LLM subscriptions but 90% of workers report daily personal-AI-tool use for work tasks. The 90% number means the realistic mid-market pilot baseline is not “no AI”. It is “uncoordinated employee AI use producing inconsistent governance and quality”. A 90-day pilot that captures the existing-shadow baseline and measures the official deployment against it produces a more defensible number than a pilot measuring against a zero baseline.

Pattern 4: human-in-the-loop oversight is the steady-state, not the training-wheel. The BT pilot’s 35% case-resolution improvement was, per BT MD Hena Jalil’s on-the-record statement, achieved with “random checks at the other end”. The Klarna walk-back was specifically about removing human oversight too aggressively. A 90-day pilot framed as “AI replaces 30% of L1 staff” is structurally weaker than a 90-day pilot framed as “AI augments L1 staff with documented human oversight on every action class”. Even if the latter produces a smaller headcount-reduction number, the former produces the Klarna walk-back pattern at month 9.

These four patterns are not novel and they are not vendor-specific. They are the operational profile of the 12% cohort the cited data identifies. A mid-market procurement team running the 90-day pilot against these patterns is on the path to 12-18 month measurable ROI; a team running against the 240%-in-90-days pitch is on the path to the 88% bimodal failure cohort.

What to write into the 90-day pilot procurement contract

For a mid-market team scoping the actual procurement contract, the four 90-day artefacts plus the four operational patterns translate into specific contractual language.

Vendor commitments: per-action-class telemetry sufficient to compute action-bounded availability against a customer-defined action taxonomy; sub-hourly granularity on the four MTTD-for-Agents signals (action volume, tool-use distribution, cost-per-action, output distribution); a published baseline for output-distribution characteristics on the deployment’s calibration prompts; per-action-class error-budget framework with documented exhaustion response.

Customer commitments: documented pre-deployment baseline before pilot day 1; bounded use-case scope written into the contract; named human-oversight roles per action class; 90-day review with a binary scale-vs-stop decision and a 12-month re-review if scale is selected.

Mutual commitments: published documentation of any claims-vs-reality gap discovered in the 90-day window; correction-log discipline for any pilot-outcome claim that needs revising at the 12-month re-review.

The contract that lands these specifics produces a procurement-grade 90-day pilot. The contract that lands the 240%-in-90-days promise without the four artefacts produces the 88% cohort outcome.

The procurement question the data answers

The mid-market procurement question for 2026 is not “can we hit 240% ROI in 90 days.” The question is “what 90-day pattern should our pilot produce so that the 12-18 month measurable ROI is in the 12% bimodal cohort rather than the 88% one.” The data has the answer: bounded scope, documented baseline, vendor tool not internal build, back-end cost reduction not front-of-house revenue lift, human-oversight steady-state, action-class error budget.

A 90-day pilot that produces those six elements has a measurable likelihood of clearing 100-300% ROI at 12-18 months. A 90-day pilot pitched against the 240% in 90 days frame typically does not produce the six elements, does not establish the baseline the 12% cohort uses to scale, and ends up in the 95% MIT-NANDA pilot-failure category, not because the AI failed but because the pilot was not designed to measure what it was claiming.

The mid-market procurement reader who internalises this distinction is operating on the same evidence base the 12% cohort already does. The 240% pitch is the most common reason the other 88% does not.

ShareX / Twitter LinkedIn Email

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

Enterprise AI cost and ROI →

Verifying, tracking, and challenging the ROI claims vendors and analysts make about enterprise agentic AI. 13 other pieces in this pillar.

Mid-market agentic AI ROI in 90 days: what the cited data actually supports vs the vendor pitch

Where the 240% number comes from (and where it doesn’t)

What the published data actually says about 90-day ROI

What 90 days actually buys for a mid-market deployment

The mid-market deployment pattern that scales into measurable ROI

What to write into the 90-day pilot procurement contract

The procurement question the data answers

Enterprise AI cost and ROI →

Related reading

Where the 240% number comes from (and where it doesn’t)

What the published data actually says about 90-day ROI

What 90 days actually buys for a mid-market deployment

The mid-market deployment pattern that scales into measurable ROI

What to write into the 90-day pilot procurement contract

The procurement question the data answers

Measure how fast your agents get caught misbehaving.

Enterprise AI cost and ROI →

Related reading

The MIT 95% GenAI-pilot-failure claim: what the State of AI in Business 2025 report actually measured

Agentic-AI vs human workers: the 2026 cost economics CIOs should actually model

The McKinsey 17% EBIT claim: what the survey actually measured

AI-written analysis, signed by a practitioner. One or two pieces a week.