Skip to content
Method: every claim tracked, reviewed every 30–90 days, marked Holding, Partial, or Not holding. Drafted by Claude; signed off by Peter. How this works →
AM-129pub4 May 2026rev4 May 2026read10 mininBusiness Case & ROI

Mid-market agentic AI ROI in 90 days: what the cited data actually supports vs the vendor pitch

The 240% ROI in 90 days framing is the most common mid-market agentic AI vendor pitch in 2026, and the most-cited stat that no audited mid-market deployment has actually produced. Read against the McKinsey 17%, MIT NANDA 95%, and Stanford 12/88 data, the realistic 90-day mid-market ROI band is much narrower and much more useful for procurement than the pitch suggests.

Holding·reviewed4 May 2026·next+57d

Bottom line. No mid-market enterprise has produced a documented +240% ROI in 90 days from agentic AI under audited conditions. The McKinsey State of AI 2025 (n=1,993) finds 23% of organisations are scaling agentic AI and 17% report ≥5% EBIT-attribution from genAI; both are 12-18-month figures, not 90-day ones. MIT NANDA’s GenAI Divide finds 95% of pilots produce no measurable P&L impact. The realistic 90-day mid-market ROI band for the highest-discipline 12% cohort: 20-40% operator-time savings on bounded use cases, ~$50-200K in annualised cost-base reduction, and a deployment pattern that scales into 12-18-month measurable ROI. The 240% pitch is a vendor frame; what 90 days actually buys is a working pilot and a measurable baseline.

If you run AI procurement for a mid-market enterprise (1,000 to 5,000 employees, $200M to $2B in revenue) and you have been pitched on “240% ROI in 90 days from agentic AI”, the question is not whether to push back. The question is what the realistic 90-day ROI band actually looks like and how to build a procurement decision against the audited evidence rather than the vendor frame.

This piece reads the three most-cited 2026 enterprise agentic AI datasets against each other to produce a defensible 90-day mid-market ROI expectation. The framing is procurement-first: what can a CFO commit to in writing for a 90-day pilot, what cannot, and what the deployment-discipline pattern of the 12% cohort that actually reaches measurable ROI looks like.

Where the 240% number comes from (and where it doesn’t)

The “240% ROI” figure circulates in 2026 mid-market AI vendor pitches without a consistently-cited primary source. It does not appear in McKinsey’s State of AI 2025, MIT NANDA’s GenAI Divide, Stanford Digital Economy Lab’s Enterprise AI Playbook, BCG’s published AI surveys, or any Big-4 published research with disclosed methodology. The figure resembles vendor-published case-study aggregates, where the highest-performing single deployment in a vendor’s reference base produces a 200-300% return, and the vendor extrapolates that to a category-level expectation.

The pattern is structurally identical to the 2024-vintage “X% productivity gain” claims that mid-market buyers now read with appropriate skepticism. A single high-performing case study is not a category-level ROI expectation. The procurement question is not “can a 240% ROI happen” (it can, in narrow circumstances, for the 12% cohort that reaches measurable ROI at all), but “what is the median mid-market deployment producing in its first 90 days, and what does the ROI distribution look like.”

That question has a much more useful answer.

What the published data actually says about 90-day ROI

Three datasets bound the realistic 90-day mid-market ROI distribution.

McKinsey State of AI 2025 (Nov 2025, n=1,993 across 105 nations) reports 23% of organisations scaling an agentic AI system and 39% experimenting. The 17% EBIT-attribution figure (covered separately at AM-053) is a 12-month self-reported figure across the entire respondent base, not a 90-day pilot figure. McKinsey does not publish 90-day specific ROI bands; the implicit baseline from the survey is that meaningful EBIT impact, when it occurs, occurs at the 12-18 month mark for the cohort that reaches scaling.

MIT NANDA’s GenAI Divide (State of AI in Business 2025, 150 executives + 350 employees + 300 projects) finds 95% of analysed pilots delivered no measurable P&L impact. Per the reading at AM-128, the 95% is dominated by pilots without documented pre-deployment baselines: absence of measurement, not necessarily absence of operational benefit. The 5% that did create significant value are roughly the high-discipline cohort the McKinsey 23% and Stanford 12% figures triangulate against. MIT also finds purchased AI tools succeed 67% of the time vs internal builds at roughly 22%, the single most actionable mid-market procurement finding in the report.

Stanford Digital Economy Lab’s Enterprise AI Playbook (covered at AM-029) tracks 51 enterprise agentic AI deployments at 12-18 months post-production-deployment and finds 12% clear 300%+ ROI while 88% operate at or below break-even. The bimodal distribution is the editorially load-bearing finding: the 12% are not “the 88% with more time”; they are structurally in a different operating mode, distinguished by the GAUGE governance dimensions the publication tracks.

Read together, the three datasets bound the realistic 90-day expectation. The 23% scaling cohort and the 12% bimodal-success cohort overlap in shape, suggesting roughly 12-23% of mid-market deployments reach measurable ROI at all. Of that cohort, the 12% cleared 300%+ ROI at 12-18 months, meaning the median 90-day result for even the highest-discipline cohort is materially below the 12-18 month outcome. Translating to a defensible mid-market 90-day band: 20-40% time savings on bounded use cases, ~$50-200K annualised cost-base reduction equivalent for a 1,000-employee mid-market deployment, a measurable but small baseline.

The 240% in 90 days framing requires a deployment to clear in 90 days what Stanford’s data shows the 12% cohort takes 12-18 months to clear. The pattern does occur. But it is not the median 90-day outcome and should not be the procurement assumption.

What 90 days actually buys for a mid-market deployment

The realistic 90-day deliverable for a disciplined mid-market agentic AI pilot is a working deployment pattern, not a measurable ROI. The pattern includes four artefacts a CFO can audit at the 90-day review.

Artefact 1: documented pre-deployment baseline. The MIT NANDA finding underlines this. The 95% pilot-failure rate is largely a function of pilots without baselines, not pilots that operationally failed. A 90-day pilot that produces a baseline (per-task time, per-task cost, per-task error rate, customer-satisfaction scores) is a procurement asset for everything downstream. The vendor-pitched 240% ROI requires a baseline to compute against; the absence of one means there is no number to commit to.

Artefact 2: bounded production deployment. The pilot is in production use, not a sandbox demo, with ≥10 users, ≥4 weeks of operating data, and a documented use-case scope. The bounded scope is what allows the 90-day review to be a real review. Unbounded “explore agentic AI” pilots produce qualitative findings; bounded pilots produce numbers.

Artefact 3: per-class action error budget. The agent’s task class has a documented expected success rate, a measured actual success rate, and an exhaustion response when the budget is depleted. The SLA architecture piece walks the four-metric surface (action-bounded availability, MTTD-for-Agents, output-distribution drift, per-class action error budget) that distinguishes a production-shaped pilot from a demo-shaped pilot.

Artefact 4: scaling-vs-stop decision documented. At day 90, the deployment lead has a written recommendation: scale to additional use cases, hold and instrument further, or stop. The decision is anchored in the baseline data from artefact 1 against the pilot’s actual operating data. The MIT NANDA finding that internal builds succeed at 22% versus purchased tools at 67% should inform the scaling decision: the cohort that scales internal builds successfully is materially smaller than the cohort that scales vendor-tool deployments successfully.

These four artefacts together are what the 12% cohort actually produces at the 90-day mark. The 12-18 month measurable ROI is the downstream outcome, not the 90-day deliverable. A mid-market procurement team that scopes the 90-day pilot against these four artefacts is operating against the audited evidence; a team that scopes against the 240% ROI pitch is operating against a vendor frame.

The mid-market deployment pattern that scales into measurable ROI

Reading across the three datasets plus the BT Now Assist customer evidence (35% case-resolution time reduction with active human oversight), the UK Government Digital Service M365 Copilot trial (26 minutes saved per user per day across 20,000 staff in Q4 2024), and the Klarna walk-back pattern (publicly retracted productivity claims in May 2025), four operational patterns distinguish the cohort that scales from the cohort that does not.

Pattern 1: buy first, build second. MIT NANDA’s 67%-vs-22% spread is the cleanest mid-market procurement signal in the published data. A mid-market team scoping an internal build is accepting a 3x worse outcome distribution than a vendor-tool deployment. This is not an argument against ever building; it is an argument for being explicit about the 22% prior when the build approach is selected. The 60-question agentic AI RFP operationalises the buy-vs-build decision under dimension 5 (vendor lock-in) of the GAUGE framework.

Pattern 2: back-end cost reduction over front-of-house revenue lift. MIT NANDA finds enterprises “deploying AI in marketing and sales, when the tools might have a much bigger impact if used to take costs out of back-end processes.” The asymmetry is measurable: cost reduction is auditable through a P&L line, revenue lift requires multi-variable attribution. A mid-market 90-day pilot deployed against an L1-helpdesk deflection use case (cost reduction) produces a cleaner 90-day ROI number than the same 90 days deployed against a sales-content-generation use case (revenue attribution).

Pattern 3: shadow AI as the realistic baseline, not zero. MIT NANDA finds 40% of companies have official LLM subscriptions but 90% of workers report daily personal-AI-tool use for work tasks. The 90% number means the realistic mid-market pilot baseline is not “no AI”. It is “uncoordinated employee AI use producing inconsistent governance and quality”. A 90-day pilot that captures the existing-shadow baseline and measures the official deployment against it produces a more defensible number than a pilot measuring against a zero baseline.

Pattern 4: human-in-the-loop oversight is the steady-state, not the training-wheel. The BT pilot’s 35% case-resolution improvement was, per BT MD Hena Jalil’s on-the-record statement, achieved with “random checks at the other end”. The Klarna walk-back was specifically about removing human oversight too aggressively. A 90-day pilot framed as “AI replaces 30% of L1 staff” is structurally weaker than a 90-day pilot framed as “AI augments L1 staff with documented human oversight on every action class”. Even if the latter produces a smaller headcount-reduction number, the former produces the Klarna walk-back pattern at month 9.

These four patterns are not novel and they are not vendor-specific. They are the operational profile of the 12% cohort the cited data identifies. A mid-market procurement team running the 90-day pilot against these patterns is on the path to 12-18 month measurable ROI; a team running against the 240%-in-90-days pitch is on the path to the 88% bimodal failure cohort.

What to write into the 90-day pilot procurement contract

For a mid-market team scoping the actual procurement contract, the four 90-day artefacts plus the four operational patterns translate into specific contractual language.

Vendor commitments: per-action-class telemetry sufficient to compute action-bounded availability against a customer-defined action taxonomy; sub-hourly granularity on the four MTTD-for-Agents signals (action volume, tool-use distribution, cost-per-action, output distribution); a published baseline for output-distribution characteristics on the deployment’s calibration prompts; per-action-class error-budget framework with documented exhaustion response.

Customer commitments: documented pre-deployment baseline before pilot day 1; bounded use-case scope written into the contract; named human-oversight roles per action class; 90-day review with a binary scale-vs-stop decision and a 12-month re-review if scale is selected.

Mutual commitments: published documentation of any claims-vs-reality gap discovered in the 90-day window; correction-log discipline for any pilot-outcome claim that needs revising at the 12-month re-review.

The contract that lands these specifics produces a procurement-grade 90-day pilot. The contract that lands the 240%-in-90-days promise without the four artefacts produces the 88% cohort outcome.

The procurement question the data answers

The mid-market procurement question for 2026 is not “can we hit 240% ROI in 90 days.” The question is “what 90-day pattern should our pilot produce so that the 12-18 month measurable ROI is in the 12% bimodal cohort rather than the 88% one.” The data has the answer: bounded scope, documented baseline, vendor tool not internal build, back-end cost reduction not front-of-house revenue lift, human-oversight steady-state, action-class error budget.

A 90-day pilot that produces those six elements has a measurable likelihood of clearing 100-300% ROI at 12-18 months. A 90-day pilot pitched against the 240% in 90 days frame typically does not produce the six elements, does not establish the baseline the 12% cohort uses to scale, and ends up in the 95% MIT-NANDA pilot-failure category, not because the AI failed but because the pilot was not designed to measure what it was claiming.

The mid-market procurement reader who internalises this distinction is operating on the same evidence base the 12% cohort already does. The 240% pitch is the most common reason the other 88% does not.

ShareX / TwitterLinkedInEmail

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

Enterprise AI cost and ROI

Verifying, tracking, and challenging the ROI claims vendors and analysts make about enterprise agentic AI. 13 other pieces in this pillar.

Related reading

Vigil · 40 reviewed