The agentic AI pilot-to-production gap: what vendor 'successful pilot' references do not tell procurement
Vendor 'successful pilot' references are the most common evidence presented to enterprise procurement committees evaluating agentic AI. McKinsey State of AI 2025 (Nov 2025, n=1,491) reports 23% of enterprises scaling and 39% still experimenting; the documented 2024-2025 walk-backs (Klarna 700-agent reversal, Salesforce Agentforce 200-customer reality, GitHub Copilot April 2026 token-counting bug) describe what those references typically obscure. The gap between vendor-reference pilot success and procuring-enterprise scaled production is operational, and it is the procurement committee's job to make the regime-translation question explicit before the contract closes.
Holding·reviewed06 May 2026·next+59dBottom line. McKinsey State of AI 2025 (n=1,491, Nov 2025) reports 23% of enterprises scaling agentic AI, 39% experimenting, 38% with nothing in production or stopped. Vendor “successful pilot” references presented to procurement committees describe outcomes at the vendor’s reference customer, not at the procuring enterprise’s measurement and governance regime. The transfer rate is closer to 23% than to the 100% the reference language implies. Six pre-pilot questions tighten the gap. Source: McKinsey State of AI Nov 2025.
The procurement committee meets in late April 2026. The agenda item is a multi-year agentic AI contract for a customer-service workflow. The vendor’s deck includes seven named “successful pilot” references at peer-segment enterprises. The committee approves the procurement on the strength of those references and a vendor-supplied ROI model derived from them. Twelve months later the deployment is in the 38% bucket: deployed, then stopped, with the residual contract value written off.
That arc is one shape of what the McKinsey State of AI 2025 finding describes. The Nov 2025 survey reports 23% of enterprises scaling agentic AI, 39% experimenting, and 38% with nothing in production or stopped. The 38% includes the procurement decisions that approved on vendor-reference evidence and did not translate that evidence into operating reality at the procuring enterprise. The McKinsey number is not a forecast about model maturity. It is a measurement of how often pilot success at one organisation translates to scaled production at another.
This piece is the procurement-committee-side companion to the IT-leader analysis of the McKinsey 23%. The IT-leader question is “we have a pilot, why does it not scale at our organisation”. The procurement-committee question is upstream of that: “we are about to authorise a pilot on the strength of the vendor’s references, what does pilot success at the reference customer mean for our deployment”.
What vendor “successful pilot” references actually describe
Vendor reference language is consistent across the major 2026 agentic AI sales motions. A “successful pilot” typically means: the vendor’s deployment team supported the implementation; the customer’s pilot unit had above-average AI-readiness; the success metric was measured against the vendor’s instrumentation rather than the customer’s pre-deployment baseline; the time horizon was 60-180 days; the pilot did not run through a full audit, regulatory review, or change-of-leadership cycle.
Each of those characteristics is procurement-relevant. The vendor deployment team will not be embedded at the procuring enterprise’s scale. The pilot unit’s AI-readiness rarely matches the adjacent business units the scale-up will reach. The procuring enterprise’s measurement regime is its own, not the vendor’s. The 60-180 day horizon does not exercise audit, regulatory review, or operational transitions that scaled production faces continuously.
Three documented 2024-2025 cases illustrate how reference language can outrun operating reality.
Klarna’s 700-agent claim. Klarna’s 2024 productivity narrative anchored on a 700-agent reduction figure that became a widely-cited reference in vendor decks. Bloomberg reported Klarna’s reversal in May 2025; the original press release stayed live, which is the procurement-relevant detail because the citation chain kept circulating unchanged (Bloomberg, 8 May 2025). A procurement committee citing the Klarna deployment as a peer-class reference in mid-2025 was citing a number Klarna itself had walked back.
Salesforce Agentforce launch versus customer reality. Marc Benioff’s Agentforce launch positioning implied broad customer adoption. The Salesforce IT division named a roughly 200-customer figure in subsequent reporting through Q1 2026, a real number for an early-stage product but materially smaller than the launch positioning implied (The Information, Apr 2025). Procurement committees evaluating Agentforce on the launch positioning were operating from an inflated reference base.
GitHub Copilot’s April 2026 token-counting bug. GitHub acknowledged a token-counting accuracy issue in April 2026 that affected billing and, by extension, customer-side ROI calculations derived from Copilot usage data (GitHub changelog, 18 Apr 2026). The bug was disclosed and remediated. Procurement committees that had built business cases on pre-disclosure usage data needed to re-derive the per-user value to verify the contract’s commercial assumptions still held.
None of the three cases are “AI does not work” findings. Each is a “the headline number is older or narrower than the citation suggests” finding, which is exactly the class of issue a procurement committee should price into the evaluation framework.
What the structural failure-mode evidence adds
Three research findings bound what is procurement-credible regardless of any single vendor reference.
CRMArena-Pro (Salesforce AI Research, Aug 2025) measured frontier-class agents at roughly 35% multi-step reliability on a structured CRM benchmark (Salesforce AI Research, Aug 2025). The agents complete individual steps competently; the multi-step sequence drifts. Carnegie Mellon’s TheAgentCompany benchmark independently reproduces the 30-35% range on adjacent enterprise workloads. Both findings are mechanism-level, not incidental, which means they apply to the procuring enterprise’s deployment regardless of the reference customer’s pilot performance.
The EchoLeak class (CVE-2025-32711, disclosed Aug 2025) named cross-agent prompt-injection where one compromised agent’s output contaminates the input substrate of agents downstream in the workflow (NVD CVE-2025-32711). Pilot deployments at the reference customer typically do not exercise the cross-agent attack surface; scaled production at the procuring enterprise will. Procurement committees that do not require cross-agent threat-model evidence are pricing a smaller risk than the deployment will face.
These structural findings cap what any vendor pilot reference can credibly imply. A vendor with strong customer references and no public response to the CRMArena-Pro and CMU reliability findings is asking the procurement committee to assume their deployment is the exception to a measured pattern.
Six pre-pilot questions for the procurement committee
The committee evaluating an agentic AI procurement on vendor pilot references can require six answers in writing before the pilot kick-off.
-
What is the procuring enterprise’s pre-deployment baseline on the workflow the agent will own, measured by the procuring enterprise’s own instrumentation, over 4-6 weeks before the agent goes live? Without this, the pilot’s success cannot be evaluated against any meaningful comparison and the eventual scaling decision will rest on vendor-side numbers that the CFO cannot defend.
-
What is the named owner of the agent’s outcome at the procuring enterprise, with reporting line and accountability scope? Vendor references typically have a champion at the reference customer. The procurement committee needs the equivalent named at its own organisation, on the org chart, before the pilot starts.
-
What is the agent registry the deployment will be added to, and what is the registry-entry’s content? Pilots without registry entries cannot be scaled because the governance, security, and compliance teams cannot evaluate the scale-up against an inventory that does not exist. The first scaled-deployment review will surface the absence; better to surface it pre-pilot.
-
What is the threat model for cross-agent delegation at the scale the procuring enterprise plans to operate, including the EchoLeak-class scenario? A pilot threat model that covers the pilot-unit attack surface is a different document from a scaled-production threat model. The procurement committee can require both.
-
What are the contractual exit conditions, and have the data-portability and runtime-portability claims been tested rather than asserted? Vendor lock-in in agentic AI is an operating-cost issue, not just a procurement-clause issue. Tested portability is the difference between a 90-day exit and a multi-year migration that exhausts the IT budget.
-
What is the 90-day, 180-day, and 365-day measurement plan the procuring enterprise will run on the scaled deployment, with named metrics, named owner, and named board-level review? A vendor reference describes outcomes at the reference customer at the time of the reference. The procuring enterprise’s outcomes will be measured by the procuring enterprise on its own cadence; the procurement committee can require the cadence to exist before the contract closes.
The six are operational preconditions, not contractual frills. A procurement committee unable to obtain answers in writing before the pilot is making a procurement decision on the same evidence base McKinsey’s 39% experimenting cohort started with. The McKinsey distribution is the prior; the answers move the deployment toward the 23% scaling cohort or away from it.
The GAUGE diagnostic operationalises questions 1, 2, 3, 4, and 6 as a 30-45 minute working-group exercise the procurement committee can run with the vendor’s deployment team in the room. Question 5 (tested portability) is contract-level work and sits with legal rather than the diagnostic. Pilots scoring above 55 on GAUGE before the procurement decision are materially more likely to enter the 23% scaling cohort; pilots scoring below 40 enter the 39% experimenting cohort or the 38% deployed-and-stopped cohort.
What the data implies for Q2-Q4 2026 procurement
The McKinsey 23%/39%/38% distribution is unlikely to compress materially through 2026. The Gartner June 2025 cancellation projection (40%+ of agentic AI projects cancelled by end-2027) describes the same pattern from a forward-looking angle: a large share of the 39% experimenting cohort will end up in the 38% deployed-and-stopped cohort, not in the 23% scaling cohort. The procurement decisions made through 2026 are the input to the 2027 cancellation rate.
The procurement-committee implication is specific. A vendor pilot reference, even a strong one, is one input to the procurement decision rather than the central evidence. The central evidence is whether the procuring enterprise can answer the six pre-pilot questions in writing before the contract closes. Procurement decisions made on vendor references without the six answers are pricing the McKinsey 23% transfer rate as if it were 100%, which is the most common 2026 enterprise procurement mistake the data describes.
For the broader procurement community the pattern matters at scale. If a meaningful share of the 39% experimenting cohort runs the six-question discipline before scaling, the cancellation rate Gartner projects compresses. If the discipline does not propagate, the 2027 cancellation projections are likely to land near the upper end of Gartner’s range.
Holding-up note
The primary claim of this piece (that vendor “successful pilot” references transfer to scaled production at the procuring enterprise at roughly the McKinsey 23% rate, and that the gap is tractable through six pre-pilot questions a procurement committee can require answered in writing) is on a 60-day review cadence. Three kinds of evidence would move the verdict.
A subsequent McKinsey, Forrester, or Gartner wave showing the 23%/39%/38% distribution compressing without the operational-discipline intervention would partially weaken the central claim. A cross-vendor procurement-outcome dataset showing pilot-to-production transfer rates materially higher than 23% at organisations not running the six-question discipline would materially weaken the piece. A named major 2026 launch (Salesforce Agentforce v2, Anthropic Computer Use commercial GA, Microsoft Copilot Studio enterprise tier) producing a documented walk-back analogous to Klarna would strengthen the piece’s framing of the reference-language risk.
If any land, the Holding-up record for AM-140 captures what changed, dated. Original claim stays visible. Nothing is quietly removed.
Spotted an error? See corrections policy →
Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.
AI agent procurement →
The contracts, SLAs, and evaluation criteria that distinguish agentic-AI procurement from SaaS procurement. 16 other pieces in this pillar.