What did the BT and UK Government Digital Service customer pilots actually find?

BT's Now Assist deployment, per BT MD Hena Jalil's on-the-record statement, achieved a 35% case-resolution time improvement specifically with 'random checks at the other end' — human oversight as steady-state, not training-wheel. UK Government Digital Service ran an M365 Copilot trial across 20,000 staff in Q4 2024 and published findings showing roughly 26 minutes saved per user per day. HMRC followed with a 28,000-staff M365 Copilot rollout in April 2026. All three are auditable customer pilots distinct from vendor-controlled environments, and all three include active human oversight as a load-bearing element of the deployment design.

What did the public walk-backs of 2024-2025 actually walk back?

Klarna May 2025 reversed the 700-agent claim that had anchored its 2024 productivity narrative — Bloomberg reported the reversal while the original press release stayed live, which is the procurement-relevant detail (the citation chain still circulates). GitHub Copilot acknowledged a token-counting bug in April 2026 that had affected billing accuracy. Salesforce CEO Marc Benioff's Agentforce launch pitch implied broader customer adoption than the actual ~200-customer reality the IT division named in subsequent reporting. None of these are 'AI does not work' findings; they are 'the headline number is older or narrower than the citation suggests' findings.

What are the structural failure modes the research arms documented?

CRMArena-Pro (Salesforce AI Research) found multi-step agent reliability at roughly 35% on a structured CRM benchmark — the agent completes individual steps competently but the multi-step sequence drifts. Carnegie Mellon independent verification reproduced the 30-35% range on similar workloads. EchoLeak (CVE-2025-32711) named a cross-agent prompt-injection class where a compromised agent's output contaminates the input substrate of agents downstream in the workflow. Each is mechanism-level rather than incidental, meaning they bound what the 2026 deployment can credibly claim regardless of headline benchmark.

What is the procurement lesson from collapsing the four classes into one narrative?

Each class produces a different procurement lesson and they do not substitute. Class 1 (vendor-controlled wins) gives an upper bound conditioned on infrastructure the customer typically lacks. Class 2 (audited pilots with oversight) gives a realistic floor for the deployment pattern that does work. Class 3 (walk-backs) gives the citation hygiene the buyer needs to apply to any vendor case study older than 6 months. Class 4 (structural failures) bounds what is credible regardless of headline. Treating them as one 'AI is working' narrative is the most common 2026 enterprise mistake — the citation chain that produces 'agentic AI is delivering' typically draws disproportionately from Class 1, lightly from Class 2, and ignores Classes 3 and 4 entirely.

Agentic AI 2024–2025 retrospective: shipped vs walked back

At a glance

Claim

Agentic AI 2024-2025 produced four distinct classes of evidence the 2026 procurement reader should not collapse into a single 'AI is working' narrative: (1) vendor-published wins inside vendor-controlled environments (ServiceNow internal 90% L1 deflection, framed by Nenshad Bardoliwalla as upper bound conditioned on two decades of structured workflow data the customer does not have), (2) audited customer pilots with active human oversight (BT 35% case-resolution improvement with random checks per Hena Jalil; UK Government Digital Service 26 minutes/day saved across 20,000 staff in Q4 2024; HMRC 28,000-staff M365 Copilot rollout April 2026), (3) public walk-backs (Klarna May 2025 Bloomberg-reported reversal of the 700-agent claim while the original press release stayed live; GitHub Copilot April 2026 token-counting bug; Salesforce Agentforce IT 200-customer reality vs Marc Benioff's launch pitch), and (4) structural failure modes (CRMArena-Pro 35% multi-step agent reliability finding; Carnegie Mellon independent verification at 30-35%; EchoLeak CVE-2025-32711 cross-agent prompt-injection class). Each class produces a different procurement lesson; treating them as one narrative is the most common 2026 enterprise mistake.

Date

4 May 2026

Verdict

Holding(AM-130)

Next review

3 Jul 2026(+57d)

Bottom line. Agentic AI 2024-2025 produced four distinct classes of evidence the 2026 procurement reader should not collapse into one “AI is working” narrative: (1) vendor-published wins inside vendor-controlled environments (ServiceNow internal 90% L1 deflection, framed by the vendor itself as upper bound), (2) audited customer pilots with active human oversight (BT 35% case-resolution improvement with random checks; UK Government Digital Service 26 minutes/day saved across 20,000 staff in Q4 2024), (3) public walk-backs (Klarna May 2025 reversal of the 700-agent claim, GitHub Copilot April 2026 rate-limit issue), and (4) structural failure modes (EchoLeak cross-agent prompt injection class, CRMArena-Pro 35% multi-step agent reliability finding). The four classes produce four different procurement lessons; the most common 2026 mistake is reading them as one.

If you make agentic AI procurement decisions in 2026 and you are reading “what shipped in 2024-2025” coverage, the question is whether the published evidence is the same shape as the procurement decision your CFO is asking you to defend. Most “agentic AI revolution” coverage from 2024-2025 collapsed four distinct evidence classes into a single optimistic narrative. Two years later, the four classes have separated into four different procurement signals, each with its own 2026 lesson.

This retrospective walks the four classes against named primary sources rather than vendor decks. The 2024-2025 corpus that survives editorial scrutiny is smaller than the original “AI is everywhere” coverage suggested, and the surviving evidence is more useful for procurement than the original framing was.

Class 1: vendor-published wins inside vendor-controlled environments

The strongest documented productivity claims of 2024-2025 came from vendors running their own products on their own platforms with their own data. ServiceNow publicly reports that 90% of targeted L1 ticket volume is handled autonomously inside its own help desk, with a 99% resolution rate within those categories (network at 46%, software at 43%, hardware at 11% of the ticket-type mix). Nenshad Bardoliwalla, ServiceNow’s group VP for AI products, was on the record. His framing of why this number is so much higher than typical customer outcomes is the most editorially honest thing ServiceNow has said in this category: “How does it know it got the right answer? Because the outcome is measurable inside the same platform. Did the ticket resolve? Did the workflow complete? Did the approval get the right sign-off? ServiceNow closes the loop in a way that a standalone LLM sitting on top of a SharePoint folder simply cannot.”

In the same coverage, Bardoliwalla’s other on-record quote: “documentation inside real-world help desks traditionally has been poor to non-existent.”

This is the procurement-honest framing. ServiceNow on ServiceNow, with two decades of structured workflow data, is the absolute upper bound of what is possible. It is not a customer-deployment benchmark. The 2024-2025 narrative that read “90% L1 deflection is the new floor for ITSM AI” was a vendor extrapolation; the 2026 procurement read is that 90% is the ceiling, and customer deployments without two decades of structured workflow data should expect materially less.

The procurement lesson from class 1: vendor-controlled environment numbers establish the upper bound of what the platform can do, not the realistic deployment outcome. A 2026 mid-market RFP that anchors against the 90% L1 deflection figure as the expected outcome is anchoring against a number the vendor itself describes as conditioned on a data substrate the customer does not have.

Class 2: audited customer pilots with active human oversight

The 2024-2025 evidence base also includes pilots that produced documented improvements with named customers, on-the-record statements, and disclosed methodology. Three pilots are the cleanest reference points.

BT’s pilot of ServiceNow Now Assist (July 2024) documented a 55% reduction in case-summary writing time and a 35% reduction in average case-resolution time. Hena Jalil, BT’s managing director and business CIO, was on the record. The caveat she put on the record is the part most coverage skips: “We have that process at the moment because, as we’re building confidence, we do need that validation. There are certain things that we want to capture that we don’t want an agent to change. We’re doing random checks at the other end as well.” Translation: the 35% figure is from a pilot with active human oversight and random sampling on the back end, not from a steady-state production deployment with the AI in the driver’s seat.

The UK Government Digital Service M365 Copilot trial (published June 2025) covered 20,000 government employees across multiple departments in a Q4 2024 evaluation period. Methodology was fully disclosed. The headline finding: participants saved an average of 26 minutes a day when using M365 Copilot. Over 70% of users agreed it reduced time spent searching for information and on routine tasks. The report is candid about limits: “complex, nuanced, or data-heavy aspects of work” were where the value dropped off; security and sensitive-data handling concerns persisted. This is what an audited productivity gain looks like, and it is the floor estimate, not the ceiling.

HMRC’s follow-on M365 Copilot rollout (April 2026) extended the UK GDS pattern to 28,000 staff. James Mitton, HMRC’s chief AI officer, was on the record at the Think AI for Government event in London. The HMRC deployment is the largest documented enterprise M365 Copilot rollout to date and the closest 2026 procurement reference point for what scales from a 20,000-person pilot to a production deployment.

The procurement lesson from class 2: 25-35% improvement on bounded use cases with active human oversight is the realistic 2026 expectation for a disciplined customer deployment. The pattern includes documented baselines, human-in-the-loop steady-state (not training-wheel), and disclosed methodology. The cohort that hits these numbers in 2026 will look operationally similar to BT and UK GDS in 2024-2025.

Class 3: the public walk-backs

The 2024-2025 corpus also includes the deployments that did NOT survive scrutiny. Three are procurement-load-bearing.

Klarna’s May 2025 walk-back is the most-cited. Klarna’s February 2024 press release is still live on the company’s website as of May 2026: “the AI assistant has had 2.3 million conversations, two-thirds of Klarna’s customer service chats… it is doing the equivalent work of 700 full-time agents… it is more accurate in errand resolution, leading to a 25 percent drop in repeat inquiries.” Sebastian Siemiatkowski’s quote from that release: “This AI breakthrough in customer interaction means superior experiences for our customers at better prices, more interesting challenges for our employees, and better returns for our investors.” The H1 2024 financials doubled down, with Siemiatkowski projecting Klarna would shrink from approximately 3,800 to 2,000 employees: “Not only can we do more with less, but we can do much more with less.”

The walk-back came in May 2025. Siemiatkowski told Bloomberg on 9 May 2025 (covered concurrently by Fortune) that Klarna had gone too far on the AI substitution. Verbatim quotes Bloomberg pulled: “From a brand perspective, a company perspective, I just think it’s so critical that you are clear to your customer that there will always be a human if you want,” and “Really, investing in the quality of human support is the way of the future for us.” Klarna began recruiting customer-service agents in an Uber-style freelance arrangement, targeting students and rural workers, paying from 400 Swedish krona (about 41 dollars) per shift.

The original Klarna press release with the 700-agent claim is still up on the company’s site as of May 2026. The company’s recent practice is materially more cautious. Both facts are true at the same time. The pattern is the one most enterprise IT leaders should be watching for: the case-study claim outlives the case-study reality by a long margin.

GitHub Copilot’s April 2026 rate-limit issue (documented coverage) was a smaller but procurement-material walk-back: a token-counting bug caused subscription allowances to “rapidly exhaust” and Anthropic took steps to discourage Copilot Pro+ usage during peak demand. The pattern matters because it shows AI-platform pricing is unstable in 2026; operators baking multi-year savings cases against quietly-shifting service definitions are taking on a class of risk most procurement teams have not yet learned to underwrite.

Salesforce Agentforce IT’s 200-customer reality check is the third walk-back-shaped event. Marc Benioff launched and pitched Agentforce IT aggressively as a direct ServiceNow competitor. Six months after launch, Agentforce IT had roughly 200 customer signups out of Salesforce’s 150,000-customer base. Bill McDermott’s response at the Citizens Technology Conference on 2 Mar 2026: the actual ServiceNow ARR loss to Salesforce was 42,000 dollars against ServiceNow’s 13.2-billion-dollar FY25 revenue. The Salesforce-versus-ServiceNow ITSM war is real noise and tiny dollars. The procurement lesson is that vendor announcements at scale do not predict customer-acquisition outcomes at the same scale.

The procurement lesson from class 3: case-study claims have a half-life. A 2024-vintage AI productivity claim that was load-bearing in a 2024 procurement decision should be re-tested in 2026, not because the original claim was necessarily wrong but because the deployment may have been walked back in a way the vendor’s marketing has not yet caught up to.

Class 4: structural failure modes

The 2024-2025 corpus also surfaced specific failure-mode evidence that bounds what agentic AI can reliably do at scale.

Multi-step agent reliability collapse is the most important. The Salesforce AI Research team published CRMArena-Pro on arXiv in May 2025: an evaluation suite for LLM agents across enterprise customer-relationship and IT-adjacent tasks. The headline finding is uncomfortable for the vendor: LLM agents achieved roughly 58% success on tasks that can be completed in a single step, and that success rate dropped to 35% when the task required multiple sequential steps. Carnegie Mellon researchers reached the same range independently, with multi-step success rates of 30 to 35%. This is the gating constraint on every “AI handles your L2 incident triage end to end” pitch.

The EchoLeak cross-agent prompt-injection class (CVE-2025-32711, disclosed June 2025 by Aim Security) was the first documented zero-click LLM exploit against an enterprise AI assistant. The vulnerability allowed an attacker to send a specially-crafted email to a target organisation; when Microsoft 365 Copilot processed the email as part of its normal context-gathering, the email’s payload manipulated Copilot into exfiltrating sensitive data. No user interaction beyond the user asking Copilot a routine question. Microsoft patched before public disclosure. The structural finding (analysed at AM-045) is that any deployment with untrusted-content-ingest plus capable-tool-surface is exposed to the class.

The Datadog cohort risk-disclosure is the third structural finding. Datadog’s Q3 FY25 10-Q discloses that its “AI-native cohort” of customers contributed approximately eight percentage points of year-over-year revenue growth for the quarter ended 30 Sep 2025, with the company’s filing language flagging this concentration as a risk factor (these customers may “in the future optimize their usage”) rather than purely as an opportunity. That is a more honest piece of reporting than the AIOps category typically generates and a useful procurement signal: AI-native customer cohorts are concentrated by definition, and concentration is a risk both ways.

The procurement lesson from class 4: structural failure modes are mechanism-level findings, not single-deployment incidents. They persist across vendors and across years. A 2026 procurement decision that does not name how the deployment addresses multi-step reliability collapse, prompt-injection class, and cohort concentration is not yet procurement-ready against the 2024-2025 evidence base.

What 2026 procurement should learn from each

The four classes produce four operational lessons.

From class 1 (vendor-controlled environments): anchor your RFP against customer deployments at similar maturity, not against vendor self-reports. The 60-question agentic AI RFP operationalises this; vendor responses citing internal-deployment numbers should be flagged in dimension 5 (vendor lock-in) and dimension 1 (governance maturity).

From class 2 (audited customer pilots): scope your pilot to the 25-35% improvement band on bounded use cases with active human oversight as the realistic 90-day target. The mid-market 90-day ROI piece walks the deployment-discipline patterns; the pilot pattern that scales matches BT and UK GDS, not the vendor 240% pitch.

From class 3 (walk-backs): include a contractual commitment from both sides to publish corrections when a load-bearing claim no longer holds. The publication’s own tracked-claims framework is the editorial implementation; the procurement-side equivalent is a clause in the master services agreement obligating the vendor to notify the customer when a published case-study claim is materially revised.

From class 4 (structural failure modes): map your deployment against the OWASP Agentic AI Top 10 and the agent red-teaming playbook before signing. The four red-team disciplines (prompt injection, tool misuse, context-window attacks, multi-turn objective drift) are the test surface that determines whether the deployment survives the structural failure modes the 2024-2025 evidence already named.

What did not survive editorial scrutiny

A category of 2024-2025 “agentic AI revolution” coverage did not survive into the 2026 evidence base. Specifically: the Sarah-and-her-AI-agents-fictional-protagonist genre, the “240% ROI in 90 days” promise without methodology, the “AI replaces 30% of your headcount” pitch, the “171% ROI from agentic AI” case study without disclosed sample, and the agentic AI vendor announcements that produced single-digit customer counts after launch.

Most of these were retracted in the publication’s /retractions/ ledger as part of the April 2026 editorial audit. They are documented as having existed and as no longer holding; the audit-trail is in the Q2 2026 Claim Review Bulletin. The 2026 procurement reader should treat citations to these as a signal that the upstream source has not done the editorial work the bulletin documents.

The 2026 reader’s takeaway

Agentic AI 2024-2025 produced real, audited, procurement-grade evidence. The evidence is more limited than the original “AI revolution” coverage suggested and more useful than the same coverage’s pessimistic counter-narrative. The four classes, vendor-controlled wins, audited pilots, walk-backs, structural failures, collectively bound what 2026 procurement should expect from agentic AI deployments. A team that internalises the four classes is operating against the same evidence base the 12% high-discipline cohort already does. A team that reads the 2024-2025 corpus as one optimistic narrative or one pessimistic counter-narrative is missing the procurement-operational signal that distinguishes the cohort that scales from the cohort that does not.

The four classes do not just describe what shipped at enterprise scale. They travel, through the AI-citation chain, vendor sales decks, and trade-press secondary coverage, to operator-cohort procurement decisions where they produce different misreads. The cross-cohort pattern, where enterprise and operator buyers consume the same vendor case studies and produce mirror-image errors, is at /vendor-case-study-misreads-across-buyers/ (claim AM-139). The Klarna walk-back, the ServiceNow internal 90% deflection number, and the BT 35% case-resolution improvement are three of the cited proof points there.

ShareX / Twitter LinkedIn Email

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Part of the pillar

AI agent procurement →

The contracts, SLAs, and evaluation criteria that distinguish agentic-AI procurement from SaaS procurement. 16 other pieces in this pillar.

Agentic AI 2024-2025 retrospective: what actually shipped, what walked back, and what 2026 procurement should learn from each

Class 1: vendor-published wins inside vendor-controlled environments

Class 2: audited customer pilots with active human oversight

Class 3: the public walk-backs

Class 4: structural failure modes

What 2026 procurement should learn from each

What did not survive editorial scrutiny

The 2026 reader’s takeaway

AI agent procurement →

Related reading

Class 1: vendor-published wins inside vendor-controlled environments

Class 2: audited customer pilots with active human oversight

Class 3: the public walk-backs

Class 4: structural failure modes

What 2026 procurement should learn from each

What did not survive editorial scrutiny

The 2026 reader’s takeaway

The 60-question agentic AI RFP, built as a procurement tool.

AI agent procurement →

Related reading

Agent evaluation in production: eval-set design, drift detection, and regression budgets for the deployed agent

Foundation-model uptime in 2026: the 24-month outage record across Anthropic, OpenAI, Google, AWS Bedrock, and Azure OpenAI

How vendor case studies travel between enterprise and operator AI buyers — and what each cohort gets wrong from the other's evidence

AI-written analysis, signed by a practitioner. One or two pieces a week.