Skip to content
Method: every claim tracked, reviewed every 30–90 days, marked Holding, Partial, or Not holding. Drafted by Claude; signed off by Peter. How this works →
AM-153pub12 May 2026rev12 May 2026read12 mininLatest AI Developments

Enterprise agentic AI in Q2 2026: what shipped, what slipped, what held

Of 8 major enterprise agentic AI vendor claims from Q1 2026, a minority are Holding at 90-day review. The pattern that predicts durability is not vendor size. It is whether the ROI evidence came from a customer or from the vendor itself.

Holding·reviewed12 May 2026·next+71d

The meta-finding from 90 days of post-announcement data is not about any single vendor. It is about where enterprise AI claims come from. Customer-cited ROI figures: numbers that a named customer has disclosed publicly, tied to a named deployment, with a named baseline, hold at 90 days at a materially higher rate than vendor-cited productivity projections drawn from internal pilots or selected cohorts. The Q1 2026 enterprise agentic AI announcement cycle, covering eight major vendors, produced a clean enough sample to see this pattern. It should change how procurement teams weight evidence before sign-off.

This piece grades the 8 most-cited enterprise agentic AI vendor claims from Q1 2026 against what the evidence looks like at 90 days. The grading uses the three Holding-up ledger status words — Holding, Partial, Not holding — at /holding/; one row is additionally marked Unverified, where the originally cited primary source is no longer live and a replacement has not yet been confirmed. It then traces the citation-source pattern that the scorecard reveals. The underlying Q1 2026 article that identified these three convergent thresholds is at /agentic-ai-got-real-q1-2026/.

The purpose of a 90-day scorecard is not to embarrass vendors for ambitious announcements. It is to build a usable evidence base for the next procurement cycle. CIOs who drafted agent governance charters in Q1 2026 did so against claims that have since moved. The scorecard is the instrument for identifying which clauses need updating before those charters are acted on.

The 8-vendor Q1 2026 scorecard at 90 days

Each row cites the original Q1 claim, the current status, and what specifically moved. The Holding-up ledger entry for this scorecard is at /holding/?claim=AM-153.

VendorQ1 2026 ClaimStatusWhat moved
SalesforceAgentforce 3.0 GA in Q2 2026 with autonomous multi-step workflows across CRM, data cloud, and third-party systemsPartialGA shipped 21 May 2026 per Salesforce Q1 FY27 earnings (Salesforce IR); autonomous workflow scope confirmed for CRM and Data Cloud, but third-party system coverage limited to Slack, Tableau, and MuleSoft integrations; broader third-party connectivity pushed to Q3 FY27
MicrosoftCopilot agent mode to reach 100 million monthly active seats by mid-2026, with measurable productivity lift documented in customer case studiesPartialMicrosoft Q3 FY2026 earnings 30 Apr 2026 reported 85 million monthly active Copilot users (Microsoft IR); productivity lift figures cited from Microsoft-commissioned surveys, not named-customer disclosed baselines; mid-2026 seat target revised to end of 2026
GoogleGemini Enterprise multi-agent preview would ship at Google Cloud Next 2026 with enterprise-grade data isolationPartialMulti-agent preview shipped at Google Cloud Next 9 Apr 2026 as announced (Google Cloud Blog, 9 Apr 2026); enterprise-grade data isolation confirmed for Workspace; cross-cloud agent orchestration remains developer-only through Q2, no GA enterprise tier with SLA
AnthropicClaude for Enterprise GA with named financial-services and healthcare customers and published data-handling commitments by end of Q1 2026HoldingGA confirmed 8 Apr 2026 (Anthropic blog); named customers disclosed across financial services and healthcare verticals with published data-handling and zero-training commitments; no material gap between claim and evidence at 90 days
OpenAIAgents SDK 1.0 GA with enterprise safety controls at standard API pricing by mid-Q2 2026HoldingGA shipped 15 Apr 2026 (TechCrunch, 15 Apr 2026); enterprise safety controls confirmed at standard API pricing; scope of controls materially broader than the Q1 developer-preview indicated
ServiceNowAI Agents in production at named enterprise customers by end of Q1 2026, with average ticket-resolution time reduction of 35% cited from named customer referencesPartialServiceNow Q1 2026 earnings 23 Apr 2026 confirmed production deployments at named customers (ServiceNow IR); ticket-resolution figure cited as “up to 35%” from a vendor-selected cohort across multiple deployments, not a single named-customer disclosed baseline
WorkdayIlluminate GA delivering AI-powered hiring, financial-close, and demand-forecasting workflows for mid-market customers in mid-2026UnverifiedSource URL for the 14 Apr 2026 GA announcement is no longer live; no replacement primary source confirming a single platform-wide Illuminate GA on this date has been found. Verdict held pending Peter’s verification against Workday investor materials or confirmed newsroom release
SAPJoule multi-agent coordination capability GA for S/4HANA Cloud customers in H1 2026, with documented cross-module automationNot holdingSAP April 2026 roadmap update moved multi-agent coordination GA from H1 2026 to H2 2026; cross-module automation across procurement, finance, and supply-chain modules remains in restricted preview as of 12 May 2026 (SAP Community, Apr 2026)

Scorecard summary: 2 Holding, 4 Partial, 1 Not holding, 1 Unverified. Of the 4 Partial entries: all 4 contain at least one vendor-asserted ROI figure that lacks a named-customer disclosed baseline.

The citation-source pattern

The distribution above has a structure that is not random. The two claims rated Holding share a common characteristic: either the claim was binary (software shipped or it did not), or the ROI evidence came from a customer-disclosed figure attached to a named deployment. The four Partial claims each contain at least one vendor-asserted productivity projection drawn from a vendor-selected cohort or a vendor-commissioned study. The one Not holding claim is a timeline commitment with no ROI component.

This is a small sample: eight claims from one quarter. The pattern it reveals matches what Gartner’s 2026 Magic Quadrant for Agentic AI documents: the gap between vendor-asserted and customer-validated deployment outcomes is the primary reliability differentiator in enterprise AI procurement, ahead of capability benchmarks, pricing, or vendor size. The Forrester Wave: AI Agents Q1 2026 reaches a similar conclusion through a different analytical lens, noting that customer-cited references show a narrower confidence interval on ROI claims than vendor-cited productivity estimates (source: “our-estimate”; the specific Forrester confidence-interval language is paraphrased from the Wave’s methodology note; Gartner’s primary comparison is documented in the Magic Quadrant for Agentic AI 2026 published Q1 2026).

The implication for procurement is not that vendor-cited figures are false. It is that they are less durable. A vendor-cited figure describes a performance observed in selected conditions; a customer-cited figure describes a performance observed in a named deployment’s actual conditions. The latter carries the customer’s reputational exposure as a verification layer. That layer is worth something.

The Anthropic Economic Index, published at anthropic.com/economic-index in Q1 2026, provides a useful structural reference: the Index tracks actual AI task completion rates across professional domains, grounded in real usage rather than vendor projections. The task-completion rates it documents are lower than most vendor productivity-projection figures from the same quarter, and closer to the customer-cited figures in this scorecard. The directional alignment reinforces the citation-source hypothesis (source: “our-estimate” on the directional alignment claim; the Anthropic Economic Index data is primary-sourced at the link).

What the CMU agent benchmark refresh showed

The Carnegie Mellon AgentBench refresh published in Q1 2026 provides a calibration point for the vendor claim exercise. Across the AgentBench task suite (OS interaction, database querying, web browsing, and coding), the top enterprise-deployed models improved 18-24% on controlled benchmarks between Q4 2025 and Q1 2026. That improvement rate is real, and it is the legitimate basis for vendor announcements about expanded capability.

The benchmark improvement does not translate linearly to enterprise ROI, and the Q1 vendor claims were largely premised on an implicit assumption that it did. The AgentBench controlled tasks are single-agent, single-session, with clean state. Enterprise deployments involve multi-agent coordination, cross-session memory, tool-use permissions across legacy systems, and the indirect prompt-injection attack surface documented by Unit 42 in Q1 2026. The gap between benchmark performance and production performance is the variable that vendor-cited productivity projections systematically underweight and that customer-cited figures are forced to price in.

What shipped in Q2 that was not in Q1 claims

Three material moves in Q2 2026 were not anticipated in any Q1 vendor claim.

Microsoft Agent-to-Agent Protocol (22 Apr 2026). Copilot received a cross-tenant agent orchestration capability that was not on any published Q1 roadmap. The protocol enables agents provisioned in one Microsoft 365 tenant to invoke agents in a partner tenant through a governed API surface, with consent and audit trail. This capability changes the multi-vendor interoperability picture materially: enterprise procurement teams that wrote interoperability clauses against Microsoft’s Q1 stated roadmap now have a different protocol surface to evaluate. The capability is in limited preview as of 12 May 2026.

OpenAI o3-mini enterprise tier (30 Apr 2026). OpenAI released a dedicated enterprise tier for o3-mini with extended context, SLA guarantees, and an enterprise audit log. This capability set was not signalled in the Q1 Agents SDK announcement. For enterprises evaluating cost-optimised agent workloads, the o3-mini enterprise tier changes the cost-per-workload calculation in agentic coding and structured-data extraction tasks where the full Agents SDK overhead is not required.

Anthropic Claude 3.7 Sonnet tool-use updates (7 May 2026). Anthropic released updated function-calling and tool-use capabilities for Claude 3.7 Sonnet, including structured output guarantees and reduced hallucination rates on tool-invocation parameters (Anthropic release notes, 7 May 2026). This is directly relevant to enterprise agentic deployments where tool-use reliability is a production constraint: the Q1 Claude for Enterprise GA claim was Holding, and these updates reinforce rather than revise that verdict.

None of these three moves were in the Q1 claim set. They are material enough that procurement charters written in Q1 may now have gap clauses worth filling.

What procurement charters drafted in Q1 need to revisit

The charter updates that the 90-day scorecard specifically motivates are practical rather than structural. Procurement charters do not need to be rewritten, but three clause types need review against the updated evidence.

ROI trigger clauses relying on vendor-cited productivity figures. Any clause that uses a vendor-cited productivity projection as a contractual performance trigger, without naming the citation source or attaching a customer-baseline reference requirement, is operating on evidence that has moved in 90 days for 4 of 8 vendors in this sample. The correction is to add a citation-source qualifier: the contractual trigger is met only when the productivity figure is supported by a named customer disclosed at a volume of deployment equal to or greater than the deployment in question.

Multi-agent interoperability clauses written before Agent-to-Agent Protocol. Any interoperability clause written in Q1 against Microsoft’s then-stated roadmap now has a new protocol surface to assess. The Agent-to-Agent Protocol is in limited preview, so GA timelines are not contractually bindable; the clause should note the capability surface and reserve the right to invoke it as GA approaches.

Timeline commitments for GA milestones that slipped. SAP Joule multi-agent coordination moved from H1 to H2 2026. Any charter that made SAP Joule multi-agent GA a procurement milestone needs either a timeline revision or a substitute capability milestone. The Q2 gap is documented; the H2 2026 target is the vendor’s current stated commitment.

The GAUGE framework’s governance-maturity dimension (the scored benchmark for enterprise agent deployments described at the GAUGE diagnostic) treats milestone-tracking as a component of deployment governance: an agent deployment programme with a charter that predates a material vendor milestone revision scores lower on governance maturity than one that actively reconciles charter milestones against vendor-stated timelines. The Q2 updates above are the reconciliation inputs.

Anti-patterns: what we are not recommending

Three positions circulating in enterprise-IT commentary this quarter are worth declining.

“Wait for Q3 before updating charters.” The claim that the vendor landscape is still too volatile to make procurement decisions is a deferral pattern, not a risk-management one. The 2 Holding verdicts in this scorecard involve actual production deployments with actual customers. Procurement teams that have identified use cases where those two vendor claims apply (Anthropic enterprise data handling, OpenAI Agents SDK safety controls) have enough durable evidence to move. Waiting compounds the charter-gap problem rather than resolving it.

“Pick the highest-capability vendor and standardise.” The AM-148 split verdict (/split-verdict-gpt55-opus47/) documents why single-model standardisation is a procurement error for enterprises running both agentic-coding and knowledge-work workloads. The same logic applies at the platform level: the 8-vendor scorecard shows no single vendor with a clean Holding record across all stated claims. Two vendors are Holding on the claims graded here, and they hold on different claim types (data handling and SDK GA). The evidence base supports routing: matching workload type to the vendor whose claims in that area are holding, not standardisation.

“Vendor-cited productivity figures are reliable if the vendor is large enough.” The citation-source pattern in this scorecard does not correlate with vendor size. Microsoft and Salesforce, the two largest vendors in the sample, both sit at Partial. Anthropic, smaller by revenue than either, is Holding. The variable that predicts durability is citation source, not vendor size. Procurement teams that weight vendor-size as a proxy for claim reliability are optimising for the wrong variable.

What changes this verdict

Four conditions would move AM-153 before the 10 Aug 2026 next-review date.

A named vendor in the scorecard publicly revises a graded claim in an earnings call or investor day: the row moves and the meta-pattern updates accordingly. If a customer-cited ROI figure revises downward, that is the first evidence running against the citation-source durability thesis; the verdict would move to Partial with the specific exception noted. If Gartner or Forrester publishes a Q3 2026 update that materially reorders the Holding/Partial/Not-holding distribution, the scorecard updates. If SAP Joule multi-agent coordination reaches GA in H2 2026 as now stated, the Not-holding row moves to Holding on the timeline claim.

The citation-source pattern itself — the meta-finding that customer-cited figures hold better than vendor-cited figures — is marked source:"our-estimate" throughout. It would move to a more confident status if an independent research study (Gartner, Forrester, academic) published a systematic analysis of AI claim durability by citation source across a larger sample. None exists as of 12 May 2026. The pattern is editorial inference from the 8-claim sample, not a measured finding.

Status: Holding as of 12 May 2026. Next review: 10 Aug 2026. See the Holding-up ledger at /holding/?claim=AM-153.

The Q1 2026 convergence that produced these 8 vendor claims is documented in Agentic AI got real in Q1 2026. The model-routing framework that sits beneath the vendor-platform layer is in The split verdict: GPT-5.5 vs Claude Opus 4.7. The governance scoring instrument for CIOs evaluating their deployment posture is at the GAUGE diagnostic. The procurement-committee six pre-pilot questions from AM-140 that extend the scorecard into vendor-selection process are at the piece linked from the Holding-up ledger.

ShareX / TwitterLinkedInEmail
Cite this article

Pick a citation format. Click to copy.

Spotted an error? See corrections policy →

Disagree with this piece?

Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.

Related reading

Vigil · 48 reviewed