A vendor claim of 'ready-to-run' agentic AI that does not name (a) the specific task being measured, (b) the baseline against which accuracy is reported, and (c) the methodology by which the measurement was produced is not procurement evidence regardless of how the rate is described in marketing; the 2026 industry baseline for procurement-credible accuracy disclosure is the Anthropic Cohort A pattern (red-team rates with named attack corpus, pre/post-mitigation deltas, named patch cadence) on the vendor side and the academic-benchmark pattern (CRMArena-Pro 35% multi-step reliability with defined CRM task corpus, CMU TheAgentCompany 30-35% reproduction range, WebArena ~36% browser-agent ceiling) on the methodology side; vendor 'ready-to-run' positioning without equivalent disclosure leaves the deploying enterprise inheriting the methodology gap as an audit-defense burden.

Claim created at publish; review on 60-day cadence. Anchor sources: CRMArena-Pro (Salesforce AI Research, August 2025; ~35% multi-step reliability on defined CRM task benchmark); CMU TheAgentCompany academic benchmark (independent reproduction in the 30-35% range on adjacent enterprise workloads); WebArena academic benchmark (browser-agent task completion in the high-30% range for frontier models); SWE-bench / SWE-bench Verified (named-task code-generation benchmark with vendor-reported scores publishable against a fixed task set); Anthropic published security disclosure on Claude for Chrome (26 Aug 2025, AM-009 anchor: 23.6% pre-mitigation, 11.2% post, 0% on URL-injection variants after patches). Sister claims: AM-005 (assistant vs agent procurement-decision distinction; assistant-class deployments have documented Lilli-pattern accuracy/adoption metrics; agent-class lacks equivalent), AM-007 (vendor-response split for cross-agent class disclosure; Cohort A/B framing extends from security to accuracy), AM-009 (Claude for Chrome disclosure pattern as the canonical Cohort A reference), AM-130 (four evidence classes for procurement readers; CRMArena-Pro 35% as the structural-failure-mode anchor), AM-140 (procurement-committee six pre-pilot questions; this claim adds three accuracy-disclosure questions on top). Trigger conditions to revisit before next cadence: (a) a major vendor publishes a procurement-grade accuracy disclosure with named task, baseline, and methodology that meets the Cohort A bar — substantially extends the named-success cohort; (b) a new academic benchmark replaces CRMArena-Pro / CMU / WebArena as the canonical reference and shifts the procurement-grade rate range materially; (c) a regulatory regime (EU AI Act post-market monitoring, US FTC, sectoral) imposes mandatory accuracy-disclosure requirements on commercial agentic AI products.

Published

09 May 2026

Last reviewed

09 May 2026

Next review

+20d· 08 Jul 2026

Source piece

Agentic AI accuracy claims: the three questions every CIO should ask before 'ready-to-run' becomes a procurement decisionRead piece →

Primary sources

Permalink/holding/AM-146/

Embed this claimiframe + oEmbed

HTML iframe

<iframe src="https://agentmodeai.com/embed/claim/AM-146/" width="600" height="280" frameborder="0" scrolling="no" loading="lazy" referrerpolicy="strict-origin-when-cross-origin" title="AM-146: Holding — Agent Mode AI" style="border:0;max-width:100%;"></iframe>

Paste-the-URL (Substack, Medium, Notion, WordPress)

The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.

Watch this claim

Email-me when AM-146's status, next review date, or correction log changes. One email per change. No newsletter subscription, no other mail.

The claim: A vendor claim of 'ready-to-run' agentic AI that does not name (a) the specific task being measured, (b) the baseline against which accuracy is reported, and (c) the methodology by which the measurement was produced is not procurement evidence regardless of how the rate is described in marketing; the 2026 industry baseline for procurement-credible accuracy disclosure is the Anthropic Cohort A pattern (red-team rates with named attack corpus, pre/post-mitigation deltas, named patch cadence) on the vendor side and the academic-benchmark pattern (CRMArena-Pro 35% multi-step reliability with defined CRM task corpus, CMU TheAgentCompany 30-35% reproduction range, WebArena ~36% browser-agent ceiling) on the methodology side; vendor 'ready-to-run' positioning without equivalent disclosure leaves the deploying enterprise inheriting the methodology gap as an audit-defense burden.

About this register

The Reporting register tracks claims published from articles addressed to senior enterprise IT leaders — CIOs, IT directors, heads of platform. Claims are reviewed on a 30–90 day cadence; each review either reaffirms the claim, marks one substantive part as Partial, or marks it Not holding once the underlying evidence has been overtaken.

Recent corrections in Reporting

AM-008 · Partial · 17 Jun 2026
Source-text figure re-review: Google's 2024 Environmental Report reports a 28% year-over-year increase to 8.1 billion gallons, not the 33% (from a 6.1 billion 2023 base) asserted at publish. The 8.1B 2024 figure and the Microsoft WUE 0.30 L/kWh / 39%-improvement figure are unchanged and verified. Article corrected to 28% and the unsupported 6.1B base removed; the claim text retains the original figure with this correction per the Holding-up protocol.
AM-132 · Partial · 10 Jun 2026
One of four legs unanchored on re-review. The claim text attributes '12% of deployments clearing 300%+ ROI with 88% at or below break-even at 12-18 months' to the Stanford DEL 2026 Enterprise AI Playbook. Full-text verification on 10 Jun 2026 found no such figure in that source: the playbook (Pereira, Graylin, Brynjolfsson, Apr 2026) studies 51 successful deployments by design and contains no ROI distribution, no 300%-plus cohort, and no break-even measurement point (full finding at AM-029, correction of 10 Jun 2026). The only verified figure carrying the same 12/88 numerals is IDC research with Lenovo (via CIO.com, Mar 2025): roughly 88% of AI proof-of-concepts never reach production and roughly 12% graduate — a pilot-to-production graduation metric, not an ROI distribution. The Gartner 28%, McKinsey 23%/17%, and MIT NANDA 95% legs verify; they support a small high-performing tail and a large struggling body, but none documents the two-peak bimodal shape the claim asserts. Status Up -> Partial.
AM-129 · Partial · 10 Jun 2026
One of three read-against anchors unanchored on re-review. The claim text cites 'Stanford Digital Economy Lab Enterprise AI Playbook (12/88 bimodal ROI distribution at 12-18 months)' and frames the realistic ROI band around 'the highest-discipline 12% cohort'. Full-text verification on 10 Jun 2026 found the playbook contains no 12/88 distribution, no bimodal ROI shape, and no 12-18-month ROI measurement point (full finding at AM-029, correction of 10 Jun 2026). The claim's core negative finding — no mid-market enterprise has produced a documented +240% ROI in 90 days under audited conditions — is unaffected; the McKinsey State of AI 2025 and MIT NANDA legs verify and continue to support it. The '12% cohort' framing has no verifiable referent. The only verified figure carrying the 12/88 numerals is IDC's pilot-graduation finding (roughly 88% of AI proof-of-concepts never reach production; via CIO.com, Mar 2025), a different metric. Status Up -> Partial.

Reviews coming up in Reporting

AM-063 · Holding · next +9d (27 Jun 2026)
AI agents executing financial transactions need a four-control bundle (action-approval gates by blast radius, kill-swit…
AM-061 · Holding · next +9d (27 Jun 2026)
Production agentic-AI costs at scale routinely run multiples of POC projections, and a layered optimisation programme c…
AM-003 · Partial · next +9d (27 Jun 2026)
GPT-5 Pro's tiered-subscription model forces enterprises to classify problems by computational difficulty — $200/month…

Referenced within Agent Mode AI by · 1 piece

Agentic AI accuracy claims: the three questions every CIO should ask before 'ready-to-run' becomes a procurement decision