Skip to content
Holding·last review09 May 2026

A vendor claim of 'ready-to-run' agentic AI that does not name (a) the specific task being measured, (b) the baseline against which accuracy is reported, and (c) the methodology by which the measurement was produced is not procurement evidence regardless of how the rate is described in marketing; the 2026 industry baseline for procurement-credible accuracy disclosure is the Anthropic Cohort A pattern (red-team rates with named attack corpus, pre/post-mitigation deltas, named patch cadence) on the vendor side and the academic-benchmark pattern (CRMArena-Pro 35% multi-step reliability with defined CRM task corpus, CMU TheAgentCompany 30-35% reproduction range, WebArena ~36% browser-agent ceiling) on the methodology side; vendor 'ready-to-run' positioning without equivalent disclosure leaves the deploying enterprise inheriting the methodology gap as an audit-defense burden.

Claim created at publish; review on 60-day cadence. Anchor sources: CRMArena-Pro (Salesforce AI Research, August 2025; ~35% multi-step reliability on defined CRM task benchmark); CMU TheAgentCompany academic benchmark (independent reproduction in the 30-35% range on adjacent enterprise workloads); WebArena academic benchmark (browser-agent task completion in the high-30% range for frontier models); SWE-bench / SWE-bench Verified (named-task code-generation benchmark with vendor-reported scores publishable against a fixed task set); Anthropic published security disclosure on Claude for Chrome (26 Aug 2025, AM-009 anchor: 23.6% pre-mitigation, 11.2% post, 0% on URL-injection variants after patches). Sister claims: AM-005 (assistant vs agent procurement-decision distinction; assistant-class deployments have documented Lilli-pattern accuracy/adoption metrics; agent-class lacks equivalent), AM-007 (vendor-response split for cross-agent class disclosure; Cohort A/B framing extends from security to accuracy), AM-009 (Claude for Chrome disclosure pattern as the canonical Cohort A reference), AM-130 (four evidence classes for procurement readers; CRMArena-Pro 35% as the structural-failure-mode anchor), AM-140 (procurement-committee six pre-pilot questions; this claim adds three accuracy-disclosure questions on top). Trigger conditions to revisit before next cadence: (a) a major vendor publishes a procurement-grade accuracy disclosure with named task, baseline, and methodology that meets the Cohort A bar — substantially extends the named-success cohort; (b) a new academic benchmark replaces CRMArena-Pro / CMU / WebArena as the canonical reference and shifts the procurement-grade rate range materially; (c) a regulatory regime (EU AI Act post-market monitoring, US FTC, sectoral) imposes mandatory accuracy-disclosure requirements on commercial agentic AI products.

Published
09 May 2026
Last reviewed
09 May 2026
Next review
+38d· 08 Jul 2026
Embed this claimiframe + oEmbed
HTML iframe
Paste-the-URL (Substack, Medium, Notion, WordPress)

The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.

Watch this claim

Email-me when AM-146's status, next review date, or correction log changes. One email per change. No newsletter subscription, no other mail.

The claim: A vendor claim of 'ready-to-run' agentic AI that does not name (a) the specific task being measured, (b) the baseline against which accuracy is reported, and (c) the methodology by which the measurement was produced is not procurement evidence regardless of how the rate is described in marketing; the 2026 industry baseline for procurement-credible accuracy disclosure is the Anthropic Cohort A pattern (red-team rates with named attack corpus, pre/post-mitigation deltas, named patch cadence) on the vendor side and the academic-benchmark pattern (CRMArena-Pro 35% multi-step reliability with defined CRM task corpus, CMU TheAgentCompany 30-35% reproduction range, WebArena ~36% browser-agent ceiling) on the methodology side; vendor 'ready-to-run' positioning without equivalent disclosure leaves the deploying enterprise inheriting the methodology gap as an audit-defense burden.

About this register

The Reporting register tracks claims published from articles addressed to senior enterprise IT leaders — CIOs, IT directors, heads of platform. Claims are reviewed on a 30–90 day cadence; each review either reaffirms the claim, marks one substantive part as Partial, or marks it Not holding once the underlying evidence has been overtaken.

Recent corrections in Reporting

  • AM-003 · Partial · 28 May 2026

    Pricing/model drift: a $100/mo Pro tier now sits beside the $200 tier (added 9 Apr 2026) and the premium model is GPT-5.5 Pro. Core thesis holds; the single-$200-tier framing no longer matches. Re-verify current tiers at chatgpt.com/pricing.

  • AM-002 · Not holding · 06 May 2026

    URL state changed. The /the-agentic-ai-revolution-real-world-success-stories-and-strategic-insights-from-2024-2025/ slug now serves a deliberately rewritten retrospective (claimId AM-130, "Agentic AI 2024-2025 retrospective", published 04 May 2026) against audited primary sources. The 28 Apr 2026 redirect to /retractions/ has been lifted to allow that. AM-002 the claim remains Not holding — the original $3.50/dollar + 70% failure-rate framing was withdrawn and is not restored. AM-130 is a separate claim with its own evidence chain. Readers arriving at /holding/AM-002 see the withdrawal here; the article link surfaces the new piece at the URL the original lived at, with this entry as the audit trail.

  • AM-121 · Holding · 2 May 2026

    Klarna walk-back primary-source upgrade — added Siemiatkowski verbatim quotes via Bloomberg-cited-by-Fortune (9 May 2025) and the Uber-style freelance hiring detail via Entrepreneur. Closes the highest-priority evidence gap from the source dossier.

Reviews coming up in Reporting

  • AM-136 · Holding · next +4d (4 Jun 2026)

    Across the 24-month window May 2024 to April 2026, every major foundation-model provider (Anthropic, OpenAI, Google, AW…

  • AM-020 · Holding · next +18d (18 Jun 2026)

    The 40-60% TCO underestimate on enterprise agentic-AI deployments is not a cost-visibility failure — it is a cross-depa…

  • AM-023 · Holding · next +18d (18 Jun 2026)

    The 10 Apr 2026 Google AI Mode rollout to eight markets is the first vertical (restaurant booking) where agentic search…

Referenced within Agent Mode AI by · 1 piece