Skip to content
Holding·last review10 May 2026

GPT-5.5 (released 23 Apr 2026) and Claude Opus 4.7 (released 16 Apr 2026) are not substitutable models for an enterprise running both agentic-coding workloads and knowledge-work workloads in 2026: GPT-5.5 leads the public evaluation evidence on agentic-coding and computer-use surfaces (Terminal-Bench 2.0 82.7% vs 69.4%; GDPval 84.9% vs 80.3%; FrontierMath Tiers 1-3 51.7% vs 43.8%) and runs roughly 72% fewer output tokens than Opus 4.7 on identical coding tasks per Artificial Analysis; Opus 4.7 leads the public evaluation evidence on contamination-resistant coding, finance, and vision-reasoning surfaces (SWE-Bench Pro 64.3% vs GPT-5.4 57.7%; Finance Agent v1.1 64.4%; CharXiv reasoning 78.3%; GPQA Diamond 94.2%) and reports a 36% AA-Omniscience hallucination rate against GPT-5.5's 86% on the same independent evaluation, a 50 percentage-point spread that is the load-bearing data point of any 2026 single-model standardisation decision. The procurement-architecture answer for an enterprise running both workload types is three-tier routing (GPT-5.5 with Codex for agentic coding; Opus 4.7 plus retrieval augmentation for knowledge work; Mythos-via-Glasswing or Opus 4.7 with verification layer for frontier and high-stakes-verification work), not single-model standardisation.

Claim created at publish; review on 60-day cadence (the frontier minor-cycle release tempo is six weeks, so 60 days covers roughly one minor-cycle window with margin). Anchor sources cited inline in the article: Artificial Analysis AA-Omniscience evaluation (the load-bearing hallucination spread), Anthropic Opus 4.7 announcement (16 Apr 2026), OpenAI GPT-5.5 announcement (23 Apr 2026), OpenAI GPT-5.5 system card (the under-supported '60% reduction in hallucinations' press-cycle figure), Vellum and llm-stats benchmark consolidations, Vals.ai SWE-Bench Verified leaderboard (the independent decontaminated run that closes the apparent vendor-card gap from ~1.1 points to ~0.6 points), CodeRabbit GPT-5.5 and Opus 4.7 third-party PR-review evaluations, the Artificial Analysis Opus 4.7 explainer covering the long-context retrieval recalibration (78.3% on Opus 4.6 to 32.2% on Opus 4.7, attributed to the model now reporting errors when information is missing rather than fabricating answers), the Decoder coverage of the GPT-5.5 token-efficiency framing (~40% fewer output tokens than GPT-5.4 supporting OpenAI's '~20% effective net cost increase' claim), the aibreakingwire reporting on OpenAI dropping SWE-Bench Verified from system-card disclosures over contamination concerns, and Anthropic's published filter-rescore analysis showing Opus 4.7's margin over Opus 4.6 holds on the SWE-bench memorisation-flagged subset. Sister claims: AM-147 (Firefox 150 / Claude Mythos disclosure as the canonical agentic-verification reference cited in the third-tier routing section), AM-146 (vendor 'ready-to-run' accuracy claims need named task / baseline / methodology — the AA-Omniscience spread is exactly that disclosure on hallucination), AM-145 (vendor switching is bound by contract not technical migration cost — relevant to the multi-vendor routing argument), AM-140 (procurement-committee six pre-pilot questions; this claim adds the model-routing question on top), AM-130 (procurement reader's four evidence classes; the AA-Omniscience benchmark sits in the 'independent third-party evaluation' class). Trigger conditions to revisit before next cadence: (a) a re-run of AA-Omniscience showing the GPT-5.5 / Opus 4.7 hallucination spread compressed to under 25 percentage points — at that gap the single-model-standardisation case becomes defensible again and the routing read needs reframing; (b) a new model release in the GPT-5.6 / Opus 4.8 / Gemini 3.2 slot that materially reorders either the agentic-coding or the knowledge-work leaderboard; (c) Claude Mythos Preview moving out of Glasswing-gated access into general API availability, which collapses the third-tier routing question into the second-tier one for most enterprises; (d) an independent decontaminated benchmark run (Vals.ai, third-party academic) that overturns the directional reading on a load-bearing category, particularly on Finance Agent v1.1 or AA-Omniscience; (e) a vendor disclosure from either Anthropic or OpenAI of an additional contamination signal on the SWE-Bench leaderboard that changes the procurement-defensibility reading on either side. The 60% hallucination-reduction press-cycle figure attached to GPT-5.5 is tracked as under-supported throughout: not in OpenAI's system card, which reports a 23% improvement in per-claim factual accuracy and a 3% reduction in per-response error rate.

Published
10 May 2026
Last reviewed
10 May 2026
Next review
+39d· 09 Jul 2026
Embed this claimiframe + oEmbed
HTML iframe
Paste-the-URL (Substack, Medium, Notion, WordPress)

The card auto-updates when the claim's status, last-reviewed date, or correction log changes. Embedders never need to refresh — the card is rendered live from the canonical record.

Watch this claim

Email-me when AM-148's status, next review date, or correction log changes. One email per change. No newsletter subscription, no other mail.

The claim: GPT-5.5 (released 23 Apr 2026) and Claude Opus 4.7 (released 16 Apr 2026) are not substitutable models for an enterprise running both agentic-coding workloads and knowledge-work workloads in 2026: GPT-5.5 leads the public evaluation evidence on agentic-coding and computer-use surfaces (Terminal-Bench 2.0 82.7% vs 69.4%; GDPval 84.9% vs 80.3%; FrontierMath Tiers 1-3 51.7% vs 43.8%) and runs roughly 72% fewer output tokens than Opus 4.7 on identical coding tasks per Artificial Analysis; Opus 4.7 leads the public evaluation evidence on contamination-resistant coding, finance, and vision-reasoning surfaces (SWE-Bench Pro 64.3% vs GPT-5.4 57.7%; Finance Agent v1.1 64.4%; CharXiv reasoning 78.3%; GPQA Diamond 94.2%) and reports a 36% AA-Omniscience hallucination rate against GPT-5.5's 86% on the same independent evaluation, a 50 percentage-point spread that is the load-bearing data point of any 2026 single-model standardisation decision. The procurement-architecture answer for an enterprise running both workload types is three-tier routing (GPT-5.5 with Codex for agentic coding; Opus 4.7 plus retrieval augmentation for knowledge work; Mythos-via-Glasswing or Opus 4.7 with verification layer for frontier and high-stakes-verification work), not single-model standardisation.

About this register

The Reporting register tracks claims published from articles addressed to senior enterprise IT leaders — CIOs, IT directors, heads of platform. Claims are reviewed on a 30–90 day cadence; each review either reaffirms the claim, marks one substantive part as Partial, or marks it Not holding once the underlying evidence has been overtaken.

Recent corrections in Reporting

  • AM-003 · Partial · 28 May 2026

    Pricing/model drift: a $100/mo Pro tier now sits beside the $200 tier (added 9 Apr 2026) and the premium model is GPT-5.5 Pro. Core thesis holds; the single-$200-tier framing no longer matches. Re-verify current tiers at chatgpt.com/pricing.

  • AM-002 · Not holding · 06 May 2026

    URL state changed. The /the-agentic-ai-revolution-real-world-success-stories-and-strategic-insights-from-2024-2025/ slug now serves a deliberately rewritten retrospective (claimId AM-130, "Agentic AI 2024-2025 retrospective", published 04 May 2026) against audited primary sources. The 28 Apr 2026 redirect to /retractions/ has been lifted to allow that. AM-002 the claim remains Not holding — the original $3.50/dollar + 70% failure-rate framing was withdrawn and is not restored. AM-130 is a separate claim with its own evidence chain. Readers arriving at /holding/AM-002 see the withdrawal here; the article link surfaces the new piece at the URL the original lived at, with this entry as the audit trail.

  • AM-121 · Holding · 2 May 2026

    Klarna walk-back primary-source upgrade — added Siemiatkowski verbatim quotes via Bloomberg-cited-by-Fortune (9 May 2025) and the Uber-style freelance hiring detail via Entrepreneur. Closes the highest-priority evidence gap from the source dossier.

Reviews coming up in Reporting

  • AM-136 · Holding · next +4d (4 Jun 2026)

    Across the 24-month window May 2024 to April 2026, every major foundation-model provider (Anthropic, OpenAI, Google, AW…

  • AM-020 · Holding · next +18d (18 Jun 2026)

    The 40-60% TCO underestimate on enterprise agentic-AI deployments is not a cost-visibility failure — it is a cross-depa…

  • AM-023 · Holding · next +18d (18 Jun 2026)

    The 10 Apr 2026 Google AI Mode rollout to eight markets is the first vertical (restaurant booking) where agentic search…