The split verdict: GPT-5.5 vs Claude Opus 4.7 and why CIOs need two models, not one
Anthropic shipped Claude Opus 4.7 on 16 Apr 2026; OpenAI shipped GPT-5.5 seven days later. Both vendors claim leadership. Neither model wins everything. The procurement question for 2026 is not which one to standardise on, because the evaluation evidence does not support a single-model answer for any enterprise running both agentic-coding workloads and knowledge-work workloads. The two-year procurement decision is whether to plan the routing or accept the tax of pretending it does not exist.
Holding·reviewed10 May 2026·next+39dAnthropic shipped Claude Opus 4.7 on 16 Apr 2026. OpenAI shipped GPT-5.5 seven days later. Both releases were branded as the leading frontier model in their category. Both came with a benchmark deck. Each deck wins on different categories, and the categories on which one wins are categories on which the other does not.
For an enterprise IT leader running a procurement cycle in this window, the press-release framing is the first procurement risk. The substantive evidence does not support a single-model standardisation answer for any organisation running both agentic-coding workloads (where GPT-5.5 carries clean leads on the public evaluations) and knowledge-work workloads (where Opus 4.7 reports a 50 percentage-point lower hallucination rate on the same independent benchmark). The procurement decision in this window is not which model is the better model. The procurement decision is whether the architecture acknowledges that the two are not substitutes, and routes accordingly, or pretends they are and pays the inverse-of-strength tax on whichever workload is the wrong fit.
Six-week minor cycles are the procurement strategy
Both vendors are now running on six-week minor-cycle release cadences. Anthropic shipped Opus 4.5 in November 2025 and Opus 4.6 in February 2026 before Opus 4.7 landed in April. OpenAI’s GPT-5.x line has run an analogous cadence. NVIDIA’s TechCrunch coverage of the GPT-5.5 launch makes the cadence explicit: this is the operating tempo, not a one-off.
The cadence has a procurement consequence that does not always make it into the conversation. A two-year enterprise standardisation decision is taken against a specific model version on the day of contract signing. Six weeks later, the same vendor has shipped a new minor cycle that may shift the relative position on the workload that drove the decision. Twelve weeks later, the other vendor has shipped a competing model that may have closed or opened the gap. By the time the contract is up for renewal, the model the contract was written for is two-to-three generations behind the model the renewal would buy, and the comparative-evaluation evidence on which the original decision was made is stale.
The implication is that a 2026 single-model standardisation is a bet on a moving target. The bet can still be the right call. There are real procurement, security, and integration costs to running two providers at production scale. But the cost of the bet has to be weighed against the cost of the inverse, which is accepting whichever model’s weakness is the weakness on the workload the enterprise actually runs.
The two release announcements made the cadence visible in two different ways. Anthropic’s Opus 4.7 announcement was framed around accuracy, calibration, and SWE-Bench Pro, with the long-context retrieval drop disclosed in the same materials. OpenAI’s GPT-5.5 announcement was framed around agentic capability, computer use, and a token-efficiency story. The framings are the press-release shape of the underlying model strategies, and they are consistent with the benchmark deltas in the rest of this piece.
Where GPT-5.5 wins
The public evaluation evidence as of late April 2026 shows clean GPT-5.5 leads on a coherent category of workloads, the agentic-coding, computer-use, terminal-tool, and evaluation-throughput surface.
Terminal-Bench 2.0 is the cleanest single number. GPT-5.5 reports 82.7% against Claude Opus 4.7 at 69.4%, a 13-point gap on a benchmark that tests command-line proficiency, shell navigation, and developer-tooling tasks. For any enterprise running Codex-class agentic-coding deployments at scale, the Terminal-Bench delta is the closest available proxy for in-production reliability on shell-touching agents.
GDPval, OpenAI’s economically-meaningful task suite, reports GPT-5.5 at 84.9% wins-or-ties against Claude Opus 4.7 at 80.3% and Gemini 3.1 Pro Preview at 67.3%. FrontierMath (Tiers 1–3) reports GPT-5.5 at 51.7% against Opus 4.7 at 43.8%. OSWorld-Verified, the computer-use benchmark, reports GPT-5.5 at 78.7% and Opus 4.7 at 78.0%, close enough that we read this as near-parity rather than as a clean GPT-5.5 win, against the brief framing that called it a GPT-5.5 lead. MRCR v2 (long-context retrieval at 1M and 256K) is a category in which GPT-5.5 retains material leads, consistent with the announcement framing.
The token-efficiency story is the second half of the GPT-5.5 case. The Decoder reported that Artificial Analysis measures GPT-5.5 using roughly 40% fewer output tokens than GPT-5.4 on equivalent tasks, which is the ground for OpenAI’s “approximately 20% effective net cost increase” framing despite the listed $5/$30 per-million-tokens price. On identical coding tasks, Artificial Analysis reports that GPT-5.5 uses roughly 72% fewer output tokens than Opus 4.7. For any enterprise running coding agents at production volume, that is the unit-economics number that compounds.
CodeRabbit’s GPT-5.5 benchmark, third-party real-world evaluation against open-source pull-request review tasks, reports the model raised the expected-issue-found rate from 55.0% to 65.0% and improved precision from 11.6% to 13.2% over its predecessor. The CodeRabbit figures are useful because they are independent and run against PRs rather than synthetic tests; the directional signal is consistent with the vendor numbers, though the absolute rates are markedly more modest than the headline benchmark scores.
The use-case pattern that holds together across these categories is workloads where the cost of an extra reasoning step is high (token economics matter), the failure modes are observable (the agent’s output runs against a real terminal or a real test suite), and a confidently-wrong intermediate step gets caught downstream. Agentic coding is the canonical example. Computer-use automation is adjacent. Bulk evaluation and grading workloads are adjacent. Knowledge work where the failure mode is an undetected wrong fact is not.
Where Claude Opus 4.7 wins
The public evaluation evidence shows Opus 4.7 with clean leads on a different and equally coherent category, the contamination-resistant coding, finance, vision-reasoning, and tool-using surface.
SWE-Bench Pro is the cleanest single number. Anthropic’s announcement reports Opus 4.7 at 64.3%, a 10.9-point lift over Opus 4.6’s 53.4%, leapfrogging both GPT-5.4 at 57.7% and Gemini at 54.2%. SWE-Bench Pro was built by Scale AI as a contamination-resistant successor to SWE-Bench Verified, it uses private and copyleft-licensed repositories specifically to minimise training-data overlap, which makes the Opus 4.7 lead a more robust signal than the marginal SWE-Bench Verified leads either side claims.
The contamination concern is the contested half of this section. Aibreakingwire reported on OpenAI’s published audit finding that GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview could each reproduce SWE-Bench Verified gold patches verbatim from memory using only the task ID as a prompt. OpenAI’s response was to drop SWE-Bench Verified from its system card disclosures. Anthropic’s response was to publish a filter-rescore analysis showing that excluding any problems flagged by its internal memorisation screens preserved Opus 4.7’s margin of improvement over Opus 4.6 on the remaining decontaminated subset. Both vendors have a reading of the same contamination evidence, both readings are public, and both readings are partially-self-serving. The honest framing is that this dispute is not yet settled, and Opus 4.7’s SWE-Bench Pro lead is more defensible as a procurement signal than its Verified lead.
Vellum’s Opus 4.7 benchmark summary puts the rest of the wins in one place. Opus 4.7 reports 94.2% on GPQA Diamond, 64.4% on Finance Agent v1.1 (the leading public score at release), and 78.3% on CharXiv’s reasoning split for figure interpretation. The Finance Agent v1.1 lead is the single number most worth holding as the operating signal for knowledge-work workloads, it tests the kind of reasoning over reports, filings, and structured documents that finance, legal, regulatory, and consulting teams actually run on against a frontier-model assistant.
CodeRabbit’s parallel evaluation of Opus 4.7 is the third-party real-world signal on the other side. Across 100 open-source pull-request reviews, Opus 4.7 outperformed prior frontier models on real-bug discovery, on the actionability of feedback comments, and on cross-file reasoning within the same PR. The CodeRabbit team’s framing is that on substantive code-review work, Opus 4.7 is the model they would route to even though GPT-5.5 ships better headline coding-benchmark numbers. The two CodeRabbit blog posts together are the closest available read on the practical-throughput tradeoff between the two models for in-production code-review pipelines.
The use-case pattern that holds together across these categories is workloads where the cost of a confidently-wrong output is high (regulatory, financial, customer-facing), the failure modes are not always immediately observable (a wrong fact in a finance memo can survive for weeks), and the routing penalty for an extra reasoning step is acceptable. Knowledge work is the canonical example. Multi-turn customer-facing assistants in regulated industries are adjacent. Tool-using agents in finance, legal, and procurement are adjacent.
The hallucination chasm
The single benchmark that changes the procurement question from “which model is better” to “which model is better for what” is AA-Omniscience, Artificial Analysis’s independent knowledge-and-hallucination evaluation released in the April 2026 frontier window.
AA-Omniscience reports two numbers per model. The first is accuracy on the benchmark’s question set. The second is the hallucination rate, how often the model produces a confidently-stated wrong answer rather than declining to answer or reporting that the information is missing. The combination is the operating signal. A model that scores high on accuracy at the cost of high hallucination is shipping the failure mode that knowledge-work owners care most about: confidently wrong outputs on questions the user does not have a quick way to verify.
The April 2026 numbers are the load-bearing data point of this piece. GPT-5.5 (xhigh) reports the highest accuracy at 57% and the highest hallucination rate at 86%. Claude Opus 4.7 (max) reports an AA-Omniscience composite index of 26 against Gemini 3.1 Pro Preview’s 33 (which leads the index), and a hallucination rate of 36%, 50 percentage points lower than GPT-5.5 on the same independent benchmark. Gemini 3.1 Pro Preview sits between the two at a 50% hallucination rate.
The reason the benchmark behaviour diverges from GDPval, where GPT-5.5 leads, is the cleanest available read on the underlying training-objective story. Both models have been pushed against benchmarks that reward confident answer at scale. Opus 4.7 has additionally been pushed against a calibration objective that rewards reporting an error when the information is genuinely missing. The two objectives are not in tension on benchmarks where every question has a knowable answer; they are in direct tension on benchmarks where some questions are genuinely unanswerable from the model’s training set, which is what AA-Omniscience tests.
The clearest in-vendor evidence for the calibration story is the Opus 4.7 long-context retrieval drop. Artificial Analysis’s Opus 4.7 explainer reports that Opus 4.7’s long-context retrieval rate dropped to 32.2%, down from Opus 4.6’s 78.3%. Anthropic’s published explanation is that the model now reports an error when the requested information is missing from the supplied context, rather than fabricating a plausible-looking answer. The reported metric goes down. The substantive failure mode goes down further. For most knowledge-work workloads, finance, regulatory, customer-facing analysis, that is the trade most owners would take if asked directly. The benchmark has not been told this trade is what it is measuring, which is why the headline number reads as a regression rather than as a calibration improvement.
The press-cycle “60% reduction in hallucinations” figure that travelled with GPT-5.5’s launch deserves a separate flag. The 60% number is not in OpenAI’s GPT-5.5 system card. The system card reports GPT-5.5’s individual claims as 23% more likely to be factually correct, and its responses as containing a factual error roughly 3% less often than the predecessor. Wire Blog’s analysis of the gap between the press-cycle number and the system-card number frames the reduction as a context-engineering win rather than an architectural reduction, better tool use, grounded search, and post-training penalties for overconfident wrong answers. The honest read is that the system-card numbers are the supportable claim and the 60% figure is under-supported by the primary source.
The article tracks the “60% reduction” framing as under-supported, the AA-Omniscience hallucination spread as supported by independent evaluation, and the long-context retrieval calibration story as supported by both vendor disclosure and independent measurement.
Pricing reality
Listed API prices are the marketing number. Effective per-task cost is the operating number. The two diverge for both models in this generation in different directions.
Anthropic priced Opus 4.7 at $5 per million input tokens and $25 per million output tokens, the same listed price as Opus 4.6, and the announcement materials lead on the price-stability claim. The “same price” framing is technically true on the listed rate. It is materially understated on the effective cost. Artificial Analysis’s measurement is that Opus 4.7 uses 35–40% more output tokens than Opus 4.6 on equivalent workloads, so an enterprise running constant workload volume on the new model pays roughly 35–40% more in practice for the same output. This is not necessarily the wrong trade. Opus 4.7’s accuracy and calibration improvements are real. But the pricing claim should not be evaluated as if the workload cost were unchanged.
OpenAI priced GPT-5.5 at $5 per million input tokens and $30 per million output tokens, against GPT-5.4’s prior rate. The listed delta reads as a 20% increase on output. OpenAI’s framing is that the effective cost increase is closer to 20% rather than 50% because GPT-5.5 uses roughly 40% fewer output tokens than GPT-5.4 on equivalent tasks. That framing is supported by the Artificial Analysis token-efficiency measurement and is a more defensible claim than Anthropic’s “same price” framing on its own listed rate.
GPT-5.5 Pro and Claude Mythos Preview both belong in a separate pricing category. GPT-5.5 Pro carries an API premium for higher-tier reasoning. Claude Mythos Preview is gated to Project Glasswing participants and lists at $25 per million input tokens and $125 per million output tokens for the approved Glasswing surface, which is roughly five times the Opus 4.7 rate. Mythos is not a frontier-tier-substitute for either Opus 4.7 or GPT-5.5 in standard procurement; it is a restricted-access security-research tier with its own access conditions, its own pricing surface, and its own use-case scope, and any procurement read that conflates the three rate cards is reading the wrong column.
Independent benchmarks vs vendor cards
The one thing the April 2026 release window made unmistakable is that the gap between vendor-card benchmark numbers and independent decontaminated runs is now material enough to drive procurement decisions on its own.
Vals.ai’s SWE-Bench Verified leaderboard, which runs decontaminated independent evaluations on the same task set, reports GPT-5.5 at 82.60% and Opus 4.7 at 82.00%, within one point of each other. The vendor-card numbers for the same benchmark are GPT-5.5 at 88.7% and Opus 4.7 at 87.6%, a comparable gap, but six-to-seven points higher than the Vals.ai independent run on both sides. The directional reading (the two are roughly tied on SWE-Bench Verified) survives. The absolute reading does not. Any procurement decision predicated on “Opus 4.7 hits 87.6% on SWE-Bench Verified” should be reframed as “Opus 4.7 hits 82.00% on Vals.ai’s decontaminated run, which is the number it would hit in your environment.”
The CodeRabbit evaluations are useful for the same reason. They run against real open-source pull-request review tasks rather than synthetic benchmarks, with a fixed scoring rubric the vendor did not optimise for. The CodeRabbit Opus 4.7 result and the CodeRabbit GPT-5.5 result are both more modest than the headline numbers and tell a more nuanced story about which model shines on which kind of code-review task.
The operating posture that follows is straightforward. Use independent decontaminated runs for absolute numbers, the rates that go into a TCO model, the rates that get cited to a board. Use vendor cards for direction, which way each model has moved relative to the prior version, which workloads the vendor has prioritised, which trade-offs the vendor is willing to disclose. The two are different signals and should be read accordingly.
A three-tier routing read for 2026
The procurement playbook that falls out of the evidence is a three-tier routing architecture. Each tier corresponds to a workload category where the available evidence selects a different leading model, and the routing carries the procurement decision.
The first tier is agentic-coding and computer-use workloads, Codex-class deployments, shell automation, terminal-tool agents, computer-use automation, bulk evaluation. GPT-5.5 leads the public evaluation evidence on every benchmark in this category that is not OSWorld (where it is at near-parity with Opus 4.7), and the token-efficiency gap compounds the evaluation lead into a unit-economics gap on production workloads. The default for this tier in 2026 is GPT-5.5 with Codex.
The second tier is knowledge-work workloads, finance, legal, regulatory, customer-facing analysis, research synthesis, multi-turn assistants in domains where being confidently wrong has an operational cost. Opus 4.7 leads the public evaluation evidence on Finance Agent v1.1, on AA-Omniscience hallucination calibration, on CharXiv reasoning, and on SWE-Bench Pro. The hallucination calibration is the load-bearing argument for this tier, the AA-Omniscience spread is direct evidence that Opus 4.7 is materially less likely to ship a confident wrong answer on a knowledge-work query, on the same independent benchmark, in the same release window. The default for this tier in 2026 is Opus 4.7, paired with retrieval augmentation against the enterprise knowledge base where the workload allows.
The third tier is the frontier-research and high-stakes-verification workload, security work, novel-vulnerability research, agent-pipeline auditing, the kind of work the Mozilla Firefox 150 disclosure is the canonical public example of. The Mozilla pipeline (Mozilla Hacks, May 2026) attributed 271 of 423 April 2026 Firefox vulnerabilities to a Claude Mythos Preview-backed agentic auditing harness; we have tracked the procurement consequences of that disclosure on claim AM-147. For enterprises with Project Glasswing access, Mythos is the routing answer for this tier. For enterprises without it, the substantial majority, the routing answer is Opus 4.7 layered with a verification step that runs the model’s outputs through a separate evaluator before they ship, which is what the Mozilla pipeline does and what an in-house equivalent looks like at smaller scale.
The routing layer is the architectural decision that holds the three tiers together. It is also the architectural decision most enterprise IT organisations have not yet built. Building it in 2026 is the operating answer to the multi-vendor cost question, and the alternative is to absorb the cost of pretending the three tiers are one tier.
What this means for the next six months
The two release strategies behind GPT-5.5 and Opus 4.7 are diverging in a way that the next two minor cycles will make more legible to procurement, not less.
OpenAI’s framing around GPT-5.5, supported by the TechCrunch coverage of the launch, leans toward a super-app strategy in which one model attempts to cover the maximum span of consumer and enterprise workloads with adequate-or-better performance on each. The token-efficiency gains, the agentic-coding lead, and the computer-use lead are the architecture of that strategy. The hallucination rate is the cost of it, and the cost is paid disproportionately by knowledge-work users who do not have a quick way to verify the model’s outputs.
Anthropic’s framing around Opus 4.7, complemented by Mythos and Project Glasswing, leans toward a depth strategy in which the lead model trades some of the breadth surface for calibration on specific high-cost-of-error workloads, and the restricted-access tier carries the frontier-research surface separately. The accuracy and calibration gains, the long-context retrieval recalibration, and the SWE-Bench Pro lead are the architecture of that strategy. The Terminal-Bench gap and the token-consumption increase are the cost of it, paid disproportionately by agentic-coding and shell-tool users for whom throughput is the whole game.
The implicit bet a CIO makes by single-model standardisation in 2026 is whichever cost the organisation is most willing to absorb on the workloads it does not run as the dominant case. A finance-heavy enterprise standardising on GPT-5.5 is betting that hallucination on its primary workload is tractable through verification layers it has not yet built. An engineering-heavy enterprise standardising on Opus 4.7 is betting that token-economics tax on its primary workload is acceptable against the calibration improvement on its secondary workload. Neither bet is wrong in principle, but each is taken against a stronger evidence base if it is taken with the routing alternative explicitly considered and the costs explicitly named.
The next six months will tell which framing the market settles on. By the time the October release window arrives, Opus 4.8 against whatever OpenAI ships in the GPT-5.6 slot, the three-tier routing read will either have become the default enterprise architecture, or the gap on the frontier benchmarks will have closed enough that single-model standardisation becomes defensible again. The honest answer in May 2026 is that the gap has not closed and the routing read is the operating answer for any enterprise running both kinds of workload at scale.
Cite this article
Pick a citation format. Click to copy.
Spotted an error? See corrections policy →
Reasoned disagreement is a first-class signal here. Every review cycle weighs documented dissent; material dissent becomes part of the article's change history. This is not a corrections form — use /corrections/ for factual errors.
AI agent procurement →
The contracts, SLAs, and evaluation criteria that distinguish agentic-AI procurement from SaaS procurement. 36 other pieces in this pillar.