AI Sustained Issue 003
28 April 2026 Frontier Models
The April Frontier · Capability Map

Four labs. Seven days. No clear winner.

Opus 4.7 launched on 16 April. GPT-5.5 followed on 23 April. DeepSeek V4 dropped on 24 April. Six months ago this was a two-horse race. Today it is a fragmented frontier — and that changes how anyone serious about AI should buy, build, and bet.

Releases · Q1 2026
255models
Frontier-class launches tracked by LLM Stats — three months.
Frontier gap
7days
Between Opus 4.7 and GPT-5.5 going live in April.
Coding leap · 6 months
+13pts
SWE-bench Verified — Opus 4.5 (74.6%) → 4.7 (87.6%).
Cost gap · open vs closed
7×
DeepSeek V4-Pro vs Opus 4.7 at near-equal SWE scores.

I have been a business analyst for twenty years. I have lived through enterprise software cycles measured in quarters. I have lived through SaaS cycles measured in months. The frontier AI cycle in April 2026 is measured in days, and that is not a metaphor.

Anthropic shipped Claude Opus 4.7 on 16 April. OpenAI shipped GPT-5.5 on 23 April. DeepSeek dropped V4-Pro and V4-Flash on 24 April. Three frontier-class releases in nine days, two of them claiming agentic-AI leadership and going head-to-head on identical pricing tiers. Anthropic also confirmed a model called Claude Mythos on 7 April that the company has explicitly chosen not to release publicly because it can identify zero-day vulnerabilities by itself.

For anyone trying to build a strategy on top of this, the pace is the story before the models are the story. Q1 2026 alone saw 255 frontier-class model releases tracked by LLM Stats. The Artificial Analysis Intelligence Index sat at a ceiling of 57 for two months because four labs were converging on the same wall — until GPT-5.5 broke through last week with a score of 60. The frontier is not just moving. It is moving in lockstep, with three or four labs landing within a few benchmark points of each other on six-week cycles.

01 · The pace problem nobody is solving for

Every enterprise AI strategy I have seen written before March 2026 is already obsolete in some material way. Not because the strategies were wrong. Because the assumptions about cadence were wrong.

If your AI roadmap names a specific model — "we will standardise on Opus 4.6" or "GPT-5.4 powers our copilot" — that line is either already out of date or about to be. The teams reporting the highest satisfaction in 2026 are not the ones who picked the best model. They are the ones who built model-agnostic architectures and run quick A/B comparisons whenever a new release lands. The model is the least stable layer of the stack now. The retrieval, the tools, the routing, the agent infrastructure — those are where compounding value lives.

This is the frame to read everything below in.

02 · The frontier, model by model

Six general-purpose models matter in April 2026. Three are American closed-weight (Opus 4.7, GPT-5.5, Gemini 3.1 Pro). Two are Chinese open-weight (DeepSeek V4-Pro, GLM-5.1). One is American with a Twitter-shaped chip on its shoulder (Grok 4.20). Mistral and Llama 4 sit on the periphery and matter for different reasons. Below: each one as it actually behaves, not as it is marketed.

Claude Opus 4.7
Anthropic · 16 April 2026
Strength
Agentic coding and tool orchestration. 87.6% on SWE-bench Verified, 64.3% on SWE-bench Pro — the highest scores any generally available model has posted on real-world software engineering. Vision resolution tripled to 3.75 megapixels.
Weakness
Web browsing regressed (BrowseComp 79.3%, down from 4.6's 83.7%). Terminal-Bench 2.0 also lost to GPT-5.5 by 13 points.
Pricing
$5 / $25 per million tokens. Same headline as 4.6, but the new tokenizer uses up to 35% more tokens for the same input — a silent 20–30% effective cost increase nobody talks about.
Differentiator
Self-verification. Anthropic's framing: Opus 4.7 "devises ways to verify its own outputs before reporting back." Vercel says it now writes proofs on systems code before starting work. This is the genuinely new behaviour.
Verdict
If you are building an agent that runs unsupervised for hours, this is the model to evaluate first. The self-checking changes failure modes more than any benchmark headline suggests.
GPT-5.5 ("Spud")
OpenAI · 23 April 2026
Strength
Computer use and long-horizon agency. 82.7% on Terminal-Bench 2.0 (Opus 4.7: 69.4%). MRCR v2 long-context retrieval more than doubled from 36.6% to 74.0%. Tau2-bench Telecom: 98.0% with no prompt tuning. The first OpenAI base model retrained from scratch since GPT-4.5.
Weakness
Hallucination rate of 86% on AA-Omniscience (versus Opus 4.7 at 36% and Gemini 3.1 Pro at 50%). It is more confident than it should be — a real risk for agentic workflows that grade themselves as they run.
Pricing
$5 / $30 per million tokens. The headline price doubled from GPT-5.4 — biggest price hike of any GPT-5.x release. OpenAI's argument: 40% fewer tokens per Codex task offsets it to roughly +20% effective cost.
Differentiator
Native omnimodality (text, image, audio, video in one architecture) and OpenAI's distribution: ChatGPT, Codex, the Agents SDK, the rumoured "super app" framing from Brockman. The product surface is the moat.
Verdict
If your workflow is "give the AI a messy task and trust it to plan, browse, click, finish," GPT-5.5 is the strongest pick today. The hallucination rate is the asterisk.
Gemini 3.1 Pro (Preview)
Google DeepMind · 19 February 2026
Strength
Reasoning and multimodality. 77.1% on ARC-AGI-2 — more than double Gemini 3 Pro's 31.1%. 94.3% on GPQA Diamond, the highest score ever reported on that benchmark. Native multimodal with full video, audio, and 1M context.
Weakness
Slips behind on agentic computer-use against Opus 4.7 and GPT-5.5. Time-to-first-token of ~28 seconds is high for interactive applications.
Pricing
$2 / $12 per million tokens — same as Gemini 3 Pro. No price increase despite the capability jump. This is the cost story of 2026 in one line.
Differentiator
Generates animated SVG and 3D code natively from text. Distribution through Workspace, Android, NotebookLM. The Gemini app crossed 750 million users.
Verdict
The best general-purpose model on a price-performance basis. If you are not specifically optimising for agentic coding or computer use, start here.
DeepSeek V4-Pro & V4-Flash
DeepSeek · 24 April 2026 · MIT licence
Strength
Open-weight near-frontier performance at a fraction of the price. V4-Pro: 1.6 trillion parameters (49B active), 80.6% on SWE-bench Verified — within 0.2 points of Claude Opus 4.6. Largest open-weight model released to date.
Weakness
Training trails the Western frontier by 3–6 months. Falls short of GPT-5.5 and Gemini 3.1 Pro on the hardest reasoning evals. No native multimodal video or audio.
Pricing
$1.74 / $3.48 per million tokens for V4-Pro. V4-Flash: $0.14 / $0.28. Cheaper than GPT-5.4 Nano, with frontier-class performance on coding tasks.
Differentiator
MIT licence — fine-tune, redistribute, self-host without restriction. Hybrid attention architecture (Compressed Sparse + Heavily Compressed) cuts long-context inference cost by 90% versus V3.2. Confirmed running on Huawei Ascend chips.
Verdict
The open-source price ceiling on commercial AI is now set by this model. If your workload is high-volume coding or agentic, the price-performance maths is hard to ignore — you can spend 7× less for ~95% of the capability.
Grok 4.20
xAI · 18 March 2026 (full release)
Strength
2 million token context — largest among Western closed models. Multi-agent architecture (4 specialised agents in standard, 16 in Heavy mode). Real-time X data integration is genuinely unique. Lowest hallucination rate at 78% non-hallucination.
Weakness
Trails the frontier on composite intelligence (Artificial Analysis Index: 49 versus Gemini's 57 and GPT-5.5's 60). No persistent memory between sessions in any tier — a real gap. The recent 4.3 beta launched with no model card, no third-party benchmarks, no tier-1 outlet coverage.
Pricing
$2 / $6 per million tokens, with a 2M context. Heavy mode access requires SuperGrok Heavy at $300/month — the most expensive consumer tier in the market.
Differentiator
Real-time X social signal as a first-class data source. Truth-seeking branding. Weekly model updates rather than quarterly.
Verdict
A specialist model with one genuine differentiator (X real-time data) wrapped in marketing that overstates the capability. Worth paying for if your work depends on live social signal. Not the right default for general business use.

The supporting cast

Llama 4 Maverick (Meta, April 2025): 400B parameters, 1M context, MoE, $0.20/$0.60 via API providers. Genuinely useful for batch and retrieval workloads — but coding lags badly and the licence explicitly prohibits use by EU-domiciled entities for vision features. Meta's surprise pivot to Muse Spark on 8 April 2026 — its first proprietary closed model — quietly signals that the open-source-everything era at Meta is over. GLM-5.1 (Zhipu AI, 7 April, MIT licence) reportedly beats Opus 4.6 and GPT-5.4 on SWE-bench Pro. Mistral Small 4 / Large 3 remain the credible European option, more on which below.

03 · The capability matrix

Where each model leads — based on the most-cited third-party benchmarks across OpenAI, Anthropic, Google DeepMind, Artificial Analysis, and partner reports as of 28 April 2026. Highlighted cells indicate category leader among generally-available models. Mythos sits above the line in absolute terms but is not deployable.

Benchmark Opus 4.7 GPT-5.5 Gemini 3.1 Pro DeepSeek V4-Pro Grok 4.20
SWE-bench Pro (coding) 64.3% 58.6% 54.2% ~58% n/r
SWE-bench Verified 87.6% n/r 80.6% 80.6% n/r
Terminal-Bench 2.0 (agentic) 69.4% 82.7% 68.5% 67.9% n/r
GPQA Diamond (PhD reasoning) 94.2% ~94% 94.3% ~91% n/r
ARC-AGI-2 (novel reasoning) n/r n/r 77.1% n/r n/r
OSWorld-Verified (computer use) 78.0% 78.7% n/r n/r n/r
BrowseComp (web research) 79.3% 89.3% 85.9% n/r n/r
FrontierMath Tier 4 22.9% 35.4% 16.7% n/r n/r
Multilingual Q&A (MMMLU) 91.5% 83.2% 92.6% ~88% n/r
Hallucination rate (lower better) 36% 86% 50% n/r 22%*
Context window 1M 1M 1M 1M 2M
Input price (per 1M tokens) $5 $5 $2 $1.74 $2
Output price (per 1M tokens) $25 $30 $12 $3.48 $6

n/r = not reported by lab or third party at time of writing. Mythos (Anthropic, restricted) leads every category but is not generally available. Hallucination figures use AA-Omniscience methodology where comparable; *Grok's figure uses Artificial Analysis non-hallucination rate, methodology differs.

No single model wins. The competition stopped being about which model is smarter and became about which one fits your specific workflow at your specific budget.

04 · Six months ago vs today

This is where the article you are reading earns its keep. The April 2026 picture only makes sense against where we were in October 2025 — six months and a different epoch ago.

October 2025: a two-horse race with a Chinese spoiler

Six months ago, the frontier looked like this. GPT-5 had launched in August. Claude Opus 4.5 was on the way (it shipped in November). Gemini 3 Pro was Google's response, due in December. Grok 4.1 was Musk's stake. The narrative was binary — OpenAI versus Anthropic, with Google catching up and DeepSeek as the cost-leader gadfly that had spooked NVIDIA's stock back in January 2025.

The benchmark numbers tell the same story. SWE-bench Verified leadership in October 2025 sat in the mid-70s. ARC-AGI-2 was a wall: best models scored in the 30s. Computer use was a research demo. Long-context retrieval at 1M tokens was a marketing number — accuracy collapsed past 100K. The Artificial Analysis Intelligence Index leader scored ~50.

April 2026: a fragmented frontier with seven-day cycles

What has changed in 180 days:

Anyone who paused their AI strategy in October 2025 to "see where this lands" is now eighteen months behind a moving target. The target is not slowing.

In October 2025, frontier AI was a question of which lab leads. By April 2026, it is a question of which model fits the specific shape of the work you are trying to do — and that question changes every six weeks.

05 · The European angle

Three things matter for European readers that do not show up in the SWE-bench tables.

Mistral grew up

Between February and April 2026, Mistral went from "European national champion" to something genuinely competitive. €722M in debt financing for a Paris data centre with 13,800 NVIDIA GB300 chips, operational by Q2. €1.2B Sweden investment for 2027. A target of 200MW of European compute capacity. A landmark NVIDIA partnership through the Nemotron Coalition. A three-year framework with the French Ministry of the Armed Forces. Their on-track $1B ARR puts them in the same conversation as second-tier US labs commercially. Mistral Large 3 and Small 4 are both Apache 2.0 — the most permissive option on the market for European enterprises with GDPR and data sovereignty constraints.

Mistral CEO Arthur Mensch's argument, repeated everywhere from Davos to GTC, is the one European procurement teams should be paying attention to: "You cannot have AI sovereignty if all your compute runs on American cloud infrastructure." Whether you agree or not, it is now a procurement question, not a philosophical one.

The EU AI Act is no longer theoretical

The high-risk AI obligations of the EU AI Act go fully applicable on 2 August 2026 — fourteen weeks from this article going live. By that date, providers and deployers of high-risk systems need: completed conformity assessments, technical documentation, CE marking, EU database registration, and quality management systems in operation. GPAI obligations have been live since August 2025. Transparency rules — including labelling of AI-generated content — also apply from August 2026. Every UK or US-headquartered enterprise selling into the EU is now a deployer.

The Brussels Effect is in full swing. The EU has already opened a formal data-retention order on X over Grok, and put Meta's Llama models under closer scrutiny after Meta refused to sign the GPAI Code of Practice. Adobe, OpenAI, Google, and Microsoft are embedding C2PA watermarking globally because compliance-by-design is cheaper than geofencing. None of this was a serious operational concern six months ago.

The licence question

Llama 4's licence explicitly excludes EU-domiciled entities from vision features. DeepSeek V4 and GLM-5.1 are MIT — no restriction. Mistral models are Apache 2.0 — no restriction. Anthropic, Google, and OpenAI offer EU data residency on Bedrock, Vertex AI, and Microsoft Foundry respectively. If you are running AI in regulated industries (financial services, healthcare, education) the licence terms increasingly matter as much as the capability.

06 · What the influencers and the public are saying

The honest version: the influencer ecosystem in AI has fragmented into three camps, and which one you read shapes which model you think is winning.

The Anthropic-leaning analysts (DataCamp, Vellum, several developer Substacks) emphasise Opus 4.7's coding lead, the verified self-checking behaviour, and the safety story around Mythos. Simon Willison's "almost on the frontier, a fraction of the price" framing for DeepSeek V4 captures the mood among technical evaluators — the gap is narrowing fast and they are watching it weekly. The general analyst consensus: Claude leads on natural prose and code quality on hard problems; quality preferences haven't shifted as fast as benchmark numbers.

The OpenAI-leaning ecosystem (TechCrunch, BigGo, agencies built on the OpenAI API) emphasises the agentic story, the super-app framing, and ChatGPT's distribution. The line that gets repeated most often: enterprise AI procurement is consolidating, OpenAI has 35.2% paid-business penetration in the US, Anthropic 30.6%. Whoever has the desktops wins.

The Google-leaning ecosystem (Visual Capitalist, Stratechery, Gemini-on-Workspace agencies) emphasises the price collapse — Gemini 3.1 Pro at frontier capability for $2/$12 — and Google's distribution: 750M Gemini app users, 15–25M paying subscribers across AI Pro and AI Ultra, AI Overviews in Search.

On the public side, two signals stand out. First, the Ramp data — the gap between OpenAI and Anthropic in paid US business adoption shrank from roughly 3× to 4.5 percentage points in twelve months. Second, the most cited consumer AI publications increasingly recommend using two or three tools in parallel rather than picking one. AI Magicx's quarterly multi-tool comparison summed it up: "There is no 'best' AI assistant. There are four products each occupying a defensible niche."

The contrarian view, worth taking seriously: OpenAI Chief Scientist Jakub Pachocki said at the GPT-5.5 launch that the last two years of model progress have been "surprisingly slow." He and Brockman pitched 5.5 as "a new class of intelligence." That is either marketing or a tell — depending on who you read. The benchmark numbers do not obviously support "slow," but the people inside the labs may know something the leaderboards don't.

07 · On the horizon: rumours and roadmaps

What I believe is real, ranked by my confidence:

High confidence

Medium confidence

Lower confidence — watch this space

The discount-it section

The "GPT-5.5 is 2 weeks away" / "Claude 5 imminent" content from before 23 April was largely social-media churn. Most of the dramatic capability speculation came from accounts with skin in the game. The leaderboard moves more than the marketing suggests, but the marketing moves more than the underlying capability sometimes does.

08 · What this means if you have to choose

If you are a senior person making AI procurement or platform decisions in 2026, here is the honest decision tree, stripped of vendor narrative:

If the work is agentic coding or long-running autonomous tasks — Claude Opus 4.7 first, GPT-5.5 second. Both above the rest. Use Opus 4.7 if the agent will run unsupervised; the self-checking matters more than the benchmark score. Use GPT-5.5 if the workflow involves heavy computer-use, web research, or terminal work.

If the work is general-purpose with cost discipline — Gemini 3.1 Pro. Frontier-class capability at a third of Opus's price, with Workspace integration if you live in Google.

If the work is high-volume or sensitive enough to want self-hosting — DeepSeek V4-Pro or GLM-5.1, both MIT-licensed, both within touching distance of the Western frontier on most benchmarks. Mistral if you specifically need EU sovereignty and Apache 2.0.

If the work depends on real-time social signal — Grok 4.20. Otherwise no.

The single best thing to internalise is this: in April 2026, frontier AI has stopped being a leaderboard race and become a portfolio decision. The leaders rotate every 4–6 weeks. The smart organisations route different requests to different models, automate the A/B testing, and accept that vendor lock-in is now the most expensive choice they could make.

Tactical takeaway

Make your stack model-agnostic, your evaluation continuous, and your procurement assumption six weeks.

01 · ARCHITECTURE
Treat the model as the most volatile layer of your stack. Route requests, swap providers, never name a specific model in a roadmap.
02 · EVALUATION
Build an internal benchmark on real workloads. The leaderboard rotates every six weeks — your tests should run that often.
03 · COMPLIANCE
2 August 2026 is fourteen weeks away. EU AI Act obligations apply to anyone deploying into Europe, including UK and US firms. Start now.
Tags
#FrontierAI #ClaudeOpus47 #GPT55 #Gemini31 #DeepSeekV4 #Grok #AIStrategy #EUAIAct #Mistral #AgenticAI #FutureOfWork #BusinessStrategy #DigitalTransformation #AIAdoption #UKBusiness #Leadership
AI Sustained · By Kevin Clubb April 2026