There is a live leaderboard at aistupidlevel.info that tracks which AI model is currently the least stupid.
The site is called StupidMeter. It updates hourly. It does not grade on a curve, does not take PR relationships into account, and does not accept sponsored rankings. It just runs evaluations, publishes scores, and lets the numbers say what the numbers say.
As of June 24, 2026: Claude Opus 4-5-20251101 is number one with a score of 69. The next thirteen positions include models from Anthropic, OpenAI, Google, DeepSeek, and Kimi — a lineup that reflects just how compressed the frontier has become.
What StupidMeter Is
aistupidlevel.info is a live AI model benchmark dashboard with a deliberately irreverent brand. The core mechanic: assign every major AI model a "stupid level" score, updated every hour, and rank them top to bottom.
Higher score = less stupid = better.
The dashboard shows you:
- Current score — the model's composite performance number
- Trend arrow — whether the score is rising (▲), falling (▼), or flat (→) since last update
- Stability tag — STBL (stable, consistent performance) or VOLA (volatile, inconsistent performance)
- Update interval — how frequently the score is recalculated (all models show 1h)
- Pricing — input/output cost per million tokens in USD
As of today, 1,336 people checked the dashboard in the current session window. It is pulling daily visitors at a clip that suggests it has become part of the routine checking cycle for developers and AI buyers who care about which model is actually performing right now.
The Current Leaderboard Breakdown
Here are the top 14 models as of the June 24, 2026 dashboard snapshot:
| Rank | Model | Company | Score | Status | Price (in/out per M) |
|---|---|---|---|---|---|
| #1 | CLAUDE-OPUS-4-5-20251101 | Anthropic | 69 | STBL → | $5/$25 |
| #2 | CLAUDE-OPUS-4-6 | Anthropic | 67 | STBL ▼ | $5/$25 |
| #3 | GPT-5.3-CODEX | OpenAI | 65 | STBL ▲ | $1.25/$10 |
| #4 | DEEPSEEK-V4-PRO | DeepSeek | 65 | STBL → | $0.28/$0.42 |
| #5 | CLAUDE-SONNET-4-5-20250929 | Anthropic | 64 | VOLA → | $3/$15 |
| #6 | GPT-5.4 | OpenAI | 63 | VOLA ▼ | $1.25/$10 |
| #7 | DEEPSEEK-V4-FLASH | DeepSeek | 62 | VOLA → | $0.28/$0.42 |
| #8 | GEMINI-3.1-PRO-PREVIEW | 58 | VOLA → | $2/$12 | |
| #9 | GPT-5.5 | OpenAI | 57 | VOLA → | $1.25/$10 |
| #10 | CLAUDE-OPUS-4-8 | Anthropic | 57 | VOLA ▼ | $5/$25 |
| #11 | CLAUDE-OPUS-4-7 | Anthropic | 56 | VOLA → | $5/$25 |
| #12 | KIMI-K2.7-CODE | Kimi | 55 | VOLA → | $0.60/$2.50 |
| #13 | CLAUDE-SONNET-4-6 | Anthropic | 54 | STBL → | $3/$15 |
| #14 | GEMINI-3.1-FLASH-LITE | 50 | VOLA ▲ | $0.1/$0.4 |
Five things jump out immediately.
Five Things the Leaderboard Reveals
1. Anthropic Has a Lot of Models in the Top 13
Seven of the first thirteen spots go to Anthropic models. That is not just dominance — it is fragmentation. Claude Opus 4-5, 4-6, 4-7, and 4-8 all appear in the top 11. Claude Sonnet 4-5 and 4-6 are both present. This reflects Anthropic's rapid release cadence: they are shipping model versions quickly enough that multiple generations coexist in the active API ecosystem.
For buyers, this is also a complexity problem. Knowing which Claude to use requires more due diligence than it did a year ago.
2. DeepSeek is Competitive at a Fraction of the Price
DeepSeek-V4-Pro scores 65 — tied for third with GPT-5.3-Codex — at $0.28 per million input tokens and $0.42 per million output tokens.
Claude Opus 4-5, the number one model, costs $5/$25 per million tokens. That is roughly 18x more expensive on input and 60x more expensive on output for 4 points of score difference.
DeepSeek-V4-Flash scores 62 at the same $0.28/$0.42 pricing. Two competitive DeepSeek models, both stable or near-stable on the score, at a price point that makes them the obvious pick for cost-sensitive workloads.
This is the DeepSeek story in one dashboard screenshot: they have closed the intelligence gap while obliterating the price gap. The question for enterprise buyers is no longer whether DeepSeek can do the task — it is whether your data governance policy allows sending data to a Chinese AI provider.
3. GPT-5 Is Splintered Too
OpenAI has GPT-5.3-Codex (#3 at 65), GPT-5.4 (#6 at 63), and GPT-5.5 (#9 at 57) all in the top 9. GPT-5.3-Codex is stable and rising. GPT-5.4 is stable but falling. GPT-5.5 is volatile and flat.
The GPT-5 family appears to be iterating rapidly with different specializations. GPT-5.3-Codex's name suggests a coding focus — consistent with OpenAI doubling down on developer workloads after the Codex product line. GPT-5.4 and 5.5 appear to be general versions, with 5.5 performing worse than 5.4 on this benchmark set despite being a higher version number.
Version numbers can mislead: newer does not always mean smarter on every task.
4. STBL vs VOLA Is the Hidden Signal
The top two models — Claude Opus 4-5 and Claude Opus 4-6 — are both STBL. The number three and four models — GPT-5.3-Codex and DeepSeek-V4-Pro — are also STBL.
Below rank 5, almost everything flips to VOLA.
Stability matters for production use. A VOLA model that averages 63 might score 70 on a good run and 56 on a bad one. If you are routing real customer requests through it, that variance shows up as inconsistent output quality. A STBL model at 65 that reliably delivers 64-66 is more deployable than a VOLA model that averages 65 with swings into the 50s.
The four stable top performers — Claude Opus 4-5, Claude Opus 4-6, GPT-5.3-Codex, and DeepSeek-V4-Pro — are the ones enterprise buyers should be looking at first for consistency-sensitive workloads.
5. Google Is Underrepresented at the Top
Gemini-3.1-Pro-Preview sits at #8 with a score of 58 — VOLA, $2/$12 per million tokens. Gemini-3.1-Flash-Lite is at #14 with a 50, also VOLA, but at $0.1/$0.4 — the cheapest model visible on the leaderboard.
Google's model lineup does not appear in the top 7 at all. This is worth noting because Gemini models have typically performed well on academic benchmarks. Either StupidMeter's evaluation tasks are measuring something that Gemini is not optimized for, or Gemini's performance on practical tasks trails its performance on standardized tests — a pattern that has appeared in other independent evaluations.
The Gemini-3.1-Flash-Lite at $0.1/$0.4 is interesting for cost reasons. At that price point, even mediocre capability (score 50) opens use cases that no other model on this list can serve economically.
How to Read the Scoring
The StupidMeter score is not a traditional benchmark number. It is not SWE-bench, MMLU, or HumanEval. The site does not fully publish its methodology, which is worth acknowledging when using the scores for procurement decisions.
What the scores appear to reflect, based on how the rankings track against known model capabilities:
- Reasoning tasks — multi-step problem solving, logical deduction
- Factual accuracy — avoiding hallucinations on verifiable claims
- Instruction following — precision in following complex, multi-constraint prompts
- Coding tasks — likely given the heavy coding model representation at the top (GPT-5.3-Codex, DeepSeek-V4-Pro as a code-capable model, Kimi-K2.7-Code in the rankings)
The scores do not capture every dimension. They will not tell you which model is best for creative writing, long-form content, multilingual tasks, or reasoning over private documents. For those use cases, targeted evaluation against your specific tasks is more informative than any general leaderboard.
Claude for Work
Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.
Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.
Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.
The Price-Performance View
If you plot score against price, two positions stand out:
Best absolute performance: Claude Opus 4-5 at 69 ($5/$25) — but at a significant premium.
Best price-performance: DeepSeek-V4-Pro at 65 ($0.28/$0.42) — near-top performance at approximately 15-20x lower cost.
The Claude Sonnet 4-6 at #13 (score 54, $3/$15) is interesting because it is the same model that currently powers Claude Code as the default, and is one of the most widely used models in enterprise production. A score of 54 on StupidMeter does not mean it is a bad choice — it means that for the specific task mix StupidMeter evaluates, the Opus tier outperforms it substantially. For many production workloads, a score of 54 at $3/$15 is the right trade.
Kimi-K2.7-Code at $0.60/$2.50 for a score of 55 is worth watching. It is a coding-specialized model (the name suggests it) from Moonshot AI (the Chinese AI lab behind the Kimi chatbot), and a score of 55 at 60 cents per million input tokens is competitive. VOLA rating is a concern for production use.
What StupidMeter Does Not Measure
Worth being explicit about what this leaderboard does not tell you:
Context window size — a 200K context model and a 2M context model could have the same score on task-level evaluation but perform very differently on long-document workloads.
Latency — speed is not reflected in the score. A model that scores 65 in 3 seconds and one that scores 65 in 15 seconds are equal on StupidMeter.
Privacy and data governance — whether your data leaves a jurisdiction when you call a model API is not a performance metric.
Modality — vision, audio, code execution capabilities are not captured in a text benchmark.
Price stability — the prices shown are point-in-time and change. DeepSeek has historically changed pricing; OpenAI and Anthropic do occasionally as well.
Use StupidMeter as a directional signal, not a definitive procurement tool.
The Site Worth Bookmarking
aistupidlevel.info updates hourly, which makes it genuinely useful for tracking model performance over time — particularly for models with ALERT tags (which appear to flag recent significant score changes, as seen on CLAUDE-OPUS-4-6 and KIMI-K2.7-CODE and GEMINI-3.1-FLASH-LITE in the current snapshot).
For developers choosing which model to default to in a new project, or enterprise teams evaluating which Claude/GPT/Gemini tier to standardize on, having a live benchmark that updates around the clock without a vendor PR spin is valuable. The leaderboard does not tell you the full story — but it tells you something real.
The leaderboard is at aistupidlevel.info. For AI model pricing breakdowns and production cost planning, see our guide on controlling Claude token costs.