What is aistupidlevel.info (StupidMeter)?

aistupidlevel.info is a website that runs a real-time benchmark leaderboard for AI language models called StupidMeter. It assigns each model a "stupid level" score — a higher score means the model is less stupid (i.e., more capable). Scores are updated hourly based on live model evaluations. The dashboard shows each model's current score, its trend direction (stable, rising, falling), a volatility indicator (STBL for stable, VOLA for volatile scores), and the model's pricing per million tokens. As of June 2026, 1,300+ visitors check the dashboard daily.

Which AI model scores highest on StupidMeter?

As of June 24, 2026, Claude Opus 4-5-20251101 by Anthropic leads with a score of 69. Claude Opus 4-6 is second at 67. GPT-5.3-Codex and DeepSeek-V4-Pro are tied third at 65. Claude Sonnet 4-5-20250929 is fifth at 64. The top 5 are split between Anthropic (3 models), OpenAI (1), and DeepSeek (1), reflecting how tight the frontier has become.

What does the "stupid level" score mean?

The stupid level is a composite performance score derived from a set of tasks the site evaluates models on regularly — the exact methodology is not fully public, but the scoring appears to reflect a mix of reasoning, factual accuracy, coding, and instruction-following tasks. A score of 69 is the current leader; scores drop off through the rankings. The name is tongue-in-cheek — a higher stupid level means the model is less stupid (performs better). The STBL/VOLA tags indicate whether a model's score is consistent or swings significantly between evaluation runs.

What does STBL vs VOLA mean on StupidMeter?

STBL (stable) means the model's scores are consistent across multiple evaluation runs — the score you see is reliable and repeatable. VOLA (volatile) means the model's scores fluctuate significantly between runs, which can indicate inconsistency in the model's outputs, nondeterminism at high temperature, or sensitivity to how prompts are phrased. A STBL rating at the top of the leaderboard is more meaningful than a VOLA rating because it shows the performance is reliable, not just lucky on a particular run.

How does DeepSeek compare to Claude and GPT on StupidMeter?

DeepSeek performs impressively on StupidMeter relative to its pricing. DeepSeek-V4-Pro scores 65 (tied third) at just $0.28/$0.42 per million tokens — compared to Claude Opus at $5/$25 for a similar or lower score. DeepSeek-V4-Flash scores 62 at the same $0.28/$0.42 pricing. This reflects the broader market reality: DeepSeek has closed the capability gap while undercutting frontier pricing by 10-20x, making it the dominant price-performance choice for cost-sensitive production workloads.

StupidMeter: AI Model Live Benchmark Leaderboard [2026] | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

StupidMeter: AI Model Live Benchmark Leaderboard [2026] | explainx.ai Blog | explainx.ai

There is a live leaderboard at aistupidlevel.info that tracks which AI model is currently the least stupid.

The site is called StupidMeter. It updates hourly. It does not grade on a curve, does not take PR relationships into account, and does not accept sponsored rankings. It just runs evaluations, publishes scores, and lets the numbers say what the numbers say.

As of June 24, 2026: Claude Opus 4-5-20251101 is number one with a score of 69. The next thirteen positions include models from Anthropic, OpenAI, Google, DeepSeek, and Kimi — a lineup that reflects just how compressed the frontier has become.

What StupidMeter Is

aistupidlevel.info is a live AI model benchmark dashboard with a deliberately irreverent brand. The core mechanic: assign every major AI model a "stupid level" score, updated every hour, and rank them top to bottom.

Higher score = less stupid = better.

The dashboard shows you:

Current score — the model's composite performance number
Trend arrow — whether the score is rising (▲), falling (▼), or flat (→) since last update
Stability tag — STBL (stable, consistent performance) or VOLA (volatile, inconsistent performance)
Update interval — how frequently the score is recalculated (all models show 1h)
Pricing — input/output cost per million tokens in USD

As of today, 1,336 people checked the dashboard in the current session window. It is pulling daily visitors at a clip that suggests it has become part of the routine checking cycle for developers and AI buyers who care about which model is actually performing right now.

The Current Leaderboard Breakdown

Here are the top 14 models as of the June 24, 2026 dashboard snapshot:

Rank	Model	Company	Score	Status	Price (in/out per M)
#1	CLAUDE-OPUS-4-5-20251101	Anthropic	69	STBL →	$5/$25
#2	CLAUDE-OPUS-4-6	Anthropic	67	STBL ▼	$5/$25
#3	GPT-5.3-CODEX	OpenAI	65	STBL ▲	$1.25/$10
#4	DEEPSEEK-V4-PRO	DeepSeek	65	STBL →	$0.28/$0.42
#5	CLAUDE-SONNET-4-5-20250929	Anthropic	64	VOLA →	$3/$15
#6	GPT-5.4	OpenAI	63	VOLA ▼	$1.25/$10
#7	DEEPSEEK-V4-FLASH	DeepSeek	62	VOLA →	$0.28/$0.42
#8	GEMINI-3.1-PRO-PREVIEW	Google	58	VOLA →	$2/$12
#9	GPT-5.5	OpenAI	57	VOLA →	$1.25/$10
#10	CLAUDE-OPUS-4-8	Anthropic	57	VOLA ▼	$5/$25
#11	CLAUDE-OPUS-4-7	Anthropic	56	VOLA →	$5/$25
#12	KIMI-K2.7-CODE	Kimi	55	VOLA →	$0.60/$2.50
#13	CLAUDE-SONNET-4-6	Anthropic	54	STBL →	$3/$15
#14	GEMINI-3.1-FLASH-LITE	Google	50	VOLA ▲	$0.1/$0.4

Five things jump out immediately.

Five Things the Leaderboard Reveals

1. Anthropic Has a Lot of Models in the Top 13

Seven of the first thirteen spots go to Anthropic models. That is not just dominance — it is fragmentation. Claude Opus 4-5, 4-6, 4-7, and 4-8 all appear in the top 11. Claude Sonnet 4-5 and 4-6 are both present. This reflects Anthropic's rapid release cadence: they are shipping model versions quickly enough that multiple generations coexist in the active API ecosystem.

For buyers, this is also a complexity problem. Knowing which Claude to use requires more due diligence than it did a year ago.

2. DeepSeek is Competitive at a Fraction of the Price

DeepSeek-V4-Pro scores 65 — tied for third with GPT-5.3-Codex — at $0.28 per million input tokens and $0.42 per million output tokens.

Claude Opus 4-5, the number one model, costs $5/$25 per million tokens. That is roughly 18x more expensive on input and 60x more expensive on output for 4 points of score difference.

DeepSeek-V4-Flash scores 62 at the same $0.28/$0.42 pricing. Two competitive DeepSeek models, both stable or near-stable on the score, at a price point that makes them the obvious pick for cost-sensitive workloads.

This is the DeepSeek story in one dashboard screenshot: they have closed the intelligence gap while obliterating the price gap. The question for enterprise buyers is no longer whether DeepSeek can do the task — it is whether your data governance policy allows sending data to a Chinese AI provider.

3. GPT-5 Is Splintered Too

OpenAI has GPT-5.3-Codex (#3 at 65), GPT-5.4 (#6 at 63), and GPT-5.5 (#9 at 57) all in the top 9. GPT-5.3-Codex is stable and rising. GPT-5.4 is stable but falling. GPT-5.5 is volatile and flat.

The GPT-5 family appears to be iterating rapidly with different specializations. GPT-5.3-Codex's name suggests a coding focus — consistent with OpenAI doubling down on developer workloads after the Codex product line. GPT-5.4 and 5.5 appear to be general versions, with 5.5 performing worse than 5.4 on this benchmark set despite being a higher version number.

Version numbers can mislead: newer does not always mean smarter on every task.

4. STBL vs VOLA Is the Hidden Signal

The top two models — Claude Opus 4-5 and Claude Opus 4-6 — are both STBL. The number three and four models — GPT-5.3-Codex and DeepSeek-V4-Pro — are also STBL.

Below rank 5, almost everything flips to VOLA.

Stability matters for production use. A VOLA model that averages 63 might score 70 on a good run and 56 on a bad one. If you are routing real customer requests through it, that variance shows up as inconsistent output quality. A STBL model at 65 that reliably delivers 64-66 is more deployable than a VOLA model that averages 65 with swings into the 50s.

The four stable top performers — Claude Opus 4-5, Claude Opus 4-6, GPT-5.3-Codex, and DeepSeek-V4-Pro — are the ones enterprise buyers should be looking at first for consistency-sensitive workloads.

5. Google Is Underrepresented at the Top

Gemini-3.1-Pro-Preview sits at #8 with a score of 58 — VOLA, $2/$12 per million tokens. Gemini-3.1-Flash-Lite is at #14 with a 50, also VOLA, but at $0.1/$0.4 — the cheapest model visible on the leaderboard.

Google's model lineup does not appear in the top 7 at all. This is worth noting because Gemini models have typically performed well on academic benchmarks. Either StupidMeter's evaluation tasks are measuring something that Gemini is not optimized for, or Gemini's performance on practical tasks trails its performance on standardized tests — a pattern that has appeared in other independent evaluations.

The Gemini-3.1-Flash-Lite at $0.1/$0.4 is interesting for cost reasons. At that price point, even mediocre capability (score 50) opens use cases that no other model on this list can serve economically.

How to Read the Scoring

The StupidMeter score is not a traditional benchmark number. It is not SWE-bench, MMLU, or HumanEval. The site does not fully publish its methodology, which is worth acknowledging when using the scores for procurement decisions.

What the scores appear to reflect, based on how the rankings track against known model capabilities:

Reasoning tasks — multi-step problem solving, logical deduction
Factual accuracy — avoiding hallucinations on verifiable claims
Instruction following — precision in following complex, multi-constraint prompts
Coding tasks — likely given the heavy coding model representation at the top (GPT-5.3-Codex, DeepSeek-V4-Pro as a code-capable model, Kimi-K2.7-Code in the rankings)

The scores do not capture every dimension. They will not tell you which model is best for creative writing, long-form content, multilingual tasks, or reasoning over private documents. For those use cases, targeted evaluation against your specific tasks is more informative than any general leaderboard.

The Price-Performance View

If you plot score against price, two positions stand out:

Best absolute performance: Claude Opus 4-5 at 69 ($5/$25) — but at a significant premium.

Best price-performance: DeepSeek-V4-Pro at 65 ($0.28/$0.42) — near-top performance at approximately 15-20x lower cost.

The Claude Sonnet 4-6 at #13 (score 54, $3/$15) is interesting because it is the same model that currently powers Claude Code as the default, and is one of the most widely used models in enterprise production. A score of 54 on StupidMeter does not mean it is a bad choice — it means that for the specific task mix StupidMeter evaluates, the Opus tier outperforms it substantially. For many production workloads, a score of 54 at $3/$15 is the right trade.

Kimi-K2.7-Code at $0.60/$2.50 for a score of 55 is worth watching. It is a coding-specialized model (the name suggests it) from Moonshot AI (the Chinese AI lab behind the Kimi chatbot), and a score of 55 at 60 cents per million input tokens is competitive. VOLA rating is a concern for production use.

What StupidMeter Does Not Measure

Worth being explicit about what this leaderboard does not tell you:

Context window size — a 200K context model and a 2M context model could have the same score on task-level evaluation but perform very differently on long-document workloads.

Latency — speed is not reflected in the score. A model that scores 65 in 3 seconds and one that scores 65 in 15 seconds are equal on StupidMeter.

Privacy and data governance — whether your data leaves a jurisdiction when you call a model API is not a performance metric.

Modality — vision, audio, code execution capabilities are not captured in a text benchmark.

Price stability — the prices shown are point-in-time and change. DeepSeek has historically changed pricing; OpenAI and Anthropic do occasionally as well.

Use StupidMeter as a directional signal, not a definitive procurement tool.

The Site Worth Bookmarking

aistupidlevel.info updates hourly, which makes it genuinely useful for tracking model performance over time — particularly for models with ALERT tags (which appear to flag recent significant score changes, as seen on CLAUDE-OPUS-4-6 and KIMI-K2.7-CODE and GEMINI-3.1-FLASH-LITE in the current snapshot).

For developers choosing which model to default to in a new project, or enterprise teams evaluating which Claude/GPT/Gemini tier to standardize on, having a live benchmark that updates around the clock without a vendor PR spin is valuable. The leaderboard does not tell you the full story — but it tells you something real.

The leaderboard is at aistupidlevel.info. For AI model pricing breakdowns and production cost planning, see our guide on controlling Claude token costs.

StupidMeter: The Real-Time AI Model Benchmark Leaderboard [2026]

Related posts

Claude Sonnet 5 Launch Guide: Specs, Pricing & Benchmarks [2026]

Perplexity Open-Sources WANDR — 500-Task Benchmark for Wide & Deep Research

Gemini 3.5 Pro Benchmark Leak — Beating Fable 5 and GPT-5.6? (July 17 Target)