What is VibeThinker-3B?

VibeThinker-3B is a compact 3-billion-parameter language model from a research team at arXiv (paper 2606.16140, submitted June 15, 2026) that achieves frontier-level performance on verifiable reasoning tasks. It scores 94.3 on AIME 2026 (97.1 with claim-level test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on recent unseen LeetCode contests — matching or exceeding DeepSeek V3.2, GLM-5, and Gemini 3 Pro despite being orders of magnitude smaller.

How did they train VibeThinker-3B?

Using the Spectrum-to-Signal post-training paradigm: curriculum-based supervised fine-tuning (starting from simpler tasks, progressing to harder ones), followed by multi-domain reinforcement learning using verifiable rewards, and offline self-distillation (the model teaching its own improved outputs back to itself). The combination pushes verifiable reasoning performance far beyond what standard SFT achieves on small models.

What is the Parametric Compression-Coverage Hypothesis?

A hypothesis introduced in the VibeThinker-3B paper: verifiable reasoning (mathematics, coding, logic) is compressible into compact model cores and can be pushed to frontier levels even in 3B parameter models. But open- domain knowledge and general-purpose competence — broad facts, long-tail scenarios, diverse world knowledge — require broad parameter coverage and don't compress as well. Small models can match large models at reasoning; they can't match them at knowing things.

Does this mean small models are replacing large models?

No — but they are complementary in a newly understood way. Large models have advantages in general knowledge, long-tail tasks, and broad coverage. Small models can match or exceed them in structured, verifiable reasoning domains. The implication is that not every reasoning-intensive task needs a 100B+ parameter model — you can route reasoning-heavy subtasks to small specialist models while reserving large models for knowledge-intensive work.

What is "claim-level test-time scaling"?

A technique where at inference time, the model generates multiple candidate claims or reasoning steps, evaluates them for consistency and correctness, and selects the best. Applied to AIME 2026, it pushes VibeThinker-3B's score from 94.3 to 97.1 — without any training change. Test-time compute scaling is a growing area of research where inference-time reasoning improves results beyond what training alone achieves.

VibeThinker-3B: Frontier Reasoning at 3B Parameters (2026) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

VibeThinker-3B: Frontier Reasoning at 3B Parameters (2026) | explainx.ai Blog | explainx.ai

The most provocative result in the VibeThinker-3B paper is not the benchmark score. It is the implication.

A 3-billion-parameter model scores 94.3 on AIME 2026. DeepSeek V3.2 — a model orders of magnitude larger — scores 94.6 on the same benchmark. Gemini 3 Pro scores 98.2 (higher, but Gemini is a frontier closed model with unknown parameter count). GLM-5 scores 95.3.

VibeThinker-3B is matching frontier models on verifiable mathematical reasoning with less than 1% of their parameter count.

That's not a benchmark curiosity. It's a structural signal about how reasoning capability scales — and it doesn't scale the way most people assume.

The Benchmark Results in Context

Published June 15, 2026 (arXiv:2606.16140), VibeThinker-3B's measured performance:

Benchmark	VibeThinker-3B	DeepSeek V3.2	Gemini 3 Pro	GLM-5
AIME 2026	94.3 (97.1 w/ TTS)	94.6	98.2	95.3
LiveCodeBench v6	80.2 Pass@1	—	—	—
LeetCode (recent, unseen)	96.1% acceptance	—

Task Type	Capability Required	Right Model Size
Math problem solving	Reasoning	Small specialist (3B) can match frontier
Code review with test-driven verification	Reasoning	Small specialist competitive
General world knowledge questions	Knowledge	Large model required
Long-tail factual retrieval	Knowledge	Large model required
Code generation (general)	Mixed	Test it — depends on specifics
Analysis of proprietary data	Mixed	Depends on context length + reasoning depth

94.3 on AIME 2026: VibeThinker-3B and the Case for Small Models With Frontier Reasoning

The Benchmark Results in Context

Related posts

1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Top 10 Large Language Model (LLM) Directories & Hubs (2026)

Anthropic Commits $10M CAD to Canadian AI Research — Amii, Mila, Vector & 8 Partners

How They Did It: The Training Pipeline

1. Curriculum-Based Supervised Fine-Tuning

2. Multi-Domain Reinforcement Learning With Verifiable Rewards

3. Offline Self-Distillation

The Parametric Compression-Coverage Hypothesis

What This Means for How You Build AI Systems

Why VibeThinker-3B Is Not the Only Signal

The Limits of the Hypothesis

Test-Time Scaling Is Part of the Story

Reading the Paper