What is Mercury 2 and who makes it?

Mercury 2 is a diffusion-based language model from Inception, a company focused on non-autoregressive generation architectures. It generates responses through parallel refinement — producing multiple tokens simultaneously and converging over a small number of refinement steps — rather than the standard left-to-right one-token-at-a-time approach. It reaches 1,009 tokens per second on NVIDIA Blackwell GPUs and is priced at $0.25/1M input and $0.75/1M output.

What is the difference between diffusion LLMs and autoregressive LLMs?

Autoregressive LLMs (GPT-4o, Claude, Llama) generate tokens one at a time from left to right — each token depends on all previous tokens. Diffusion LLMs start with a noisy token sequence and iteratively refine the entire sequence simultaneously over multiple steps. NVIDIA's Nemotron-Labs-TwoTower (July 2026) adapts a pretrained Nemotron Nano backbone into block-wise parallel decode at 2.42× AR throughput. Mercury 2 uses a similar parallel-refinement idea at commercial API scale.

Is Mercury 2 quality competitive with GPT-4o or Claude?

Mercury 2 is positioned as competitive with "leading speed-optimized models" — models like Gemini Flash, GPT-4o-mini, and Llama-3.1 8B — not with frontier reasoning models. At its pricing and speed tier, it is strong. For tasks requiring deep reasoning, extended context comprehension, or frontier-level quality, autoregressive frontier models remain better. The Mercury 2 value proposition is: comparable speed-tier quality at 5x faster generation.

When does the speed advantage actually matter?

When inference calls compound. In agent loops with 20+ LLM calls per task, a 5x speed improvement doesn't just save 5x time — it changes what tasks are economically worth automating. Voice interfaces benefit from sub-300ms response times. Code autocomplete needs to land before the developer moves on. Real-time transcript cleanup requires processing at speech rate. In single-turn chat, the user's reading speed is the bottleneck; in those cases, 1009 tok/sec is wasted.

How is Mercury 2 different from quantized autoregressive models?

Quantization (4-bit, 8-bit) reduces the precision of model weights to speed up computation, trading some quality for speed within the same autoregressive architecture. Mercury 2 uses a different generation algorithm entirely — diffusion — that produces tokens in parallel. The speed is architectural, not numerical approximation. At equivalent quality tiers, diffusion achieves speeds that quantized autoregressive models cannot match without significant quality degradation.

Mercury 2: 1,009 Tokens/Sec Diffusion LLM for Agents (2026) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Mercury 2: 1,009 Tokens/Sec Diffusion LLM for Agents (2026) | explainx.ai Blog | explainx.ai

Every agentic workflow is a latency multiplication problem.

If your agent makes 20 LLM calls per task, and each call takes 3 seconds for generation, that's 60 seconds of model waiting per task — before any actual work happens. Cut that to 0.6 seconds per call and you cut 60 seconds to 12. The task goes from slow to fast enough to rethink what's worth automating.

Mercury 2 — 1,009 tokens per second on NVIDIA Blackwell GPUs — is built for exactly that arithmetic.

But the speed is not from the usual tricks. It's from a fundamentally different generation algorithm.

What Diffusion LLMs Actually Do

Standard language models are autoregressive: they generate tokens one at a time, left to right. Token N depends on tokens 1 through N-1. The bottleneck is sequential — you cannot generate token 5 until you have tokens 1 through 4. More tokens = more time, linearly.

Diffusion models work differently. They were originally developed for image generation (Stable Diffusion, DALL-E 3), where the process starts with random noise and iteratively denoises it toward a coherent image. Each denoising step refines the entire image simultaneously.

Mercury 2 applies this approach to text:

Start with a noisy token sequence of the target length
Run a refinement pass over all positions simultaneously
Repeat for a small number of steps until the sequence converges
Output the result

The key distinction: token positions are refined in parallel, not generated sequentially. The speed advantage is architectural. It's not approximating a slower process — it's a different process with a different scaling behavior.

Inception describes it as "less typewriter, more editor revising a full draft at once." The analogy is apt. An autoregressive model writes word by word. Mercury 2 starts with a draft and edits the whole thing simultaneously until it's coherent.

The Numbers

Metric	Value
Generation speed	1,009 tokens/sec (NVIDIA Blackwell)
Speed advantage vs autoregressive	>5x
Input pricing	$0.25/1M tokens
Output pricing	$0.75/1M tokens
Context window	128K tokens
Tool use	Native
JSON output	Schema-aligned
Reasoning	Tunable

The pricing is competitive with speed-tier models. At $0.75/1M output tokens, Mercury 2 costs the same as many quantized fast models while generating 5x faster.

Why Speed Compounding Changes the Calculus

A single LLM call at 1009 tok/sec vs 200 tok/sec: you notice the difference, but it's not transformative.

A 20-step agent loop changes the math:

Scenario	200 tok/sec	1009 tok/sec
Single 200-token response	1.0 sec	0.2 sec
10-step agent loop (200 tokens/step)	10 sec	2 sec
50-step agent loop (200 tokens/step)	50 sec	10 sec
Real-time voice transcript (continuous)	Falls behind	Keeps up

The speed advantage doesn't save time uniformly. It saves time proportional to how many inference calls you stack. This makes Mercury 2 specifically valuable for:

Coding tools: Autocomplete and next-edit suggestions need to land before the developer moves on. If the suggestion arrives after 2 seconds, it lands after the developer has already typed ahead. At 1009 tok/sec, short completions arrive in tens of milliseconds.

Agent loops: Agentic workflows that chain dozens of inference calls per task benefit more from Mercury 2 than any other use case. Not just because it's faster, but because faster loops enable more steps within the same latency budget — better quality through more iteration.

Voice interfaces: Voice pipelines have the tightest latency budget in AI — natural speech cadence allows about 200ms between turns before the pause becomes noticeable. Mercury 2's speed makes reasoning-quality responses viable within that window.

RAG pipelines: Multi-hop retrieval, reranking, and summarization latencies stack. Adding reasoning to the search loop — without blowing the latency budget — becomes possible at 1009 tok/sec.

What the Quality Tier Actually Is

Inception positions Mercury 2 as competitive with "leading speed-optimized models." That's the honest bracket: not frontier reasoning (Claude Opus 4.8, GPT-5.5) but competitive with fast models like Gemini Flash, GPT-4o-mini, or Llama-3.1 8B serving.

What this means practically:

Use Case	Mercury 2 fit
Code autocomplete	Strong — speed is the primary value
Agent loop reasoning (non-critical)	Strong
Voice response generation	Strong
RAG summarization	Strong
Frontier reasoning (complex math, code)	Not the right tool
Long-horizon planning	Not the right tool
Deep analysis requiring extended context comprehension	Depends — test it

The tunable reasoning feature (the reasoning_effort parameter in OpenAI-compatible API) lets you trade some speed for more reasoning quality within Mercury 2 itself, which expands the applicable use case range.

Real-World Validation

The most meaningful signal is who is using it:

Zed editor (Max Brunsfeld, Co-Founder): "Suggestions land fast enough to feel like part of your own thinking, not something you have to wait for." — The autocomplete use case where speed determines whether the tool is useful at all.

Skyvern (Suchintan Singh, CTO): "Mercury 2 is at least twice as fast as GPT-5.2, which is a game changer for us." — Agent automation where generation speed compounds across task steps.

Wispr Flow (Sahaj Garg, CTO): "No other model has come close to the speed Mercury can provide!" — Real-time transcript cleanup that must run at speech rate.

OpenCall (Oliver Silverstein, CEO): "Mercury 2 quality is excellent, and the model's low latency enables more responsive voice agents." — Voice agents where response delay destroys the conversational feel.

The pattern: every validated use case involves either real-time interaction (voice, autocomplete) or agentic loops where generation calls compound. These are the cases where the speed advantage is load-bearing, not marginal.

The OpenAI-Compatible API

Mercury 2 exposes an OpenAI-compatible API:

python

from openai import OpenAI

client = OpenAI(
    api_key="your_inception_api_key",
    base_url="https://api.inceptionlabs.ai/v1"
)

response = client.chat.completions.create(
    model="mercury-coder-small",
    messages=[{"role": "user", "content": "Write a Python function to parse JSON safely"}]
)
print(response.choices[0].message.content)

Drop-in replacement for existing OpenAI API integrations. No rewrites required.

Available models:

mercury-coder-small — fastest, best for autocomplete and short tasks
Standard model — balanced quality/speed for most agent use cases

When to Use Mercury 2 vs Frontier Models

The decision is not "is Mercury 2 good?" It is "does this use case need Mercury 2's specific advantage?"

Use Mercury 2 when:

Your pipeline has 10+ chained inference calls
Real-time responsiveness is required (voice, autocomplete)
You're optimizing for throughput at scale with speed-tier quality requirements
Latency is a hard constraint, not a preference

Use frontier models (Claude, GPT-5.5) when:

Reasoning depth matters more than speed
The task is a single, complex prompt — not a loop
You need the best quality output, not the fastest adequate output
Code generation quality needs to be correct, not just fast

Many production systems will end up using both: Mercury 2 for the high-frequency loop steps that don't need frontier quality, and frontier models for the final synthesis or critical reasoning steps.

Getting Started

Try Mercury 2 at chat.inceptionlabs.ai or via API at api.inceptionlabs.ai. The API is OpenAI-compatible — replace api.openai.com with the Inception endpoint and update the API key.

NVIDIA Nemotron-Labs-TwoTower: 2.42× Diffusion Decode — open Nemotron Nano retrofit (July 2026)
AI models directory — full landscape of language models including speed comparisons
AI agent tools — autonomous agent tools that benefit from fast inference
AI skills registry — reusable skills for agent pipelines

1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Related posts

Azure AI Apps and Agents Developer (AI-103): what the exam tests and how to prepare

NVIDIA Nemotron-Labs-TwoTower: Split a 30B Model in Two for 2.42× Faster Diffusion Generation

Langflow vs n8n vs Make vs Flowise: Which No-Code AI Builder Should You Use in 2026?