← Back to blog

explainx / blog

1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Inception's Mercury 2 is a diffusion-based language model hitting 1,009 tokens per second on Blackwell GPUs — over 5x faster than autoregressive models at competitive quality. It is not fast inference via quantization or speculative decoding. It is a fundamentally different generation algorithm. Here is what that means for the AI applications where latency compounds.

·7 min read·Yash Thakker
AI ModelsLLMAI AgentsGenerative AIInference Speed
1,009 Tokens Per Second: Mercury 2 and What Diffusion LLMs Change for Agent Loops

Every agentic workflow is a latency multiplication problem.

If your agent makes 20 LLM calls per task, and each call takes 3 seconds for generation, that's 60 seconds of model waiting per task — before any actual work happens. Cut that to 0.6 seconds per call and you cut 60 seconds to 12. The task goes from slow to fast enough to rethink what's worth automating.

Mercury 2 — 1,009 tokens per second on NVIDIA Blackwell GPUs — is built for exactly that arithmetic.

But the speed is not from the usual tricks. It's from a fundamentally different generation algorithm.


What Diffusion LLMs Actually Do

Standard language models are autoregressive: they generate tokens one at a time, left to right. Token N depends on tokens 1 through N-1. The bottleneck is sequential — you cannot generate token 5 until you have tokens 1 through 4. More tokens = more time, linearly.

Diffusion models work differently. They were originally developed for image generation (Stable Diffusion, DALL-E 3), where the process starts with random noise and iteratively denoises it toward a coherent image. Each denoising step refines the entire image simultaneously.

Mercury 2 applies this approach to text:

  1. Start with a noisy token sequence of the target length
  2. Run a refinement pass over all positions simultaneously
  3. Repeat for a small number of steps until the sequence converges
  4. Output the result

The key distinction: token positions are refined in parallel, not generated sequentially. The speed advantage is architectural. It's not approximating a slower process — it's a different process with a different scaling behavior.

Inception describes it as "less typewriter, more editor revising a full draft at once." The analogy is apt. An autoregressive model writes word by word. Mercury 2 starts with a draft and edits the whole thing simultaneously until it's coherent.


The Numbers

MetricValue
Generation speed1,009 tokens/sec (NVIDIA Blackwell)
Speed advantage vs autoregressive>5x
Input pricing$0.25/1M tokens
Output pricing$0.75/1M tokens
Context window128K tokens
Tool useNative
JSON outputSchema-aligned
ReasoningTunable

The pricing is competitive with speed-tier models. At $0.75/1M output tokens, Mercury 2 costs the same as many quantized fast models while generating 5x faster.


Why Speed Compounding Changes the Calculus

A single LLM call at 1009 tok/sec vs 200 tok/sec: you notice the difference, but it's not transformative.

A 20-step agent loop changes the math:

Scenario200 tok/sec1009 tok/sec
Single 200-token response1.0 sec0.2 sec
10-step agent loop (200 tokens/step)10 sec2 sec
50-step agent loop (200 tokens/step)50 sec10 sec
Real-time voice transcript (continuous)Falls behindKeeps up

The speed advantage doesn't save time uniformly. It saves time proportional to how many inference calls you stack. This makes Mercury 2 specifically valuable for:

Coding tools: Autocomplete and next-edit suggestions need to land before the developer moves on. If the suggestion arrives after 2 seconds, it lands after the developer has already typed ahead. At 1009 tok/sec, short completions arrive in tens of milliseconds.

Agent loops: Agentic workflows that chain dozens of inference calls per task benefit more from Mercury 2 than any other use case. Not just because it's faster, but because faster loops enable more steps within the same latency budget — better quality through more iteration.

Voice interfaces: Voice pipelines have the tightest latency budget in AI — natural speech cadence allows about 200ms between turns before the pause becomes noticeable. Mercury 2's speed makes reasoning-quality responses viable within that window.

RAG pipelines: Multi-hop retrieval, reranking, and summarization latencies stack. Adding reasoning to the search loop — without blowing the latency budget — becomes possible at 1009 tok/sec.


What the Quality Tier Actually Is

Inception positions Mercury 2 as competitive with "leading speed-optimized models." That's the honest bracket: not frontier reasoning (Claude Opus 4.8, GPT-5.5) but competitive with fast models like Gemini Flash, GPT-4o-mini, or Llama-3.1 8B serving.

What this means practically:

Use CaseMercury 2 fit
Code autocompleteStrong — speed is the primary value
Agent loop reasoning (non-critical)Strong
Voice response generationStrong
RAG summarizationStrong
Frontier reasoning (complex math, code)Not the right tool
Long-horizon planningNot the right tool
Deep analysis requiring extended context comprehensionDepends — test it

The tunable reasoning feature (the reasoning_effort parameter in OpenAI-compatible API) lets you trade some speed for more reasoning quality within Mercury 2 itself, which expands the applicable use case range.


Real-World Validation

The most meaningful signal is who is using it:

Zed editor (Max Brunsfeld, Co-Founder): "Suggestions land fast enough to feel like part of your own thinking, not something you have to wait for." — The autocomplete use case where speed determines whether the tool is useful at all.

Skyvern (Suchintan Singh, CTO): "Mercury 2 is at least twice as fast as GPT-5.2, which is a game changer for us." — Agent automation where generation speed compounds across task steps.

Wispr Flow (Sahaj Garg, CTO): "No other model has come close to the speed Mercury can provide!" — Real-time transcript cleanup that must run at speech rate.

OpenCall (Oliver Silverstein, CEO): "Mercury 2 quality is excellent, and the model's low latency enables more responsive voice agents." — Voice agents where response delay destroys the conversational feel.

The pattern: every validated use case involves either real-time interaction (voice, autocomplete) or agentic loops where generation calls compound. These are the cases where the speed advantage is load-bearing, not marginal.


The OpenAI-Compatible API

Mercury 2 exposes an OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    api_key="your_inception_api_key",
    base_url="https://api.inceptionlabs.ai/v1"
)

response = client.chat.completions.create(
    model="mercury-coder-small",
    messages=[{"role": "user", "content": "Write a Python function to parse JSON safely"}]
)
print(response.choices[0].message.content)

Drop-in replacement for existing OpenAI API integrations. No rewrites required.

Available models:

  • mercury-coder-small — fastest, best for autocomplete and short tasks
  • Standard model — balanced quality/speed for most agent use cases

When to Use Mercury 2 vs Frontier Models

The decision is not "is Mercury 2 good?" It is "does this use case need Mercury 2's specific advantage?"

Use Mercury 2 when:

  • Your pipeline has 10+ chained inference calls
  • Real-time responsiveness is required (voice, autocomplete)
  • You're optimizing for throughput at scale with speed-tier quality requirements
  • Latency is a hard constraint, not a preference

Use frontier models (Claude, GPT-5.5) when:

  • Reasoning depth matters more than speed
  • The task is a single, complex prompt — not a loop
  • You need the best quality output, not the fastest adequate output
  • Code generation quality needs to be correct, not just fast

Many production systems will end up using both: Mercury 2 for the high-frequency loop steps that don't need frontier quality, and frontier models for the final synthesis or critical reasoning steps.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Register now

Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.

Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.


Getting Started

Try Mercury 2 at chat.inceptionlabs.ai or via API at api.inceptionlabs.ai. The API is OpenAI-compatible — replace api.openai.com with the Inception endpoint and update the API key.


Related

Related posts