Every agentic workflow is a latency multiplication problem.
If your agent makes 20 LLM calls per task, and each call takes 3 seconds for generation, that's 60 seconds of model waiting per task — before any actual work happens. Cut that to 0.6 seconds per call and you cut 60 seconds to 12. The task goes from slow to fast enough to rethink what's worth automating.
Mercury 2 — 1,009 tokens per second on NVIDIA Blackwell GPUs — is built for exactly that arithmetic.
But the speed is not from the usual tricks. It's from a fundamentally different generation algorithm.
What Diffusion LLMs Actually Do
Standard language models are autoregressive: they generate tokens one at a time, left to right. Token N depends on tokens 1 through N-1. The bottleneck is sequential — you cannot generate token 5 until you have tokens 1 through 4. More tokens = more time, linearly.
Diffusion models work differently. They were originally developed for image generation (Stable Diffusion, DALL-E 3), where the process starts with random noise and iteratively denoises it toward a coherent image. Each denoising step refines the entire image simultaneously.
Mercury 2 applies this approach to text:
- Start with a noisy token sequence of the target length
- Run a refinement pass over all positions simultaneously
- Repeat for a small number of steps until the sequence converges
- Output the result
The key distinction: token positions are refined in parallel, not generated sequentially. The speed advantage is architectural. It's not approximating a slower process — it's a different process with a different scaling behavior.
Inception describes it as "less typewriter, more editor revising a full draft at once." The analogy is apt. An autoregressive model writes word by word. Mercury 2 starts with a draft and edits the whole thing simultaneously until it's coherent.
The Numbers
| Metric | Value |
|---|---|
| Generation speed | 1,009 tokens/sec (NVIDIA Blackwell) |
| Speed advantage vs autoregressive | >5x |
| Input pricing | $0.25/1M tokens |
| Output pricing | $0.75/1M tokens |
| Context window | 128K tokens |
| Tool use | Native |
| JSON output | Schema-aligned |
| Reasoning | Tunable |
The pricing is competitive with speed-tier models. At $0.75/1M output tokens, Mercury 2 costs the same as many quantized fast models while generating 5x faster.
Why Speed Compounding Changes the Calculus
A single LLM call at 1009 tok/sec vs 200 tok/sec: you notice the difference, but it's not transformative.
A 20-step agent loop changes the math:
| Scenario | 200 tok/sec | 1009 tok/sec |
|---|---|---|
| Single 200-token response | 1.0 sec | 0.2 sec |
| 10-step agent loop (200 tokens/step) | 10 sec | 2 sec |
| 50-step agent loop (200 tokens/step) | 50 sec | 10 sec |
| Real-time voice transcript (continuous) | Falls behind | Keeps up |
The speed advantage doesn't save time uniformly. It saves time proportional to how many inference calls you stack. This makes Mercury 2 specifically valuable for:
Coding tools: Autocomplete and next-edit suggestions need to land before the developer moves on. If the suggestion arrives after 2 seconds, it lands after the developer has already typed ahead. At 1009 tok/sec, short completions arrive in tens of milliseconds.
Agent loops: Agentic workflows that chain dozens of inference calls per task benefit more from Mercury 2 than any other use case. Not just because it's faster, but because faster loops enable more steps within the same latency budget — better quality through more iteration.
Voice interfaces: Voice pipelines have the tightest latency budget in AI — natural speech cadence allows about 200ms between turns before the pause becomes noticeable. Mercury 2's speed makes reasoning-quality responses viable within that window.
RAG pipelines: Multi-hop retrieval, reranking, and summarization latencies stack. Adding reasoning to the search loop — without blowing the latency budget — becomes possible at 1009 tok/sec.
What the Quality Tier Actually Is
Inception positions Mercury 2 as competitive with "leading speed-optimized models." That's the honest bracket: not frontier reasoning (Claude Opus 4.8, GPT-5.5) but competitive with fast models like Gemini Flash, GPT-4o-mini, or Llama-3.1 8B serving.
What this means practically:
| Use Case | Mercury 2 fit |
|---|---|
| Code autocomplete | Strong — speed is the primary value |
| Agent loop reasoning (non-critical) | Strong |
| Voice response generation | Strong |
| RAG summarization | Strong |
| Frontier reasoning (complex math, code) | Not the right tool |
| Long-horizon planning | Not the right tool |
| Deep analysis requiring extended context comprehension | Depends — test it |
The tunable reasoning feature (the reasoning_effort parameter in OpenAI-compatible API) lets you trade some speed for more reasoning quality within Mercury 2 itself, which expands the applicable use case range.
Real-World Validation
The most meaningful signal is who is using it:
Zed editor (Max Brunsfeld, Co-Founder): "Suggestions land fast enough to feel like part of your own thinking, not something you have to wait for." — The autocomplete use case where speed determines whether the tool is useful at all.
Skyvern (Suchintan Singh, CTO): "Mercury 2 is at least twice as fast as GPT-5.2, which is a game changer for us." — Agent automation where generation speed compounds across task steps.
Wispr Flow (Sahaj Garg, CTO): "No other model has come close to the speed Mercury can provide!" — Real-time transcript cleanup that must run at speech rate.
OpenCall (Oliver Silverstein, CEO): "Mercury 2 quality is excellent, and the model's low latency enables more responsive voice agents." — Voice agents where response delay destroys the conversational feel.
The pattern: every validated use case involves either real-time interaction (voice, autocomplete) or agentic loops where generation calls compound. These are the cases where the speed advantage is load-bearing, not marginal.
The OpenAI-Compatible API
Mercury 2 exposes an OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
api_key="your_inception_api_key",
base_url="https://api.inceptionlabs.ai/v1"
)
response = client.chat.completions.create(
model="mercury-coder-small",
messages=[{"role": "user", "content": "Write a Python function to parse JSON safely"}]
)
print(response.choices[0].message.content)
Drop-in replacement for existing OpenAI API integrations. No rewrites required.
Available models:
mercury-coder-small— fastest, best for autocomplete and short tasks- Standard model — balanced quality/speed for most agent use cases
When to Use Mercury 2 vs Frontier Models
The decision is not "is Mercury 2 good?" It is "does this use case need Mercury 2's specific advantage?"
Use Mercury 2 when:
- Your pipeline has 10+ chained inference calls
- Real-time responsiveness is required (voice, autocomplete)
- You're optimizing for throughput at scale with speed-tier quality requirements
- Latency is a hard constraint, not a preference
Use frontier models (Claude, GPT-5.5) when:
- Reasoning depth matters more than speed
- The task is a single, complex prompt — not a loop
- You need the best quality output, not the fastest adequate output
- Code generation quality needs to be correct, not just fast
Many production systems will end up using both: Mercury 2 for the high-frequency loop steps that don't need frontier quality, and frontier models for the final synthesis or critical reasoning steps.
Claude for Work
Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.
Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.
Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.
Getting Started
Try Mercury 2 at chat.inceptionlabs.ai or via API at api.inceptionlabs.ai. The API is OpenAI-compatible — replace api.openai.com with the Inception endpoint and update the API key.
Related
- AI models directory — full landscape of language models including speed comparisons
- AI agent tools — autonomous agent tools that benefit from fast inference
- AI skills registry — reusable skills for agent pipelines