← Blog
explainx / blog

DiffusionGemma: Google’s 4× Faster Open Model Uses Text Diffusion

Google DeepMind releases DiffusionGemma—26B MoE, Apache 2.0, up to 4× faster text gen via parallel 256-token blocks. H100 1000+ tok/s, runs on 18GB VRAM.

6 min readYash Thakker
Google DeepMindGemmaOpen SourceLLMDiffusion Models

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

DiffusionGemma: Google’s 4× Faster Open Model Uses Text Diffusion

On June 10, 2026, Google DeepMind released DiffusionGemma—an experimental open-weights model that generates text with discrete diffusion instead of predicting one word at a time. @Google and @GoogleDeepMind pitched it as up to 4× faster on dedicated GPUs; CEO Sundar Pichai called it a "racehorse" for interactive apps.

TL;DR

SpecDiffusionGemmaStandard Gemma 4 (26B A4B)
MethodParallel 256-token diffusion blocksAutoregressive (token-by-token)
Total / active params26B MoE / 3.8B active25.2B / 3.8B active
Speed (H100)1000+ tok/sBaseline
Speed (RTX 5090)700+ tok/sBaseline
VRAM (quantized)~18 GBSimilar class
LicenseApache 2.0Gemma terms
QualityLower on most benchmarksProduction recommended
Best forInteractive local, low latencyMax quality, cloud QPS

Why diffusion for text?

Nearly every LLM today is autoregressive: generate token t, then t+1, each depending on all prior tokens. Decode becomes memory-bandwidth-bound—the GPU waits on KV cache reads more than it computes.

DiffusionGemma inverts the decode bottleneck:

flowchart LR
  A[Encoder prefills context] --> B[KV cache built]
  B --> C[256-token canvas of masked tokens]
  C --> D[Parallel denoising passes]
  D --> E[Canvas finalized → append to cache]
  E --> F[Next canvas…]

From the Hugging Face Gemma 4 launch post:

  • An autoregressive encoder prefills prompts and builds the KV cache.
  • A diffusion decoder applies bidirectional attention over a 256-token canvas.
  • The model iteratively denoises the full canvas; finalized tokens append to cache; the next canvas begins.

Block-autoregressive across canvases, parallel within each canvas—roughly 15–20 tokens per forward pass versus one.


Speed numbers that matter

Google's published throughput (official blog):

HardwareThroughput
NVIDIA H100 (single GPU)1000+ tokens/sec
GeForce RTX 5090700+ tokens/sec
vs autoregressive Gemma 4Up to ~4× faster

@sundarpichai:

DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It's a racehorse 🏇 achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token output!

Adaptive compute: Simpler prompts and structured tasks (code infilling, markdown) can use fewer denoising steps, so tokens-per-second scales with task complexity.


Architecture and footprint

DiffusionGemma shares Gemma 4's 26B A4B MoE foundation:

AttributeValue
Total parameters~26B (25.2B in HF spec)
Active per forward3.8B (8 of 128 experts + 1 shared)
ContextUp to 256K tokens
ModalitiesText + image in; text out
Languages140+
Canvas size256 tokens per diffusion block

Self-correction: Bidirectional denoising lets the model revise masked tokens mid-block—useful for markdown formatting and structured output. Autoregressive models commit each token permanently.

Local footprint: Quantized weights target ~18 GB VRAM—high-end consumer GPUs without datacenter hardware.


The quality trade-off

Google is explicit: use autoregressive Gemma 4 for production quality. DiffusionGemma is experimental—speed first.

Benchmark snapshot (Hugging Face Gemma 4 blog):

BenchmarkDiffusionGemmaGemma 4 26B A4B
MMLU Pro77.6%82.6%
AIME 202669.1%88.3%
GPQA Diamond73.2%82.3%
HLE (no tools)11.0%8.7%

DiffusionGemma wins a few benchmarks and trails on most—the expected Pareto frontier when trading accuracy for throughput.

When speed wins:

  • Inline editing and code infilling
  • Rapid iteration in IDE assistants
  • Real-time markdown / structured formatting
  • Interactive local apps where latency beats absolute benchmark scores

When quality wins:

  • Long-form reasoning, agents, production RAG
  • High-QPS cloud serving where Gemma 4 autoregressive stacks are tuned
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.


Multimodal toolkit inherited from Gemma 4

DiffusionGemma is not a text-only hack—it ships the broader Gemma 4 feature set:

  • Thinking mode
  • Function calling
  • Native system prompts
  • Image understanding — OCR, document parsing, object detection, pointing at variable aspect ratios

Released alongside the wider Gemma 4 family (E2B, E4B, multimodal on-device models), DiffusionGemma is the speed-specialized sibling under the same Apache 2.0 open license.


Ecosystem and how to run it

Download: Hugging Face model hub (Apache 2.0).

Serving and tuning:

StackSupport
Hugging Face TransformersYes
vLLMYes
NVIDIA NeMo / NIMYes
MLX (Apple Silicon)Community ports
UnslothFine-tuning
llama.cppAnnounced coming soon at launch
Google Cloud Model GardenYes

NVIDIA optimization: NVFP4 4-bit kernels on Hopper/Blackwell; consumer RTX 5090/4090; DGX Spark and DGX Station for deskside local AI.


DiffusionGemma vs autoregressive: decision table

QuestionChoose DiffusionGemmaChoose Gemma 4 AR
Need lowest latency locally?
Need best benchmark scores?
Interactive editing / infilling?
Agent loops with tool use?Caution—verify quality
Apache 2.0 open weights?Gemma license
18 GB consumer GPU?✅ (quantized)✅ (smaller variants too)

For agentic coding stacks, DiffusionGemma is interesting as a local copilot engine—pair speed with verification loops (loop engineering) rather than trusting first-pass quality.


Industry context (June 2026)

DiffusionGemma landed the same week as Claude Fable 5, Code with Claude Tokyo agent scheduling, and Thariq's agent-edited launch video—a dense news cycle where speed (DiffusionGemma), autonomy (Fable, managed agents), and orchestration (workflows) all advanced in parallel.

Google's bet: decode parallelism matters for the next wave of on-device and IDE-embedded models, even if autoregression keeps the quality crown for now.


Related ExplainX guides

Primary sources: Google DiffusionGemma blog · Hugging Face Gemma 4 launch · @Google · @sundarpichai


Summary

DiffusionGemma is Google's open speed experiment: 26B MoE, 256-token parallel diffusion blocks, 4× faster on H100/5090, 18 GB quantized local runs, Apache 2.0. Sundar Pichai's racehorse framing is apt—it wins races where latency dominates, not where MMLU Pro does.

For production text quality, Google still points to autoregressive Gemma 4. For interactive local generation, inline edits, and researcher exploration of text diffusion, DiffusionGemma is the model to benchmark this week.


Specs, benchmarks, and serving support reflect Google's June 10, 2026 release. Re-check Hugging Face and the Google developers blog before production deployment.

Related posts