explainx / blog
DiffusionGemma: Google’s 4× Faster Open Model Uses Text Diffusion
Google DeepMind releases DiffusionGemma—26B MoE, Apache 2.0, up to 4× faster text gen via parallel 256-token blocks. H100 1000+ tok/s, runs on 18GB VRAM.
explainx / blog
Google DeepMind releases DiffusionGemma—26B MoE, Apache 2.0, up to 4× faster text gen via parallel 256-token blocks. H100 1000+ tok/s, runs on 18GB VRAM.
Jul 3, 2026
Pasting a YouTube link into ChatGPT reads the transcript, not the picture. Claude often rejects video files outright. Here is what actually works in 2026 — native multimodal APIs, local frame+transcript pipelines like claude-real-video, and transcript-first agents like video-use — with honest limits and cost math.
Jun 23, 2026
The standard assumption for running a 70B model locally: you need 140GB of VRAM. AirLLM breaks that assumption by loading layers one at a time from disk, holding only one layer in GPU memory at any moment. 21K+ GitHub stars, three lines of code to start. Here is what it actually buys you and what it costs you.
Jun 23, 2026
GLM-5.2 has 744B parameters but only 40B are active at any time — that's what makes it runnable locally. The 2-bit dynamic GGUF fits in 239GB of disk/RAM. With Unsloth Studio's web UI, you can run it on a Mac without touching the command line. Here is the full guide.
On June 10, 2026, Google DeepMind released DiffusionGemma—an experimental open-weights model that generates text with discrete diffusion instead of predicting one word at a time. @Google and @GoogleDeepMind pitched it as up to 4× faster on dedicated GPUs; CEO Sundar Pichai called it a "racehorse" for interactive apps.
TL;DR
| Spec | DiffusionGemma | Standard Gemma 4 (26B A4B) |
|---|---|---|
| Method | Parallel 256-token diffusion blocks | Autoregressive (token-by-token) |
| Total / active params | 26B MoE / 3.8B active | 25.2B / 3.8B active |
| Speed (H100) | 1000+ tok/s | Baseline |
| Speed (RTX 5090) | 700+ tok/s | Baseline |
| VRAM (quantized) | ~18 GB | Similar class |
| License | Apache 2.0 | Gemma terms |
| Quality | Lower on most benchmarks | Production recommended |
| Best for | Interactive local, low latency | Max quality, cloud QPS |
Nearly every LLM today is autoregressive: generate token t, then t+1, each depending on all prior tokens. Decode becomes memory-bandwidth-bound—the GPU waits on KV cache reads more than it computes.
DiffusionGemma inverts the decode bottleneck:
flowchart LR
A[Encoder prefills context] --> B[KV cache built]
B --> C[256-token canvas of masked tokens]
C --> D[Parallel denoising passes]
D --> E[Canvas finalized → append to cache]
E --> F[Next canvas…]
From the Hugging Face Gemma 4 launch post:
Block-autoregressive across canvases, parallel within each canvas—roughly 15–20 tokens per forward pass versus one.
Google's published throughput (official blog):
| Hardware | Throughput |
|---|---|
| NVIDIA H100 (single GPU) | 1000+ tokens/sec |
| GeForce RTX 5090 | 700+ tokens/sec |
| vs autoregressive Gemma 4 | Up to ~4× faster |
DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It's a racehorse 🏇 achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token output!
Adaptive compute: Simpler prompts and structured tasks (code infilling, markdown) can use fewer denoising steps, so tokens-per-second scales with task complexity.
DiffusionGemma shares Gemma 4's 26B A4B MoE foundation:
| Attribute | Value |
|---|---|
| Total parameters | ~26B (25.2B in HF spec) |
| Active per forward | 3.8B (8 of 128 experts + 1 shared) |
| Context | Up to 256K tokens |
| Modalities | Text + image in; text out |
| Languages | 140+ |
| Canvas size | 256 tokens per diffusion block |
Self-correction: Bidirectional denoising lets the model revise masked tokens mid-block—useful for markdown formatting and structured output. Autoregressive models commit each token permanently.
Local footprint: Quantized weights target ~18 GB VRAM—high-end consumer GPUs without datacenter hardware.
Google is explicit: use autoregressive Gemma 4 for production quality. DiffusionGemma is experimental—speed first.
Benchmark snapshot (Hugging Face Gemma 4 blog):
| Benchmark | DiffusionGemma | Gemma 4 26B A4B |
|---|---|---|
| MMLU Pro | 77.6% | 82.6% |
| AIME 2026 | 69.1% | 88.3% |
| GPQA Diamond | 73.2% | 82.3% |
| HLE (no tools) | 11.0% | 8.7% |
DiffusionGemma wins a few benchmarks and trails on most—the expected Pareto frontier when trading accuracy for throughput.
When speed wins:
When quality wins:
DiffusionGemma is not a text-only hack—it ships the broader Gemma 4 feature set:
Released alongside the wider Gemma 4 family (E2B, E4B, multimodal on-device models), DiffusionGemma is the speed-specialized sibling under the same Apache 2.0 open license.
Download: Hugging Face model hub (Apache 2.0).
Serving and tuning:
| Stack | Support |
|---|---|
| Hugging Face Transformers | Yes |
| vLLM | Yes |
| NVIDIA NeMo / NIM | Yes |
| MLX (Apple Silicon) | Community ports |
| Unsloth | Fine-tuning |
| llama.cpp | Announced coming soon at launch |
| Google Cloud Model Garden | Yes |
NVIDIA optimization: NVFP4 4-bit kernels on Hopper/Blackwell; consumer RTX 5090/4090; DGX Spark and DGX Station for deskside local AI.
| Question | Choose DiffusionGemma | Choose Gemma 4 AR |
|---|---|---|
| Need lowest latency locally? | ✅ | |
| Need best benchmark scores? | ✅ | |
| Interactive editing / infilling? | ✅ | |
| Agent loops with tool use? | Caution—verify quality | ✅ |
| Apache 2.0 open weights? | ✅ | Gemma license |
| 18 GB consumer GPU? | ✅ (quantized) | ✅ (smaller variants too) |
For agentic coding stacks, DiffusionGemma is interesting as a local copilot engine—pair speed with verification loops (loop engineering) rather than trusting first-pass quality.
DiffusionGemma landed the same week as Claude Fable 5, Code with Claude Tokyo agent scheduling, and Thariq's agent-edited launch video—a dense news cycle where speed (DiffusionGemma), autonomy (Fable, managed agents), and orchestration (workflows) all advanced in parallel.
Google's bet: decode parallelism matters for the next wave of on-device and IDE-embedded models, even if autoregression keeps the quality crown for now.
Primary sources: Google DiffusionGemma blog · Hugging Face Gemma 4 launch · @Google · @sundarpichai
DiffusionGemma is Google's open speed experiment: 26B MoE, 256-token parallel diffusion blocks, 4× faster on H100/5090, 18 GB quantized local runs, Apache 2.0. Sundar Pichai's racehorse framing is apt—it wins races where latency dominates, not where MMLU Pro does.
For production text quality, Google still points to autoregressive Gemma 4. For interactive local generation, inline edits, and researcher exploration of text diffusion, DiffusionGemma is the model to benchmark this week.
Specs, benchmarks, and serving support reflect Google's June 10, 2026 release. Re-check Hugging Face and the Google developers blog before production deployment.