June 29, 2026:Cerebras announced Gemma 4 31B running at 1,851 output tokens per second on Cerebras Inference β 35Γ a typical GPU endpoint per Artificial Analysis benchmarking. It is the first Google DeepMind model on the platform and the first multimodal model at wafer-scale speed: developers can feed images β screenshots, documents, charts, UI states β into inference that previously only text models achieved at this throughput.
Cerebras has benchmarked fast inference across open-weight stacks β Kimi, GLM, GPT-OSS, Qwen β on wafer-scale hardware. Gemma 4 is the first Google DeepMind model on the platform, and the first where vision enters the loop at Cerebras speed.
Olivier Lacombe, Product Lead for Gemma at Google DeepMind:
"Gemma 4, Google DeepMind's family of open models, was built to bring advanced reasoning and multimodal capabilities at developer-friendly sizes. Pairing these capabilities with Cerebras's wafer-scale technology provides developers with an exciting platform for running extremely fast visual and agentic workflows."
The pitch is not "same product, faster." Logan Kilpatrick (Google DeepMind):
"If every model was doing 2,000 tokens per second, you would probably build different products. You wouldn't build the same product and just have it be faster."
At 1,800+ TPS, multimodal agent loops β inspect image β reason β structured output β tool call β verify β retry β stop feeling like batch jobs and start feeling interactive.
Speed Numbers β What 1,851 TPS Changes
Metric
Gemma 4 31B on Cerebras
Typical GPU endpoint
Output TPS
1,851
~53 (35Γ slower)
First token (incl. reasoning)
1.5 s
Much higher on GPU
vs Claude Haiku 4.5 (same hardware class)
18Γ faster
β
Intelligence (AA Index)
29
Haiku 4.5: 30
Cerebras recommends Gemma 4 31B as the reference medium-size model on its cloud: an alternative to Haiku, GPT-OSS, or Llama with equal or higher intelligence at Cerebras speed.
Dense vs MoE: Gemma 4 31B is a dense multimodal model β high intelligence without the large memory footprint of MoE serving. That fits Cerebras's wafer-scale serving story: strong enough for serious agent work, efficient to run at scale, Apache 2.0 for build-around freedom.
Multimodal on Wafer-Scale β A Platform First
Before Gemma 4, Cerebras Inference was text-first at extreme speed. Gemma 4 adds:
Cerebras states multimodal support starts with Gemma 4 and will extend to additional models β the same platform pattern they used for Kimi and GLM text inference.
This directly addresses the workflow gap Zhipu users asked for in GLM-5.3 polls: vision integrated with reasoning, not a separate VL model bridge. Gemma 4 on Cerebras ships that integration today β on cloud wafer hardware, not local 16GB VRAM.
Feed a dense dashboard screenshot or document page. The model identifies what matters, explains the finding, returns structured output β in real time, not after a GPU wait.
Long-context summarization
Hand it a research report or technical brief. Get a decision-ready summary fast enough to read, react, and re-query in one sitting β relevant for teams comparing against Kimi K2.7's long-context coding or GLM document pipelines.
Screenshot to Patch
Play to medium-model strengths: broken UI screenshot + source + console error β minimal patch + verification checks. This is the agentic coding loop Fable 5 marketed β now runnable on open weights at speeds that keep a human in the loop.
Computer use and robotics
Gemma 4's multimodal stack supports UI state reasoning β overlapping with E4B + Argent simulator navigation at the edge, but at cloud scale and speed for heavier agents.
Why Agent Loops Compound at 1,800 TPS
Multimodal and agentic workflows rarely call a model once:
For international developers blocked from Fable 5, Gemma 4 + Cerebras is another unrestricted multimodal path β alongside GLM-5.2 text and Kimi K2.7-Code coding.
Haiku remains marginally smarter on the AA Index. Gemma 4 wins on speed, openness, and self-host/build-around freedom when paired with Cerebras cloud.
Cerebras Platform Context β Kimi, GLM, and the Speed Ladder
Cerebras has positioned itself as the inference speed leader for open weights:
Kimi, GLM, GPT-OSS, Qwen β text at wafer-scale TPS
Gemma 4 31B β first Google model, first multimodal
If your stack already routes GLM-5.2 through Z.ai for text coding, Gemma 4 on Cerebras is the vision complement β similar to the Qwen-VL β GLM bridge developers want eliminated in GLM-5.3, but with single-model multimodal at extreme speed.
Availability β Public Preview
Gemma 4 31B is on the Cerebras Inference Cloud in public preview for a limited time as of June 29, 2026.
Cerebras asks teams with workloads in:
Multimodal reasoning
Fast document processing
Real-time audio and video (future platform extensions)
This is not a self-serve unlimited free tier announcement β preview access with enterprise outreach for heavy multimodal pipelines.
What Developers Should Do
Need multimodal agents at interactive speed?
Evaluate Cerebras Inference Cloud preview for Gemma 4 31B. Benchmark your screenshot-to-patch or dashboard-insight loop against GPU baselines β Cerebras claims 35Γ on output TPS.
Need local / privacy-first multimodal?
Use Gemma 4 12B locally β 16GB VRAM, Apache 2.0, unified architecture. Trade wafer-scale TPS for data never leaving your machine.
Need frontier coding without vision?
GLM-5.2 and Kimi K2.7-Code remain the open-weight coding leaders while Fable is suspended.
Building computer-use agents?
Compare Gemma 4 on Cerebras (cloud speed) vs Gemma 4 E4B + Argent (local iOS simulator) β same family, different deployment surface.
The Honest Answer
Is Gemma 4 on Cerebras the fastest multimodal inference available?
Per Cerebras and Artificial Analysis methodology cited in the announcement β yes, as of June 29, 2026, at 1,851 output TPS for Gemma 4 31B.
Does it replace Fable 5 or Opus for coding?
No β it targets Haiku-class medium intelligence with multimodal speed. Serious autonomous coding at Fable depth still points to restricted US models or open alternatives (Kimi, GLM, LongCat-2.0).
Does it matter for product design?
Yes. Kilpatrick's quote is the thesis: at 2,000 TPS, you build different products β not the same agent with shorter waits. Multimodal loops become real-time collaborators.
Speed and intelligence figures cite Cerebras's June 29, 2026 announcement and Artificial Analysis benchmarking as referenced by Cerebras. Preview availability and pricing may change β verify on cerebras.ai before production commitments.