How fast is Gemma 4 31B on Cerebras?

Cerebras reports Gemma 4 31B at 1,851 output tokens per second as measured by Artificial Analysis — roughly 35× a typical GPU endpoint. First-token latency including reasoning is 1.5 seconds. Cerebras positions this as the world's fastest multimodal model inference as of its June 29, 2026 announcement.

Is Gemma 4 on Cerebras multimodal?

Yes. Gemma 4 31B is the first model on Cerebras Inference to support image understanding — screenshots, documents, charts, UI states, scanned pages, and diagrams. Cerebras states multimodal support will extend to additional models on the platform going forward.

How does Gemma 4 31B compare to Claude Haiku 4.5?

Per Artificial Analysis Intelligence Index scores cited by Cerebras, Gemma 4 31B scores 29 versus Haiku 4.5 at 30 — comparable intelligence. On Cerebras hardware Gemma 4 runs 18× faster than Haiku. Gemma 4 is open-weight under Apache 2.0; Haiku is a proprietary Anthropic API model.

Where can I access Gemma 4 on Cerebras?

Gemma 4 31B is available on the Cerebras Inference Cloud in public preview for a limited time as of June 29, 2026. Cerebras invites teams with multimodal reasoning, fast document processing, or real-time audio/video workloads to contact them directly.

What workloads is Gemma 4 on Cerebras built for?

Cerebras highlights screenshot-to-insight (dashboards, documents), long-context summarization, screenshot-to-patch (broken UI + console error → minimal fix), computer use, and robotics. Agentic and multimodal loops that call a model many times per task benefit most from 1,800+ TPS — verification and retries fit in the same product latency budget.

How does this relate to local Gemma 4 models?

Google DeepMind's Gemma family includes smaller local models like Gemma 4 12B (runs on 16GB VRAM) and edge variants like E4B for on-device automation. Gemma 4 31B on Cerebras targets cloud inference at wafer-scale speed — complementary to local deployment, not a replacement for laptop/edge use cases.

Gemma 4 on Cerebras: 1,851 TPS Multimodal Inference | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Gemma 4 on Cerebras: 1,851 TPS Multimodal Inference | explainx.ai Blog | explainx.ai

Update — July 16, 2026: Gemma 4 31B agentic benchmarks improved — +10.1% Tau2 Telecom, +4.5% TB2 Agents, plus max_soft_tokens 1120 for sharper OCR on refreshed HF weights.

June 29, 2026: Cerebras announced Gemma 4 31B running at 1,851 output tokens per second on Cerebras Inference — 35× a typical GPU endpoint per Artificial Analysis benchmarking. It is the first Google DeepMind model on the platform and the first multimodal model at wafer-scale speed: developers can feed images — screenshots, documents, charts, UI states — into inference that previously only text models achieved at this throughput.

For teams building visual agent loops while Fable 5 remains offline and GLM-5.3 vision is still a community wishlist, Gemma 4 on Cerebras is a different bet: open-weight multimodal at real-time latency, not frontier closed API access.

TL;DR — Gemma 4 on Cerebras

Item	Detail
Model	Gemma 4 31B — Google DeepMind flagship dense open model

Metric	Gemma 4 31B on Cerebras	Typical GPU endpoint
Output TPS	1,851	~53 (35× slower)
First token (incl. reasoning)	1.5 s	Much higher on GPU
vs Claude Haiku 4.5 (same hardware class)	18× faster	—
Intelligence (AA Index)	29	Haiku 4.5: 30

Dimension	Gemma 4 31B (Cerebras)	Claude Haiku 4.5
Intelligence (AA Index)	29	30
Speed on Cerebras	1,851 TPS	18× slower (per Cerebras)
License	Apache 2.0 open weights	Proprietary API
Export control	No US nationality gate	Anthropic API terms
Multimodal	Native on Cerebras	Native via Anthropic API
Fable-class coding	Medium model — not Fable tier	Haiku ≠ Fable; Fable still offline

Gemma 4 31B on Cerebras: 1,800+ TPS — The Fastest Multimodal Inference Yet

TL;DR — Gemma 4 on Cerebras

Related posts

Gemma 4 12B: Multimodal Local AI Guide 2026

Gemma 4 July 2026 Update: Flash Attention 4, Tool Calling, and Vision Fixes

GLM-5.3: Zhipu AI Asks the Community — Vision Leads the Wishlist

Why Cerebras Paired With Gemma 4

Speed Numbers — What 1,851 TPS Changes

Multimodal on Wafer-Scale — A Platform First

Example Workloads Cerebras Highlights

Screenshot to Insight

Long-context summarization

Screenshot to Patch

Computer use and robotics

Why Agent Loops Compound at 1,800 TPS

Gemma 4 31B vs Haiku 4.5 — The Open-Weight Angle

Cerebras Platform Context — Kimi, GLM, and the Speed Ladder

Availability — Public Preview

What Developers Should Do

Need multimodal agents at interactive speed?

Need local / privacy-first multimodal?

Need frontier coding without vision?

Building computer-use agents?

The Honest Answer

Gemma ecosystem

Open-weight alternatives

Agent context

TL;DR — Gemma 4 on Cerebras

Related posts

Gemma 4 12B: Multimodal Local AI Guide 2026

Gemma 4 July 2026 Update: Flash Attention 4, Tool Calling, and Vision Fixes

GLM-5.3: Zhipu AI Asks the Community — Vision Leads the Wishlist

Why Cerebras Paired With Gemma 4

Speed Numbers — What 1,851 TPS Changes

Multimodal on Wafer-Scale — A Platform First

Example Workloads Cerebras Highlights

Screenshot to Insight

Long-context summarization

Screenshot to Patch

Computer use and robotics

Why Agent Loops Compound at 1,800 TPS

Gemma 4 31B vs Haiku 4.5 — The Open-Weight Angle

Cerebras Platform Context — Kimi, GLM, and the Speed Ladder

Availability — Public Preview

What Developers Should Do

Need multimodal agents at interactive speed?

Need local / privacy-first multimodal?

Need frontier coding without vision?

Building computer-use agents?

The Honest Answer

Related Reading

Gemma ecosystem

Open-weight alternatives

Agent context