What did Wafer AI announce regarding GLM-5.2?

Wafer AI, in collaboration with Vercel AI Gateway and OpenRouter, announced serving GLM-5.2 on AMD MI355X at 213 tokens/second single-stream and 2,626 tokens/second aggregate throughput per node.

How does AMD MI355X compare to NVIDIA Blackwell on cost?

According to Wafer AI, AMD Instinct MI355X GPUs are approximately 2.75x cheaper per GPU than Blackwell B300, delivering over 2x lower inference cost for frontier models.

What quantization and framework were used?

Wafer AI quantized GLM-5.2 to MXFP4 using AMD Quark and served it via sglang, bypassing vLLM due to lack of MXFP4 MoE paths and avoiding ATOM due to long-context degradation.

How was speculative decoding fixed on ROCm?

Wafer AI resolved two bugs: mapping the mismatched MTP head layers in the Quark config to sglang, and adding a ROCm preprocessor guard (`#ifdef USE_ROCM`) to the fused multi-step metadata kernel.

Fastest GLM-5.2 on AMD MI355X: Wafer AI Performance | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Fastest GLM-5.2 on AMD MI355X: Wafer AI Performance | explainx.ai Blog | explainx.ai

The demand for frontier model inference is outstripping NVIDIA's supply. With frontier models launching weekly—from Leanstral 1.5 to GLM-5.2—NVIDIA GPU prices remain high, and tokens are expensive.

In comes AMD. At approximately 2.75x cheaper per GPU (MI355X vs. NVIDIA Blackwell B300) with comparable silicon specs, the AMD Instinct line is a strong competitor for cost-effective inference. In a new technical deep dive, Wafer AI—in collaboration with Vercel AI Gateway and OpenRouter—announced serving GLM-5.2 on AMD Instinct MI355X at 2626 tok/s/node aggregate and 213 tok/s single-stream, achieving over 2x lower cost than Blackwell.

Here is how Wafer AI bypassed day-0 compilation friction, optimized speculative decoding on ROCm, and tuned the MoE kernels.

TL;DR: Quick Reference

Question	Answer
What is the headline speed?	213 tokens/second single-stream (served on TensorWave hardware) and 2,626 tokens/second aggregate per node.
How does the cost compare?	AMD MI355X delivers over 2x lower cost than NVIDIA Blackwell (B200/B300) for equivalent throughput.
What quantization was used?	MXFP4 (via AMD Quark), which proved to be lossless compared to the official Z.ai FP8 quantization.
What engine runs it?	sglang, chosen over vLLM and ATOM due to MoE path compatibility and long-context stability.
Were custom kernels written?

Sustained RPS	Aggregate tok/s/node	TTFT p50 / p95	Success Rate
0.5	449	0.59s / 0.60s	100%
1.0	974	0.60s / 0.81s	100%
1.5	1913	0.62s / 1.03s	100%
2.0	1944	0.62s / 1.05s	100%
2.25	2089	0.63s / 1.23s	100%
2.4 (Saturation)	2626	0.81s / 2.22s	100%

Evaluation	FP8 Baseline	MXFP4	Δ (MXFP4 − FP8)
GSM8K (200q, 5-shot)	0.965 ± 0.013	0.955 ± 0.014	−0.010
GPQA-Diamond (198q × 2)	0.9217 ± 0.027	0.9026 ± 0.029	−0.019
tau2 macro	0.819	0.834	+0.015

Fastest GLM-5.2 on AMD MI355X: Wafer AI Achieves 213 Tokens/Second

TL;DR: Quick Reference

Related posts

Hugging Face Was Breached by OpenAI's Own Models During a Cyber Eval

AI Cyber Guardrails Block US Defenders — Kimi K3 and GLM 5.2 Fix What Codex and Fable Refused

Colibrì: Run GLM-5.2 on 25 GB RAM by Streaming MoE Experts From Disk

Wafer AI’s GLM-5.2 Benchmark Numbers

1. Prefill-Heavy Aggregate Throughput

2. Single-Stream Decode

Step 1: Quantization and Framework Selection

MXFP4 Quantization via AMD Quark

Framework: sglang

Step 2: Speculative Decode Fixes on ROCm

1. MTP Head Prefix Mismatch

2. ROCm Preprocessor Guard

Step 3: Prefill Optimization and MoE Kernel Tuning

What People Are Asking: Hacker News Reactions

What about performance per watt?

Is MXFP4 really "lossless" compared to FP8?

What are the profit margins for AMD inference hosts?

TL;DR: Quick Reference

Related posts

Hugging Face Was Breached by OpenAI's Own Models During a Cyber Eval

AI Cyber Guardrails Block US Defenders — Kimi K3 and GLM 5.2 Fix What Codex and Fable Refused

Colibrì: Run GLM-5.2 on 25 GB RAM by Streaming MoE Experts From Disk

Wafer AI’s GLM-5.2 Benchmark Numbers

1. Prefill-Heavy Aggregate Throughput

2. Single-Stream Decode

Step 1: Quantization and Framework Selection

MXFP4 Quantization via AMD Quark

Framework: sglang

Step 2: Speculative Decode Fixes on ROCm

1. MTP Head Prefix Mismatch

2. ROCm Preprocessor Guard

Step 3: Prefill Optimization and MoE Kernel Tuning

What People Are Asking: Hacker News Reactions

What about performance per watt?

Is MXFP4 really "lossless" compared to FP8?

What are the profit margins for AMD inference hosts?

Related reading