The demand for frontier model inference is outstripping NVIDIA's supply. With frontier models launching weekly—from Leanstral 1.5 to GLM-5.2—NVIDIA GPU prices remain high, and tokens are expensive.
In comes AMD. At approximately 2.75x cheaper per GPU (MI355X vs. NVIDIA Blackwell B300) with comparable silicon specs, the AMD Instinct line is a strong competitor for cost-effective inference. In a new technical deep dive, Wafer AI—in collaboration with Vercel AI Gateway and OpenRouter—announced serving GLM-5.2 on AMD Instinct MI355X at 2626 tok/s/node aggregate and 213 tok/s single-stream, achieving over 2x lower cost than Blackwell.
Here is how Wafer AI bypassed day-0 compilation friction, optimized speculative decoding on ROCm, and tuned the MoE kernels.
TL;DR: Quick Reference
Question
Answer
What is the headline speed?
213 tokens/second single-stream (served on TensorWave hardware) and 2,626 tokens/second aggregate per node.
How does the cost compare?
AMD MI355X delivers over 2x lower cost than NVIDIA Blackwell (B200/B300) for equivalent throughput.
What quantization was used?
MXFP4 (via AMD Quark), which proved to be lossless compared to the official Z.ai FP8 quantization.
What engine runs it?
sglang, chosen over vLLM and ATOM due to MoE path compatibility and long-context stability.
Were custom kernels written?
No. Unlike previous Qwen runs, this was achieved by fixing framework config mismatches and ROCm preprocessor guards.
Wafer AI’s GLM-5.2 Benchmark Numbers
Wafer AI evaluated performance across two major workloads using AMD Instinct MI355X capacity provided by TensorWave:
1. Prefill-Heavy Aggregate Throughput
Under a heavy workload of 20k input tokens / 1k output tokens (with a 60% KV-cache hit rate), Wafer hit an aggregate throughput of 2626 tok/s/node at 2.4 requests per second (RPS) saturation, with a Time-to-First-Token (TTFT) p95 under 2.22 seconds. This represents 80% of Blackwell B200 performance at less than half the hardware cost.
Sustained RPS
Aggregate tok/s/node
TTFT p50 / p95
Success Rate
0.5
449
0.59s / 0.60s
100%
1.0
974
0.60s / 0.81s
100%
1.5
1913
0.62s / 1.03s
100%
2.0
1944
0.62s / 1.05s
100%
2.25
2089
0.63s / 1.23s
100%
2.4 (Saturation)
2626
0.81s / 2.22s
100%
2. Single-Stream Decode
Using Artificial Analysis standards on a 10k input / 1.5k output workload, the setup achieved 213 tokens/second on a single stream. While not topping the absolute speed leaderboards, it establishes a new state-of-the-art for performance per dollar.
Step 1: Quantization and Framework Selection
To serve Zhipu AI's 744B Mixture-of-Experts (MoE) model efficiently, Wafer had to choose the right quantization and runtime.
MXFP4 Quantization via AMD Quark
They quantized the base bf16 model to MXFP4 using AMD Quark. In evaluations against the official FP8 baseline, the MXFP4 quantization proved virtually lossless:
Evaluation
FP8 Baseline
MXFP4
Δ (MXFP4 − FP8)
GSM8K (200q, 5-shot)
0.965 ± 0.013
0.955 ± 0.014
−0.010
GPQA-Diamond (198q × 2)
0.9217 ± 0.027
0.9026 ± 0.029
−0.019
tau2 macro
0.819
0.834
+0.015
Framework: sglang
Wafer evaluated three inference engines: vLLM, ATOM, and sglang.
vLLM lacked working MXFP4 + GlmMoeDsa paths, meaning the 4-bit weights yielded no speedup.
ATOM suffered from severe output quality degradation at longer context windows.
sglang was selected as the engine with the least friction for native MoE structures, maintaining coherence at scale.
Step 2: Speculative Decode Fixes on ROCm
Speculative decoding (Multi-Token Prediction / MTP) was critical for hitting 213 tok/s, but sglang's ROCm image did not support it out of the box for GLM-5.2. Wafer resolved two key software mismatches:
1. MTP Head Prefix Mismatch
The speculative decoding (MTP) head keeps its single shared expert stored in bf16, not MXFP4. However, the MTP head was registered under a different module prefix than the main decoder stack:
Due to this mismatch, sglang's quantization lookup failed, defaulted to building the shared expert as MXFP4, and crashed during initialization on a shape mismatch (trying to load bf16 weights into a 4-bit slot). Wafer resolved this by copying the layer 78 config entries to the list under the decoder name sglang actually uses. This fix unblocked speculative decode, yielding a 3x gain in single-stream throughput.
2. ROCm Preprocessor Guard
Deep speculative decode (draft depth $\ge 4$) was blocked because the fused multi-step metadata kernel hardcoded #include <cuda_runtime.h> without a ROCm fallback. Wafer added a simple #ifdef USE_ROCM guard to compile on AMD:
Alongside these fixes, enabling config tweaks like --kv-cache-dtype fp8_e4m3 and --enable-aiter-allreduce-fusion secured the 213 tok/s decode performance.
Step 3: Prefill Optimization and MoE Kernel Tuning
Because the 20k input workload is primarily prefill-bound, decode-only optimizations were insufficient for aggregate throughput.
At Tensor Parallelism 8 (TP8), the MI355X ran GLM-5.2-MXFP4 at 1461 tok/s/node. Switching to TP4×DP2 (Tensor Parallelism 4, Data Parallelism 2) increased throughput to 1944 tok/s/node at 2.0 RPS.
However, they discovered that GLM-5.2's fp4 MoE was silently falling back to a slow FlyDSL heuristic because aiter only shipped tuned configs for the standard a8w8/fp8 path. Wafer manually tuned the MoE kernel selection on GLM’s specific fp4 shapes:
model_dim: 6144
moe_inter: 2048
E (Experts): 256
topk: 8
This manual kernel selection mapping unlocked the final jump to 2626 tok/s/node at 2.4 RPS.
What People Are Asking: Hacker News Reactions
The Wafer AI announcement quickly climbed to the top of Hacker News, triggering a debate about AMD vs. NVIDIA in the datacenter:
What about performance per watt?
Several commenters noted that the AMD Instinct MI355X draws 1,400W per GPU compared to the NVIDIA Blackwell B200 at 1,200W (a 16% increase). In high-power-cost regions like Germany, running these straight for years adds minor operational cost compared to the upfront acquisition difference. The primary bottleneck is datacenter power limits—efficient setups allow fitting more compute density under limited power hookups.
Additionally, the MI355X's larger memory capacity (~100GB HBM3e) complicates direct comparisons, as it allows running larger models on fewer cards.
Is MXFP4 really "lossless" compared to FP8?
A few users pointed out the minor accuracy drops in GPQA-Diamond (−0.019) and GSM8K (−0.010) when switching from FP8 to MXFP4. Wafer engineers consider this drop statistically negligible in production coding agents, especially given the massive throughput boost.
What are the profit margins for AMD inference hosts?
A Wafer engineer confirmed that gross margins for AMD inference hosting currently average ~40%. Node utilization remains the primary factor in determining these margins.
Related reading
To explore how the open-weight model ecosystem is developing, read these related articles on explainx.ai: