Alibaba's Qwen team just released something most AI labs haven't tried: a model trained not to act in environments, but to model them. Qwen-AgentWorld is a language world model — a single model that predicts what seven different agent environments (terminals, search engines, MCP servers, browsers, Android UIs, desktop OSes, and code editors) will return after any given action.
The flagship 397B-A17B version outperforms GPT-5.4 and Claude Opus 4.8 on AgentWorldBench, the new seven-domain benchmark released alongside it. The 35B-A3B version (MoE, 3B active parameters, 256K context) is fully open-source on Hugging Face.
The paper dropped June 23, 2026. The GitHub and HuggingFace repos are live.
The Core Idea: Model the Environment, Not Just the Agent
Every AI agent tutorial shows the same loop: agent observes state → agent picks action → environment returns new state → repeat. Almost all research optimizes one side of this loop — the policy (which action to take). Nobody has explicitly trained a language model to optimize the other side: predicting what the environment returns.
That's what Qwen-AgentWorld does. Given the interaction history and the agent's next action, the world model predicts the environment's response. For a terminal, that means predicting the exact shell output. For a search engine, that means generating realistic URLs, snippets, and rankings. For an MCP server, that means predicting the correct API response while maintaining referential consistency across sequential calls.
This isn't template generation. Getting these predictions right requires the same capabilities that make a good agent: multi-step causal reasoning, long-context state tracking, and domain-specific knowledge. For background on how agentic systems operate in these kinds of environments, see our overview of MCP servers and how to connect them to AI tools.
Seven Domains, One Model
Qwen-AgentWorld covers four text-based and three GUI-based environment types:
Text Environments
- Terminal — shell output, file system state, process behavior
- Search — URLs, snippets, rankings, page content
- MCP — API server responses, tool call results, database state
- SWE — git diffs, test results, compilation errors from code changes
GUI Environments (rendered as accessibility tree XML / HTML, not pixel frames)
- Web — DOM state changes after user interactions
- Android — UI hierarchy changes after touch/gesture actions
- OS — desktop file system, window management, application behavior
Training on all seven domains simultaneously isn't just convenient — it's architecturally important. Their experiments show that training on one text domain produces gains on all other text domains, meaning the model is learning shared underlying environment modeling capabilities that compound as domain coverage grows.
How It Was Trained
The three-stage pipeline follows a clear principle: CPT injects, SFT activates, RL sharpens.
Stage 1: Continual Pre-Training
CPT injects environment knowledge through more than 10 million real environment interaction trajectories. Data comes from containerized execution sandboxes, MCP servers, and Android/web/OS emulators. Beyond environment traces, the training incorporates specialized world knowledge: industrial control, cybersecurity, law, medicine, finance, and current affairs — because accurately simulating a regulatory compliance platform requires legal knowledge; simulating search results on current events requires factual grounding.
A notable contribution here is turn-level information-theoretic loss masking: four surface-level statistics per (action, observation) pair identify turns that carry genuine environment information and mask the rest from the loss while retaining them as context. This prevents the model from spending capacity learning to predict boilerplate.
Stage 2: Supervised Fine-Tuning
SFT activates next-state prediction as an explicit <think>...</think> reasoning pattern. They use rejection sampling to select 7,094 high-quality thinking trajectories. At this stage, the model learns to reason through what the environment would do before generating the prediction.
Stage 3: Reinforcement Learning
RL sharpens output quality using GSPO with hybrid rewards: an LLM judge scoring multi-dimensional quality, combined with rule-based verifiers for domains where exact correctness is checkable. The three-stage pipeline adds +8.66 points overall at 35B-A3B scale (47.73 → 56.39), putting it above Claude Sonnet 4.6 (56.04) at that size.
AgentWorldBench Results
| Model | Overall |
|---|---|
| Qwen-AgentWorld-397B-A17B | 58.71 |
| GPT-5.4 | 58.25 |
| Claude Opus 4.8 | — (below GPT-5.4) |
| Qwen-AgentWorld-35B-A3B | 56.39 |
| Claude Sonnet 4.6 | 56.04 |
The advantage is most pronounced on Terminal and SWE — exactly the domains where predictions require accurate modeling of code execution state and API behavior, rather than surface-level plausibility.
What the Model Reasons About
Analyzing 129 thinking traces across four text domains reveals three emergent reasoning patterns that weren't explicitly trained:
Deliberative self-correction. The model uses "Wait!" as a cognitive interrupt to revise intermediate predictions — averaging 10.4 such interrupts per turn. It catches reasoning errors, marks epistemic limits ("I cannot actually execute np.random.seed(42)"), and takes the perspective of both the environment and the agent.
Information leakage prevention. In search simulation, the model holds a reference answer that the agent is trying to find. When the query is unrelated to the target, the model ensures its generated snippets don't accidentally surface the answer — the world-model equivalent of theory of mind.
Multi-step causal reasoning. Predicting the output of curl -s localhost:3000 | python3 -m json.tool requires a six-step chain: Node.js missing → server never started → no listener on port 3000 → curl fails silently → empty pipe → json.tool raises JSONDecodeError. The model works through this chain rather than pattern-matching to a plausible output.
Two Paradigms for Using World Models in Agent Training
The more practical question is: given a language world model, what do you do with it? The paper explores two answers.
Paradigm I: Decoupled Simulation (World Model as Environment)
Here, the world model replaces the real environment during agent RL. The agent acts, the world model predicts the observation, the agent learns from simulated rollouts. The agent and world model are separate models.
Zero-shot generalization. Qwen-AgentWorld simulated 4,000 OpenClaw environments — an open-source agent platform entirely out of distribution — with no domain-specific adaptation. This produced gains of +4.3 on Claw-Eval and +7.1 on QwenClawBench for the downstream agent.
The quality bottleneck is real: using Qwen3.6-Plus as the simulator produced negligible improvement. World model quality directly determines whether Sim RL works at all.
Controllable simulation. This is where things get interesting. Natural-language instructions can shape the simulator's behavior at training time — injecting intermittent API errors, paginated responses, partial batch failures, or any targeted edge case that real environments produce rarely and inconsistently.
Results for MCP tasks:
| Tool Decathlon | MCPMark | |
|---|---|---|
| Qwen3.5-35B-A3B-SFT | 32.4 | 21.5 |
| + Sim RL (uncontrolled) | 31.5 | 24.6 |
| + Sim RL (controlled) | 36.1 | 33.8 |
Uncontrolled simulation didn't just fail to help — it hurt on Tool Decathlon. Controllability is what makes Sim RL work in the MCP domain.
Fictional-world search training. For search tasks, they constructed 1,000 self-consistent fictional environments: relational databases of 300–500 rows of internally consistent invented facts. A time-shifted environment might contain a 2029 smartphone market ranking with real brand names but non-existent model numbers. Because answers exist only in the fictional world, the agent can't bypass retrieval by answering from memory. Because all facts are invented, there's no risk of the agent confusing simulated facts with real-world knowledge.
Results:
| F1 by Item | F1 by Row | |
|---|---|---|
| Qwen3.5-35B-A3B-SFT | 34.02 | 13.72 |
| + Sim RL (controlled) | 50.31 | 24.21 |
Sim RL vs. Real RL. On WideSearch, controllable Sim RL reached 50.3% F1 vs. 45.6% for Real RL trained with a live search engine. But the more informative result is behavioral: Real RL-trained agents reduced web_extractor calls from 2.5 to 1.5. Sim RL-trained agents increased them from 2.5 to 4.0 — because the simulated snippets deliberately withheld detailed content, teaching agents that full-page extraction is necessary for complete answers. The simulator shaped behavior that real environments couldn't.
Paradigm II: Agent Foundation Model (World Modeling as Agent Capability)
Here, world modeling and action selection are unified in the same model. The hypothesis: a capable general agent should be able to both decide what to do and predict what happens next. World modeling is internalized as a meta-reasoning pattern — predict before you act.
They test this by running LWM RL on Qwen3.5-35B-A3B using single-turn, tool-call-free world-model tasks only, then evaluating directly on multi-turn, tool-calling agentic benchmarks without any additional fine-tuning. For context on how Qwen models have evolved toward long-horizon autonomy, see our coverage of Qwen 3.7 Max and frontier agent capabilities.
| Benchmark | Base | + LWM RL | Δ |
|---|---|---|---|
| Terminal-Bench 2.0 | 33.3 | 39.6 | +6.3 |
| SWE-Bench Verified | 64.5 | 67.9 | +3.4 |
| SWE-Bench Pro | 42.2 | 47.4 | +5.2 |
| WideSearch F1 Item | 33.4 | 46.2 | +12.8 |
| Claw-Eval (OOD) | 53.6 | 64.9 | +11.3 |
| QwenClawBench (OOD) | 39.8 | 49.4 | +9.7 |
| BFCL v4 (OOD) | 62.3 | 71.3 | +9.0 |
The out-of-domain results are the headline. Claw, QwenClawBench, and BFCL v4 contain no data from the LWM training pipeline. Gains of +11.3, +9.7, and +9.0 emerged on domains the world model never trained on. This suggests that predicting environment states builds transferable reasoning capabilities — not domain-specific shortcuts.
Deployment
The open-source 35B-A3B model supports SGLang and vLLM:
# SGLang
python -m sglang.launch_server \
--model-path Qwen/Qwen-AgentWorld-35B-A3B \
--port 8000 \
--tensor-parallel-size 4 \
--context-length 262144 \
--reasoning-parser qwen3
# vLLM
vllm serve Qwen/Qwen-AgentWorld-35B-A3B \
--port 8000 \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--trust-remote-code
For Python inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-AgentWorld-35B-A3B",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-AgentWorld-35B-A3B")
AgentWorldBench is available on Hugging Face as per-domain JSONL files. The eval pipeline runs in three steps via eval/eval.py: infer → judge → aggregate. Both world model and judge use OpenAI-compatible APIs, so any SGLang or vLLM endpoint works.
Claude for Work
Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.
Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.
Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.
What This Changes (and What It Doesn't)
The paper is careful about what it's claiming. Language world models are not proposed as replacements for real-environment training. The explicit framing is "a complementary axis" — and that framing holds up. For a broader look at where agentic AI is heading through 2030, see The Agentic Era: AI Future 2026–2030.
Real-environment RL remains the ground truth for grounding agent behavior. Sim RL produces different behavior from Real RL (as the web_extractor divergence shows), not strictly better behavior across all tasks. The value is in what real environments can't provide: controllability over edge-case exposure, scalability to environments without execution infrastructure, and the ability to construct fictional worlds that isolate skills cleanly.
The more surprising result is Paradigm II — that world-model training transfers to agent tasks without any agent-specific RL. That's a stronger claim and warrants scrutiny on the holdout construction. The OOD benchmarks (Claw, BFCL v4) are genuinely separate from the world-model training distribution, which is the right kind of evidence. But the mechanism isn't fully explained: is it that world-model RL improves general reasoning, or specifically that next-state prediction instills useful agent priors? The ablations don't fully separate these.
Worth reading the paper carefully on this section before drawing strong conclusions about production applicability.
The Bigger Picture
The Chinese open-source ecosystem shipped several significant infrastructure pieces in the same week: Zvec (in-process vector database), Qwen-AgentWorld (language world model), and Baidu's OCR tooling. The pattern is consistent — applied infrastructure that addresses real gaps in the agentic stack, released fully open-source.
Qwen-AgentWorld specifically addresses a gap that's been visible for a while: agentic RL training is expensive, brittle, and hard to control when you're interacting with real environments at scale. Sandboxes break. Rate limits hit. Irreversible actions happen. A faithful simulator that you can perturb precisely is genuinely useful, not just theoretically interesting.
The open-source release at 35B-A3B scale means teams can actually run this. It's not a research artifact — it's deployable on 4x A100s with the commands above.
Where to Go
- Paper: arxiv.org/abs/2606.24597
- Blog: qwen.ai/blog (search Qwen-AgentWorld)
- GitHub: github.com/QwenLM/Qwen-AgentWorld
- HuggingFace: huggingface.co/collections/Qwen/qwen-agentworld
- ModelScope: modelscope.cn/collections/Qwen/Qwen-AgentWorld
The 35B-A3B model and AgentWorldBench are both live. If you're working on agent training pipelines and want a controllable simulation layer, this is worth pulling down and testing.