Agents-A1: InternScience 35B MoE Agent Model โ Long-Horizon Search, GAIA 96, and vLLM Setup
ModelScope dropped Agents-A1 June 30 โ 35B MoE, 256K context, Apache 2.0, SOTA on Seal-0 and FrontierScience. vs Qwen 3.6 35B A3B, GPT-5.5, Kimi K2.6. vLLM/SGLang commands, benchmarks, and what X is asking.
June 30, 2026 โ 4:19 PM:ModelScope announced Agents-A1 on X โ a 35B MoE agentic model from InternScience built for long-horizon search, engineering, scientific research, instruction following, and tool calling. Weights landed on Hugging Face the same day under Apache 2.0, with a technical report claiming trillion-parameter-class agent performance without trillion-parameter weights.
The launch sits in a crowded week: LongCat-2.0 from Meituan, ongoing Qwen 3.6 local-dev hype, and Fable 5 still offline. Agents-A1's pitch is different โ not raw coding SWE scores alone, but heterogeneous agent horizons: search loops, science tools, instruction evals, and function-calling at 256K context.
TL;DR โ what people asked on X
Question
Answer
What is it?
35.11B MoE agent model, qwen3_5_moe architecture, 262K server context
License?
Apache 2.0 โ enterprise-friendly
On Hugging Face?
Yes โ InternScience/Agents-A1 safetensors
Coding?
SciCode 44.3 โ competitive in ~35B class, not frontier (GPT-5.5 56.1)
vs Qwen AgentWorld?
Different job โ AgentWorld simulates envs; Agents-A1 is the acting agent. Shared Qwen-family DNA per HF tags; launch copy doesn't cite AgentWorld
256K enough for agents?
Debatable โ beats most open models on long-bench rows, but real agent runs accumulate tool I/O fast. Treat context as necessary, not sufficient
How to run?
vLLM or SGLang โ not llama.cpp GGUF at ship
What InternScience claims
Paper title: Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent (arxiv:2606.30616, dated June 29, 2026).
Two scaling axes:
Long-horizon trajectories โ a domain-grounded knowledge-action infrastructure that jointly constructs actions, observations, and verifier outcomes so multi-step agent runs become trainable targets, not one-off demos.
Heterogeneous agent abilities โ a three-stage pipeline:
Full-domain supervised fine-tuning for broad agent behaviors
Domain teacher models for specialized expertise (search vs science vs engineering)
Multi-teacher, multi-domain on-policy distillation with heterogeneity-aware optimization
InternScience also open-sourced an evaluation framework in the repo (Agents-A1/evaluation) so others can reproduce agent-capability scores under one protocol โ a move toward the standardized eval hygiene we cover in the AI benchmarks guide.
Benchmark table โ where 35B punches up
Legend from the model card: ๐ฅ overall SOTA ยท ๐ข best among ~35B comparables
Long-horizon search
Benchmark
Agents-A1
Qwen3.6-35B-A3B
Kimi-K2.6
GPT-5.5 (xhigh)
BrowseComp
๐ข 75.51
67.93
83.2
๐ฅ 84.4
XBench-DS-2510
๐ข 86.0
71.0
๐ฅ 90.0
84.0
Seal-0
๐ฅ 56.36
38.74
50.45
42.34
GAIA
๐ข 96.04
78.64
80.58
87.38
explainx.ai read:GAIA 96 and BrowseComp 75.5 are the headline numbers for teams building search + tool agents โ the same benchmark family LongCat-2.0 cites at 79.9 (different harness/protocol โ always compare apples-to-apples).
Scientific research
Benchmark
Agents-A1
Qwen3.6-35B-A3B
DeepSeek-V4-pro
FrontierScience-Olympiad
๐ฅ 79.0
60.3
76.0
FrontierScience-Research
๐ฅ 40.0
2.9
13.3
HiPhO
๐ฅ 46.4
37.7
38.7
HLE w/ tools
๐ข 47.6
36.2
48.2
Strongest story: research-agent tasks with tools โ relevant for RAG + calculator + literature search stacks, not pure chat.
Instruction following
Benchmark
Agents-A1
Qwen3.6-35B-A3B
GPT-5.5
IFBench
๐ฅ 80.61
64.4
75.9
IFEval
๐ฅ 94.82
91.3
93.35
LongBench-v2
๐ข 60.2
57.7
โ
If your product fails on multi-constraint prompts (format + content + exclusions), this row matters more than MMLU.
Engineering / coding (the skeptical row)
Benchmark
Agents-A1
Qwen3.6-35B-A3B
Kimi-K2.6
GPT-5.5
SciCode
๐ข 44.33
35.8
53.5
๐ฅ 56.1
MLE-Lite
๐ข 43.94
34.85
62.12
๐ฅ 72.73
X asked "What about coding?" โ fairly. Agents-A1 wins its weight class but Kimi K2.6 and GPT-5.5 lead on SciCode. For repo-scale coding agents, prioritize Kimi K2.7-Code, LongCat-2.0, or dense local Qwen 3.6 27B until independent Terminal-Bench / SWE-bench runs appear.
Architecture and lineage
Field
Value
Parameters
35.11B (MoE)
Format
Safetensors, BF16
Architecture tag
qwen3_5_moe
Context
256K native; servers document 262144 max
Modalities
Text + vision encoder (text-only mode skips vision to free KV cache)
X commenters suggesting "basically Qwen AgentWorld rebranded" oversimplify โ but teams evaluating both should read AgentWorld's paper and Agents-A1's distillation story as complementary, not duplicate.
How to run Agents-A1 (vLLM and SGLang)
Weights are Transformers safetensors โ use vLLM or SGLang, not llama.cpp until community GGUF quantizers ship.
Claude Code / OpenClaw โ if your stack supports custom OpenAI endpoints
For MCP-heavy loops, run the tool-call parser variant and validate with your real server set โ benchmark tool rows don't guarantee clean JSON on your schema.
What X got right (and what to verify)
Reasonable hype
IFEval / IFBench SOTA โ instruction-following is a real product surface; numbers are strong.
FrontierScience-Research 40.0 โ large jump vs 35B peers (2.9โ6.7 range in table).
Open eval code โ rare and useful; run it before procurement.
Skepticism to keep
"256K is not enough for agent" โ long-horizon agents blow context with tool payloads, retries, and state snapshots. 256K helps; state externalization and compaction still required.
"7 steps before state drift" (Gregor's reply) โ real-world agent runs fail on memory coherence, not just benchmark max scores.
Coding โ wait for Terminal-Bench 2.0 / SWE-bench reproductions on your harness before replacing Kimi/LongCat/Qwen dense locals.
Vendor tables โ when a model doesn't report a benchmark, InternScience says they evaluated under their protocol; cross-check against original vendor papers.
Where Agents-A1 fits in the 2026 open-agent ladder
Coding-first open MoE โ LongCat-2.0, Kimi K2.7-Code
Local daily driver (dense) โ Qwen 3.6 27B + llama.cpp
World-model / sim RL โ Qwen-AgentWorld
Heterogeneous long-horizon โ Agents-A1 โ this launch
Closed frontier (if allowed)โ GPT-5.5, Fable/Mythos (offline)
Benchmark figures and sampling defaults reflect the InternScience model card as of June 30, 2026. MoE serving requirements and independent coding evals may differ โ verify before production. Last updated: June 30, 2026.