Apodex 1.0-mini: 35B Open Model Tops FutureX โ Beats Sonnet 4.6 and GPT-5.5
Apodex shipped 1.0 with open 35B Apodex-1.0-mini โ #1 on FutureX at 59.17, ahead of Claude Sonnet 4.6, DeepSeek-V4-Pro, GPT-5.5. Verification-centric deep research, AgentHarness, vLLM setup.
June 29, 2026 โ 9:31 PM:Apodex posted on X that since shipping Apodex 1.0, its open 35B Apodex-1.0-mini has been outperforming models many times its size on FutureX โ three #1 finishes in four weeks, with fresh scores of 59.17 (#1) and 58.42 (#2), ahead of Claude-Sonnet-4.6 (56.32), DeepSeek-V4-Pro (53.58), and GPT-5.5 (52.51).
Apodex calls itself a Self-Evolving Heavy-Duty Solver โ built for questions whose answer isn't written down yet. That is future prediction, geopolitics, market moves, and open research problems where retrieval plus verification beats param count.
Harness + verification-centric training โ not raw logits alone
FIFA 2026 winner?
Apodex replied "checking our notes ๐" โ the benchmark vibe in one reply
What Apodex 1.0 actually ships
Apodex 1.0 is a verification-centric agent stack, not a single checkpoint drop:
Component
Role
Apodex-1.0-mini
Open 35B-A3B weights โ standard ReAct tool agent
Apodex-1.0-H
Heavy-duty mode โ async agent team, shared evidence pool, global verifier
Apodex-1.0-0.8B / 2B / 4B SFT
Smaller open models trained on deep-research SFT data alone
AgentHarness
Open eval repo โ reproduce BrowseComp, DeepSearchQA, HLE, etc.
AgentOS
Task-agnostic runtime for building and evaluating agent workflows
apodex.ai API
Product surface โ Deep Research powered by mini
Paper title: Apodex-1.0: A Verification-Centric Agent Team for Discoverative Intelligence (tech blog, Hugging Face model card).
Verification-first design
Most deep-research agents conflate finding and trusting. Apodex splits the loop:
Orchestrator plans search and dispatches sub-agents
Retrieval agents gather evidence in parallel
Verifier audits the evidence graph before the answer ships
Report pool logs findings, verdicts, and interventions โ auditable, retractable, forkable
In heavy-duty deployment, Apodex claims coordination of up to 150 sub-agents over 15,000 steps on a single task โ test-time scaling as architecture, not just more tokens.
FutureX evaluates future prediction โ forecasting outcomes before ground truth exists. That is deep research under uncertainty: track sources, detect regime change, synthesize conflicting signals, and commit to a calibrated answer.
Apodex submitted four experimental prediction harnesses built only on Apodex-1.0-mini. On the June 29 board they occupied #1 and #2 (and previously held top four per LinkedIn launch copy).
Rank (Jun 29)
Score
Model / harness
#1
59.17
Apodex harness (Apodex-1.0-mini)
#2
58.42
Apodex harness (Apodex-1.0-mini)
โ
56.32
Claude-Sonnet-4.6
โ
53.58
DeepSeek-V4-Pro
โ
52.51
GPT-5.5
explainx.ai read: FutureX rewards evidence discipline, not memorization โ the same axis where BrowseComp and DeepSearchQA live. A 35B open model leading here is a procurement signal for teams blocked from Fable/Mythos who still need frontier-grade research loops.
Apodex-1.0-H (heavy agent team, not weights alone) pushes further โ 90.3 BrowseComp, 94.4 DeepSearchQA, 60.8 HLE-text, 46.7 FrontierScience-Research, per the launch blog โ edging GPT-5.5 on BrowseComp in Apodex's table and beating Claude Opus 4.8 and Kimi K2.6 on DeepSearchQA.
Data > params at small scale: Apodex-1.0-4B-SFT beats every open 30B-class model on BrowseComp and BrowseComp-ZH in their comparison โ a direct challenge to "just scale parameters" thinking in the China free-models playbook.
Evaluations block benchmark-hosting websites during runs to reduce answer leakage from public repos โ standard hygiene, worth noting when comparing vendor numbers.
Apodex claims additive gains โ general knowledge (MMLU-Pro/Redux, C-Eval), math (AIME, HMMT), instruction-following (IFEval, IFBench), and long-context (LongBench v2) stay within ~1 point of matched Qwen3.5 bases. Coding is preserved too โ Apodex-1.0-H reports 79.0 SWE-bench Verified and 58.4 Terminal-Bench v2 in the launch blog.
For local MoE coding trade-offs on the same Qwen family, see Qwen 3.6 27B dense vs 35B A3B โ different fine-tune, same hardware class.
Apodex vs Agents-A1 โ two 35B agent launches, same week
Apodex uses the Qwen3.5 chat template โ tool calls as <function=...> blocks; pass schemas via OpenAI tools= parameter, not inlined in system prompt. Full Python agent loop is on the Hugging Face card.
Reproduce public benchmarks
git clone https://github.com/ApodexAI/AgentHarness
# Download benchmark pack from Hugging Face datasets/apodex/Deep-Research-Benchmarks# See AgentHarness README for ReAct eval protocol
Supported suites include BrowseComp, BrowseComp-ZH, DeepSearchQA, HLE (text), FrontierScience, SuperChem, WideSearch โ see AI benchmarks guide for how to read each.
X confirmed: Apodex-1.0-mini powers Deep Research mode on apodex.ai.
Use cases Apodex targets:
Open-domain research where citations must survive audit
Future prediction (FutureX-class questions)
Scientific and professional QA with tool integration
Mission-critical tasks where unsupported conclusions are unacceptable
That is a different buyer than "fast codegen" โ closer to enterprise Fable alternatives research tiering: Apodex / DeepSeek / GLM API for volume + verify, closed frontier for edge cases if accessible.
Skepticism to keep
Concern
Notes
Harness vs model
FutureX scores reflect prediction harness + mini, not bare chat completion
Heavy vs mini gap
90.3 BrowseComp is Apodex-1.0-H team mode โ not downloadable as one file
Benchmark leakage
Apodex blocks hosts; other labs may not โ compare under AgentHarness when possible
Sonnet 4.6 naming
Verify exact API snapshot and harness parity on futurex.live
Self-evolving claim
Marketing term โ read tech report for what actually updates (data, verifiers, policies)
Bottom line
Apodex-1.0-mini is the strongest public signal yet that open 35B + verification architecture can beat closed Sonnet-, DeepSeek-, and GPT-class entries on future-facing research evals โ not just static knowledge tests.
If you build research agents in 2026, run three checks:
FutureX or BrowseComp on your tools with AgentHarness
Deep Research mode on apodex.ai for qualitative audit trails
FutureX scores and benchmark tables reflect Apodex's June 29, 2026 public posts and Hugging Face model card โ leaderboard positions change daily. Verify live at futurex.live before citing ranks. Last updated: June 29, 2026.