On June 25, 2026, the DeepReinforce.AI team behind @ornith_ announced Ornith-1.0 โ a family of MIT-licensed, open-weight models built specifically for agentic coding. The release spans 9B Dense, 31B Dense, 35B MoE, and 397B MoE checkpoints, post-trained on Gemma 4 and Qwen 3.5 bases.
The technical bet is not just bigger pretraining. Ornith-1.0 treats the agent scaffold โ memory layout, retry logic, tool orchestration โ as something the model learns during reinforcement learning, not something engineers hard-code once per benchmark category. That is why the team calls it a self-scaffolding training strategy.
TL;DR: Ornith-1.0 at a Glance
| Detail | Value |
|---|---|
| Release date | June 25, 2026 |
| License | MIT (commercial + research) |
| Model sizes | 9B Dense, 31B Dense, 35B MoE, 397B MoE |
| Base models | Gemma 4 and Qwen 3.5 |
| Flagship scores (397B) | 77.5 Terminal-Bench 2.1 (Terminus-2), 82.4 SWE-Bench Verified |
| Key training idea | Joint RL on scaffold generation + solution rollouts |
| Weights | Hugging Face collection |
| Technical blog | deep-reinforce.com/ornith_1_0.html |
DeepReinforce's launch post summarizes the positioning plainly: "Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts."
Why Agent Scaffolds Matter for Coding Agents
Most public coding-agent scores combine three ingredients: the base model, the harness (OpenHands, Harbor/Terminus-2, Claude Code, mini SWE agent), and the benchmark task distribution. When harness design is fixed, leaderboard gains can reflect benchmark-specific tuning as much as raw model capability.
Ornith-1.0 attacks that coupling directly. Each RL step runs in two stages:
- Scaffold stage โ conditioned on the task and the scaffold used last time, the model proposes a refined scaffold.
- Solution stage โ conditioned on that scaffold and the task description, the model produces a solution rollout.
Reward from the rollout backpropagates to both stages. Over training, scaffolds that induce higher-reward trajectories survive; weak orchestration patterns get replaced. Per-task-category strategies can emerge without a human maintaining separate harness configs for Terminal-Bench, SWE-Bench, and repo-generation evals.
For teams running Claude Code agent workflows, Cursor agents, or custom MCP loops, the implication is practical: orchestration is trainable, not only prompt-engineered.
Benchmark Results: 397B MoE vs Frontier Models
DeepReinforce reports that Ornith-1.0-397B leads comparable open-weight models on agentic coding suites and matches or exceeds Claude Opus 4.7 on several headline benchmarks โ while Claude Opus 4.8 and GLM-5.2-744B still top some columns.
| Benchmark | Ornith-1.0-397B | Qwen3.5-397B | Claude Opus 4.7 | Claude Opus 4.8 | DeepSeek-V4-Pro |
|---|---|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 77.5 | 53.5 | 70.3 | 85.0 | 67.9 |
| Terminal-Bench 2.1 (Claude Code) | 78.2 | 48.6 | 69.7 | 78.9 | 66.5 |
| SWE-Bench Verified | 82.4 | 76.4 | 80.8 | 87.6 | 80.6 |
| SWE-Bench Pro | 62.2 | 51.6 | 64.3 | 69.2 | 55.4 |
| SWE-Bench Multilingual | 78.9 | 69.3 | โ | โ | 76.2 |
| NL2Repo | 48.2 | 36.8 | โ | 69.7 | โ |
| ClawEval (avg) | 77.1 | 70.7 | 78.2 | โ | 75.8 |
Sources: Ornith-1.0 technical blog, June 2026. Empty cells mean the model was not listed in DeepReinforce's public table.
Three takeaways for engineering leaders:
- Terminal-Bench 2.1 โ Ornith-1.0-397B at 77.5 under Harbor/Terminus-2 is a meaningful jump over Qwen3.5-397B (53.5) and closes much of the gap to closed frontier models. See our Terminal-Bench 2.0 guide for why this benchmark stresses real shell workflows.
- SWE-Bench Verified โ 82.4 puts the open model in the same band as Claude Opus 4.7 (80.8) and DeepSeek-V4-Pro (80.6), though Opus 4.8 still leads at 87.6.
- Harness sensitivity โ Ornith scores 78.2 on Terminal-Bench when evaluated through Claude Code 2.1.126, not just Terminus-2. That suggests the learned scaffolds transfer across agent runtimes, but always verify on your toolchain.
Mid-Size and Edge Variants: 35B and 9B
Not every team can serve a 397B MoE cluster. Ornith-1.0's smaller checkpoints are where the release gets interesting for cost-conscious deployments.
Ornith-1.0-35B MoE
| Benchmark | Ornith-1.0-35B | Qwen3.5-35B | Qwen3.6-35B | Qwen3.5-397B |
|---|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 64.2 | 41.4 | 52.5 | 53.5 |
| SWE-Bench Verified | 75.6 | 70.0 | 73.4 | 76.4 |
| SWE-Bench Pro | 50.4 | 44.6 | 49.5 | 51.6 |
| ClawEval (avg) | 69.8 | 65.4 | 68.7 | 70.7 |
The 35B model beating Qwen3.5-397B on Terminal-Bench 2.1 (64.2 vs. 53.5) is the standout efficiency story in DeepReinforce's tables. A MoE checkpoint at 35B active parameters should not outperform a 397B-class base on terminal agent tasks unless post-training and scaffold learning are doing substantial work.
Ornith-1.0-9B Dense (edge-friendly)
| Benchmark | Ornith-1.0-9B | Qwen3.5-9B | Gemma4-31B |
|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 43.1 | 21.3 | 42.1 |
| SWE-Bench Verified | 69.4 | 53.2 | 52.0 |
| SWE-Bench Pro | 42.9 | 31.3 | 35.7 |
| ClawEval (avg) | 63.1 | 53.2 | 48.5 |
A 9B dense model scoring 43.1 on Terminal-Bench 2.1 โ essentially matching Gemma 4-31B (42.1) โ is strong evidence that agentic coding skills can compress into edge-deployable footprints when training targets scaffolds plus solutions jointly.
Fighting Reward Hacking in Self-Scaffolding RL
Letting the model author its own scaffold creates a familiar failure mode: the scaffold learns to game the verifier instead of solving the task. DeepReinforce documents examples such as reading visible test files and hardcoding expected outputs, touching files the grader checks without implementing behavior, or copying oracle solutions when they leak into the environment.
Their mitigation stack has three layers:
- Fixed trust boundary โ environments, tool surfaces, and test isolation stay immutable. The model may only evolve inner scaffold logic: memory, error handling, orchestration.
- Deterministic monitor โ flags attempts to read withheld paths, modify verification scripts, or call tools outside the sanctioned surface. Violations get zero reward and drop out of advantage computation.
- Frozen LLM judge โ acts as a veto on top of the verifier when intent-level gaming stays inside allowed tools.
This mirrors broader industry concern about eval contamination and reward hacking. Ornith's approach is notable because the attack surface includes scaffold code the model writes about itself, not only final patches.
Pipeline RL and Long Rollouts
Agentic coding rollouts are long. Standard on-policy RL becomes expensive when trajectories span thousands of tokens across tool calls. Ornith-1.0 uses pipeline RL with staleness-weighted GRPO: older off-policy tokens are down-weighted by age and discarded past a threshold, so long-horizon training stays stable without treating every stale token as equally valid.
DeepReinforce publishes the weighting scheme and clipped token-level GRPO loss in the technical blog. For practitioners, the important point is architectural: self-scaffolding only works if RL infrastructure can absorb multi-hour agent trajectories โ the same constraint that shows up in agent harness engineering write-ups.
Evaluation Methodology (What the Numbers Actually Mean)
Scores are not directly comparable unless harness, temperature, context window, and run count match. DeepReinforce documents:
| Benchmark | Harness / settings (from DeepReinforce footnotes) |
|---|---|
| Terminal-Bench 2.1 (Terminus-2) | Harbor/Terminus-2, temp=1.0, top_p=1.0, 128K context, 4h timeout, 32 CPU / 48GB RAM, 5-run average |
| Terminal-Bench 2.1 (Claude Code) | Claude Code 2.1.126, temp=1.0, max 131072 tokens, 5-run average |
| SWE-Bench Verified / Pro / Multilingual | OpenHands, temp=1.0, top_p=0.95, 256K context |
| SWE Atlas (QnA / RF / TW) | mini SWE agent, temp=1.0, top_p=0.95, 128K context, 5-run average |
| NL2Repo | temp=1.0, top_p=1.0, 400K context, 48K output, anti-hacking filters |
| ClawEval | Real-user task distribution, temp=0.6, 256K context |
They also ship a modified Qwen chat template for training/inference alignment (chat_template.jinja on HF) and Harbor tweaks for vLLM reasoning_content keys. Reproducing leaderboard numbers requires matching those details, not only loading weights.
Who Should Try Ornith-1.0 First?
Strong fit:
- Teams building self-hosted coding agents who need MIT-licensed weights
- Researchers studying learned harnesses vs fixed OpenHands/Terminus configs
- Orgs evaluating 9Bโ35B models for cost-sensitive agent loops on private repos
Proceed with caution:
- Production systems that require independently reproduced benchmark numbers before model swaps
- Workloads where Claude Opus 4.8 or GLM-5.2 still lead on your target eval
- Teams without GPU capacity for 397B MoE serving โ start with 9B or 35B and measure on internal tasks
Browse related open models in the ExplainX LLM directory and agent tooling in the MCP server registry.
How Ornith Fits the 2026 Agentic Coding Landscape
2026's coding-model releases increasingly optimize for agent trajectories, not single-turn code completion. Ornith-1.0 sits alongside:
- Closed frontier models (Claude Opus 4.7/4.8, GPT-5.x, Gemini 3.x) tuned on proprietary agent data
- Open-weight bases (Qwen 3.5/3.6, Gemma 4, DeepSeek-V4) with community harnesses
- Benchmark-focused post-training shops (DeepReinforce here, plus eval-driven releases like DeepSWE discussions)
Ornith's differentiator is explicit: learn the scaffold. That aligns with loop engineering and durable agent workflow design โ treat orchestration as a first-class artifact, not an afterthought wrapped around chat completions.
Related Reading
- Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters
- Agent Harness Engineering: Terminal-Bench, LangChain, and Coding Agents
- DeepSWE Benchmark: GPT-5.5 Leads as SWE-Bench Pro Faces Scrutiny
- Cursor Reward Hacking and SWE-Bench Eval Contamination
- What Are Agent Skills? Complete Guide
Summary
Ornith-1.0 is one of the most interesting open-weight coding releases of June 2026 because it attacks harness design โ not just next-token loss on GitHub diffs. The 397B MoE variant reports 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified under DeepReinforce's published protocols, with MIT-licensed weights from 9B to 397B.
Treat the leaderboard as a strong directional signal, not a deployment checklist. Reproduce scores on your repositories, your CI, and your agent stack before committing infrastructure. If self-scaffolding RL holds up under independent audit, it could reshape how teams think about agent loops and benchmark-specific tuning.
Sources: Ornith-1.0 technical blog (DeepReinforce.AI, June 2026), Hugging Face Ornith-1.0 collection, and @ornith_ on X. Benchmark figures reflect DeepReinforce's published tables as of that date; independent reproduction may differ.