Ornith-1.0 is an open-source family of large language models from DeepReinforce.AI, post-trained for agentic coding. Variants span 9B Dense, 31B Dense, 35B MoE, and 397B MoE, built on Gemma 4 and Qwen 3.5 bases and released under the MIT license.

What makes Ornith-1.0 different from other coding models?

Ornith-1.0 uses a self-improving RL framework where the model generates both a task-specific scaffold (the agent harness) and the solution rollout. Reward from the rollout trains both stages, so the model learns orchestration patterns instead of relying on a fixed human-written harness.

How does Ornith-1.0-397B compare to Claude Opus 4.7 on coding benchmarks?

On DeepReinforce's published tables, Ornith-1.0-397B scores 77.5 on Terminal-Bench 2.1 (Terminus-2) and 82.4 on SWE-Bench Verified, versus Claude Opus 4.7 at 70.3 and 80.8 on the same benchmarks. Claude Opus 4.8 still leads on several tasks, including 85 on Terminal-Bench 2.1.

Can Ornith-1.0 be used commercially?

Yes. DeepReinforce releases all Ornith-1.0 weights under the MIT license, which permits commercial and research use without copyleft requirements.

Where can I download Ornith-1.0 weights?

Weights and model cards are on Hugging Face in the deepreinforce-ai Ornith-1.0 collection. The technical blog at deep-reinforce.com documents evaluation harnesses, chat templates, and training details.

How does Ornith prevent reward hacking when the model writes its own scaffold?

DeepReinforce uses three layers: an immutable outer environment and tool boundary, a deterministic monitor that zero-rewards forbidden actions such as reading withheld test paths, and a frozen LLM judge that can veto trajectories that pass verifiers without doing real work.

Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding | explainx.ai Blog

explainx.ainewsletter3.5k

Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding | explainx.ai Blog | explainx.ai

On June 25, 2026, the DeepReinforce.AI team behind @ornith_ announced Ornith-1.0 — a family of MIT-licensed, open-weight models built specifically for agentic coding. The release spans 9B Dense, 31B Dense, 35B MoE, and 397B MoE checkpoints, post-trained on Gemma 4 and Qwen 3.5 bases.

The technical bet is not just bigger pretraining. Ornith-1.0 treats the agent scaffold — memory layout, retry logic, tool orchestration — as something the model learns during reinforcement learning, not something engineers hard-code once per benchmark category. That is why the team calls it a self-scaffolding training strategy.

TL;DR: Ornith-1.0 at a Glance

Detail	Value
Release date	June 25, 2026
License	MIT (commercial + research)
Model sizes	9B Dense, 31B Dense, 35B MoE, 397B MoE
Base models	Gemma 4 and Qwen 3.5
Flagship scores (397B)	77.5 Terminal-Bench 2.1 (Terminus-2), 82.4 SWE-Bench Verified

Benchmark	Ornith-1.0-397B	Qwen3.5-397B	Claude Opus 4.7	Claude Opus 4.8	DeepSeek-V4-Pro
Terminal-Bench 2.1 (Terminus-2)	77.5	53.5	70.3	85.0	67.9
Terminal-Bench 2.1 (Claude Code)	78.2	48.6	69.7	78.9	66.5
SWE-Bench Verified	82.4	76.4	80.8	87.6	80.6
SWE-Bench Pro	62.2	51.6	64.3	69.2	55.4
SWE-Bench Multilingual	78.9	69.3	—	—	76.2
NL2Repo	48.2	36.8	—	69.7	—
ClawEval (avg)	77.1	70.7	78.2	—	75.8

Benchmark	Ornith-1.0-35B	Qwen3.5-35B	Qwen3.6-35B	Qwen3.5-397B
Terminal-Bench 2.1 (Terminus-2)	64.2	41.4	52.5	53.5
SWE-Bench Verified	75.6	70.0	73.4	76.4
SWE-Bench Pro	50.4	44.6	49.5	51.6
ClawEval (avg)	69.8	65.4	68.7	70.7

Benchmark	Ornith-1.0-9B	Qwen3.5-9B	Gemma4-31B
Terminal-Bench 2.1 (Terminus-2)	43.1	21.3	42.1
SWE-Bench Verified	69.4	53.2	52.0
SWE-Bench Pro	42.9	31.3	35.7
ClawEval (avg)	63.1	53.2	48.5

Benchmark	Harness / settings (from DeepReinforce footnotes)
Terminal-Bench 2.1 (Terminus-2)	Harbor/Terminus-2, temp=1.0, top_p=1.0, 128K context, 4h timeout, 32 CPU / 48GB RAM, 5-run average
Terminal-Bench 2.1 (Claude Code)	Claude Code 2.1.126, temp=1.0, max 131072 tokens, 5-run average
SWE-Bench Verified / Pro / Multilingual	OpenHands, temp=1.0, top_p=0.95, 256K context
SWE Atlas (QnA / RF / TW)	mini SWE agent, temp=1.0, top_p=0.95, 128K context, 5-run average
NL2Repo	temp=1.0, top_p=1.0, 400K context, 48K output, anti-hacking filters
ClawEval	Real-user task distribution, temp=0.6, 256K context

Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding

TL;DR: Ornith-1.0 at a Glance

Related posts

Tencent Hy3: 295B Open-Source MoE Model for Agentic Coding — Apache 2.0, Free API, 256K Context

Self-Harness: AI Agents That Improve Their Own Operating Framework

AI Benchmarks in 2026: The Complete Guide to MMLU, GPQA, SWE-bench, and Beyond

Why Agent Scaffolds Matter for Coding Agents

Benchmark Results: 397B MoE vs Frontier Models

Mid-Size and Edge Variants: 35B and 9B

Ornith-1.0-35B MoE

Ornith-1.0-9B Dense (edge-friendly)

Fighting Reward Hacking in Self-Scaffolding RL

Pipeline RL and Long Rollouts

Evaluation Methodology (What the Numbers Actually Mean)

Who Should Try Ornith-1.0 First?

How Ornith Fits the 2026 Agentic Coding Landscape

Summary

TL;DR: Ornith-1.0 at a Glance

Related posts

Tencent Hy3: 295B Open-Source MoE Model for Agentic Coding — Apache 2.0, Free API, 256K Context

Self-Harness: AI Agents That Improve Their Own Operating Framework

AI Benchmarks in 2026: The Complete Guide to MMLU, GPQA, SWE-bench, and Beyond

Why Agent Scaffolds Matter for Coding Agents

Benchmark Results: 397B MoE vs Frontier Models

Mid-Size and Edge Variants: 35B and 9B

Ornith-1.0-35B MoE

Ornith-1.0-9B Dense (edge-friendly)

Fighting Reward Hacking in Self-Scaffolding RL

Pipeline RL and Long Rollouts

Evaluation Methodology (What the Numbers Actually Mean)

Who Should Try Ornith-1.0 First?

How Ornith Fits the 2026 Agentic Coding Landscape

Related Reading

Summary