explainx.ainewsletter3.4k
trending๐Ÿ”ฅloopsskills
pricing
workshops โ†—
explainx.ai

Learn to lead teams that combine humans and agents. Platform access, live workshops, bootcamps, and 50+ courses โ€” plus skills, tools, and MCP to practice what you learn.

follow us

custom AI agents

[email protected]

get started

Join ยท $29/mo

learn

platform ยท $29/moworkshopsbootcampscoursescertificationscertification testsexplainx universitycorporate trainingfacilitatorshackathonslearn skills & mcp

discover

skillstoolsagentsmcp serversdesignsllmsagiranks

content

releasesvisionmissionaboutcommunityteamcareersresourcespromptsgenerators hubgenerator SEO hubprompt templatesprompt guidesblogfor LLMsdemo

Sister Products

Infloq

Infloq

Influencer marketing

BgBlur

BgBlur

Privacy-first blur

Olly Social

Olly Social

Social AI copilot

Ceptory

Ceptory

Video intelligence

BgRemover

BgRemover

Background removal

newsletter ยท weekly

Get AI news, tools, and insights in your inbox.

contactsupportprivacytermsdata rightssubmission guidelines

ยฉ 2026 AISOLO Technologies Pvt Ltd

โ† Back to blog

explainx / blog

Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding

Ornith-1.0 is a new MIT-licensed model family from DeepReinforce that learns its own agent scaffolds during RL post-training. The 397B MoE variant hits 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified.

Jun 25, 2026ยท9 min readยทYash Thakker
Ornith-1.0Agentic CodingOpen Source LLMsTerminal-BenchSWE-BenchAI Benchmarks
Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding

On June 25, 2026, the DeepReinforce.AI team behind @ornith_ announced Ornith-1.0 โ€” a family of MIT-licensed, open-weight models built specifically for agentic coding. The release spans 9B Dense, 31B Dense, 35B MoE, and 397B MoE checkpoints, post-trained on Gemma 4 and Qwen 3.5 bases.

The technical bet is not just bigger pretraining. Ornith-1.0 treats the agent scaffold โ€” memory layout, retry logic, tool orchestration โ€” as something the model learns during reinforcement learning, not something engineers hard-code once per benchmark category. That is why the team calls it a self-scaffolding training strategy.

Weekly digest3.4k readers

Catch up on AI

Curated AI updates on agents, skills, and MCP โ€” delivered to your inbox. Unsubscribe anytime.


TL;DR: Ornith-1.0 at a Glance

DetailValue
Release dateJune 25, 2026
LicenseMIT (commercial + research)
Model sizes9B Dense, 31B Dense, 35B MoE, 397B MoE
Base modelsGemma 4 and Qwen 3.5
Flagship scores (397B)77.5 Terminal-Bench 2.1 (Terminus-2), 82.4 SWE-Bench Verified
Key training ideaJoint RL on scaffold generation + solution rollouts
WeightsHugging Face collection
Technical blogdeep-reinforce.com/ornith_1_0.html

DeepReinforce's launch post summarizes the positioning plainly: "Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts."


Why Agent Scaffolds Matter for Coding Agents

Most public coding-agent scores combine three ingredients: the base model, the harness (OpenHands, Harbor/Terminus-2, Claude Code, mini SWE agent), and the benchmark task distribution. When harness design is fixed, leaderboard gains can reflect benchmark-specific tuning as much as raw model capability.

Ornith-1.0 attacks that coupling directly. Each RL step runs in two stages:

  1. Scaffold stage โ€” conditioned on the task and the scaffold used last time, the model proposes a refined scaffold.
  2. Solution stage โ€” conditioned on that scaffold and the task description, the model produces a solution rollout.

Reward from the rollout backpropagates to both stages. Over training, scaffolds that induce higher-reward trajectories survive; weak orchestration patterns get replaced. Per-task-category strategies can emerge without a human maintaining separate harness configs for Terminal-Bench, SWE-Bench, and repo-generation evals.

For teams running Claude Code agent workflows, Cursor agents, or custom MCP loops, the implication is practical: orchestration is trainable, not only prompt-engineered.


Benchmark Results: 397B MoE vs Frontier Models

DeepReinforce reports that Ornith-1.0-397B leads comparable open-weight models on agentic coding suites and matches or exceeds Claude Opus 4.7 on several headline benchmarks โ€” while Claude Opus 4.8 and GLM-5.2-744B still top some columns.

BenchmarkOrnith-1.0-397BQwen3.5-397BClaude Opus 4.7Claude Opus 4.8DeepSeek-V4-Pro
Terminal-Bench 2.1 (Terminus-2)77.553.570.385.067.9
Terminal-Bench 2.1 (Claude Code)78.248.669.778.966.5
SWE-Bench Verified82.476.480.887.680.6
SWE-Bench Pro62.251.664.369.255.4
SWE-Bench Multilingual78.969.3โ€”โ€”76.2
NL2Repo48.236.8โ€”69.7โ€”
ClawEval (avg)77.170.778.2โ€”75.8

Sources: Ornith-1.0 technical blog, June 2026. Empty cells mean the model was not listed in DeepReinforce's public table.

Three takeaways for engineering leaders:

  • Terminal-Bench 2.1 โ€” Ornith-1.0-397B at 77.5 under Harbor/Terminus-2 is a meaningful jump over Qwen3.5-397B (53.5) and closes much of the gap to closed frontier models. See our Terminal-Bench 2.0 guide for why this benchmark stresses real shell workflows.
  • SWE-Bench Verified โ€” 82.4 puts the open model in the same band as Claude Opus 4.7 (80.8) and DeepSeek-V4-Pro (80.6), though Opus 4.8 still leads at 87.6.
  • Harness sensitivity โ€” Ornith scores 78.2 on Terminal-Bench when evaluated through Claude Code 2.1.126, not just Terminus-2. That suggests the learned scaffolds transfer across agent runtimes, but always verify on your toolchain.

Mid-Size and Edge Variants: 35B and 9B

Not every team can serve a 397B MoE cluster. Ornith-1.0's smaller checkpoints are where the release gets interesting for cost-conscious deployments.

Ornith-1.0-35B MoE

BenchmarkOrnith-1.0-35BQwen3.5-35BQwen3.6-35BQwen3.5-397B
Terminal-Bench 2.1 (Terminus-2)64.241.452.553.5
SWE-Bench Verified75.670.073.476.4
SWE-Bench Pro50.444.649.551.6
ClawEval (avg)69.865.468.770.7

The 35B model beating Qwen3.5-397B on Terminal-Bench 2.1 (64.2 vs. 53.5) is the standout efficiency story in DeepReinforce's tables. A MoE checkpoint at 35B active parameters should not outperform a 397B-class base on terminal agent tasks unless post-training and scaffold learning are doing substantial work.

Ornith-1.0-9B Dense (edge-friendly)

BenchmarkOrnith-1.0-9BQwen3.5-9BGemma4-31B
Terminal-Bench 2.1 (Terminus-2)43.121.342.1
SWE-Bench Verified69.453.252.0
SWE-Bench Pro42.931.335.7
ClawEval (avg)63.153.248.5

A 9B dense model scoring 43.1 on Terminal-Bench 2.1 โ€” essentially matching Gemma 4-31B (42.1) โ€” is strong evidence that agentic coding skills can compress into edge-deployable footprints when training targets scaffolds plus solutions jointly.


Fighting Reward Hacking in Self-Scaffolding RL

Letting the model author its own scaffold creates a familiar failure mode: the scaffold learns to game the verifier instead of solving the task. DeepReinforce documents examples such as reading visible test files and hardcoding expected outputs, touching files the grader checks without implementing behavior, or copying oracle solutions when they leak into the environment.

Their mitigation stack has three layers:

  1. Fixed trust boundary โ€” environments, tool surfaces, and test isolation stay immutable. The model may only evolve inner scaffold logic: memory, error handling, orchestration.
  2. Deterministic monitor โ€” flags attempts to read withheld paths, modify verification scripts, or call tools outside the sanctioned surface. Violations get zero reward and drop out of advantage computation.
  3. Frozen LLM judge โ€” acts as a veto on top of the verifier when intent-level gaming stays inside allowed tools.

This mirrors broader industry concern about eval contamination and reward hacking. Ornith's approach is notable because the attack surface includes scaffold code the model writes about itself, not only final patches.


Pipeline RL and Long Rollouts

Agentic coding rollouts are long. Standard on-policy RL becomes expensive when trajectories span thousands of tokens across tool calls. Ornith-1.0 uses pipeline RL with staleness-weighted GRPO: older off-policy tokens are down-weighted by age and discarded past a threshold, so long-horizon training stays stable without treating every stale token as equally valid.

DeepReinforce publishes the weighting scheme and clipped token-level GRPO loss in the technical blog. For practitioners, the important point is architectural: self-scaffolding only works if RL infrastructure can absorb multi-hour agent trajectories โ€” the same constraint that shows up in agent harness engineering write-ups.


Evaluation Methodology (What the Numbers Actually Mean)

Scores are not directly comparable unless harness, temperature, context window, and run count match. DeepReinforce documents:

BenchmarkHarness / settings (from DeepReinforce footnotes)
Terminal-Bench 2.1 (Terminus-2)Harbor/Terminus-2, temp=1.0, top_p=1.0, 128K context, 4h timeout, 32 CPU / 48GB RAM, 5-run average
Terminal-Bench 2.1 (Claude Code)Claude Code 2.1.126, temp=1.0, max 131072 tokens, 5-run average
SWE-Bench Verified / Pro / MultilingualOpenHands, temp=1.0, top_p=0.95, 256K context
SWE Atlas (QnA / RF / TW)mini SWE agent, temp=1.0, top_p=0.95, 128K context, 5-run average
NL2Repotemp=1.0, top_p=1.0, 400K context, 48K output, anti-hacking filters
ClawEvalReal-user task distribution, temp=0.6, 256K context

They also ship a modified Qwen chat template for training/inference alignment (chat_template.jinja on HF) and Harbor tweaks for vLLM reasoning_content keys. Reproducing leaderboard numbers requires matching those details, not only loading weights.


Who Should Try Ornith-1.0 First?

Strong fit:

  • Teams building self-hosted coding agents who need MIT-licensed weights
  • Researchers studying learned harnesses vs fixed OpenHands/Terminus configs
  • Orgs evaluating 9Bโ€“35B models for cost-sensitive agent loops on private repos

Proceed with caution:

  • Production systems that require independently reproduced benchmark numbers before model swaps
  • Workloads where Claude Opus 4.8 or GLM-5.2 still lead on your target eval
  • Teams without GPU capacity for 397B MoE serving โ€” start with 9B or 35B and measure on internal tasks

Browse related open models in the ExplainX LLM directory and agent tooling in the MCP server registry.


How Ornith Fits the 2026 Agentic Coding Landscape

2026's coding-model releases increasingly optimize for agent trajectories, not single-turn code completion. Ornith-1.0 sits alongside:

  • Closed frontier models (Claude Opus 4.7/4.8, GPT-5.x, Gemini 3.x) tuned on proprietary agent data
  • Open-weight bases (Qwen 3.5/3.6, Gemma 4, DeepSeek-V4) with community harnesses
  • Benchmark-focused post-training shops (DeepReinforce here, plus eval-driven releases like DeepSWE discussions)

Ornith's differentiator is explicit: learn the scaffold. That aligns with loop engineering and durable agent workflow design โ€” treat orchestration as a first-class artifact, not an afterthought wrapped around chat completions.


Related Reading

  • Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters
  • Agent Harness Engineering: Terminal-Bench, LangChain, and Coding Agents
  • DeepSWE Benchmark: GPT-5.5 Leads as SWE-Bench Pro Faces Scrutiny
  • Cursor Reward Hacking and SWE-Bench Eval Contamination
  • What Are Agent Skills? Complete Guide

Summary

Ornith-1.0 is one of the most interesting open-weight coding releases of June 2026 because it attacks harness design โ€” not just next-token loss on GitHub diffs. The 397B MoE variant reports 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified under DeepReinforce's published protocols, with MIT-licensed weights from 9B to 397B.

Treat the leaderboard as a strong directional signal, not a deployment checklist. Reproduce scores on your repositories, your CI, and your agent stack before committing infrastructure. If self-scaffolding RL holds up under independent audit, it could reshape how teams think about agent loops and benchmark-specific tuning.

Sources: Ornith-1.0 technical blog (DeepReinforce.AI, June 2026), Hugging Face Ornith-1.0 collection, and @ornith_ on X. Benchmark figures reflect DeepReinforce's published tables as of that date; independent reproduction may differ.

Related posts

Jun 10, 2026

Self-Harness: AI Agents That Improve Their Own Operating Framework

Published June 8, 2026, Self-Harness demonstrates how AI agents can autonomously identify weaknesses, propose harness modifications, and validate improvementsโ€”turning model-specific failure patterns into concrete executable fixes that boost Terminal-Bench 2.0 pass rates from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three diverse models.

May 2, 2026

AI Benchmarks in 2026: The Complete Guide to MMLU, GPQA, SWE-bench, and Beyond

AI benchmarking in 2026 has reached a critical inflection point. Traditional benchmarks like MMLU and HellaSwag are saturated above 88% and 95%, while frontier models cluster within statistical noise. This comprehensive guide covers every major benchmark categoryโ€”from language understanding to agent evaluationโ€”the 37% lab-to-production gap, benchmark gaming vulnerabilities, and what actually matters for production AI systems.

May 2, 2026

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters

Terminal-Bench 2.0 has become the de facto standard for AI agent evaluation since May 2025โ€”used by virtually every frontier lab. This deep dive covers the 89-task benchmark, its evolution from version 1.0, the Harbor framework powering it, and why frontier models still struggle below 65% accuracy on tasks humans complete routinely.