Sakana Fugu is a multi-agent orchestration system from Sakana AI that behaves like a single foundation model. You call one OpenAI-compatible API endpoint and Fugu internally decides whether to solve the task directly or assemble and coordinate a team of specialist models. The orchestration is invisible to the caller.

How does Fugu Ultra compare to Fable 5 and GPT-5.5?

On Sakana's published benchmarks, Fugu Ultra matches or exceeds Fable 5 and Mythos Preview on SWE Bench Pro (73.7), LiveCodeBench (93.2), Humanity's Last Exam (50.0), and GPQA-D (95.5). Neither Fable 5 nor Mythos is in Fugu's agent pool. However, early independent testing by Ethan Mollick and others (June 23, 2026) found Fugu Ultra-high slow on shader tests (30 min) with quality below Fable in practice. Verify on your own workloads.

Why does Fugu matter for AI sovereignty?

Anthropic's Fable 5 and Mythos models recently became subject to export controls, cutting off access for organizations in certain regions overnight. Fugu's architecture uses a swappable agent pool — if one provider restricts access, Fugu routes around it. This makes it the first model explicitly designed to hedge against single-vendor dependency at the frontier.

What is the difference between Fugu and Fugu Ultra?

Fugu prioritizes low latency and is suited for interactive tools like coding assistants, chatbots, and code review. Fugu Ultra maximizes answer quality for demanding multi-step tasks like research automation, cybersecurity assessment, literature review, and patent analysis. Both use the same OpenAI-compatible API.

What research is Fugu built on?

Fugu is built on two ICLR 2026 papers from Sakana AI: TRINITY (an evolved LLM coordinator) and Conductor (learning to orchestrate agents in natural language). The core insight is training a model to understand when to delegate, how agents should communicate, and how to synthesize their outputs into a single reliable answer.

Can I use Fugu today?

Yes. Sakana Fugu launched on June 22, 2026. It's accessible via a single OpenAI-compatible API with subscription tiers for everyday use and pay-as-you-go for enterprise workloads. Visit the Sakana AI product page to get started.

Does Fugu Ultra match Fable 5 in real use?

Sakana claims Fugu Ultra matches Fable 5 and Mythos on published benchmarks. Early independent testing tells a different story. Wharton professor Ethan Mollick reported that his standard shader and interactive-scene tests took 30 minutes on Fugu Ultra-high and produced results that were "fine" but did not match Fable in practice. Other testers cited ~$6 and 10+ minutes per demo with visible glitches. Treat benchmark tables as directional; verify on your own workloads before committing.

Sakana Fugu: Benchmarks vs Real-World Testing (June 2026 Update) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Sakana Fugu: Benchmarks vs Real-World Testing (June 2026 Update) | explainx.ai Blog | explainx.ai

There's a new way to think about frontier AI performance: don't build a bigger model, build a smarter coordinator.

Sakana Fugu, released today by Sakana AI, is the first production model built on this premise. It's a multi-agent orchestration system that presents itself as a single foundation model — you call one API endpoint, and internally Fugu decides whether to answer directly or assemble a team of specialized models to handle it. The complexity never reaches your code.

And the benchmarks are hard to dismiss — on paper. Fugu Ultra matches Anthropic's Fable 5 and Mythos Preview across Sakana's published engineering, scientific, and reasoning tables — while neither of those models is even in its agent pool. They can't be. They're subject to export controls.

Update (June 23, 2026): Within 24 hours of launch, independent testers — led by Ethan Mollick, who runs some of the community's most-watched creative-coding experiments — reported a sharp gap between benchmark claims and real use. Shader and interactive-scene tests took 30 minutes on Fugu Ultra-high. Output was "fine" but did not match Fable in practice. See Real-world testing below.

That last part is still the point for sovereignty — but the performance story is now more complicated than launch-day headlines suggested.

Sakana AI introduces Fugu: one API that coordinates multiple frontier models under the hood.

One thing to get clear before going further: Fugu is not a frontier model. It's a model orchestrator. It doesn't replace Opus, GPT-5.5, or Gemini — it coordinates them. Fugu is the conductor; the frontier models in its pool are the orchestra. Without those underlying models, there's nothing to orchestrate. What Fugu brings is the intelligence layer that decides which model handles which part of a task, routes dynamically, and synthesizes the outputs into one coherent answer. If you're already paying for API access to multiple providers, Fugu is the glue that makes the collective smarter than any individual piece.

	Fugu	Fugu Ultra
Optimized for	Low latency, everyday tasks	Maximum quality, complex tasks
Best use cases	Code review, chatbots, Codex-style tools	Research, security assessment, patent analysis
Compliance controls	Opt specific agents out of pool	Same
Latency	Lower	Higher (deeper coordination)

Benchmark	Fugu	Fugu Ultra	Opus 4.8	Gemini 3.1 Pro	GPT-5.5
SWE Bench Pro	59.0	73.7	69.2	54.2	58.6
TerminalBench 2.1	80.2	82.1	74.6	70.3	78.2
LiveCodeBench	92.9	93.2	87.8	88.5	85.3
LiveCodeBench Pro	87.8	90.8	84.8	82.9	88.4
Humanity's Last Exam	47.2	50.0	49.8	44.4	41.4
CharXiv Reasoning	85.1	86.6	84.2	83.3	84.1
GPQA-D	95.5	95.5	92.0	94.3	93.6
SciCode	60.1	58.7	53.5	58.9	56.1
Long Context Reasoning	74.7	73.3	67.7	72.7	74.3
MRCRv2	86.6	93.6	87.9	84.9	94.8

Tester	Observation
Ethan Mollick	30 min per shader test; results "fine" but below Fable; Harbor demo
Peter Steinberger (@steipete)	Burned 100% of 5-hour quota in one prompt on a Three.js task; game "notably worse than GPT-5.5"; needed 7–8 Codex follow-ups to reach "almost playable"; fablepool.com/demo-fugu
MAT (@mbarras_ing)	~10 minutes, ~$6, "obvious glitches" on Fugu demo; re-running for longer
@LLMJunky	"Not great in my testing"
@0xV0LYX	"30min turnaround for a shader test is wild for something claiming frontier performance" — benchmarks ≠ real use
Janek Mann (@janekm)	Harbor scoring is subjective; dislikes Fugu's "overcardification" style (likely GPT-influenced UI patterns) vs GLM output
@wassieailouros	Local console run shows no reasoning trace during processing despite streaming enabled

Dimension	Sakana's published benchmarks	Mollick / community testing
Task type	SWE-bench, LiveCodeBench, GPQA	Shaders, Three.js, interactive scenes
Latency	Not emphasized	30 min per shader test (Ultra-high)
Cost	Subscription framing	~$6 and full quota in 1 prompt (Steinberger)
Quality vs Fable	Claimed parity	Does not match in Mollick's words
vs GPT-5.5	Competitive on tables	Worse on Three.js (Steinberger); GPT-5.5 needed no follow-ups

Scenario	Verdict
Individual dev, occasional prompting	Skip — OpenRouter or direct API is cheaper and more transparent
Creative coding (shaders, Three.js, games)	Caution — Mollick and Steinberger report slow runs, quota burn, quality below Fable/GPT-5.5
Team building an agent product	Worth evaluating — multi-vendor resilience and zero integration overhead have real value
Enterprise blocked from Fable/Mythos by export controls	Evaluate carefully — sovereignty story is real; creative-coding parity is unproven
Daily multi-step research or security assessment	Yes — Sakana beta feedback and benchmarks align here more than on Harbor-style tests
Light chat, autocomplete, RAG retrieval	No — you're paying orchestration overhead for tasks that don't need it
Budget-constrained solo developer	No — OpenRouter Fusion or a direct Opus/DeepSeek setup will cover most needs at a fraction of the cost
Latency-sensitive interactive work	No — 30-minute shader tests are disqualifying for most dev workflows

Sakana Fugu: One Model API to Orchestrate All the Others

Related posts

Perplexity Computer GLM 5.2 Orchestrator: 0.34× Opus Cost, Advisor Escalation

Apertus: The Fully Open Foundation Model Making AI Truly Sovereign

DoorDash dd-cli: Order Food From Your AI Agent in the Terminal

Why Orchestration, Why Now

What Fugu Actually Is

Fugu vs. Fugu Ultra

The Benchmark Numbers

Fugu Ultra vs Fable 5: Benchmarks vs Real Use

Real-world testing: Mollick's Harbor bench (June 23, 2026)

Why the Harbor bench matters

Other early reactions (June 22–23)

Benchmarks vs Harbor: the pattern

What Beta Users Built

The Architecture Advantage

Pricing and the Real Cost Math

The Token Fanout Problem

The Margin Problem Every Aggregator Has

Who Should Actually Use This

Alternatives Worth Knowing

What Comes Next

The Bigger Picture