Sakana Fugu Benchmarks: Specs, Evals, and How It Compares to Open-Source AI Models

Fast answer: Sakana Fugu is best understood as a managed multi-agent orchestration system exposed through one model API. It is not the same thing as downloading Llama, Qwen, DeepSeek, Mistral, Gemma, GLM, Kimi, or Nemotron weights and running them on your own infrastructure. Sakana’s own benchmark results for Fugu Ultra are genuinely strong, especially on agentic coding and hard reasoning, but the cleanest comparison is not simply “Fugu vs open source.” It is API orchestration vs self-hosted open-weight models.

That distinction matters. If your team wants a single API that can coordinate several strong models behind the scenes, Fugu is one of the most interesting launches of 2026. If your team needs downloadable weights, offline inference, fully controlled deployment, reproducible research, or a license you can audit before procurement, Fugu is not a drop-in replacement for an open-weight model stack. For a shorter companion read on the headline frontier-model comparison, see our earlier Sakana Fugu Ultra benchmark breakdown.

Mini verdict

Best for: complex multi-step reasoning, coding review, research workflows, paper reproduction, cybersecurity analysis, and agentic tasks where answer quality matters more than raw latency.
Not ideal for: teams that require fully open weights, offline inference, transparent internals, custom fine-tuning, or complete control over the serving stack.
Evidence level: strong first-party product and benchmark evidence, meaningful technical-report detail, limited independent validation so far.
Kingy verdict: promising and technically important, but not a blanket replacement for open-weight models or frontier APIs. Benchmark it on your own tasks before switching production workloads.

What Is Sakana Fugu?

Sakana Fugu is a product from Sakana AI launched on June 22, 2026. Sakana describes it as “a multi-agent system, delivered as one model.” In plain English: the developer calls one API, while Fugu decides how to use a coordinated pool of specialist models behind the scenes.

The important part is that Fugu is not presented as one downloadable model checkpoint like a conventional open-weight release. Sakana says Fugu dynamically coordinates and orchestrates a diverse pool of powerful models. The developer does not manually design a planner-worker-verifier workflow, write a routing tree, or choose a provider for every step. Fugu’s job is to learn when to delegate, how agents should communicate, and how to synthesize the result.

That is why Fugu sits in an awkward but interesting category. It competes with open models at the workflow level: can it solve your hard task better than the model you would otherwise host or call? But it does not compete with open models at the weight/license level: based on the public materials reviewed, users are not downloading Fugu weights, inspecting the orchestration model, or self-hosting the full system.

Architecture: one API, learned orchestration, multiple specialist agents

Developer call
OpenAI-compatible API

→

Fugu orchestrator
learned routing and scaffolds

→

Specialist model pool
planner, worker, verifier, critic patterns

→

Final answer
synthesized response

Simplified from Sakana’s public description and technical report. The exact internal routing for a user request is not fully exposed.

Specs and Product Details

Field	Public detail reviewed
Product name	Sakana Fugu and Sakana Fugu Ultra
Developer	Sakana AI
Launch date	June 22, 2026, from Sakana’s launch post
Model type	Managed multi-agent orchestration system presented through a single model interface
Access method	OpenAI-compatible API
Model weights availability	Not publicly disclosed as downloadable Fugu weights
Open-source status	Not proven open-source from public materials reviewed
Open-weight status	Not proven open-weight from public materials reviewed
Self-hosting availability	Not publicly disclosed
Agent pool customization	Fugu can opt out of specific agents/providers; Fugu Ultra uses the full fixed agent pool
EU/EEA availability	Official page says it is not yet available in EU/EEA while Sakana works toward GDPR and EU-specific compliance
Pricing	Subscription tiers: Standard $20/month, Pro $100/month, Max $200/month. Pay-as-you-go also available. Fugu Ultra fugu-ultra-20260615: $5/M input, $30/M output, $0.50/M cached input; above 272K context: $10/M input, $45/M output, $1/M cached input.
Best use cases	Coding, code review, research, hard reasoning, paper reproduction, cybersecurity analysis, literature/patent investigation
Known limitations	Independent reproduction still early; latency/cost per successful task not yet well characterized publicly; not a self-hosted open model
Core sources	Fugu page, launch post, technical report

Benchmark Summary

Sakana’s official benchmark table is the backbone of the Fugu story. The numbers are broad, and on many rows they are excellent. But this is still a first-party table. Sakana states that baseline results are provider-reported wherever available, and the technical report gives benchmark-specific caveats. That does not make the numbers useless. It means buyers should treat them as a strong reason to test, not as the final word.

Benchmark score overview: Fugu vs Fugu Ultra

SWE Bench Pro

Fugu59.0

Ultra73.7

Terminal Bench 2.1

Fugu80.2

Ultra82.1

LiveCodeBench

Fugu92.9

Ultra93.2

LiveCodeBench Pro

Fugu87.8

Ultra90.8

Humanity's Last Exam

Fugu47.2

Ultra50.0

CharXiv Reasoning

Fugu85.1

Ultra86.6

GPQA-D

Fugu95.5

Ultra95.5

SciCode

Fugu60.1

Ultra58.7

tau3 Banking

Fugu21.7

Ultra20.6

Long Context Reasoning

Fugu74.7

Ultra73.3

MRCRv2

Fugu86.6

Ultra93.6

Source: Sakana Fugu official benchmark table and technical report. These bars put mixed benchmarks on one visual canvas for scanning; they are not a normalized universal score.

Benchmark	What it tests	Fugu	Fugu Ultra	Best listed public baseline	What it suggests	Caveat
SWE Bench Pro	real software-engineering issue resolution with mini-swe-agent scaffolding	59.0	73.7	Claude Opus 4.8, 69.2	Fugu Ultra leads Sakana's frontier baseline table	Baseline scores are provider-reported; harness choice matters.
Terminal Bench 2.1	agentic command-line tasks	80.2	82.1	GPT-5.5, 78.2	Fugu Ultra leads, while Fugu is close	Sakana used Terminus 2 / leaderboard or provider-reported baselines.
LiveCodeBench	competitive coding from May 2023-April 2025	92.9	93.2	Gemini 3.1 Pro, 88.5	Both Fugu models lead the listed baselines	Vals AI baseline source; not a full production coding-agent test.
LiveCodeBench Pro	harder text-only competitive programming	87.8	90.8	GPT-5.5, 88.4	Fugu Ultra leads	Sakana ran baselines with retries on timeout/token exhaustion.
Humanity's Last Exam	multidisciplinary reasoning, including multimodal samples	47.2	50.0	Claude Opus 4.8, 49.8	Fugu Ultra narrowly leads	Baseline mix includes provider-reported and Artificial Analysis values.
CharXiv Reasoning	multimodal chart/figure reasoning	85.1	86.6	Claude Opus 4.8, 84.2	Fugu Ultra leads	Uses GPT-4o as judge; most baselines provider-reported.
GPQA-D	diamond subset of graduate-level science QA	95.5	95.5	Gemini 3.1 Pro, 94.3	Fugu and Ultra tie and lead	Default EvalScope; baseline scores provider-reported.
SciCode	scientific coding tasks	60.1	58.7	Gemini 3.1 Pro, 58.9	Fugu leads Fugu Ultra and listed baselines	Sakana notes package-version issues can affect legitimate solutions.
tau3 Banking	simulated banking dialog/task completion	21.7	20.6	Claude/GPT tied, 20.6	Fugu leads	Reported as pass@4 with GPT-5.2 simulator.
Long Context Reasoning	long-document retrieval and reasoning	74.7	73.3	GPT-5.5, 74.3	Fugu leads the table	Artificial Analysis setup with equality checker and two-hour timeout.
MRCRv2	8-needle retrieval up to 128K context	86.6	93.6	GPT-5.5, 94.8	Fugu Ultra is strong but below GPT-5.5 in this table	Provider-reported baseline caveat.

What The Benchmarks Actually Mean

SWE Bench Pro is the row that will make coding-agent teams pay attention. It evaluates software engineering issue resolution. Sakana used mini-swe-agent scaffolding and effectively disabled a turn cap in the technical-report configuration. Fugu Ultra’s 73.7 score is strong in Sakana’s table, above Claude Opus 4.8’s listed 69.2. The caveat is obvious but important: a coding benchmark is partly a model test and partly a harness test.

Terminal Bench 2.1 measures hard command-line agent tasks. Fugu scores 80.2 and Fugu Ultra scores 82.1, both above the listed GPT-5.5 baseline of 78.2. This is exactly where a learned orchestrator should help: the system can choose different expert behaviors across a long task rather than relying on one model’s first plan.

LiveCodeBench and LiveCodeBench Pro test competitive programming. Fugu and Fugu Ultra perform well in both, with Ultra leading more clearly on the Pro split. This suggests Fugu’s routing is not only useful for tool-heavy agent loops; it can also help with direct reasoning and code generation. Still, contest-style coding does not fully predict repository maintenance, debugging, or product engineering.

Humanity’s Last Exam and GPQA-Diamond are hard reasoning and knowledge benchmarks. Fugu Ultra narrowly leads on HLE in Sakana’s table, while Fugu and Ultra tie at 95.5 on GPQA-D. These results make Fugu look like a serious reasoning layer, but they also raise an evaluation question: if Fugu can call frontier models, the buyer wants to know how much lift comes from orchestration, how stable that lift is, and how much it costs per solved problem.

CharXiv Reasoning is a chart and figure reasoning benchmark judged with GPT-4o in Sakana’s configuration. Fugu Ultra leads the listed baselines. That is encouraging for multimodal analysis, but any judged benchmark should be read with the judge and rubric in mind.

SciCode, tau3 Banking, and Long Context Reasoning are useful because Fugu, not Fugu Ultra, is the best of the two Fugu models on those rows. That is the kind of inconvenient detail that makes the table more interesting. Ultra is not simply “Fugu plus better.” It is a different operating point. For some tasks, a lower-latency routing choice may be enough, or even preferable.

MRCRv2 is another useful caveat row. Fugu Ultra scores 93.6, but GPT-5.5 is listed at 94.8. That keeps the story honest: Fugu Ultra may be excellent, but it does not dominate every benchmark in Sakana’s own table.

Fugu vs Fugu Ultra

Dimension	Fugu	Fugu Ultra
Speed	Optimized for latency and daily interactive use	Trades latency for deeper multi-agent work
Cost	Pay-as-you-go depends on the active underlying model tier; subscription access included	Fixed token price for fugu-ultra-20260615, with higher rates above 272K context
Agent pool	Can opt out of specific models/providers	Full fixed agent pool
Best use cases	Everyday coding, review, chatbots, analysis, interactive services	Hard multi-step reasoning, AI research, paper reproduction, cybersecurity, literature/patent work
Benchmark profile	Very strong; leads Ultra on SciCode, tau3 Banking, and Long Context Reasoning in the official table	Best on SWE Bench Pro, Terminal Bench 2.1, LiveCodeBench Pro, HLE, CharXiv, and MRCRv2 among Fugu variants
Main risk	May not get the full benefit of deeper multi-agent collaboration	Higher latency and potentially higher cost per task
Who should start here	Most developers testing Fugu in normal workflows	Teams with hard tasks where correctness is worth extra time and spend

Is Sakana Fugu Open Source?

Based on public materials reviewed, Sakana Fugu appears to be an API-based commercial orchestration product, not a conventional open-weight model release. The GitHub repo contains the technical report and related materials, but the repo itself is not proof that Fugu’s model weights, training code, inference stack, and orchestration runtime are open-source.

Use these categories carefully:

Open-source model: source code and related artifacts are released under an OSI-style open-source license. For AI models, people often misuse this term unless training code, inference code, weights, and license terms are all clear.
Open-weight model: model weights are downloadable, but the license may have restrictions. Llama is a common example of “open weights, custom license” rather than OSI open-source.
Source-available model: some code or documents are public, but rights to use, modify, deploy, or redistribute may be limited.
API-only commercial model: users call a hosted endpoint and do not control the weights or serving stack.
Orchestration system / router / multi-agent wrapper: a system that coordinates one or more models, tools, prompts, or agents to produce a result. Fugu belongs closest to this family.

That matters for sovereignty, privacy, reproducibility, fine-tuning, vendor dependency, and cost predictability. If your company needs air-gapped deployment or wants to audit every model component, open-weight models are still the natural starting point. If your company wants managed quality on difficult tasks and is comfortable with an external API, Fugu becomes interesting.

Sakana Fugu vs Open-Weight AI Models

To keep this comparison current, I checked public model cards and official sources on June 22, 2026. The open-weight landscape moves quickly: DeepSeek V4 Pro, GLM-5.2, Kimi K2.7 Code, Mistral Small 4, Mistral Medium 3.5, Gemma 4, Llama 4, and NVIDIA Nemotron 3 Ultra are the kind of models a technical buyer might consider alongside an API orchestration layer. For a broader local-model buyer’s guide, see Kingy AI's open-source AI model guide and our open-weight model roundup.

Model/system	Type	Weights?	Self-host?	License/status	Parameters	Context	Best fit	Cost profile	Control
Sakana Fugu / Fugu Ultra	API-only orchestration system	No	No	Not presented as open-source or open-weight	Not publicly disclosed as standalone model weights	Not publicly disclosed	Agentic coding, hard reasoning, research workflows	Managed API; token pricing plus subscription options	Lower than self-hosted; higher than building your own stack
DeepSeek V4 Pro	Open-weight model	Yes	Yes	MIT on Hugging Face card	1.6T total / 49B active	1M	Long context, coding, reasoning	Self-hosting infra or third-party inference	High
GLM-5.2	Open-weight model	Yes	Yes	MIT on Hugging Face card	Not summarized cleanly in card excerpt reviewed	1M	Long-horizon work, SWE-bench Pro, GPQA-Diamond	Self-hosting infra or hosted inference	High
Kimi K2.7 Code	Open-weight model	Yes	Yes	Modified MIT per model card	1T total / 32B active	256K	Coding agents and multimodal/code workflows	Self-hosting infra or hosted inference	High, subject to modified license
Llama 4 Maverick	Open-weight model under custom license	Yes	Yes	Llama 4 Community License	17B active / 128 experts	1M on several official/partner cards	Multimodal chat, coding, tool use	Self-hosting or managed providers	Medium-high; license is not OSI open-source
Mistral Small 4 119B	Open-weight model	Yes	Yes	Apache 2.0	119B total / 6.5B active	256K	Efficient coding, agentic use, chat	Self-hosting infra or hosted inference	High
Mistral Medium 3.5	Open-weight model under modified license	Yes	Yes	Modified MIT with revenue exceptions	128B dense	256K	Agentic coding, instruction following	Self-hosting infra or hosted inference	Medium-high; license needs review
Gemma 4 31B	Open-weight model	Yes	Yes	Apache 2.0 on card	30.7B	256K	Multimodal, coding, long context	Self-hosting/local/hosted	High
NVIDIA Nemotron 3 Ultra	Open-weight model under OpenMDW	Yes	Yes	OpenMDW 1.1	550B total / 55B active	Up to 1M	Frontier-scale reasoning, agents, long context	Heavy self-hosting or NVIDIA ecosystem	Medium-high; license-specific obligations

This is a practical table, not an apples-to-apples scientific ranking. Fugu is a managed orchestration product. DeepSeek, GLM, Kimi, Llama, Mistral, Gemma, and Nemotron are model releases with downloadable weights. The right question is not “which one is philosophically better?” The right question is: which deployment model gives your team the best answer quality, privacy, latency, cost, and operational control for the jobs you actually run?

Benchmark Comparison Against Open Models

Direct benchmark comparison is messy. Sakana’s Fugu table compares Fugu against frontier API models. Many open model cards report benchmark numbers, but the harnesses, dates, scaffolds, and source types differ. DeepSeek V4 Pro’s model card reports GPQA-Diamond and LiveCodeBench numbers. GLM-5.2’s model card reports SWE-bench Pro and GPQA-Diamond. Gemma 4’s card reports LiveCodeBench v6, GPQA Diamond, and MRCR v2. NVIDIA’s Nemotron 3 Ultra card reports SWE-Bench Verified, GPQA, and RULER 1M. These are useful signals, but they should not be mixed into one leaderboard without qualification.

Benchmark family	Fugu/Fugu Ultra source	Open-model source availability	Comparable?	How to read it
SWE-style coding agents	Sakana reports SWE Bench Pro	GLM-5.2 reports SWE-bench Pro; Nemotron reports SWE-Bench Verified; many cards use different SWE variants	Partly	Compare only when benchmark variant and harness match.
LiveCodeBench	Sakana reports LiveCodeBench and LiveCodeBench Pro	DeepSeek V4 Pro and Gemma 4 cards report LiveCodeBench variants	Partly	Useful for coding strength, less useful for full agent workflows.
GPQA-Diamond	Sakana reports GPQA-D	DeepSeek, GLM, Gemma, and others publish GPQA-Diamond numbers	Somewhat	More comparable than agentic benchmarks, but still check shots and evaluation setup.
Humanity’s Last Exam	Sakana reports HLE	Not always present on open model cards	Often no	Use ‘not available’ rather than filling gaps.
Long context	Sakana reports Long Context Reasoning and MRCRv2	DeepSeek, GLM, Gemma, Llama, and Nemotron publish long-context claims/benchmarks	Partly	Context window size is not the same as reliable long-context reasoning.
tau3 / tau-bench	Sakana reports tau3 Banking	Some open cards publish tau-style or agentic results, many do not	Usually no	Harness and simulator details matter too much for casual ranking.

Where Fugu Might Beat Open Models

Fugu’s best argument is not that Sakana has invented a magical single model that makes every open model irrelevant. The best argument is that some tasks benefit from a trained coordination layer. A hard code review may need one model to inspect architecture, another to reason about tests, another to challenge the patch, and another to synthesize a concise answer. A research task may need planner, searcher, verifier, and writer behaviors. A cybersecurity task may need one agent to construct a hypothesis and another to look for ways it fails.

You can build those loops yourself with open models. Many serious teams will. But that means owning routing logic, prompts, retries, model selection, evals, observability, spend controls, and maintenance. Fugu is attractive when you want the benefit of multi-agent behavior without building a custom orchestration stack.

Where Open Models Might Beat Fugu

Open-weight models still win when control is the product requirement. If you need offline inference, private data handling, air-gapped deployment, custom fine-tuning, deterministic version pinning, local batch inference, or direct auditability, Fugu’s managed API shape is a limitation. Open models also let you optimize cost at scale: once you own the hardware or reserve inference capacity, marginal cost can be more predictable than a multi-agent API bill.

Open models can also be lower-latency for simple tasks. If the job is a single-turn extraction, classification, summarization, or code-completion task, running a well-chosen local model may be faster and cheaper than invoking a managed orchestrator. Save orchestration for problems that actually need orchestration.

Cost and Practical Deployment

Sakana offers both subscription and pay-as-you-go pricing. The subscription tiers are Standard at $20/month, Pro at $100/month, and Max at $200/month, with Pro and Max described as 10x and 20x Standard usage. The official page says every subscription tier includes both Fugu and Fugu Ultra. Pay-as-you-go is aimed at heavier production workloads and bills by token usage.

For Fugu, the pay-as-you-go rule is unusual but sensible: if one agent is active, you pay the standard rate for that underlying model. If multiple agents are active, Sakana says it does not stack model fees; you pay a single rate based on the top-tier model involved. For Fugu Ultra, the published fixed price for fugu-ultra-20260615 is $5 per 1M input tokens, $30 per 1M output tokens, and $0.50 per 1M cached input tokens. Above 272K context, the rates rise to $10 input, $45 output, and $1 cached input per 1M tokens.

That is only the raw token story. For multi-agent systems, the better metric is cost per successful completed task. If Fugu Ultra solves a code migration in one expensive run that a cheaper model fails three times, the expensive token rate can be rational. If it calls more agents for a simple answer that an open 30B model could solve locally, the managed orchestrator is wasteful.

Workflow	Fugu fit	Fugu Ultra fit	Open model fit	Why
Simple coding question	Good	Usually overkill	Strong	Low-latency local or hosted open models may be enough.
Large code review	Good	Strong	Good with custom scaffolding	Verifier and critic loops can matter.
Research report	Good	Strong	Good if you build retrieval/evals	Multi-step planning and synthesis are useful.
Paper reproduction	Possible	Strong	Possible but engineering-heavy	Long-horizon tool use and verification matter.
Long-context analysis	Good	Good	Strong with DeepSeek/GLM/Nemotron-class models	Context window and retrieval reliability both matter.
Production chatbot	Strong for complex support	Usually too slow/expensive	Strong	Most chats do not need deep orchestration.
Batch summarization	Often too expensive	Poor fit	Strong	Predictable local inference usually wins.

What Feels Proven

Fugu is a real Sakana AI product with official API access.
Sakana presents Fugu as a learned model orchestration system, not just a hand-coded router.
Fugu and Fugu Ultra target different latency/quality operating points.
Official benchmark results are broad and strong, especially on agentic coding and hard reasoning.
The system is connected to Sakana’s TRINITY and Conductor research.
Pricing, subscription tiers, and the EU/EEA availability limitation are public on the official page.

What Feels Unproven

Evidence confidence map

Strong evidence

Official product specs, launch date, access method, pricing tiers, EU/EEA limitation, and Sakana’s published benchmark table.

Medium evidence

Technical-report methodology, first-party benchmark explanations, qualitative examples, and comparisons to provider-reported frontier baselines.

Early evidence

Independent reproduction, latency distributions, total cost per completed production task, failure modes, and enterprise compliance details.

Whether independent labs can reproduce the full benchmark table.
Whether Fugu Ultra beats the best open-weight models on real company tasks.
Latency distribution for long-running multi-agent tasks.
Total cost per successful task after retries and failures.
Failure modes when the orchestrator chooses the wrong agent or synthesis strategy.
Privacy and compliance details beyond the current public page.
How often the agent pool changes and how much users can audit routing decisions.
Whether Fugu should be treated as a replacement for self-hosted open models in regulated workflows.

Should Developers Use Sakana Fugu?

Developers should try Fugu if they want a simple API that may outperform individual model calls on complex workflows. Start with Fugu for normal coding, code review, analysis, and interactive services. Try Fugu Ultra when the task is hard enough that correctness is worth more than speed: reproducing a paper, reviewing a security-sensitive codebase, investigating patents, or running a multi-step research loop.

Do not switch production workloads because a benchmark table looks exciting. Build a small eval set from your own work: 20 to 50 real tasks, with expected answers, human review, latency measurement, token accounting, retries, and failure tags. Compare Fugu, Fugu Ultra, your current frontier API, and at least one leading open-weight model. Track cost per completed task, not just cost per million tokens.

Should Businesses Care?

Yes, because Fugu is strategically interesting. It reduces dependence on any one model vendor by turning a pool of models into one managed interface. But it does not remove vendor dependency entirely. You still depend on Sakana’s hosted product, availability, pricing, compliance posture, and routing decisions. The business category is best described as a managed multi-agent intelligence layer.

That category is likely to matter. Many enterprises do not want to hand-wire model routers, agent frameworks, eval harnesses, prompt libraries, and monitoring systems. They want a reliable API that handles hard tasks. Fugu is a credible version of that idea. Open-weight models remain the better choice when the business priority is sovereignty, private deployment, or fully controlled economics.

Should Creators Care?

AI creators should care because Fugu sits at a useful intersection: frontier benchmarks, open-weight model competition, orchestration, coding agents, research automation, and vendor lock-in. It is a better topic than another generic model launch because the central tension is real. Is orchestration the next scaling axis? Can a managed multi-agent system beat a self-hosted model stack on actual work? Where does benchmark lift turn into buyer value?

Good creator angles include: “I tested Fugu against open models,” “Can Fugu beat self-hosted AI?”, “Is orchestration the next scaling law?”, and “Fugu Ultra vs open-source coding models.” For teams that want distribution around deep technical launches, Kingy also has a relevant sponsor a deep-dive video on Kingy AI page.

Final Verdict

Sakana Fugu is one of the most interesting AI launches of 2026 if the official results hold up. But it should not be casually described as an open-source model. It competes with open-weight models at the workflow layer, not at the license or deployment layer.

Fugu may be better for teams that want maximum answer quality through one API and do not want to build a multi-agent stack. Open models may be better for teams that need control, self-hosting, auditability, fine-tuning, private deployment, or predictable high-volume inference. The benchmark results are strong enough to justify serious testing. They are not strong enough to skip your own evals.

FAQ

What is Sakana Fugu?

Sakana Fugu is a managed multi-agent orchestration system exposed as a single OpenAI-compatible model API. It coordinates a pool of specialist models behind the scenes.

Is Sakana Fugu open source?

Based on public materials reviewed, no. It appears to be an API-based commercial orchestration product, not a conventional open-source model release.

Is Fugu Ultra open weight?

No public source reviewed shows downloadable Fugu Ultra weights.

Can I self-host Fugu?

Self-hosting is not publicly disclosed in the official product materials reviewed.

What is the difference between Fugu and Fugu Ultra?

Fugu balances performance and latency and allows opt-outs from the agent pool. Fugu Ultra prioritizes answer quality on hard tasks and uses the full fixed agent pool.

What benchmarks does Fugu perform best on?

In Sakana’s official table, Fugu Ultra is especially strong on SWE Bench Pro, Terminal Bench 2.1, LiveCodeBench Pro, Humanity’s Last Exam, CharXiv Reasoning, GPQA-D, and MRCRv2. Fugu itself leads Ultra on SciCode, tau3 Banking, and Long Context Reasoning.

Did Fugu beat GPT, Gemini, or Claude?

In Sakana’s first-party table, Fugu or Fugu Ultra beats the listed GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.8 baselines on many rows, but baseline scores are often provider-reported and not always independently reproduced in the same harness.

How does Fugu compare to DeepSeek, Qwen, Llama, and Mistral?

Fugu is a managed orchestration API. DeepSeek, Qwen, Llama, and Mistral releases are generally open-weight model families. Compare them by workflow fit, not just benchmark score.

Is Fugu better than open-source AI models?

Sometimes, for complex multi-step work where orchestration helps. Open models can be better for self-hosting, privacy, cost control, and reproducibility.

What is the best open-source alternative to Fugu?

There is no exact alternative because Fugu is an orchestration layer. Strong open-weight candidates include DeepSeek V4 Pro, GLM-5.2, Kimi K2.7 Code, Mistral Small 4, Gemma 4, Llama 4, and NVIDIA Nemotron 3 Ultra, depending on workload and license needs.

Does Fugu use open-source models internally?

The official materials describe a pool of powerful models and frontier agents, but the exact full internal pool and routing are not fully disclosed publicly.

Can users control which models Fugu uses?

For Fugu, the official FAQ says users can opt out of specific models from the console settings. Fugu Ultra relies on the full fixed agent pool.

How much does Sakana Fugu cost?

Subscriptions are Standard $20/month, Pro $100/month, and Max $200/month. Fugu Ultra pay-as-you-go is listed at $5/M input, $30/M output, and $0.50/M cached input, with higher rates above 272K context.

Is Fugu good for coding?

The official benchmark table suggests strong coding performance, especially on SWE Bench Pro, Terminal Bench 2.1, and LiveCodeBench. Developers should still test it on their own repositories.

Is Fugu good for research?

It is positioned for research workflows, paper reproduction, literature investigation, and hard multi-step tasks. Independent production case studies are still early.

What are the biggest caveats?

Open-source status, self-hosting, independent reproduction, latency distribution, cost per completed task, routing transparency, and compliance details.