Fast answer: Sakana Fugu is best understood as a managed multi-agent orchestration system exposed through one model API. It is not the same thing as downloading Llama, Qwen, DeepSeek, Mistral, Gemma, GLM, Kimi, or Nemotron weights and running them on your own infrastructure. Sakana’s own benchmark results for Fugu Ultra are genuinely strong, especially on agentic coding and hard reasoning, but the cleanest comparison is not simply “Fugu vs open source.” It is API orchestration vs self-hosted open-weight models.
That distinction matters. If your team wants a single API that can coordinate several strong models behind the scenes, Fugu is one of the most interesting launches of 2026. If your team needs downloadable weights, offline inference, fully controlled deployment, reproducible research, or a license you can audit before procurement, Fugu is not a drop-in replacement for an open-weight model stack. For a shorter companion read on the headline frontier-model comparison, see our earlier Sakana Fugu Ultra benchmark breakdown.
Mini verdict
- Best for: complex multi-step reasoning, coding review, research workflows, paper reproduction, cybersecurity analysis, and agentic tasks where answer quality matters more than raw latency.
- Not ideal for: teams that require fully open weights, offline inference, transparent internals, custom fine-tuning, or complete control over the serving stack.
- Evidence level: strong first-party product and benchmark evidence, meaningful technical-report detail, limited independent validation so far.
- Kingy verdict: promising and technically important, but not a blanket replacement for open-weight models or frontier APIs. Benchmark it on your own tasks before switching production workloads.
What Is Sakana Fugu?
Sakana Fugu is a product from Sakana AI launched on June 22, 2026. Sakana describes it as “a multi-agent system, delivered as one model.” In plain English: the developer calls one API, while Fugu decides how to use a coordinated pool of specialist models behind the scenes.
The important part is that Fugu is not presented as one downloadable model checkpoint like a conventional open-weight release. Sakana says Fugu dynamically coordinates and orchestrates a diverse pool of powerful models. The developer does not manually design a planner-worker-verifier workflow, write a routing tree, or choose a provider for every step. Fugu’s job is to learn when to delegate, how agents should communicate, and how to synthesize the result.
That is why Fugu sits in an awkward but interesting category. It competes with open models at the workflow level: can it solve your hard task better than the model you would otherwise host or call? But it does not compete with open models at the weight/license level: based on the public materials reviewed, users are not downloading Fugu weights, inspecting the orchestration model, or self-hosting the full system.
OpenAI-compatible API
learned routing and scaffolds
planner, worker, verifier, critic patterns
synthesized response
Simplified from Sakana’s public description and technical report. The exact internal routing for a user request is not fully exposed.
Specs and Product Details
| Field | Public detail reviewed |
|---|---|
| Product name | Sakana Fugu and Sakana Fugu Ultra |
| Developer | Sakana AI |
| Launch date | June 22, 2026, from Sakana’s launch post |
| Model type | Managed multi-agent orchestration system presented through a single model interface |
| Access method | OpenAI-compatible API |
| Model weights availability | Not publicly disclosed as downloadable Fugu weights |
| Open-source status | Not proven open-source from public materials reviewed |
| Open-weight status | Not proven open-weight from public materials reviewed |
| Self-hosting availability | Not publicly disclosed |
| Agent pool customization | Fugu can opt out of specific agents/providers; Fugu Ultra uses the full fixed agent pool |
| EU/EEA availability | Official page says it is not yet available in EU/EEA while Sakana works toward GDPR and EU-specific compliance |
| Pricing | Subscription tiers: Standard $20/month, Pro $100/month, Max $200/month. Pay-as-you-go also available. Fugu Ultra fugu-ultra-20260615: $5/M input, $30/M output, $0.50/M cached input; above 272K context: $10/M input, $45/M output, $1/M cached input. |
| Best use cases | Coding, code review, research, hard reasoning, paper reproduction, cybersecurity analysis, literature/patent investigation |
| Known limitations | Independent reproduction still early; latency/cost per successful task not yet well characterized publicly; not a self-hosted open model |
| Core sources | Fugu page, launch post, technical report |
Benchmark Summary
Sakana’s official benchmark table is the backbone of the Fugu story. The numbers are broad, and on many rows they are excellent. But this is still a first-party table. Sakana states that baseline results are provider-reported wherever available, and the technical report gives benchmark-specific caveats. That does not make the numbers useless. It means buyers should treat them as a strong reason to test, not as the final word.
Source: Sakana Fugu official benchmark table and technical report. These bars put mixed benchmarks on one visual canvas for scanning; they are not a normalized universal score.
| Benchmark | What it tests | Fugu | Fugu Ultra | Best listed public baseline | What it suggests | Caveat |
|---|---|---|---|---|---|---|
| SWE Bench Pro | real software-engineering issue resolution with mini-swe-agent scaffolding | 59.0 | 73.7 | Claude Opus 4.8, 69.2 | Fugu Ultra leads Sakana's frontier baseline table | Baseline scores are provider-reported; harness choice matters. |
| Terminal Bench 2.1 | agentic command-line tasks | 80.2 | 82.1 | GPT-5.5, 78.2 | Fugu Ultra leads, while Fugu is close | Sakana used Terminus 2 / leaderboard or provider-reported baselines. |
| LiveCodeBench | competitive coding from May 2023-April 2025 | 92.9 | 93.2 | Gemini 3.1 Pro, 88.5 | Both Fugu models lead the listed baselines | Vals AI baseline source; not a full production coding-agent test. |
| LiveCodeBench Pro | harder text-only competitive programming | 87.8 | 90.8 | GPT-5.5, 88.4 | Fugu Ultra leads | Sakana ran baselines with retries on timeout/token exhaustion. |
| Humanity's Last Exam | multidisciplinary reasoning, including multimodal samples | 47.2 | 50.0 | Claude Opus 4.8, 49.8 | Fugu Ultra narrowly leads | Baseline mix includes provider-reported and Artificial Analysis values. |
| CharXiv Reasoning | multimodal chart/figure reasoning | 85.1 | 86.6 | Claude Opus 4.8, 84.2 | Fugu Ultra leads | Uses GPT-4o as judge; most baselines provider-reported. |
| GPQA-D | diamond subset of graduate-level science QA | 95.5 | 95.5 | Gemini 3.1 Pro, 94.3 | Fugu and Ultra tie and lead | Default EvalScope; baseline scores provider-reported. |
| SciCode | scientific coding tasks | 60.1 | 58.7 | Gemini 3.1 Pro, 58.9 | Fugu leads Fugu Ultra and listed baselines | Sakana notes package-version issues can affect legitimate solutions. |
| tau3 Banking | simulated banking dialog/task completion | 21.7 | 20.6 | Claude/GPT tied, 20.6 | Fugu leads | Reported as pass@4 with GPT-5.2 simulator. |
| Long Context Reasoning | long-document retrieval and reasoning | 74.7 | 73.3 | GPT-5.5, 74.3 | Fugu leads the table | Artificial Analysis setup with equality checker and two-hour timeout. |
| MRCRv2 | 8-needle retrieval up to 128K context | 86.6 | 93.6 | GPT-5.5, 94.8 | Fugu Ultra is strong but below GPT-5.5 in this table | Provider-reported baseline caveat. |
What The Benchmarks Actually Mean
SWE Bench Pro is the row that will make coding-agent teams pay attention. It evaluates software engineering issue resolution. Sakana used mini-swe-agent scaffolding and effectively disabled a turn cap in the technical-report configuration. Fugu Ultra’s 73.7 score is strong in Sakana’s table, above Claude Opus 4.8’s listed 69.2. The caveat is obvious but important: a coding benchmark is partly a model test and partly a harness test.
Terminal Bench 2.1 measures hard command-line agent tasks. Fugu scores 80.2 and Fugu Ultra scores 82.1, both above the listed GPT-5.5 baseline of 78.2. This is exactly where a learned orchestrator should help: the system can choose different expert behaviors across a long task rather than relying on one model’s first plan.
LiveCodeBench and LiveCodeBench Pro test competitive programming. Fugu and Fugu Ultra perform well in both, with Ultra leading more clearly on the Pro split. This suggests Fugu’s routing is not only useful for tool-heavy agent loops; it can also help with direct reasoning and code generation. Still, contest-style coding does not fully predict repository maintenance, debugging, or product engineering.
Humanity’s Last Exam and GPQA-Diamond are hard reasoning and knowledge benchmarks. Fugu Ultra narrowly leads on HLE in Sakana’s table, while Fugu and Ultra tie at 95.5 on GPQA-D. These results make Fugu look like a serious reasoning layer, but they also raise an evaluation question: if Fugu can call frontier models, the buyer wants to know how much lift comes from orchestration, how stable that lift is, and how much it costs per solved problem.
CharXiv Reasoning is a chart and figure reasoning benchmark judged with GPT-4o in Sakana’s configuration. Fugu Ultra leads the listed baselines. That is encouraging for multimodal analysis, but any judged benchmark should be read with the judge and rubric in mind.
SciCode, tau3 Banking, and Long Context Reasoning are useful because Fugu, not Fugu Ultra, is the best of the two Fugu models on those rows. That is the kind of inconvenient detail that makes the table more interesting. Ultra is not simply “Fugu plus better.” It is a different operating point. For some tasks, a lower-latency routing choice may be enough, or even preferable.
MRCRv2 is another useful caveat row. Fugu Ultra scores 93.6, but GPT-5.5 is listed at 94.8. That keeps the story honest: Fugu Ultra may be excellent, but it does not dominate every benchmark in Sakana’s own table.
Fugu vs Fugu Ultra
| Dimension | Fugu | Fugu Ultra |
|---|---|---|
| Speed | Optimized for latency and daily interactive use | Trades latency for deeper multi-agent work |
| Cost | Pay-as-you-go depends on the active underlying model tier; subscription access included | Fixed token price for fugu-ultra-20260615, with higher rates above 272K context |
| Agent pool | Can opt out of specific models/providers | Full fixed agent pool |
| Best use cases | Everyday coding, review, chatbots, analysis, interactive services | Hard multi-step reasoning, AI research, paper reproduction, cybersecurity, literature/patent work |
| Benchmark profile | Very strong; leads Ultra on SciCode, tau3 Banking, and Long Context Reasoning in the official table | Best on SWE Bench Pro, Terminal Bench 2.1, LiveCodeBench Pro, HLE, CharXiv, and MRCRv2 among Fugu variants |
| Main risk | May not get the full benefit of deeper multi-agent collaboration | Higher latency and potentially higher cost per task |
| Who should start here | Most developers testing Fugu in normal workflows | Teams with hard tasks where correctness is worth extra time and spend |
Is Sakana Fugu Open Source?
Based on public materials reviewed, Sakana Fugu appears to be an API-based commercial orchestration product, not a conventional open-weight model release. The GitHub repo contains the technical report and related materials, but the repo itself is not proof that Fugu’s model weights, training code, inference stack, and orchestration runtime are open-source.
Use these categories carefully:
- Open-source model: source code and related artifacts are released under an OSI-style open-source license. For AI models, people often misuse this term unless training code, inference code, weights, and license terms are all clear.
- Open-weight model: model weights are downloadable, but the license may have restrictions. Llama is a common example of “open weights, custom license” rather than OSI open-source.
- Source-available model: some code or documents are public, but rights to use, modify, deploy, or redistribute may be limited.
- API-only commercial model: users call a hosted endpoint and do not control the weights or serving stack.
- Orchestration system / router / multi-agent wrapper: a system that coordinates one or more models, tools, prompts, or agents to produce a result. Fugu belongs closest to this family.
That matters for sovereignty, privacy, reproducibility, fine-tuning, vendor dependency, and cost predictability. If your company needs air-gapped deployment or wants to audit every model component, open-weight models are still the natural starting point. If your company wants managed quality on difficult tasks and is comfortable with an external API, Fugu becomes interesting.
Sakana Fugu vs Open-Weight AI Models
To keep this comparison current, I checked public model cards and official sources on June 22, 2026. The open-weight landscape moves quickly: DeepSeek V4 Pro, GLM-5.2, Kimi K2.7 Code, Mistral Small 4, Mistral Medium 3.5, Gemma 4, Llama 4, and NVIDIA Nemotron 3 Ultra are the kind of models a technical buyer might consider alongside an API orchestration layer. For a broader local-model buyer’s guide, see Kingy AI's open-source AI model guide and our open-weight model roundup.
| Model/system | Type | Weights? | Self-host? | License/status | Parameters | Context | Best fit | Cost profile | Control |
|---|---|---|---|---|---|---|---|---|---|
| Sakana Fugu / Fugu Ultra | API-only orchestration system | No | No | Not presented as open-source or open-weight | Not publicly disclosed as standalone model weights | Not publicly disclosed | Agentic coding, hard reasoning, research workflows | Managed API; token pricing plus subscription options | Lower than self-hosted; higher than building your own stack |
| DeepSeek V4 Pro | Open-weight model | Yes | Yes | MIT on Hugging Face card | 1.6T total / 49B active | 1M | Long context, coding, reasoning | Self-hosting infra or third-party inference | High |
| GLM-5.2 | Open-weight model | Yes | Yes | MIT on Hugging Face card | Not summarized cleanly in card excerpt reviewed | 1M | Long-horizon work, SWE-bench Pro, GPQA-Diamond | Self-hosting infra or hosted inference | High |
| Kimi K2.7 Code | Open-weight model | Yes | Yes | Modified MIT per model card | 1T total / 32B active | 256K | Coding agents and multimodal/code workflows | Self-hosting infra or hosted inference | High, subject to modified license |
| Llama 4 Maverick | Open-weight model under custom license | Yes | Yes | Llama 4 Community License | 17B active / 128 experts | 1M on several official/partner cards | Multimodal chat, coding, tool use | Self-hosting or managed providers | Medium-high; license is not OSI open-source |
| Mistral Small 4 119B | Open-weight model | Yes | Yes | Apache 2.0 | 119B total / 6.5B active | 256K | Efficient coding, agentic use, chat | Self-hosting infra or hosted inference | High |
| Mistral Medium 3.5 | Open-weight model under modified license | Yes | Yes | Modified MIT with revenue exceptions | 128B dense | 256K | Agentic coding, instruction following | Self-hosting infra or hosted inference | Medium-high; license needs review |
| Gemma 4 31B | Open-weight model | Yes | Yes | Apache 2.0 on card | 30.7B | 256K | Multimodal, coding, long context | Self-hosting/local/hosted | High |
| NVIDIA Nemotron 3 Ultra | Open-weight model under OpenMDW | Yes | Yes | OpenMDW 1.1 | 550B total / 55B active | Up to 1M | Frontier-scale reasoning, agents, long context | Heavy self-hosting or NVIDIA ecosystem | Medium-high; license-specific obligations |
This is a practical table, not an apples-to-apples scientific ranking. Fugu is a managed orchestration product. DeepSeek, GLM, Kimi, Llama, Mistral, Gemma, and Nemotron are model releases with downloadable weights. The right question is not “which one is philosophically better?” The right question is: which deployment model gives your team the best answer quality, privacy, latency, cost, and operational control for the jobs you actually run?
Benchmark Comparison Against Open Models
Direct benchmark comparison is messy. Sakana’s Fugu table compares Fugu against frontier API models. Many open model cards report benchmark numbers, but the harnesses, dates, scaffolds, and source types differ. DeepSeek V4 Pro’s model card reports GPQA-Diamond and LiveCodeBench numbers. GLM-5.2’s model card reports SWE-bench Pro and GPQA-Diamond. Gemma 4’s card reports LiveCodeBench v6, GPQA Diamond, and MRCR v2. NVIDIA’s Nemotron 3 Ultra card reports SWE-Bench Verified, GPQA, and RULER 1M. These are useful signals, but they should not be mixed into one leaderboard without qualification.
| Benchmark family | Fugu/Fugu Ultra source | Open-model source availability | Comparable? | How to read it |
|---|---|---|---|---|
| SWE-style coding agents | Sakana reports SWE Bench Pro | GLM-5.2 reports SWE-bench Pro; Nemotron reports SWE-Bench Verified; many cards use different SWE variants | Partly | Compare only when benchmark variant and harness match. |
| LiveCodeBench | Sakana reports LiveCodeBench and LiveCodeBench Pro | DeepSeek V4 Pro and Gemma 4 cards report LiveCodeBench variants | Partly | Useful for coding strength, less useful for full agent workflows. |
| GPQA-Diamond | Sakana reports GPQA-D | DeepSeek, GLM, Gemma, and others publish GPQA-Diamond numbers | Somewhat | More comparable than agentic benchmarks, but still check shots and evaluation setup. |
| Humanity’s Last Exam | Sakana reports HLE | Not always present on open model cards | Often no | Use ‘not available’ rather than filling gaps. |
| Long context | Sakana reports Long Context Reasoning and MRCRv2 | DeepSeek, GLM, Gemma, Llama, and Nemotron publish long-context claims/benchmarks | Partly | Context window size is not the same as reliable long-context reasoning. |
| tau3 / tau-bench | Sakana reports tau3 Banking | Some open cards publish tau-style or agentic results, many do not | Usually no | Harness and simulator details matter too much for casual ranking. |
Where Fugu Might Beat Open Models
Fugu’s best argument is not that Sakana has invented a magical single model that makes every open model irrelevant. The best argument is that some tasks benefit from a trained coordination layer. A hard code review may need one model to inspect architecture, another to reason about tests, another to challenge the patch, and another to synthesize a concise answer. A research task may need planner, searcher, verifier, and writer behaviors. A cybersecurity task may need one agent to construct a hypothesis and another to look for ways it fails.
You can build those loops yourself with open models. Many serious teams will. But that means owning routing logic, prompts, retries, model selection, evals, observability, spend controls, and maintenance. Fugu is attractive when you want the benefit of multi-agent behavior without building a custom orchestration stack.
Where Open Models Might Beat Fugu
Open-weight models still win when control is the product requirement. If you need offline inference, private data handling, air-gapped deployment, custom fine-tuning, deterministic version pinning, local batch inference, or direct auditability, Fugu’s managed API shape is a limitation. Open models also let you optimize cost at scale: once you own the hardware or reserve inference capacity, marginal cost can be more predictable than a multi-agent API bill.
Open models can also be lower-latency for simple tasks. If the job is a single-turn extraction, classification, summarization, or code-completion task, running a well-chosen local model may be faster and cheaper than invoking a managed orchestrator. Save orchestration for problems that actually need orchestration.
Cost and Practical Deployment
Sakana offers both subscription and pay-as-you-go pricing. The subscription tiers are Standard at $20/month, Pro at $100/month, and Max at $200/month, with Pro and Max described as 10x and 20x Standard usage. The official page says every subscription tier includes both Fugu and Fugu Ultra. Pay-as-you-go is aimed at heavier production workloads and bills by token usage.
For Fugu, the pay-as-you-go rule is unusual but sensible: if one agent is active, you pay the standard rate for that underlying model. If multiple agents are active, Sakana says it does not stack model fees; you pay a single rate based on the top-tier model involved. For Fugu Ultra, the published fixed price for fugu-ultra-20260615 is $5 per 1M input tokens, $30 per 1M output tokens, and $0.50 per 1M cached input tokens. Above 272K context, the rates rise to $10 input, $45 output, and $1 cached input per 1M tokens.
That is only the raw token story. For multi-agent systems, the better metric is cost per successful completed task. If Fugu Ultra solves a code migration in one expensive run that a cheaper model fails three times, the expensive token rate can be rational. If it calls more agents for a simple answer that an open 30B model could solve locally, the managed orchestrator is wasteful.
| Workflow | Fugu fit | Fugu Ultra fit | Open model fit | Why |
|---|---|---|---|---|
| Simple coding question | Good | Usually overkill | Strong | Low-latency local or hosted open models may be enough. |
| Large code review | Good | Strong | Good with custom scaffolding | Verifier and critic loops can matter. |
| Research report | Good | Strong | Good if you build retrieval/evals | Multi-step planning and synthesis are useful. |
| Paper reproduction | Possible | Strong | Possible but engineering-heavy | Long-horizon tool use and verification matter. |
| Long-context analysis | Good | Good | Strong with DeepSeek/GLM/Nemotron-class models | Context window and retrieval reliability both matter. |
| Production chatbot | Strong for complex support | Usually too slow/expensive | Strong | Most chats do not need deep orchestration. |
| Batch summarization | Often too expensive | Poor fit | Strong | Predictable local inference usually wins. |
What Feels Proven
- Fugu is a real Sakana AI product with official API access.
- Sakana presents Fugu as a learned model orchestration system, not just a hand-coded router.
- Fugu and Fugu Ultra target different latency/quality operating points.
- Official benchmark results are broad and strong, especially on agentic coding and hard reasoning.
- The system is connected to Sakana’s TRINITY and Conductor research.
- Pricing, subscription tiers, and the EU/EEA availability limitation are public on the official page.
What Feels Unproven
Official product specs, launch date, access method, pricing tiers, EU/EEA limitation, and Sakana’s published benchmark table.
Technical-report methodology, first-party benchmark explanations, qualitative examples, and comparisons to provider-reported frontier baselines.
Independent reproduction, latency distributions, total cost per completed production task, failure modes, and enterprise compliance details.
- Whether independent labs can reproduce the full benchmark table.
- Whether Fugu Ultra beats the best open-weight models on real company tasks.
- Latency distribution for long-running multi-agent tasks.
- Total cost per successful task after retries and failures.
- Failure modes when the orchestrator chooses the wrong agent or synthesis strategy.
- Privacy and compliance details beyond the current public page.
- How often the agent pool changes and how much users can audit routing decisions.
- Whether Fugu should be treated as a replacement for self-hosted open models in regulated workflows.
Should Developers Use Sakana Fugu?
Developers should try Fugu if they want a simple API that may outperform individual model calls on complex workflows. Start with Fugu for normal coding, code review, analysis, and interactive services. Try Fugu Ultra when the task is hard enough that correctness is worth more than speed: reproducing a paper, reviewing a security-sensitive codebase, investigating patents, or running a multi-step research loop.
Do not switch production workloads because a benchmark table looks exciting. Build a small eval set from your own work: 20 to 50 real tasks, with expected answers, human review, latency measurement, token accounting, retries, and failure tags. Compare Fugu, Fugu Ultra, your current frontier API, and at least one leading open-weight model. Track cost per completed task, not just cost per million tokens.
Should Businesses Care?
Yes, because Fugu is strategically interesting. It reduces dependence on any one model vendor by turning a pool of models into one managed interface. But it does not remove vendor dependency entirely. You still depend on Sakana’s hosted product, availability, pricing, compliance posture, and routing decisions. The business category is best described as a managed multi-agent intelligence layer.
That category is likely to matter. Many enterprises do not want to hand-wire model routers, agent frameworks, eval harnesses, prompt libraries, and monitoring systems. They want a reliable API that handles hard tasks. Fugu is a credible version of that idea. Open-weight models remain the better choice when the business priority is sovereignty, private deployment, or fully controlled economics.
Should Creators Care?
AI creators should care because Fugu sits at a useful intersection: frontier benchmarks, open-weight model competition, orchestration, coding agents, research automation, and vendor lock-in. It is a better topic than another generic model launch because the central tension is real. Is orchestration the next scaling axis? Can a managed multi-agent system beat a self-hosted model stack on actual work? Where does benchmark lift turn into buyer value?
Good creator angles include: “I tested Fugu against open models,” “Can Fugu beat self-hosted AI?”, “Is orchestration the next scaling law?”, and “Fugu Ultra vs open-source coding models.” For teams that want distribution around deep technical launches, Kingy also has a relevant sponsor a deep-dive video on Kingy AI page.
Final Verdict
Sakana Fugu is one of the most interesting AI launches of 2026 if the official results hold up. But it should not be casually described as an open-source model. It competes with open-weight models at the workflow layer, not at the license or deployment layer.
Fugu may be better for teams that want maximum answer quality through one API and do not want to build a multi-agent stack. Open models may be better for teams that need control, self-hosting, auditability, fine-tuning, private deployment, or predictable high-volume inference. The benchmark results are strong enough to justify serious testing. They are not strong enough to skip your own evals.
FAQ
What is Sakana Fugu?
Sakana Fugu is a managed multi-agent orchestration system exposed as a single OpenAI-compatible model API. It coordinates a pool of specialist models behind the scenes.
Is Sakana Fugu open source?
Based on public materials reviewed, no. It appears to be an API-based commercial orchestration product, not a conventional open-source model release.
Is Fugu Ultra open weight?
No public source reviewed shows downloadable Fugu Ultra weights.
Can I self-host Fugu?
Self-hosting is not publicly disclosed in the official product materials reviewed.
What is the difference between Fugu and Fugu Ultra?
Fugu balances performance and latency and allows opt-outs from the agent pool. Fugu Ultra prioritizes answer quality on hard tasks and uses the full fixed agent pool.
What benchmarks does Fugu perform best on?
In Sakana’s official table, Fugu Ultra is especially strong on SWE Bench Pro, Terminal Bench 2.1, LiveCodeBench Pro, Humanity’s Last Exam, CharXiv Reasoning, GPQA-D, and MRCRv2. Fugu itself leads Ultra on SciCode, tau3 Banking, and Long Context Reasoning.
Did Fugu beat GPT, Gemini, or Claude?
In Sakana’s first-party table, Fugu or Fugu Ultra beats the listed GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.8 baselines on many rows, but baseline scores are often provider-reported and not always independently reproduced in the same harness.
How does Fugu compare to DeepSeek, Qwen, Llama, and Mistral?
Fugu is a managed orchestration API. DeepSeek, Qwen, Llama, and Mistral releases are generally open-weight model families. Compare them by workflow fit, not just benchmark score.
Is Fugu better than open-source AI models?
Sometimes, for complex multi-step work where orchestration helps. Open models can be better for self-hosting, privacy, cost control, and reproducibility.
What is the best open-source alternative to Fugu?
There is no exact alternative because Fugu is an orchestration layer. Strong open-weight candidates include DeepSeek V4 Pro, GLM-5.2, Kimi K2.7 Code, Mistral Small 4, Gemma 4, Llama 4, and NVIDIA Nemotron 3 Ultra, depending on workload and license needs.
Does Fugu use open-source models internally?
The official materials describe a pool of powerful models and frontier agents, but the exact full internal pool and routing are not fully disclosed publicly.
Can users control which models Fugu uses?
For Fugu, the official FAQ says users can opt out of specific models from the console settings. Fugu Ultra relies on the full fixed agent pool.
How much does Sakana Fugu cost?
Subscriptions are Standard $20/month, Pro $100/month, and Max $200/month. Fugu Ultra pay-as-you-go is listed at $5/M input, $30/M output, and $0.50/M cached input, with higher rates above 272K context.
Is Fugu good for coding?
The official benchmark table suggests strong coding performance, especially on SWE Bench Pro, Terminal Bench 2.1, and LiveCodeBench. Developers should still test it on their own repositories.
Is Fugu good for research?
It is positioned for research workflows, paper reproduction, literature investigation, and hard multi-step tasks. Independent production case studies are still early.
What are the biggest caveats?
Open-source status, self-hosting, independent reproduction, latency distribution, cost per completed task, routing transparency, and compliance details.
Should developers try it?
Yes, if they have complex workflows worth benchmarking. No, if their immediate need is a fully self-hosted or open-weight model.
Sources
- Sakana Fugu official page
- Sakana Fugu launch post
- Fugu technical report PDF
- Sakana Fugu GitHub repo
- TRINITY paper
- Conductor paper
- SWE-bench
- Terminal-Bench
- LiveCodeBench
- Artificial Analysis
- DeepSeek V4 Pro model card
- GLM-5.2 model card
- Kimi K2.7 Code model card
- Llama 4 model card
- Mistral Small 4 model card
- Mistral Medium 3.5 model card
- Gemma 4 model card
- NVIDIA Nemotron 3 Ultra model card
- Kingy AI earlier Sakana Fugu Ultra benchmark breakdown
- Kingy AI AI Launch Tracker
- Kingy AI AI tool directory
- Kingy AI coding agent guide
- Kingy AI open-source AI model guide






