Rio 3.5 Open 397B: A Serious Open Model Release, Or A Benchmark Claim In Need Of An Audit?

Rio 3.5 Open 397B is a real, large, open-weight Hugging Face release tied to Rio de Janeiro’s municipal technology company. Its specs are impressive. Its benchmark claims are even bigger. The right response is neither hype nor dismissal, but verification.

As of June 14, 2026, this analysis treats the Hugging Face repository and model card as primary sources while distinguishing confirmed repository evidence from first-party benchmark claims that have not yet been independently reproduced.

TL;DR

Rio 3.5 Open 397B is a newly published Hugging Face model under prefeitura-rio/Rio-3.5-Open-397B. The repository is real, large, and downloadable: the Hugging Face API lists 97 safetensor shards, 110 repository files, and safetensor metadata totaling 403,397,928,944 tensor parameters, with an index total size of 806,795,875,168 bytes, or roughly 807 GB decimal / 751 GiB binary. The model card identifies the base model as Qwen/Qwen3.5-397B-A17B, tags the release as MIT-licensed, and says it was developed by IplanRIO, the municipal IT company associated with Rio de Janeiro’s city government. Hugging Face model card, Hugging Face API, raw config, safetensors index

The headline claims are substantial: about 397B total parameters, about 17B active parameters, a sparse Mixture-of-Experts architecture, multimodal input, multilingual coverage, a 1,010,000-token context window, and first-party benchmark results that place Rio close to leading open and proprietary models. The important caveat is that most Rio-specific benchmark claims appear only on the Rio model card at the time of writing. I found repository evidence for the weights, architecture, base-model relationship, modality configuration, and a permissive license tag. I did not find independent Rio benchmark reproductions, a published Rio training report, or repo-local code implementing the claimed SwiReasoning inference path.

That makes Rio 3.5 Open 397B interesting, but not yet settled. Builders should treat it as an auditable open-weight release with aggressive first-party claims, not as an independently proven new frontier champion. For broader context on choosing between closed, open, and local models, see Kingy AI’s guide: Which AI Model Should You Use?

Why This Release Matters

The unusual part is not just the scale. Open-weight MoE releases at hundreds of billions or trillions of total parameters are now part of the competitive landscape. Qwen, DeepSeek, Kimi, MiniMax, Meta, Z.ai/GLM, and others have already made large open or open-weight models central to the AI race.

The unusual part is the name on the repository. The Rio model card says the model was developed by IplanRIO – Empresa Municipal de Informatica e Planejamento S.A., which it describes as the municipal IT company of Rio de Janeiro’s city government. IplanRIO’s official site describes the organization as the municipal company responsible for administering the city’s information and communications technology resources. Rio model card, IplanRIO official site

That is a different release profile from the usual frontier-lab playbook. Frontier systems are normally launched by OpenAI, Anthropic, Google DeepMind, xAI, Alibaba/Qwen, DeepSeek, Moonshot, Meta, Mistral, MiniMax, or Z.ai. A city-government-associated technology company publishing a Qwen-derived, frontier-scale open-weight model is therefore notable even before the benchmark table enters the room.

It also lands at the center of a live strategic question: will the next useful AI stack be controlled mostly through closed APIs, or will open models become good enough for serious agentic software, government, and enterprise work? Kingy has covered that broader sovereignty question here: Open-source AI models, local LLMs, hardware, and AI sovereignty.

What Rio 3.5 Open 397B Is

Rio 3.5 Open 397B appears to be a post-trained derivative of Qwen3.5-397B-A17B. The Rio README front matter lists base_model: Qwen/Qwen3.5-397B-A17B, and the Hugging Face tags include base_model:Qwen/Qwen3.5-397B-A17B. The text configuration is also structurally aligned with Qwen3.5: qwen3_5_moe_text, 60 layers, 512 experts, 10 selected experts per token, 4096 hidden size, 32 attention heads, and 2 key-value heads. Rio README, Rio config, Qwen3.5-397B-A17B card

In plain English, it is not a dense 397B model that activates every parameter for every token. It is a Mixture-of-Experts model. In an MoE model, many expert subnetworks exist, but only a subset is routed for each token. That is why the model card can claim roughly 397B total parameters and roughly 17B active parameters. Total parameters shape storage and memory requirements; active parameters shape part of per-token compute cost. Both matter.

AI-generated illustration of MoE routing — AI-generated illustration: Mixture-of-Experts routing. No benchmark or architectural data is embedded in this image.

Technical Specs

Spec	What the source says	Verification status
Repository	`prefeitura-rio/Rio-3.5-Open-397B` on Hugging Face	Confirmed
Developer claim	IplanRIO / municipal IT company of Rio de Janeiro	Claimed by model card; IplanRIO role corroborated by official site
Base model	`Qwen/Qwen3.5-397B-A17B`	Confirmed in README metadata and HF tags
Architecture	`Qwen3_5MoeForConditionalGeneration`; text config `qwen3_5_moe_text`	Confirmed in config
Total parameters	Model card: about 397B; HF safetensors metadata: 403.397B tensor parameters	Confirmed with counting caveat
Active parameters	Model card/base model: about 17B active	Claimed by model card and Qwen base card
Text layers	60	Confirmed in config
Experts	512 experts, 10 selected experts per token	Confirmed in config
Hidden size	4096	Confirmed in config
Attention heads	32 query heads, 2 key-value heads	Confirmed in config
Context window	README claim: 1,010,000 tokens; config: `max_position_embeddings` 262,144; examples use 1,048,576 in vLLM/SGLang	Mixed evidence; long-context serving likely requires overrides/scaling
Modalities	Hugging Face pipeline: image-text-to-text; config includes image/video token IDs and vision config	Plausible/confirmed at config level; needs independent task testing
Languages	README lists multilingual support and front matter lists `pt`, `en`	Claimed; broad language quality not independently verified here
Precision	Safetensors metadata: BF16 parameters plus 8,640 F32 values	Confirmed by HF API
Weight footprint	Safetensors index total size: 806,795,875,168 bytes	Confirmed
Repository files	97 `.safetensors` shards; 110 sibling files via HF API; no repo-local `.py` implementation files observed	Confirmed at retrieval time
License	README front matter `license: mit`; HF tags `license:mit`; no separate `LICENSE` file observed in the sibling list I retrieved	MIT-tagged, but package-level license artifacts should be audited before commercial use
Inference examples	Transformers, vLLM, and SGLang examples in README	Confirmed
Quantizations	Not part of the original repo files reviewed; Hugging Face model search shows derivative quantized listings may appear	Verify each derivative separately

Chart: repository metadata versus prose claims

The context-window caveat deserves special attention. Rio’s README lists a 1,010,000-token context window and gives vLLM/SGLang examples using 1,048,576 tokens. But the actual config.json sets max_position_embeddings to 262,144. Qwen’s own Qwen3.5 card says Qwen3.5 “natively supports” up to 262,144 tokens and gives YaRN/RoPE-scaling instructions for reaching 1,010,000 tokens. That makes the safest interpretation: 262K is the native/configured context; roughly 1M is a long-context serving mode that likely depends on scaling overrides and memory budget. Rio README, Rio config, Qwen3.5 card long-context section

AI-generated illustration of long-context document flow — AI-generated illustration: long-context processing. The verified config value is 262,144 positions; the 1M-token claim is a README/serving-mode claim.

Benchmark Claims

The Rio model card includes a large benchmark table covering agentic coding, software engineering, reasoning, math, multilingual, multimodal, instruction following, agents, and an estimated economic-value benchmark. The repository also includes the following benchmark graphic.

Hugging Face benchmark chart for Rio 3.5 Open 397B — Source image from the Rio 3.5 Open 397B Hugging Face repository: rio-3.5-open-benchmarks.png. Treat it as a first-party chart unless independently reproduced.

Here is the core issue: benchmark numbers are not all equal. A number can be independently reproduced, benchmark-host reported, vendor reported, model-card-only, or copied from a comparison table with unclear harness settings. Rio’s model-card numbers should currently be treated as first-party/model-card-only unless a benchmark host or third party publishes reproducible Rio runs.

Category	Benchmark	Rio 3.5 Open 397B	Qwen3.5 base in Rio card	Source status
Agentic coding	Terminal-Bench 2.1	70.8	52.5	Rio model card only
Agentic coding	DeepSWE	23.0	6.0	Rio model card only
Agentic coding	SWE-Bench Pro	58.1	50.9	Rio model card only
Agentic coding	SWE-Bench Verified	80.2	76.2	Rio model card only
Agentic coding	SWE-Bench Multilingual	77.0	69.3	Rio model card only
Knowledge/reasoning	GPQA Diamond	90.9	88.4	Rio model card only
Knowledge/reasoning	Humanity’s Last Exam	36.5	28.7	Rio model card only
Knowledge/reasoning	MMLU-Pro	88.0	87.8	Rio model card only
Knowledge/reasoning	MMLU-Redux	94.6	94.9	Rio model card only
Knowledge/reasoning	SuperGPQA	72.3	70.4	Rio model card only
Knowledge/reasoning	Apex	29.2	9.4	Rio model card only
Math	HMMT 2026 Feb	93.9	87.9	Rio model card only
Math	IMOAnswerBench	89.5	80.9	Rio model card only
Multilingual	MMMLU	89.8	88.5	Rio model card only
Multilingual	MMLU-ProX	85.6	84.7	Rio model card only
Multimodal	MMMU-Pro	78.4	79.0	Rio model card only
Multimodal	MathVision	89.1	88.6	Rio model card only
Multimodal	VideoMMMU	81.6	84.7	Rio model card only
Tool/instruction	MCP-Atlas	74.2	74.2	Rio model card only
Tool/instruction	IFBench	78.4	76.5	Rio model card only
Tool/instruction	IFEval	93.4	92.6	Rio model card only
Economic value	GDPval estimated	1533	1200	Rio model card only; methodology unclear from card

Chart: selected coding and agent benchmarks from the Rio model card

The benchmark families themselves are real. Terminal-Bench evaluates terminal agents in realistic command-line environments; SWE-Bench Verified is a human-validated subset of GitHub issue-resolution tasks; SWE-Bench Pro extends software-engineering evaluation to harder enterprise-style tasks; DeepSWE is a long-horizon software-engineering benchmark; and MCP-Atlas evaluates tool use across real MCP servers. Terminal-Bench, SWE-Bench Verified, SWE-Bench Pro, DeepSWE, MCP-Atlas

But a real benchmark is not the same thing as a verified score. A model-card table does not tell us whether the run used a standard scaffold, whether the same harness was used across models, whether tool access was comparable, whether contamination screens were run, whether the benchmark version was the same, or whether the reported comparison values came from benchmark hosts, vendors, or internal reruns.

AI-generated illustration of benchmark audit workflow — AI-generated illustration: benchmark auditing. The article’s benchmark tables are sourced from the Rio model card unless otherwise noted.

Reported Gains Over Qwen3.5 Base

The Rio card’s most meaningful comparison is not Rio versus every frontier model. It is Rio versus its base model. Because the model claims to be post-trained from Qwen3.5-397B-A17B, the base-model delta is the natural place to look first.

The reported gains are large in some places: +18.3 on Terminal-Bench 2.1, +17.0 on DeepSWE, +19.8 on Apex, +8.6 on IMOAnswerBench, +7.8 on HLE, and +7.7 on SWE-Bench Multilingual. If independently reproduced, that would be a meaningful post-training result. But the word “if” is doing real work.

One Hugging Face discussion adds useful color. A user asked whether the Rio benchmarks were run with SwiReasoning active and whether the model would still improve without latent reasoning. A reply from Sangu1nius answered “With!” and gave an ablation-style mini-table: Qwen base at 80.9 on IMOAnswerBench, 50.9 on SWE-Bench Pro, and 9.4 on Apex; “+ Training” at 84.5, 54.8, and 22.9; and “+ Latent Reasoning” at 89.5, 58.1, and 29.2. That suggests the published Rio headline numbers may depend materially on the latent-reasoning inference path. HF discussion #2

That is important because not every inference stack can implement soft/latent reasoning. The same discussion explicitly notes that engines limited to discrete token generation, such as llama.cpp, cannot currently implement that style of continuous soft-embedding reasoning. HF discussion #2

SwiReasoning: What It Is, And What Is Actually Released

Rio’s model card says the model “features SwiReasoning,” a training-free inference framework from Shi et al. that switches between explicit chain-of-thought reasoning and latent-space reasoning. The arXiv abstract for SwiReasoning describes exactly that: dynamic switching between explicit and latent reasoning guided by entropy trends in next-token distributions, plus a switch-count control mechanism to reduce overthinking. The paper reports average accuracy gains of 1.8%-3.1% and token-efficiency gains of 57%-79% under constrained budgets across mathematics, STEM, coding, and general benchmarks. SwiReasoning arXiv, SwiReasoning GitHub

The idea is straightforward, at least at a high level:

Explicit reasoning emits intermediate natural-language reasoning tokens. It is inspectable, but token-expensive.
Latent reasoning keeps some reasoning in continuous hidden space or soft-embedding form. It may be more token-efficient, but it is harder to inspect and harder to support in ordinary inference engines.
Switching tries to use entropy/confidence signals to decide when to explore in latent space and when to commit in explicit text.

AI-generated illustration of explicit versus latent reasoning paths — AI-generated illustration: explicit step-by-step reasoning and latent reasoning paths converging. It is conceptual, not a diagram from the SwiReasoning paper.

The catch: the Rio repository files I reviewed do not include repo-local Python code implementing SwiReasoning. The Hugging Face API file list includes model weights, configs, tokenizer files, chat_template.jinja, and preprocessors, but no .py files. The README examples show standard Transformers loading plus vLLM and SGLang serving commands. That does not disprove that the model was trained to work well with SwiReasoning, but it does mean the integration is not self-evident from the released repository alone. Rio API, Rio README

For readers who want a fuller primer on reasoning tokens and chain-of-thought tradeoffs, see Kingy AI’s explainer: Chain-of-thought explained.

How It Compares To Frontier Labs

Rio’s first-party numbers suggest a model that may be competitive with leading open-weight systems and within striking distance of some frontier API models on selected rows. But frontier comparison is where benchmark hygiene matters most.

OpenAI’s GPT-5.5 page, Anthropic’s Claude Opus 4.8 page, Google’s Gemini 3.1 Pro model page, and xAI’s Grok 4 materials all publish their own benchmark packages, often with different benchmark versions, scaffolds, tools, and inference settings. OpenAI’s GPT-5.5 page, for example, reports Terminal-Bench 2.0 and SWE-Bench Pro values, while Rio’s card lists Terminal-Bench 2.1. Those are not automatically comparable. OpenAI GPT-5.5, Anthropic Claude Opus 4.8, Google Gemini 3.1 Pro, xAI Grok 4

The real comparison looks like this:

Dimension	Rio 3.5 Open 397B	Frontier lab models
Weights	Public Hugging Face weights	Usually closed, API-only
License/control	MIT-tagged on HF, but license package should be audited	API terms, no weights
Reproducibility	Weights are available; eval harness not published in repo	Evals usually vendor-reported; some third-party leaderboards exist
Deployment	Possible self-hosting, but ~807 GB BF16 footprint before KV cache and serving overhead	Hosted APIs hide infra complexity
Latency/cost	MoE active-parameter design helps, but full weights still need large distributed memory	API price/latency depends on vendor tier
Long context	262K config; 1M serving claim likely via RoPE/YaRN-style scaling	Many frontier APIs advertise 1M-class contexts, but details vary
Multimodal	Config supports image/video tokens and vision tower	Frontier labs generally have mature multimodal products
Agentic ecosystem	README points to vLLM/SGLang/Transformers and Qwen Code	Closed labs have integrated agents, connectors, computer use, and managed tools

Rio’s advantage is control. A user can download the weights, inspect the config, run local or private infrastructure, and fine-tune or adapt subject to license review and hardware budget. Frontier labs’ advantage is operational maturity: hosted reliability, safety/process documentation, product integrations, SDKs, tool ecosystems, and larger public testing footprints.

If you are building agents, that distinction matters. Agentic performance is not only “model IQ”; it is model plus tools, memory, sandboxing, retry logic, browser/terminal harness, context management, and eval discipline. Kingy has a broader implementation-oriented guide here: The state of AI agents in 2026, and an MCP-specific explainer here: What is Model Context Protocol?.

How It Compares To Heavyweight Open Models

Rio enters a crowded heavyweight open-weight field. The relevant peers are not only dense Llama-class models, but sparse MoE systems that try to combine huge total parameter counts with smaller active compute.

Model	Public source	Parameters / active parameters	Context claim	Modality	License/open status
Rio 3.5 Open 397B	Hugging Face	Card: ~397B / ~17B active; HF metadata: 403.4B tensor params	Config 262K; README 1.01M / examples 1.048M	Image-text-to-text config	MIT-tagged HF repo; no separate LICENSE file observed in API sibling list
Qwen3.5-397B-A17B	Hugging Face/Qwen	397B / 17B active	Native 262K; hosted Plus says 1M default; YaRN instructions to ~1.01M	Multimodal	Apache-2.0 tag
DeepSeek-V4-Pro	Hugging Face	1.6T / 49B active	1M	Text-focused MoE card; check repo for modality details	MIT
Kimi-K2.6	Hugging Face/Moonshot	1T / 32B active	256K	Image-text-to-text; MoonViT vision encoder	Modified MIT
MiniMax-M3	Hugging Face	~428B / ~23B active	1M	Native multimodal text/image/video	MiniMax community license
GLM-5	Hugging Face/Z.ai	744B / 40B active	Z.ai says long-context with DSA; HF snippet cites 744B/40B	Model family varies by endpoint/card	MIT according to Z.ai blog/HF card
Llama 4 Scout / Maverick	Meta	Scout: 17B active, 16 experts; Maverick: 17B active, 128 experts	Scout advertised up to 10M context	Natively multimodal	Open-weight Llama license

Sources: Rio model card, Qwen3.5-397B-A17B, DeepSeek-V4-Pro, Kimi-K2.6, MiniMax-M3, GLM-5, Meta Llama 4

Against those peers, Rio’s positioning is narrower than the model-card headline might imply:

Versus Qwen3.5, Rio is best understood as a post-training and inference-strategy claim on top of the same base family. The reported gains over base are the key thing to reproduce.
Versus DeepSeek-V4-Pro, Rio is smaller in total and active parameters, and its license tag is attractive, but DeepSeek’s own V4 card claims a much larger 1.6T/49B MoE and 1M context.
Versus Kimi-K2.6, Rio is smaller and appears less obviously agent-productized. Kimi has a stronger public narrative around long-horizon coding and agent orchestration; Rio’s SWE-Bench Pro result is slightly lower than Kimi’s value in Rio’s own comparison table.
Versus MiniMax-M3, Rio is similar in total-parameter class but lower in claimed active parameters. MiniMax emphasizes native multimodality and sparse attention for 1M context.
Versus GLM-5, Rio is smaller and less documented, but potentially easier to reason about as a Qwen-derived release if the Qwen ecosystem supports it.
Versus Llama 4, Rio is in a different open-weight lineage: Qwen MoE rather than Meta’s Llama MoE family. Llama’s ecosystem maturity remains a major counterweight.

Kingy has covered some of these peer families in more detail here: DeepSeek V4 deep dive, Kimi K2.6 analysis, and AI model picker.

Deployment Realities

The “17B active” number can be misleading if read casually. It does not mean Rio is as easy to run as a 17B dense model. Sparse MoE still requires the full expert weights to be present somewhere in the serving system. For Rio, the safetensors index alone reports about 807 GB of model data. Add KV cache, allocator overhead, tensor parallelism, communication overhead, and long-context memory pressure, and this becomes a serious multi-GPU or hosted-inference project.

The README examples use --tensor-parallel-size 8 for vLLM and --tp 8 for SGLang. That does not prove eight GPUs are always sufficient, because GPU type, precision, quantization, context length, batch size, and KV-cache strategy dominate the practical answer. It does, however, signal that the authors expect distributed serving. Rio README

Quantized derivatives may make experimentation easier, and Hugging Face search already exposes derivative quantized listings tied to the Rio base-model metadata. But quantization is not magic. It can reduce memory footprint, but long-context KV cache, expert routing, throughput, and quality degradation still need testing. Hugging Face quantized model search

Claim Audit

Claim	Rating	Why
The Hugging Face repo exists and contains large model weights	Confirmed	HF API lists the repo, 97 safetensor shards, and 806.8 GB index total size
It is based on Qwen3.5-397B-A17B	Confirmed	README front matter and HF tags list the Qwen base model
It uses a Qwen3.5 MoE architecture	Confirmed	Config lists `Qwen3_5MoeForConditionalGeneration`, 512 experts, 10 experts per token
It is roughly 397B total / 17B active	Plausible/mostly confirmed	Model card and Qwen base card say this; HF safetensors metadata counts 403.4B tensor parameters
It has a 1,010,000-token context window	Plausible but needs caveat	README claims it; config says 262,144 max positions; Qwen base docs describe 1.01M via RoPE/YaRN scaling
It is multimodal	Plausible/config-confirmed	HF pipeline is image-text-to-text and config has vision/video fields; independent multimodal evals not found
It is MIT licensed	Partially confirmed	README metadata and HF tags say MIT; no separate `LICENSE` file observed in retrieved sibling list
It includes SwiReasoning	Unclear	Card claims it and a discussion reply says benchmark numbers used it; repo has no local implementation files
It beats or matches leading open models	Unverified	Rio-specific scores are currently model-card-only in sources reviewed
It competes with frontier labs	Provisional	It may on selected first-party rows, but frontier comparisons mix benchmark versions, tool settings, and vendor-reported values
GDPval estimate of 1533	Weak / unclear	Model card gives the value but not enough methodology to treat it as a reproducible public score

Bottom Line

Rio 3.5 Open 397B is worth taking seriously because the weights and config are real, the base model is clear, the parameter footprint is large, and the release comes from an unusual institutional source. If the reported post-training and SwiReasoning gains reproduce, it could be an important open-weight model for coding agents, multilingual work, and high-control deployments.

But the responsible headline is not “municipal model beats frontier labs.” The responsible headline is: a Rio-associated team has released a large Qwen3.5-derived open-weight MoE with bold benchmark claims that now need independent reproduction.

For builders, the next steps are obvious:

Run a small, auditable subset of SWE-Bench Pro, SWE-Bench Verified, Terminal-Bench 2.1, and DeepSWE with a published harness.
Test Rio with and without SwiReasoning, because the discussion-thread ablation implies latent reasoning may account for a meaningful share of the gains.
Verify long-context behavior at 262K, 524K, and ~1M with retrieval, reasoning, and degradation tests.
Audit the license package before commercial deployment.
Benchmark against Qwen3.5-397B-A17B under identical serving conditions.

Until then, Rio 3.5 Open 397B should sit in the promising-but-unproven bucket: a real open-weight release, a fascinating governance story, and a benchmark table that deserves a hard, careful second pass.

Source List

Primary Rio sources

Base model and open-weight peers

Frontier-lab context

Benchmark and reasoning sources

Internal Kingy AI links used

Rio 3.5 Open 397B: A Serious Open Model Release, Or A Benchmark Claim In Need Of An Audit?

Curtis Pyke

Related Posts

The Soft Nationalization of AI Has Begun

Anthropic’s Fable 5 Shutdown: Did the U.S. Just Start Export Controls for AI Models?

Kimi K2.7 Code Released: Benchmarks, Specs, and How It Compares

Leave a Reply Cancel reply

Get Kingy AI Launch Intelligence

Recent News