• AI News
  • Blog
  • AI Calculators
    • AI Video Sponsorship: Calculate Your ROI
    • AI Agent Directory & Readiness Scorecard
    • AI Search Visibility Calculator
    • Build Your AI Workflow Stack: Find the Best AI Tools for Your Job, Budget, and Skill Level
    • 100 AI Agent Use Cases That Actually Work in 2026: Real Workflows for Founders, Marketers, Creators, and Operators
  • AI Courses
    • OpenAI Codex Course for Beginners: Build Apps Without Coding
    • AI Agents for Beginners: Build Your First AI Worker Without Coding
    • AI Coding Foundations for Beginners
    • AI Workflow Operator Course for Beginners
    • AI Search Visibility Course for Beginners
    • AI Video Production Course for Beginners
    • MCP, AGENTS.md, and Context Engineering for Beginners – Online Course
    • AI Browser Agents for Beginners: Use AI Websites Safely – Full Course
  • Microsoft Copilot – Zero To Hero
    • Codex Zero to Hero
  • AI Launch Radar
  • Clients
  • Contact
  • Sponsorship & Youtube
Wednesday, June 3, 2026
Kingy AI
  • AI News
  • Blog
  • AI Calculators
    • AI Video Sponsorship: Calculate Your ROI
    • AI Agent Directory & Readiness Scorecard
    • AI Search Visibility Calculator
    • Build Your AI Workflow Stack: Find the Best AI Tools for Your Job, Budget, and Skill Level
    • 100 AI Agent Use Cases That Actually Work in 2026: Real Workflows for Founders, Marketers, Creators, and Operators
  • AI Courses
    • OpenAI Codex Course for Beginners: Build Apps Without Coding
    • AI Agents for Beginners: Build Your First AI Worker Without Coding
    • AI Coding Foundations for Beginners
    • AI Workflow Operator Course for Beginners
    • AI Search Visibility Course for Beginners
    • AI Video Production Course for Beginners
    • MCP, AGENTS.md, and Context Engineering for Beginners – Online Course
    • AI Browser Agents for Beginners: Use AI Websites Safely – Full Course
  • Microsoft Copilot – Zero To Hero
    • Codex Zero to Hero
  • AI Launch Radar
  • Clients
  • Contact
  • Sponsorship & Youtube
No Result
View All Result
  • AI News
  • Blog
  • AI Calculators
    • AI Video Sponsorship: Calculate Your ROI
    • AI Agent Directory & Readiness Scorecard
    • AI Search Visibility Calculator
    • Build Your AI Workflow Stack: Find the Best AI Tools for Your Job, Budget, and Skill Level
    • 100 AI Agent Use Cases That Actually Work in 2026: Real Workflows for Founders, Marketers, Creators, and Operators
  • AI Courses
    • OpenAI Codex Course for Beginners: Build Apps Without Coding
    • AI Agents for Beginners: Build Your First AI Worker Without Coding
    • AI Coding Foundations for Beginners
    • AI Workflow Operator Course for Beginners
    • AI Search Visibility Course for Beginners
    • AI Video Production Course for Beginners
    • MCP, AGENTS.md, and Context Engineering for Beginners – Online Course
    • AI Browser Agents for Beginners: Use AI Websites Safely – Full Course
  • Microsoft Copilot – Zero To Hero
    • Codex Zero to Hero
  • AI Launch Radar
  • Clients
  • Contact
  • Sponsorship & Youtube
No Result
View All Result
Kingy AI
No Result
View All Result
Home AI launch radar

AI Launch Tracker – Miso One: The 8B Open‑Source Voice Model That Wants to Out‑Emote Humans

Curtis Pyke by Curtis Pyke
June 3, 2026
in AI launch radar, AI News
Reading Time: 13 mins read
A A

On the morning of June 3, 2026, a tweet from Miso Labs co-founder Aoden Teo — reposted by tech evangelist Robert Scoble — slid into a lot of timelines with a bold claim: “the most emotive voice model in the world.” By 9:06 AM the post had already crossed 312,000 views. The product is called Miso One, and the pitch is deceptively simple. It’s an 8-billion-parameter text-to-speech (TTS) model that, in Teo’s words, “emotes like a human and responds faster than a human, with just 110 milliseconds of latency.“

What makes the launch genuinely interesting isn’t just the marketing. Miso Labs shipped the model weights as open source on day one, with API access promised to follow. In a field where the best-sounding voices have historically lived behind closed APIs and metered billing, that’s a meaningful gesture. Let’s dig into what Miso One actually is, what the published numbers say, and — the question everyone in my DMs keeps asking — whether you can run it on your own hardware.

Today, we’re excited to introduce Miso One, the most emotive voice model in the world.

Miso One is an 8-billion-parameter text-to-speech model for highly expressive speech generation. It emotes like a human and responds faster than a human, with just 110 milliseconds of… pic.twitter.com/shjSjmbWKV

— Aoden Teo (@AodenTeoMT) June 3, 2026

What Miso One actually is

Stripped of the marketing language, Miso One (the model repository is named MisoTTS) is an 8-billion-parameter text-to-dialogue RVQ Transformer. If you’ve been following the open speech-synthesis scene, the lineage will feel familiar: Miso Labs describes the model as “inspired by the Sesame CSM architecture” — the same conversational-speech-model family that made waves for sounding eerily natural in back-and-forth dialogue.

According to the official GitHub README, the model “generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.” In plainer terms: it borrows the transformer brain of a Llama-style language model and repurposes it to predict audio tokens instead of words.

The design splits into two transformer components, and understanding that split is key to understanding why the model is both expressive and reasonably fast:

  • A large backbone transformer that consumes interleaved text and audio-frame embeddings. Because it accepts both modalities, it can condition its output on the conversation history — meaning it “remembers” how the dialogue has been flowing and adjusts tone accordingly.
  • A smaller decoder transformer that autoregressively predicts the higher-order audio codebooks within each frame.

The Hugging Face model card spells out the division of labor precisely: “Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.” So the heavyweight backbone sets the coarse acoustic direction, and the lightweight 300M decoder fills in the fine-grained detail. This two-stage approach is the same trick Sesame’s CSM popularized, and it’s a big part of why the model can stay responsive without sacrificing nuance.

The model is explicitly built for conversational speech generation and voice continuation from prompt audio — not just reading static paragraphs aloud. That conversational framing matters, because it’s the difference between an audiobook narrator and a voice agent that can hold a phone call.


The specs, in a table

Miso Labs publishes a clean model summary in both the GitHub repo and on Hugging Face. Here’s the consolidated spec sheet:

ItemValue
ModelMiso TTS 8B
OrganizationMiso Labs
TaskText-to-speech
ArchitectureRVQ Transformer (Sesame-style CSM)
Backbonellama-8B
Audio decoderllama-300M
Text vocabulary128,256
Audio vocabulary2,051
Audio codebooks32
Audio tokenizerMimi
Max sequence length2,048

A few things stand out. The text vocabulary of 128,256 is the standard Llama 3 tokenizer size — another confirmation of the Llama 3.2-style heritage. The 32 audio codebooks with a tiny 2,051-token audio vocabulary is characteristic of the Mimi neural audio codec, which compresses speech into a stack of discrete residual-vector-quantization tokens. The 2,048-token max sequence length is the practical ceiling on how much combined text-and-audio context you can feed in a single generation — generous for turn-by-turn dialogue, though something to keep in mind for very long monologues.

On Hugging Face, the model is listed at 8B parameters with tensors stored in F32 (full 32-bit float) precision in the published checkpoint, though — as we’ll see below — the recommended inference path runs in bfloat16.


The headline benchmark: latency

Miso Labs has chosen to fight on one battlefield above all others — latency — and it’s the centerpiece of the misolabs.ai landing page. The company frames the problem clearly: “Most AI voice agents lag at 700ms or more, creating awkward pauses that kill conversational flow.”

Here’s the comparison chart Miso Labs published, ranking time-to-first-response:

SystemLatency
ElevenLabs700 ms
Sesame300 ms
Human reaction160 ms
Miso Labs110 ms

The claim worth pausing on is that Miso One responds in 110 ms — faster than the ~160 ms of a typical human conversational turn. If that number holds up in independent testing, it’s the kind of figure that genuinely changes what a voice agent feels like. The uncanny-valley awkwardness of AI calls is usually less about voice quality and more about the dead air after you stop talking. Shaving the response gap below human reflexes is, conceptually, what separates “talking to a bot” from “talking to a person.”

A fair caveat: these are vendor-published numbers, and Miso Labs hasn’t (yet) released a detailed methodology — what hardware, what audio length, what measurement definition of “latency.” There’s also no published comparison on the dimensions you’d normally want for a TTS model: word error rate, speaker-similarity scores, or MOS (mean opinion score) naturalness ratings against competitors. The “most emotive voice model in the world” claim, for now, rests on the live demo and the launch-thread audio sample rather than a peer-style evaluation table. Treat the emotiveness claim as a “listen and judge for yourself” proposition.


Why people are excited

Three features anchor Miso Labs’ pitch, and each maps to a real pain point in deploying voice AI today.

1. Sub-human latency. Covered above — 110 ms is the marquee number, and if you’ve ever built a voice agent, you know that latency is the single most-felt quality metric.

2. One-shot voice cloning. Per the landing page, you can “clone any voice with a ten-second audio clip,” and the agent’s voice “remains an exact replica of the original sample from the first second of a call to the last.” This is backed up in the code: the repo ships a Segment API for prompted generation, where you pass in prompt.wav plus its transcript as context, and the model continues in that voice. Consistency across a long call is the hard part of cloning, and Miso is explicitly promising it.

3. On-premises sovereignty. This is the most strategically interesting one. Miso Labs leans hard into the open-source angle: “Our models are open source and built for local deployment. Keep your most sensitive data in-house and maintain total sovereignty over your voice layer.” For regulated industries — healthcare, finance, government — the ability to keep voice data from ever leaving the building is often a hard requirement, not a nice-to-have. Miso also offers on-premises hosting and enterprise support contracts on request. This is plausibly the real business model: open weights as the funnel, enterprise support as the revenue.

The emotive angle deserves its own note. The demo line on the landing page — “You’re my best friend, and my favourite person in the whole wide world… honestly, you’re just unbelievable” — is deliberately chosen to show off prosody, emphasis, and the kind of casual filler-laden cadence (“like, so smart”) that most TTS systems flatten into monotone. The model also exposes voice presets like friend, teacher, and voiceover in the preview UI.


Can it run locally? Yes — with caveats

This is the part I was most curious about, and the honest answer is: yes, it’s genuinely runnable locally, but it is not a lightweight model you’ll be running on a laptop CPU.

The GitHub repository ships full inference code — generator.py, models.py, a moshi_compat.py shim, watermarking, and a run_misotts.py entry point. Setup is refreshingly modern, built around Astral’s uv:

bashCopy# Install uv if you don't have it  
curl -LsSf https://astral.sh/uv/install.sh | sh  
  
# Clone and set up  
git clone https://github.com/MisoLabsAI/MisoTTS.git  
cd MisoTTS  
uv sync --python 3.10  
source .venv/bin/activate  
  
# Run the example — writes full_conversation.wav  
uv run python run_misotts.py  

There’s a pip-based path too for those who prefer a classic virtualenv. On first run, the script pulls the weights from the MisoLabs/MisoTTS Hugging Face repo and caches them locally — so subsequent runs are offline-friendly.

Generating speech from Python is about as terse as it gets:

pythonCopyimport torch  
import torchaudio  
from generator import load_miso_8b  
  
device = "cuda"if torch.cuda.is_available() else"cpu"  
generator = load_miso_8b(device=device, model_path_or_repo_id="MisoLabs/MisoTTS")  
  
audio = generator.generate(  
    text="Hello from Miso.",  
    speaker=0,  
    context=[],  
    max_audio_length_ms=10_000,  
)  
torchaudio.save("miso.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)  

Voice cloning adds just a few lines — you load your prompt.wav, resample it to the generator’s sample rate, wrap it in a Segment with its transcript, and pass it as context.

The caveats are about hardware. The repo’s deployment notes are blunt: “Miso TTS 8B is a large model. For best results, use a CUDA GPU with sufficient VRAM for the checkpoint precision you are loading. The default inference path uses torch.bfloat16.” The code will fall back to CPU if no CUDA device is found, but for an 8B model targeting real-time latency, an NVIDIA GPU with healthy VRAM is effectively mandatory if you want anything close to that 110 ms claim. The published checkpoint is F32, so at full precision you’re looking at roughly 32 GB just for weights — though the bfloat16 path roughly halves that, and the Hugging Face page already lists a community quantization (plus one finetune), which is exactly what you’d want for fitting it onto more modest cards.


Safety and watermarking

To Miso Labs’ credit, the safety story isn’t an afterthought bolted on at the end. The repo states plainly: “Do not use it to impersonate people, create deceptive audio, commit fraud, or generate harmful content.” More importantly, that intent is backed by code — generated audio is watermarked by default, using Sony’s SilentCipher model, which the script downloads on first run alongside the main weights.

There’s a sensible operational note attached: if you deploy the model in your own application, “use your own private watermark key and keep it secret.” For a one-shot voice-cloning model — a technology that is, let’s be honest, tailor-made for misuse — shipping default watermarking is a responsible baseline. It won’t stop a determined bad actor who strips the watermarking code, but it does mean honest deployments leave a traceable signal, and it raises the friction for casual abuse.


How it stacks up — and what we still don’t know

Positioned against the field, Miso One’s differentiators are clear: it claims lower latency than both ElevenLabs (700 ms) and Sesame (300 ms), it ships open weights where ElevenLabs is closed, and it bakes in on-prem deployment and watermarking. Architecturally it’s a Sesame-CSM descendant, so it inherits that family’s strength in conversational, context-aware speech rather than flat narration.

But a clear-eyed reader should note what hasn’t been published yet:

  • No independent benchmarks. The latency numbers and the “most emotive” claim are vendor-stated. There’s no released MOS naturalness study, no WER figures, no third-party A/B results.
  • No methodology for the 110 ms figure. We don’t know the hardware, audio length, or measurement definition behind it.
  • API access isn’t live. Teo’s launch post says it’s “coming soon” — for now, local deployment is the only way in.
  • Real VRAM requirements aren’t precisely documented. “Sufficient VRAM” leaves the exact floor to your own experimentation, though the existing quantization helps.

The repo is also brand new — a single commit, sitting around 271 stars and 27 forks at the time of writing — so the community vetting that ultimately validates (or deflates) launch-day claims is only just beginning.


The verdict

Miso One is one of the more exciting open releases in voice AI this year, and the reason is strategic as much as technical. By open-sourcing an 8B Sesame-style model with credible-sounding latency, real one-shot cloning, and a privacy-first, on-prem deployment story, Miso Labs is targeting exactly the gap that closed APIs like ElevenLabs leave open: teams that need a great-sounding voice agent and need to keep their data in-house.

The 110 ms latency claim — if it survives contact with independent testing — is the genuinely differentiating number here, because conversational feel lives or dies on response gaps, not just timbre. The honest move is to download the weights, point it at a decent CUDA GPU, and judge the emotiveness with your own ears. The barrier to doing exactly that is delightfully low.

If voice agents are going to feel less like talking to a machine and more like talking to a person, latency below human reaction time is the threshold that matters — and Miso One is making a confident, open-source bet that it has crossed it.

For AI founders and marketers

Want your AI product explained to a large AI-native audience?

Kingy AI helps AI companies turn complex products into clear, useful YouTube videos that drive awareness, product understanding, demos, clicks, and search visibility.

Get a Sponsorship Fit Review Calculate Sponsored Video ROI See Client Examples
Curtis Pyke

Curtis Pyke

A.I. enthusiast with multiple certificates and accreditations from Deep Learning AI, Coursera, and more. I am interested in machine learning, LLM's, and all things AI.

Related Posts

Microsoft Majorana 2 quantum chip
AI News

Microsoft’s Majorana 2 Quantum Chip: Big Leap, Big Claims, and a Tiny Little Bitcoin Panic

June 3, 2026
The Anthropic IPO filing
AI News

Anthropic’s Draft S-1 Tests Wall Street’s Appetite for Frontier AI

June 3, 2026
NVIDIA Nemotron 3 Ultra
AI News

NVIDIA’s Nemotron 3 Ultra Arrives at Computex 2026, and the Open AI Race Just Got Spicier

June 2, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

I agree to the Terms & Conditions and Privacy Policy.

Recent News

Microsoft Majorana 2 quantum chip

Microsoft’s Majorana 2 Quantum Chip: Big Leap, Big Claims, and a Tiny Little Bitcoin Panic

June 3, 2026
The Anthropic IPO filing

Anthropic’s Draft S-1 Tests Wall Street’s Appetite for Frontier AI

June 3, 2026
AI Launch Tracker – Miso One: The 8B Open‑Source Voice Model That Wants to Out‑Emote Humans

AI Launch Tracker – Miso One: The 8B Open‑Source Voice Model That Wants to Out‑Emote Humans

June 3, 2026
NVIDIA Nemotron 3 Ultra

NVIDIA’s Nemotron 3 Ultra Arrives at Computex 2026, and the Open AI Race Just Got Spicier

June 2, 2026

The Best in A.I.

Kingy AI

We feature the best AI apps, tools, and platforms across the web. If you are an AI app creator and would like to be featured here, feel free to contact us.

Recent Posts

  • Microsoft’s Majorana 2 Quantum Chip: Big Leap, Big Claims, and a Tiny Little Bitcoin Panic
  • Anthropic’s Draft S-1 Tests Wall Street’s Appetite for Frontier AI
  • AI Launch Tracker – Miso One: The 8B Open‑Source Voice Model That Wants to Out‑Emote Humans

Recent News

Microsoft Majorana 2 quantum chip

Microsoft’s Majorana 2 Quantum Chip: Big Leap, Big Claims, and a Tiny Little Bitcoin Panic

June 3, 2026
The Anthropic IPO filing

Anthropic’s Draft S-1 Tests Wall Street’s Appetite for Frontier AI

June 3, 2026
  • About
  • Advertise
  • Privacy & Policy
  • Contact Us

© 2026 Kingy AI

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • AI News
  • Blog
  • AI Calculators
    • AI Video Sponsorship: Calculate Your ROI
    • AI Agent Directory & Readiness Scorecard
    • AI Search Visibility Calculator
    • Build Your AI Workflow Stack: Find the Best AI Tools for Your Job, Budget, and Skill Level
    • 100 AI Agent Use Cases That Actually Work in 2026: Real Workflows for Founders, Marketers, Creators, and Operators
  • AI Courses
    • OpenAI Codex Course for Beginners: Build Apps Without Coding
    • AI Agents for Beginners: Build Your First AI Worker Without Coding
    • AI Coding Foundations for Beginners
    • AI Workflow Operator Course for Beginners
    • AI Search Visibility Course for Beginners
    • AI Video Production Course for Beginners
    • MCP, AGENTS.md, and Context Engineering for Beginners – Online Course
    • AI Browser Agents for Beginners: Use AI Websites Safely – Full Course
  • Microsoft Copilot – Zero To Hero
    • Codex Zero to Hero
  • AI Launch Radar
  • Clients
  • Contact
  • Sponsorship & Youtube

© 2026 Kingy AI

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.