On the morning of June 3, 2026, a tweet from Miso Labs co-founder Aoden Teo — reposted by tech evangelist Robert Scoble — slid into a lot of timelines with a bold claim: “the most emotive voice model in the world.” By 9:06 AM the post had already crossed 312,000 views. The product is called Miso One, and the pitch is deceptively simple. It’s an 8-billion-parameter text-to-speech (TTS) model that, in Teo’s words, “emotes like a human and responds faster than a human, with just 110 milliseconds of latency.“
What makes the launch genuinely interesting isn’t just the marketing. Miso Labs shipped the model weights as open source on day one, with API access promised to follow. In a field where the best-sounding voices have historically lived behind closed APIs and metered billing, that’s a meaningful gesture. Let’s dig into what Miso One actually is, what the published numbers say, and — the question everyone in my DMs keeps asking — whether you can run it on your own hardware.
What Miso One actually is
Stripped of the marketing language, Miso One (the model repository is named MisoTTS) is an 8-billion-parameter text-to-dialogue RVQ Transformer. If you’ve been following the open speech-synthesis scene, the lineage will feel familiar: Miso Labs describes the model as “inspired by the Sesame CSM architecture” — the same conversational-speech-model family that made waves for sounding eerily natural in back-and-forth dialogue.
According to the official GitHub README, the model “generates Mimi audio codes from text and optional audio context, using a large Llama 3.2-style backbone and a smaller autoregressive audio decoder.” In plainer terms: it borrows the transformer brain of a Llama-style language model and repurposes it to predict audio tokens instead of words.
The design splits into two transformer components, and understanding that split is key to understanding why the model is both expressive and reasonably fast:
- A large backbone transformer that consumes interleaved text and audio-frame embeddings. Because it accepts both modalities, it can condition its output on the conversation history — meaning it “remembers” how the dialogue has been flowing and adjusts tone accordingly.
- A smaller decoder transformer that autoregressively predicts the higher-order audio codebooks within each frame.
The Hugging Face model card spells out the division of labor precisely: “Codebook 0 is predicted from the backbone hidden state, while codebooks 1 through 31 are predicted by the audio decoder autoregressively in codebook depth.” So the heavyweight backbone sets the coarse acoustic direction, and the lightweight 300M decoder fills in the fine-grained detail. This two-stage approach is the same trick Sesame’s CSM popularized, and it’s a big part of why the model can stay responsive without sacrificing nuance.
The model is explicitly built for conversational speech generation and voice continuation from prompt audio — not just reading static paragraphs aloud. That conversational framing matters, because it’s the difference between an audiobook narrator and a voice agent that can hold a phone call.
The specs, in a table
Miso Labs publishes a clean model summary in both the GitHub repo and on Hugging Face. Here’s the consolidated spec sheet:
| Item | Value |
|---|---|
| Model | Miso TTS 8B |
| Organization | Miso Labs |
| Task | Text-to-speech |
| Architecture | RVQ Transformer (Sesame-style CSM) |
| Backbone | llama-8B |
| Audio decoder | llama-300M |
| Text vocabulary | 128,256 |
| Audio vocabulary | 2,051 |
| Audio codebooks | 32 |
| Audio tokenizer | Mimi |
| Max sequence length | 2,048 |
A few things stand out. The text vocabulary of 128,256 is the standard Llama 3 tokenizer size — another confirmation of the Llama 3.2-style heritage. The 32 audio codebooks with a tiny 2,051-token audio vocabulary is characteristic of the Mimi neural audio codec, which compresses speech into a stack of discrete residual-vector-quantization tokens. The 2,048-token max sequence length is the practical ceiling on how much combined text-and-audio context you can feed in a single generation — generous for turn-by-turn dialogue, though something to keep in mind for very long monologues.
On Hugging Face, the model is listed at 8B parameters with tensors stored in F32 (full 32-bit float) precision in the published checkpoint, though — as we’ll see below — the recommended inference path runs in bfloat16.

The headline benchmark: latency
Miso Labs has chosen to fight on one battlefield above all others — latency — and it’s the centerpiece of the misolabs.ai landing page. The company frames the problem clearly: “Most AI voice agents lag at 700ms or more, creating awkward pauses that kill conversational flow.”
Here’s the comparison chart Miso Labs published, ranking time-to-first-response:
| System | Latency |
|---|---|
| ElevenLabs | 700 ms |
| Sesame | 300 ms |
| Human reaction | 160 ms |
| Miso Labs | 110 ms |
The claim worth pausing on is that Miso One responds in 110 ms — faster than the ~160 ms of a typical human conversational turn. If that number holds up in independent testing, it’s the kind of figure that genuinely changes what a voice agent feels like. The uncanny-valley awkwardness of AI calls is usually less about voice quality and more about the dead air after you stop talking. Shaving the response gap below human reflexes is, conceptually, what separates “talking to a bot” from “talking to a person.”
A fair caveat: these are vendor-published numbers, and Miso Labs hasn’t (yet) released a detailed methodology — what hardware, what audio length, what measurement definition of “latency.” There’s also no published comparison on the dimensions you’d normally want for a TTS model: word error rate, speaker-similarity scores, or MOS (mean opinion score) naturalness ratings against competitors. The “most emotive voice model in the world” claim, for now, rests on the live demo and the launch-thread audio sample rather than a peer-style evaluation table. Treat the emotiveness claim as a “listen and judge for yourself” proposition.
Why people are excited
Three features anchor Miso Labs’ pitch, and each maps to a real pain point in deploying voice AI today.
1. Sub-human latency. Covered above — 110 ms is the marquee number, and if you’ve ever built a voice agent, you know that latency is the single most-felt quality metric.
2. One-shot voice cloning. Per the landing page, you can “clone any voice with a ten-second audio clip,” and the agent’s voice “remains an exact replica of the original sample from the first second of a call to the last.” This is backed up in the code: the repo ships a Segment API for prompted generation, where you pass in prompt.wav plus its transcript as context, and the model continues in that voice. Consistency across a long call is the hard part of cloning, and Miso is explicitly promising it.
3. On-premises sovereignty. This is the most strategically interesting one. Miso Labs leans hard into the open-source angle: “Our models are open source and built for local deployment. Keep your most sensitive data in-house and maintain total sovereignty over your voice layer.” For regulated industries — healthcare, finance, government — the ability to keep voice data from ever leaving the building is often a hard requirement, not a nice-to-have. Miso also offers on-premises hosting and enterprise support contracts on request. This is plausibly the real business model: open weights as the funnel, enterprise support as the revenue.
The emotive angle deserves its own note. The demo line on the landing page — “You’re my best friend, and my favourite person in the whole wide world… honestly, you’re just unbelievable” — is deliberately chosen to show off prosody, emphasis, and the kind of casual filler-laden cadence (“like, so smart”) that most TTS systems flatten into monotone. The model also exposes voice presets like friend, teacher, and voiceover in the preview UI.
Can it run locally? Yes — with caveats
This is the part I was most curious about, and the honest answer is: yes, it’s genuinely runnable locally, but it is not a lightweight model you’ll be running on a laptop CPU.
The GitHub repository ships full inference code — generator.py, models.py, a moshi_compat.py shim, watermarking, and a run_misotts.py entry point. Setup is refreshingly modern, built around Astral’s uv:
bashCopy# Install uv if you don't have it curl -LsSf https://astral.sh/uv/install.sh | sh # Clone and set up git clone https://github.com/MisoLabsAI/MisoTTS.git cd MisoTTS uv sync --python 3.10 source .venv/bin/activate # Run the example — writes full_conversation.wav uv run python run_misotts.py
There’s a pip-based path too for those who prefer a classic virtualenv. On first run, the script pulls the weights from the MisoLabs/MisoTTS Hugging Face repo and caches them locally — so subsequent runs are offline-friendly.
Generating speech from Python is about as terse as it gets:
pythonCopyimport torch
import torchaudio
from generator import load_miso_8b
device = "cuda"if torch.cuda.is_available() else"cpu"
generator = load_miso_8b(device=device, model_path_or_repo_id="MisoLabs/MisoTTS")
audio = generator.generate(
text="Hello from Miso.",
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
torchaudio.save("miso.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Voice cloning adds just a few lines — you load your prompt.wav, resample it to the generator’s sample rate, wrap it in a Segment with its transcript, and pass it as context.
The caveats are about hardware. The repo’s deployment notes are blunt: “Miso TTS 8B is a large model. For best results, use a CUDA GPU with sufficient VRAM for the checkpoint precision you are loading. The default inference path uses torch.bfloat16.” The code will fall back to CPU if no CUDA device is found, but for an 8B model targeting real-time latency, an NVIDIA GPU with healthy VRAM is effectively mandatory if you want anything close to that 110 ms claim. The published checkpoint is F32, so at full precision you’re looking at roughly 32 GB just for weights — though the bfloat16 path roughly halves that, and the Hugging Face page already lists a community quantization (plus one finetune), which is exactly what you’d want for fitting it onto more modest cards.
Safety and watermarking
To Miso Labs’ credit, the safety story isn’t an afterthought bolted on at the end. The repo states plainly: “Do not use it to impersonate people, create deceptive audio, commit fraud, or generate harmful content.” More importantly, that intent is backed by code — generated audio is watermarked by default, using Sony’s SilentCipher model, which the script downloads on first run alongside the main weights.
There’s a sensible operational note attached: if you deploy the model in your own application, “use your own private watermark key and keep it secret.” For a one-shot voice-cloning model — a technology that is, let’s be honest, tailor-made for misuse — shipping default watermarking is a responsible baseline. It won’t stop a determined bad actor who strips the watermarking code, but it does mean honest deployments leave a traceable signal, and it raises the friction for casual abuse.
How it stacks up — and what we still don’t know
Positioned against the field, Miso One’s differentiators are clear: it claims lower latency than both ElevenLabs (700 ms) and Sesame (300 ms), it ships open weights where ElevenLabs is closed, and it bakes in on-prem deployment and watermarking. Architecturally it’s a Sesame-CSM descendant, so it inherits that family’s strength in conversational, context-aware speech rather than flat narration.
But a clear-eyed reader should note what hasn’t been published yet:
- No independent benchmarks. The latency numbers and the “most emotive” claim are vendor-stated. There’s no released MOS naturalness study, no WER figures, no third-party A/B results.
- No methodology for the 110 ms figure. We don’t know the hardware, audio length, or measurement definition behind it.
- API access isn’t live. Teo’s launch post says it’s “coming soon” — for now, local deployment is the only way in.
- Real VRAM requirements aren’t precisely documented. “Sufficient VRAM” leaves the exact floor to your own experimentation, though the existing quantization helps.
The repo is also brand new — a single commit, sitting around 271 stars and 27 forks at the time of writing — so the community vetting that ultimately validates (or deflates) launch-day claims is only just beginning.
The verdict
Miso One is one of the more exciting open releases in voice AI this year, and the reason is strategic as much as technical. By open-sourcing an 8B Sesame-style model with credible-sounding latency, real one-shot cloning, and a privacy-first, on-prem deployment story, Miso Labs is targeting exactly the gap that closed APIs like ElevenLabs leave open: teams that need a great-sounding voice agent and need to keep their data in-house.
The 110 ms latency claim — if it survives contact with independent testing — is the genuinely differentiating number here, because conversational feel lives or dies on response gaps, not just timbre. The honest move is to download the weights, point it at a decent CUDA GPU, and judge the emotiveness with your own ears. The barrier to doing exactly that is delightfully low.
If voice agents are going to feel less like talking to a machine and more like talking to a person, latency below human reaction time is the threshold that matters — and Miso One is making a confident, open-source bet that it has crossed it.
Want your AI product explained to a large AI-native audience?
Kingy AI helps AI companies turn complex products into clear, useful YouTube videos that drive awareness, product understanding, demos, clicks, and search visibility.






