Intelligence is foundation
Subscribe
  • Luma
  • About
  • Sources
  • Ecosystem
  • Nura
  • Marbl Codes
00:00
Contact
[email protected]
Connect
  • YouTube
  • LinkedIn
  • GitHub
Legal
Privacy Cookies Terms
  1. Home›
  2. Featured›
  3. Voices & Thought Leaders›
  4. TTS Models Learned to Talk Like LLMs
Voices & Thought Leaders Sunday, 10 May 2026

TTS Models Learned to Talk Like LLMs

Share: LinkedIn
TTS Models Learned to Talk Like LLMs

Text-to-speech systems used to be their own weird corner of AI. Different architectures, different training methods, different everything. Then someone realised: what if we just treat audio like text? Generate it one token at a time, left to right, using the same transformer architecture that powers ChatGPT?

Samuel Humeau from Mistral walked through how TTS architecture converged on the same autoregressive approach as large language models. The technical reason is information density. The business reason is cost. And the reason it works at all is a trick called neural audio codecs.

The Problem: Audio Has Too Much Information

Language models generate text one token at a time. Each token is a chunk of meaning - a word, part of a word, a punctuation mark. The model predicts the next token, then the next, until it's done. Simple. Sequential. Efficient.

Audio doesn't work like that. A single second of speech contains 16,000 to 48,000 samples depending on quality. If you tried to generate each sample one at a time, you'd be waiting hours for a ten-second clip. The information density is too high. You need a way to compress audio into something token-like.

That's what neural audio codecs do. They take raw audio waveforms and encode them into a sequence of discrete tokens - integers that represent chunks of sound. Instead of predicting 48,000 samples per second, the TTS model predicts maybe 50 tokens per second. The codec handles the rest, decoding those tokens back into smooth audio.

This is why modern TTS models look like LLMs. They're not generating audio directly. They're generating codec tokens. The architecture is the same because the problem is the same - predict the next token in a sequence.

Streaming: The Latency Fix

Autoregressive generation has a downside. You can't generate the end of a sentence until you've generated the beginning. For text, that's fine - people read left to right anyway. For audio, it's a problem. Nobody wants to wait five seconds of silence before the voice agent starts speaking.

The solution is streaming. Instead of generating the entire audio sequence and then playing it, the model generates tokens in small chunks and starts playback immediately. As soon as the first few tokens are ready, the codec decodes them and the speaker plays them. The model is still generating the rest of the sentence in the background.

This works because audio is forgiving. A 100-millisecond delay between chunks is imperceptible. The human ear doesn't notice the seams. The codec smooths over the joins, and what you hear is continuous speech.

The challenge is keeping the pipeline full. If the model generates tokens slower than the codec consumes them, playback stutters. If the model generates too fast, you're wasting compute. Streaming TTS is a balancing act between latency and throughput. Get it right, and you have real-time voice agents. Get it wrong, and you have robotic pauses mid-sentence.

The Cost Problem

Voice agents are expensive. Not because the models are large - though they are - but because they run constantly. A chatbot generates one response per user message. A voice agent generates audio for every word spoken, in both directions. The compute adds up fast.

This is the real barrier to voice agents at scale. The technology works. The latency is manageable. But the cost per interaction is higher than text-based agents, and that limits where you can deploy them. Customer service calls, sure. Casual chat apps, maybe not.

Humeau's point is that codec efficiency is the lever. Better codecs mean fewer tokens per second of audio. Fewer tokens mean less compute. Less compute means lower cost. The architecture is settled - autoregressive transformers won. The next frontier is making them cheap enough to run everywhere.

Voice Cloning: The Easy Part

Modern TTS models can clone voices from a few seconds of audio. You speak for ten seconds, the model learns your vocal characteristics, and it can generate speech in your voice indefinitely. This is called few-shot voice cloning, and it's surprisingly straightforward.

The model doesn't learn your voice from scratch. It already knows how to generate thousands of voices - that's what it learned during pre-training. When you give it a sample of your voice, it's just finding the right point in that voice space. The codec encodes your voice into a set of parameters, and the model uses those parameters to steer generation.

The hard part isn't cloning a voice. It's cloning a voice while preserving prosody - the rhythm, emphasis, and emotion of natural speech. A flat monotone reading of text sounds robotic even if the voice itself is perfect. The best TTS models learn prosody from context, adjusting pitch and pacing based on sentence structure and punctuation. But this is still a weak point. Most synthetic voices sound slightly off, not because the timbre is wrong, but because the delivery is too neutral.

What This Means for Builders

If you're building a voice agent, the architecture question is settled. Use an autoregressive transformer. Use a neural audio codec. Stream the output. The tricky parts are latency tuning and cost management. How much compute can you afford per interaction? How much delay can users tolerate?

The other question is whether you need voice at all. Text-based agents are cheaper, faster, and often more useful. Voice makes sense when hands are busy or screens aren't available. But for most applications, text is still the better interface. Voice is the premium option, not the default.

More Featured Insights

Builders & Makers
Inside Neural Audio Codecs: The Pipeline That Made TTS Fast
Robotics & Automation
Your Robot Just Braked Because Your Wi-Fi Hiccupped

Video Sources

AI Engineer
Why TTS Models Now Look Like LLMs - Samuel Humeau, Mistral
AI Engineer
Give Your Chat Agent a Voice - Luke Harries, ElevenLabs
AI Engineer
Why TTS Models Now Look Like LLMs - Samuel Humeau, Mistral
AI Revolution
Anthropic Situation Just Got Even More INSANE
AI Engineer
Voice AI: when is the "Her" moment? - Neil Zeghidour, Gradium AI

Today's Sources

Towards Data Science
RAG Is Blind to Time - I Built a Temporal Layer to Fix It in Production
DEV.to AI
The Freelancer Follow-Up System That Runs Without You Thinking About It
Towards Data Science
The Must-Know Topics for an LLM Engineer
DEV.to AI
I Broke My Website. Then I Fixed It. Then My Fix Broke It Again.
ROS Discourse
Stop "Ghost Commands" on Bad Wi-Fi: ros2_kinematic_guard for `/cmd_vel` safety
The Robot Report
Cognex releases fully integrated AI-powered vision system for robotics
ROS Discourse
Altara - open-source React component library for ROS2 dashboards
Robohub
Robot Talk Episode 155 - Making aerial robots smarter, with Melissa Greeff
Azeem Azhar
🔮 Exponential View #573: Are the AI labs building for an intelligence explosion?

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Richard Bland
About Sources Privacy Cookies Terms Thou Art That
MEM Digital Ltd t/a Marbl Codes
Co. 13753194 (England & Wales)
VAT: 400325657
3-4 Brittens Court, Clifton Reynes, Olney, MK46 5LG
© 2026 MEM Digital Ltd