Text-to-speech systems used to be their own weird corner of AI. Different architectures, different training methods, different everything. Then someone realised: what if we just treat audio like text? Generate it one token at a time, left to right, using the same transformer architecture that powers ChatGPT?
Samuel Humeau from Mistral walked through how TTS architecture converged on the same autoregressive approach as large language models. The technical reason is information density. The business reason is cost. And the reason it works at all is a trick called neural audio codecs.
The Problem: Audio Has Too Much Information
Language models generate text one token at a time. Each token is a chunk of meaning - a word, part of a word, a punctuation mark. The model predicts the next token, then the next, until it's done. Simple. Sequential. Efficient.
Audio doesn't work like that. A single second of speech contains 16,000 to 48,000 samples depending on quality. If you tried to generate each sample one at a time, you'd be waiting hours for a ten-second clip. The information density is too high. You need a way to compress audio into something token-like.
That's what neural audio codecs do. They take raw audio waveforms and encode them into a sequence of discrete tokens - integers that represent chunks of sound. Instead of predicting 48,000 samples per second, the TTS model predicts maybe 50 tokens per second. The codec handles the rest, decoding those tokens back into smooth audio.
This is why modern TTS models look like LLMs. They're not generating audio directly. They're generating codec tokens. The architecture is the same because the problem is the same - predict the next token in a sequence.
Streaming: The Latency Fix
Autoregressive generation has a downside. You can't generate the end of a sentence until you've generated the beginning. For text, that's fine - people read left to right anyway. For audio, it's a problem. Nobody wants to wait five seconds of silence before the voice agent starts speaking.
The solution is streaming. Instead of generating the entire audio sequence and then playing it, the model generates tokens in small chunks and starts playback immediately. As soon as the first few tokens are ready, the codec decodes them and the speaker plays them. The model is still generating the rest of the sentence in the background.
This works because audio is forgiving. A 100-millisecond delay between chunks is imperceptible. The human ear doesn't notice the seams. The codec smooths over the joins, and what you hear is continuous speech.
The challenge is keeping the pipeline full. If the model generates tokens slower than the codec consumes them, playback stutters. If the model generates too fast, you're wasting compute. Streaming TTS is a balancing act between latency and throughput. Get it right, and you have real-time voice agents. Get it wrong, and you have robotic pauses mid-sentence.
The Cost Problem
Voice agents are expensive. Not because the models are large - though they are - but because they run constantly. A chatbot generates one response per user message. A voice agent generates audio for every word spoken, in both directions. The compute adds up fast.
This is the real barrier to voice agents at scale. The technology works. The latency is manageable. But the cost per interaction is higher than text-based agents, and that limits where you can deploy them. Customer service calls, sure. Casual chat apps, maybe not.
Humeau's point is that codec efficiency is the lever. Better codecs mean fewer tokens per second of audio. Fewer tokens mean less compute. Less compute means lower cost. The architecture is settled - autoregressive transformers won. The next frontier is making them cheap enough to run everywhere.
Voice Cloning: The Easy Part
Modern TTS models can clone voices from a few seconds of audio. You speak for ten seconds, the model learns your vocal characteristics, and it can generate speech in your voice indefinitely. This is called few-shot voice cloning, and it's surprisingly straightforward.
The model doesn't learn your voice from scratch. It already knows how to generate thousands of voices - that's what it learned during pre-training. When you give it a sample of your voice, it's just finding the right point in that voice space. The codec encodes your voice into a set of parameters, and the model uses those parameters to steer generation.
The hard part isn't cloning a voice. It's cloning a voice while preserving prosody - the rhythm, emphasis, and emotion of natural speech. A flat monotone reading of text sounds robotic even if the voice itself is perfect. The best TTS models learn prosody from context, adjusting pitch and pacing based on sentence structure and punctuation. But this is still a weak point. Most synthetic voices sound slightly off, not because the timbre is wrong, but because the delivery is too neutral.
What This Means for Builders
If you're building a voice agent, the architecture question is settled. Use an autoregressive transformer. Use a neural audio codec. Stream the output. The tricky parts are latency tuning and cost management. How much compute can you afford per interaction? How much delay can users tolerate?
The other question is whether you need voice at all. Text-based agents are cheaper, faster, and often more useful. Voice makes sense when hands are busy or screens aren't available. But for most applications, text is still the better interface. Voice is the premium option, not the default.