Inside Neural Audio Codecs: The Pipeline That Made TTS Fast

Neural audio codecs are the reason text-to-speech models work at all. Without them, you'd be waiting minutes for a voice agent to finish a sentence. With them, you get real-time speech generation that feels natural. The trick is treating audio like compressed text.

Samuel Humeau's deep dive breaks down the codec-to-backbone-to-decoder pipeline that powers modern TTS. This is the architecture that Mistral, OpenAI, and every other voice AI company is running. Three stages, each solving a different part of the problem.

Stage One: The Codec Compresses Audio Into Tokens

Raw audio is a stream of numbers - amplitude values sampled thousands of times per second. At 48kHz, that's 48,000 numbers per second. If a TTS model tried to generate each one individually, it would take forever. The codec solves this by compressing audio into discrete tokens.

Here's how it works. The codec is a neural network trained to encode audio into a much smaller representation. Instead of 48,000 samples per second, it outputs maybe 50 tokens per second. Each token is an integer - a lookup into a codebook of learned audio patterns. The codec learns this codebook during training, figuring out which patterns appear most often in speech.

The clever bit is that the codec is lossy but imperceptible. It throws away information that humans don't notice. Subtle variations in pitch, tiny gaps between phonemes, background noise below a certain threshold - all gone. What's left is a compressed representation that captures the essence of the audio without the bulk.

This is why TTS models can generate speech in real time. They're not predicting raw waveforms. They're predicting codec tokens. The model outputs an integer, the codec decodes it into a chunk of audio, and playback continues. The compression ratio is the key - a 1000x reduction in data lets the model work at human speed.

Stage Two: The Transformer Backbone Generates Tokens

Once audio is compressed into tokens, the TTS problem looks exactly like text generation. You have a sequence of discrete symbols, and you need to predict the next one. This is what transformers are built for.

The backbone is an autoregressive transformer - same architecture as GPT, Claude, or any other language model. It takes in text (the thing you want spoken) and outputs a sequence of codec tokens (the speech). The model learned this mapping during pre-training, by reading millions of hours of transcribed audio and learning which tokens correspond to which words.

The important detail is that the model generates tokens sequentially. It predicts the first token, feeds it back into itself as context, then predicts the second token. This feedback loop continues until the sentence is done. It's slow by computer standards - each token requires a full forward pass through the network - but fast enough for real-time speech.

The transformer also handles prosody - the rhythm and emphasis of speech. It learns from context. A question mark at the end of a sentence triggers rising intonation. An exclamation point triggers emphasis. Commas trigger brief pauses. The model doesn't need explicit prosody labels. It figures this out from seeing enough examples during training.

Stage Three: The Decoder Reconstructs Audio

The codec tokens coming out of the transformer are still just integers. The final stage is decoding them back into audio. This is handled by the codec decoder - the inverse of the encoder from stage one.

The decoder takes each token, looks it up in the codebook, and reconstructs the corresponding audio chunk. These chunks are 20-50 milliseconds each. The decoder stitches them together, smoothing over the seams so you don't hear the joins. The result is continuous audio that plays back at normal speed.

This stage is fast - decoding is much cheaper than generation. The bottleneck is the transformer backbone. Once the tokens are ready, the decoder spits out audio almost instantly. This is why streaming works. You can start playback as soon as the first few tokens are ready, and the decoder will keep up with the transformer as it generates the rest.

Streaming: How to Hide Latency

Real-time TTS relies on overlapping generation and playback. The transformer starts generating tokens immediately. As soon as the first chunk is ready - maybe 200 milliseconds of audio - the decoder reconstructs it and playback begins. While that chunk is playing, the transformer is generating the next chunk. By the time playback reaches the end of the first chunk, the second chunk is ready.

This is called chunked streaming, and it's how voice agents feel responsive even though the model is still generating. The user hears speech start within a few hundred milliseconds, and the rest arrives in a steady stream. The illusion is that the agent is speaking in real time. The reality is that it's racing to stay ahead of playback.

The challenge is pipeline stalls. If the transformer falls behind - maybe because the sentence is complex and takes longer to generate - playback catches up and you get a stutter. The decoder runs out of tokens and has to wait. This is why latency tuning matters. You need to balance chunk size, model speed, and buffer depth so the pipeline never stalls.

Voice Cloning: Few-Shot Learning in Practice

Modern TTS models can clone a voice from a 10-second sample. The demo Humeau showed was striking - record yourself speaking for a few seconds, and the model generates new speech in your voice. This works because the model already learned a continuous voice space during pre-training.

The voice space is a high-dimensional representation of vocal characteristics - pitch, timbre, accent, speaking rate. Every voice is a point in this space. During few-shot cloning, the model hears your sample, figures out where you sit in voice space, and generates speech from that point. It's not learning your voice from scratch. It's finding you in a space it already knows.

The codec plays a key role here. It encodes your voice sample into a set of parameters - a voice embedding - that the transformer uses to condition generation. This embedding is small, maybe a few hundred numbers. The transformer uses it to bias token generation toward your vocal characteristics. The result is speech that sounds like you, even though the model only heard you for ten seconds.

Cost: The Real Constraint

Voice agents are expensive because they generate constantly. A text-based agent generates one response per message - maybe a few hundred tokens. A voice agent generates thousands of tokens for the same conversation - audio for every word spoken, in both directions.

The cost per interaction scales with token count. Better codecs help by compressing more audio into fewer tokens. Faster models help by reducing compute per token. But the fundamental constraint remains: voice is more expensive than text. This limits where you deploy it.

Customer service calls - worth it, because the alternative is hiring humans. Casual chat apps - maybe not, because text is good enough and much cheaper. The business case for voice depends on whether the interaction genuinely needs it. The technology is solved. The economics are still being figured out.