Intelligence is foundation
Subscribe
  • Luma
  • About
  • Sources
  • Ecosystem
  • Nura
  • Marbl Codes
00:00
Contact
[email protected]
Connect
  • YouTube
  • LinkedIn
  • GitHub
Legal
Privacy Cookies Terms
  1. Home›
  2. Featured›
  3. Builders & Makers›
  4. Inside Neural Audio Codecs: The Pipeline That Made TTS Fast
Builders & Makers Sunday, 10 May 2026

Inside Neural Audio Codecs: The Pipeline That Made TTS Fast

Share: LinkedIn
Inside Neural Audio Codecs: The Pipeline That Made TTS Fast

Neural audio codecs are the reason text-to-speech models work at all. Without them, you'd be waiting minutes for a voice agent to finish a sentence. With them, you get real-time speech generation that feels natural. The trick is treating audio like compressed text.

Samuel Humeau's deep dive breaks down the codec-to-backbone-to-decoder pipeline that powers modern TTS. This is the architecture that Mistral, OpenAI, and every other voice AI company is running. Three stages, each solving a different part of the problem.

Stage One: The Codec Compresses Audio Into Tokens

Raw audio is a stream of numbers - amplitude values sampled thousands of times per second. At 48kHz, that's 48,000 numbers per second. If a TTS model tried to generate each one individually, it would take forever. The codec solves this by compressing audio into discrete tokens.

Here's how it works. The codec is a neural network trained to encode audio into a much smaller representation. Instead of 48,000 samples per second, it outputs maybe 50 tokens per second. Each token is an integer - a lookup into a codebook of learned audio patterns. The codec learns this codebook during training, figuring out which patterns appear most often in speech.

The clever bit is that the codec is lossy but imperceptible. It throws away information that humans don't notice. Subtle variations in pitch, tiny gaps between phonemes, background noise below a certain threshold - all gone. What's left is a compressed representation that captures the essence of the audio without the bulk.

This is why TTS models can generate speech in real time. They're not predicting raw waveforms. They're predicting codec tokens. The model outputs an integer, the codec decodes it into a chunk of audio, and playback continues. The compression ratio is the key - a 1000x reduction in data lets the model work at human speed.

Stage Two: The Transformer Backbone Generates Tokens

Once audio is compressed into tokens, the TTS problem looks exactly like text generation. You have a sequence of discrete symbols, and you need to predict the next one. This is what transformers are built for.

The backbone is an autoregressive transformer - same architecture as GPT, Claude, or any other language model. It takes in text (the thing you want spoken) and outputs a sequence of codec tokens (the speech). The model learned this mapping during pre-training, by reading millions of hours of transcribed audio and learning which tokens correspond to which words.

The important detail is that the model generates tokens sequentially. It predicts the first token, feeds it back into itself as context, then predicts the second token. This feedback loop continues until the sentence is done. It's slow by computer standards - each token requires a full forward pass through the network - but fast enough for real-time speech.

The transformer also handles prosody - the rhythm and emphasis of speech. It learns from context. A question mark at the end of a sentence triggers rising intonation. An exclamation point triggers emphasis. Commas trigger brief pauses. The model doesn't need explicit prosody labels. It figures this out from seeing enough examples during training.

Stage Three: The Decoder Reconstructs Audio

The codec tokens coming out of the transformer are still just integers. The final stage is decoding them back into audio. This is handled by the codec decoder - the inverse of the encoder from stage one.

The decoder takes each token, looks it up in the codebook, and reconstructs the corresponding audio chunk. These chunks are 20-50 milliseconds each. The decoder stitches them together, smoothing over the seams so you don't hear the joins. The result is continuous audio that plays back at normal speed.

This stage is fast - decoding is much cheaper than generation. The bottleneck is the transformer backbone. Once the tokens are ready, the decoder spits out audio almost instantly. This is why streaming works. You can start playback as soon as the first few tokens are ready, and the decoder will keep up with the transformer as it generates the rest.

Streaming: How to Hide Latency

Real-time TTS relies on overlapping generation and playback. The transformer starts generating tokens immediately. As soon as the first chunk is ready - maybe 200 milliseconds of audio - the decoder reconstructs it and playback begins. While that chunk is playing, the transformer is generating the next chunk. By the time playback reaches the end of the first chunk, the second chunk is ready.

This is called chunked streaming, and it's how voice agents feel responsive even though the model is still generating. The user hears speech start within a few hundred milliseconds, and the rest arrives in a steady stream. The illusion is that the agent is speaking in real time. The reality is that it's racing to stay ahead of playback.

The challenge is pipeline stalls. If the transformer falls behind - maybe because the sentence is complex and takes longer to generate - playback catches up and you get a stutter. The decoder runs out of tokens and has to wait. This is why latency tuning matters. You need to balance chunk size, model speed, and buffer depth so the pipeline never stalls.

Voice Cloning: Few-Shot Learning in Practice

Modern TTS models can clone a voice from a 10-second sample. The demo Humeau showed was striking - record yourself speaking for a few seconds, and the model generates new speech in your voice. This works because the model already learned a continuous voice space during pre-training.

The voice space is a high-dimensional representation of vocal characteristics - pitch, timbre, accent, speaking rate. Every voice is a point in this space. During few-shot cloning, the model hears your sample, figures out where you sit in voice space, and generates speech from that point. It's not learning your voice from scratch. It's finding you in a space it already knows.

The codec plays a key role here. It encodes your voice sample into a set of parameters - a voice embedding - that the transformer uses to condition generation. This embedding is small, maybe a few hundred numbers. The transformer uses it to bias token generation toward your vocal characteristics. The result is speech that sounds like you, even though the model only heard you for ten seconds.

Cost: The Real Constraint

Voice agents are expensive because they generate constantly. A text-based agent generates one response per message - maybe a few hundred tokens. A voice agent generates thousands of tokens for the same conversation - audio for every word spoken, in both directions.

The cost per interaction scales with token count. Better codecs help by compressing more audio into fewer tokens. Faster models help by reducing compute per token. But the fundamental constraint remains: voice is more expensive than text. This limits where you deploy it.

Customer service calls - worth it, because the alternative is hiring humans. Casual chat apps - maybe not, because text is good enough and much cheaper. The business case for voice depends on whether the interaction genuinely needs it. The technology is solved. The economics are still being figured out.

More Featured Insights

Robotics & Automation
Your Robot Just Braked Because Your Wi-Fi Hiccupped
Voices & Thought Leaders
TTS Models Learned to Talk Like LLMs

Video Sources

AI Engineer
Why TTS Models Now Look Like LLMs - Samuel Humeau, Mistral
AI Engineer
Give Your Chat Agent a Voice - Luke Harries, ElevenLabs
AI Engineer
Why TTS Models Now Look Like LLMs - Samuel Humeau, Mistral
AI Revolution
Anthropic Situation Just Got Even More INSANE
AI Engineer
Voice AI: when is the "Her" moment? - Neil Zeghidour, Gradium AI

Today's Sources

Towards Data Science
RAG Is Blind to Time - I Built a Temporal Layer to Fix It in Production
DEV.to AI
The Freelancer Follow-Up System That Runs Without You Thinking About It
Towards Data Science
The Must-Know Topics for an LLM Engineer
DEV.to AI
I Broke My Website. Then I Fixed It. Then My Fix Broke It Again.
ROS Discourse
Stop "Ghost Commands" on Bad Wi-Fi: ros2_kinematic_guard for `/cmd_vel` safety
The Robot Report
Cognex releases fully integrated AI-powered vision system for robotics
ROS Discourse
Altara - open-source React component library for ROS2 dashboards
Robohub
Robot Talk Episode 155 - Making aerial robots smarter, with Melissa Greeff
Azeem Azhar
🔮 Exponential View #573: Are the AI labs building for an intelligence explosion?

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Richard Bland
About Sources Privacy Cookies Terms Thou Art That
MEM Digital Ltd t/a Marbl Codes
Co. 13753194 (England & Wales)
VAT: 400325657
3-4 Brittens Court, Clifton Reynes, Olney, MK46 5LG
© 2026 MEM Digital Ltd