OpenAI released three new audio models this week: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. They handle live conversations, real-time translation across 70+ languages, and streaming transcription respectively.
The interesting bit isn't the features - it's the architecture. These aren't text models with speech bolted on. They process audio natively, which means they can reason about tone, pacing, and interruptions without converting to text first.
What GPT-Realtime-2 Does
GPT-Realtime-2 is the conversation model. It has 128K context - roughly two hours of back-and-forth dialogue. That's enough for a full customer service call, a medical consultation, or a therapy session without losing track of what was said earlier.
It also has GPT-5-class reasoning capabilities. When you ask it to calculate something or look up information mid-conversation, it handles tool calls transparently. You don't hear it thinking or see it load. The response just includes the answer, as if it knew it all along.
For developers, this matters because it removes the scaffolding. You don't need to detect when the model needs external data, route the request, wait for a response, and stitch the answer back into speech. The model does all of that internally.
The use case everyone's thinking about: phone systems. Customer service agents who never get tired, never lose their temper, and remember every interaction you've ever had with the company. Whether that's exciting or dystopian depends on where you sit.
Real-Time Translation Without Lag
GPT-Realtime-Translate handles live speech translation across 70+ languages. You speak English, the other person hears Mandarin, and vice versa - with minimal delay.
This isn't new as a concept. Google Translate has had live modes for years. What's different here is latency. The system processes audio in chunks small enough that conversations feel natural, not like walkie-talkie exchanges.
That's technically hard. Most translation systems wait for you to finish a sentence before translating. Natural speech doesn't work that way. People interrupt themselves, trail off, change direction mid-thought. Real-time translation has to guess when you've made your point and start converting before you've actually stopped speaking.
OpenAI's version handles that by predicting sentence boundaries. It's not perfect - you'll still get occasional awkward pauses - but it's fast enough that people can have actual conversations, not take turns broadcasting at each other.
Streaming Transcription That Learns
GPT-Realtime-Whisper is the transcription model. It converts speech to text in real-time and gets better as the conversation progresses.
Most transcription systems treat every utterance as independent. If you say "Kubernetes" once and it mishears it as "cool Bernetes", it'll make the same mistake again five minutes later. Whisper learns from context. Once it figures out you're talking about container orchestration, it weights technical vocabulary higher.
For accessibility tools, this is significant. Live captions that improve over the course of a lecture or meeting, rather than staying consistently mediocre. For journalists and researchers, it's interviews that don't need post-correction.
The limitation is domain specificity. If you're discussing niche topics - rare medical conditions, obscure historical events, highly technical jargon - the model still guesses wrong. It just guesses wrong less often than it used to.
The Bigger Shift
These three models aren't separate products. They're pieces of a single architecture that processes audio end-to-end. Text is a byproduct, not the primary format.
That changes how voice interfaces work. Until now, most voice assistants followed the same pattern: speech-to-text, process text with an LLM, text-to-speech. Every step adds latency. Every conversion loses information.
Native audio models skip the conversions. They hear tone, pacing, emotion, interruption directly. When you say "Wait, no, I meant..." mid-sentence, the model doesn't need to parse the text to understand you're correcting yourself. It hears the hesitation.
Whether that makes conversations feel more natural or more uncanny is still playing out. Early demos are impressive. But demos are always impressive. The test is whether people choose to use these systems when they're not being watched.
What Developers Can Build
These models are available via API now. Pricing isn't public yet, which probably means it's expensive until scale kicks in.
The obvious applications: customer service, language learning, accessibility tools, live interpretation. But the less obvious ones might be more interesting. Real-time audio search. Conversational interfaces for databases. Voice-driven workflow automation that understands context and corrects itself on the fly.
The constraint is trust. People will use these systems for low-stakes interactions - booking a restaurant, asking for directions, casual chat. For high-stakes decisions, we still want humans. How fast that changes depends less on the technology and more on how many times these systems get it right when it matters.