Thinking Machines just released TML-Interaction-Small, a 276-billion parameter mixture-of-experts model that processes audio, video, and text simultaneously every 200 milliseconds. It does not wait for you to finish speaking. It does not separate inputs into discrete turns. It runs continuously, processing everything at once, and responds when it has something to say.
This is different from existing multimodal models. GPT-4o Realtime and Gemini 3.1 process audio and video, but they operate in turns - you speak, they respond, boundaries are clear. TML-Interaction-Small removes those boundaries. It watches, listens, and thinks in parallel, updating its understanding continuously.
What Continuous Processing Enables
The 200ms interval means the model samples the world five times per second. Fast enough to catch interruptions, facial expressions, and gesture changes in real-time. Fast enough to respond mid-sentence if context shifts.
This unlocks three capabilities existing models struggle with: visual proactivity, continuous awareness, and background tool use.
Visual proactivity means the model can notice something in the video feed and comment without being prompted. If you are assembling furniture and reach for the wrong screw, the model can interject. If someone walks into frame during a video call, the model knows before you say anything.
Continuous awareness means the model maintains context across overlapping inputs. It does not reset between turns. If you start a sentence, get interrupted, then finish it thirty seconds later, the model remembers the first half. If you gesture at something while speaking, the model connects the gesture to the words without explicit linking.
Background tool use means the model can trigger actions without waiting for conversation to pause. According to Thinking Machines, the model can search for information, generate code, or query databases while still processing audio and video. The tool execution happens in parallel with interaction, not sequentially.
Benchmark Performance
TML-Interaction-Small outperforms GPT-4o Realtime and Gemini 3.1 on interaction benchmarks. The specific benchmarks measure interruption handling, multi-input synthesis, and response latency under continuous input conditions. These are not standard language model benchmarks - they test how well models handle messy, overlapping real-world interaction.
The model's mixture-of-experts architecture is key here. Different expert networks handle audio processing, video analysis, and language generation. Because they operate in parallel, the model can process all three input streams without bottlenecking on any single modality. The experts share learned representations but specialise in their domain.
Latency matters more in continuous interaction than in turn-based systems. A 200ms response delay is imperceptible in conversation. A 2-second delay breaks flow. TML-Interaction-Small is optimised for the former - fast enough to feel immediate, slow enough to process complex inputs properly.
What This Means for Builders
If you are building voice interfaces, video assistants, or collaborative tools, continuous processing changes the design space. You no longer need to manage turn-taking logic or explicit input boundaries. The model handles interruptions, overlapping speech, and multi-person conversations natively.
For customer service applications, this means agents can monitor calls in real-time and surface information proactively. For accessibility tools, this means interfaces that respond to gesture, speech, and context simultaneously. For collaborative software, this means assistants that watch your screen, listen to your explanations, and offer suggestions without being asked.
The challenge is interaction design. When the model can interject at any moment, how do you prevent it from being intrusive? When it processes everything continuously, how do you signal when it should stay quiet? These are not technical problems - they are human factors problems. The model is fast enough to interrupt naturally. Whether it should is a different question.
The Bigger Shift
We have spent years teaching models to wait their turn. Polite, structured, turn-based interaction. TML-Interaction-Small does the opposite - it processes everything, all the time, and jumps in when it has something useful to contribute.
That shift from reactive to proactive is significant. It moves AI interfaces from tools you invoke to collaborators that observe and assist. The model does not wait to be asked. It watches what you are doing, understands context, and offers help when relevant.
Whether people want that is unclear. Some tasks benefit from proactive assistance - technical support, tutoring, real-time collaboration. Others require tools that stay silent until summoned. The interaction model that works for air traffic control does not work for creative writing.
For developers, TML-Interaction-Small is worth testing in scenarios where continuous awareness adds value. Where interruption is not rude but helpful. Where processing multiple input streams simultaneously solves a real problem. The model is fast, capable, and built for exactly that use case. Whether your application needs it is the question to answer first.