Boston Dynamics put a visual language model into Spot and sent it into a house. No task-specific programming. No hard-coded routines. Just instructions like "tidy up" and "get me a drink" - and Spot figured out the rest.
The demo shows Spot navigating rooms, identifying objects, and completing multi-step tasks by understanding what it's looking at. When asked to tidy up, it doesn't follow a script. It sees a bag on the floor, recognises that bags belong in cupboards, opens the cupboard, and puts the bag away. When asked for a drink, it identifies the fizzy drinks among other bottles, picks one, and delivers it.
This is embodied reasoning - the model understands the physical world and reasons about it in real time.
What Changed
Until now, getting a robot to complete household tasks meant programming every step. You'd define what "tidy" means, where objects belong, how to grip each item, what path to take. The robot executes instructions, but it doesn't understand the task.
Spot's new capability comes from Google's Gemini Robotics model - a vision-language model trained to understand spatial relationships, object properties, and task context. It sees the environment through Spot's cameras and reasons about what actions make sense. The breakthrough isn't the hardware - Spot's been around for years. It's that vision models can now handle the messy, unstructured reality of a real house.
The model doesn't just recognise objects. It understands relationships. A bag isn't just a bag - it's something that belongs somewhere, probably in a cupboard. A fizzy drink isn't just a bottle - it's the right kind of bottle when someone asks for something carbonated. This contextual reasoning is what makes the tasks feel genuinely intelligent rather than scripted.
Why This Matters for Builders
For anyone developing robotic systems, this shifts the engineering problem. Instead of programming task logic, you're now curating training data and tuning inference. The robot learns what "tidy" means by seeing examples, not by following code paths. That's a completely different development cycle.
It also means robots can handle variability. If the house layout changes, if objects move, if someone asks for a slightly different task - the model adapts. You're not rewriting code every time the environment shifts. The system reasons about new situations using the same underlying understanding of physics, objects, and goals.
The practical applications extend far beyond household chores. Warehouses with constantly changing inventory. Construction sites where conditions shift daily. Healthcare environments where each patient's needs differ. Any domain where you can't script every possibility benefits from a system that can see, understand, and reason about what to do next.
The Honest Limitations
Boston Dynamics' demo is impressive, but it's a demo. Real-world deployment means handling failures gracefully, operating safely around people, and dealing with edge cases that didn't appear in training data. The model might misidentify objects, misjudge distances, or make decisions that seem logical but aren't safe in context.
There's also the question of speed. Visual reasoning takes compute. Spot pauses to think before acting - that's fine for tidying up, less fine for time-sensitive tasks. And the model's decision-making process is still largely opaque. When Spot makes a mistake, debugging why is harder than tracing through traditional code.
But the direction is clear. Robots that understand tasks rather than execute scripts. Systems that adapt to new environments without reprogramming. Physical agents that can take vague instructions and figure out the details. We're watching the gap close between "follow these steps" and "achieve this goal" - and that changes what's possible to build.