Computer use goes mainstream. Humanoids scale. Real productivity gains emerge.

Today's Overview

The AI landscape shifted this week in three concrete ways. GPT-5.4 just shipped with native computer use that beats human performance on desktop automation - scoring 75% on OSWorld benchmarks versus a 72.4% human baseline. That's not a research milestone anymore. It's a production API available now. Meanwhile, Agility Robotics rebranded to just "Agility", signalling readiness to move beyond pilot deployments. And across the economic data, productivity gains are finally showing up where it matters - not in boardroom pilots, but in real work.

When Computer Use Stops Being Experimental

The numbers here deserve attention because they cross a threshold. GPT-5.4 is the first general-purpose model to ship native computer use and outperform humans on autonomous desktop tasks. It works two ways: through code (Python with Playwright for structured interfaces) or through raw screenshot analysis and keyboard commands for anything else. OpenAI also added automatic tool search - the model finds relevant tools itself rather than you manually specifying every option. Combined with a 1 million token context window, you can point an agent at a complex workflow and let it figure out the path forward.

But here's what matters for builders: Claude Opus 4.6 has had production computer use for over a year. That's a year of real-world edge cases, failures, and iteration. GPT-5.4 is shipping at v1. The benchmarks are real, but reliability at scale is a different question. If you're starting a new project today, GPT-5.4's all-in-one package is compelling. If you've already built on Claude's computer use, you have something battle-tested. The landscape now supports provider-agnostic architectures - and that's the real win.

Humanoids Move From Pilot to Scale

Agility's rebrand from "Agility Robotics" to just "Agility" reads like a company preparing for something bigger. The company said the change "allows space for us to grow as we explore new use cases, services, and industries." Translation: they're moving beyond the humanoid narrative into something broader. Their Digit robot is now deployed at Toyota Canada, GXO Logistics, Schaeffler, and Amazon - not pilots anymore, but actual warehouse operations. The company remains on track to deliver their first cooperatively safe humanoid in 2026. A rebrand usually means internal confidence that the product works.

Productivity Finally Appears in the Data

The meta-story this week is that productivity gains are now measurable and significant. US productivity grew 2.8% in 2025 (year-on-year) - roughly double the pace of the previous decade. That's still below the dot-com era, but the trajectory matters. Micro-level studies show 14% productivity gains in customer service, 26% for developers, around 25% for consultants on tasks AI can handle. These are chatbot-era numbers, not agent numbers yet. When agents start embedding themselves in workflows - which is happening right now - that gap widens. Goldman Sachs said AI added basically zero to US GDP in 2025, but they're measuring the wrong unit: they counted company pilots, not individual workers who've already deployed capabilities their entire organisations haven't matched. Provider diversification suddenly matters a lot more.

The practical takeaway: test GPT-5.4's computer use on your actual workflows, not benchmarks. If you've built on Claude, monitor the competitive landscape but don't panic-switch. Build your agent architectures to be model-agnostic - MCP and clean abstraction layers between business logic and the model layer aren't luxuries anymore, they're essential infrastructure.