Opus 4.8 launches with agent orchestration. Robotaxis ship.

Today's Overview

Anthropic raised $65B at a $965B post-money valuation this week. The company disclosed $47B in annual run-rate revenue. Those numbers matter less than what they shipped with it: Claude Opus 4.8, which fixes the core complaint about the previous version-it was lazy, overconfident, and stopped working halfway through hard tasks. The new model is sharper about what it doesn't know, flags its own mistakes earlier, and can work independently for longer without getting stuck.

What Changed in the Model

The headline improvements are calibration and honesty. Opus 4.8 doesn't hallucinate completion. It won't tell you a task is done when it's halfway through. It says "I don't know" instead of guessing. On benchmarks-SWE-Bench Pro, APEX-SWE, FrontierSWE-it's now state-of-the-art. But the more important detail is how it performs on real work: Cognition's Devin tool saw merged PRs jump from 16% to 80% of commits in their internal repos over the past few months. That's not a model score. That's a production signal.

The other release worth your time is Dynamic Workflows in Claude Code-what Anthropic calls "ultracode" internally. Claude now writes orchestration scripts that spawn hundreds of parallel subagents, each working on a specific piece of a large problem. Ramp's CEO used this to rewrite Bun from Zig to Rust, 750k lines, in 6 days. Most of the test suite passed. The catch is obvious: that's token-expensive and quota-burning in practice. But it works.

The Infrastructure Reality

Behind every shipping agent is infrastructure nobody talks about until it breaks. Walden Yan from Cognition and Cole Murray (who built OpenInspect, an open-source background agent system) sat down this week to walk through what actually matters: repo setup, VM snapshots, secret scoping, GitHub integrations that don't loop infinitely, and testing workflows that combine screenshots with video verification. The unsexy detail that keeps coming up is this: most agent deployments fail not because the model is bad, but because companies don't have working local development environments. When an agent needs to test code, it can't. Docker helps, but full VMs are what Cognition actually uses. And even VMs have weird corners-file system performance, nested virtualization for Android development, snapshots that only save the diff instead of the whole terabyte.

The other infrastructure shift happening now: agents are becoming first responders in production. SRE alerts come in. The agent triages them, collects context from logs and the database, and opens a PR. Customer support tags an issue in Slack. The agent reconstructs the problem, queries the codebase, and gives you a full analysis before any human even looks. This only works if permissions are scoped, if secrets live outside the agent's sandbox, and if there's a kill switch. Security and governance become first-class problems, not afterthoughts.

What's Shipping in Robotics

Jensen Huang said this week that the robotaxi moment is here-not coming, but now. The ecosystem building autonomous vehicles on NVIDIA's stack is expanding. Real deployments are happening. In the ROS world, 31 teams advanced from the AI for Industry Challenge qualification phase, all building contact-rich manipulation tasks (cable insertion, specifically) on ROS 2. This is the other side of the agent story: not software agents orchestrating code, but physical agents learning to manipulate the world. The problems are different-you need real hardware, real testing loops, real failure modes-but the throughline is the same. Can you get an agent to work reliably on a task that matters?