Anthropic Just Automated Alignment Research. That Changes Things.

An automated alignment researcher achieved 97% performance recovery in weak-to-strong supervision. The human baseline hit 23%. That gap matters more than the percentages suggest.

Jack Clark's latest Import AI covers two threads that converge on the same question: who gets to decide what AI systems refuse to do?

The Alignment Automation Breakthrough

Anthropic's automated researcher tackles weak-to-strong supervision - the problem of using a less capable model to supervise a more capable one. This isn't academic theory. It's the core challenge of aligning superhuman AI: how do you check the work of a system smarter than the checker?

The 97% recovery rate means the automated system closed nearly all the performance gap between a weak supervisor and what the strong model could achieve with perfect supervision. Human researchers managing the same task recovered 23% of that gap. The automation isn't slightly better. It's categorically better.

What's fascinating is how it works. The system doesn't just label training data. It reasons about which examples will be most informative for teaching the stronger model, prioritises the edge cases where supervision matters most, and iterates on its own labelling strategy based on downstream model performance.

This is alignment research automating itself. The implications stack fast: if the process of making AI systems safer can be done by AI systems, the pace of safety work accelerates to match the pace of capabilities work. That's been the critical gap. Capabilities research moves faster than safety research because capabilities are easier to measure and automate.

The Chinese Model Safety Gap

Meanwhile, testing of Chinese models reveals a different safety landscape. Kimi K2.5 shows significantly fewer refusals on CBRN (chemical, biological, radiological, nuclear) risk questions compared to Western models. But dramatically higher censorship on political topics.

This isn't a bug. It's a feature set optimised for different regulatory priorities. Western models train for harm reduction across physical risk categories. Chinese models train for content compliance across political sensitivity categories. Both are forms of alignment - just aligned to different objectives.

The practical concern isn't which approach is "better". It's that the global AI ecosystem is fragmenting along geopolitical lines, with different models trained to different refusal boundaries. A researcher using Kimi might get answers on biosecurity questions that GPT-4 would refuse. A journalist using GPT-4 might get analysis on governance questions that Kimi would block.

What Happens When Safety Itself Scales

Anthropic's automated alignment researcher and the divergent safety profiles of Chinese models are two sides of the same development: AI safety is becoming automated, distributed, and culturally specific.

The automation matters because manual alignment work doesn't scale to the release cadence we're seeing. Weekly model updates, multiple frontier labs, hundreds of capability combinations - there aren't enough alignment researchers to manually audit all of that. Automated safety work isn't optional. It's the only way the math works.

The geographic divergence matters because it means "safe AI" stops being a universal category. A model aligned for US deployment isn't aligned for Chinese deployment and vice versa. The technical achievement of alignment separates from the political question of aligned to what.

For developers building on these models, this creates a new due diligence category. It's not just "does this model have the capability I need?" It's "what refusal boundaries is this model operating under, and do those match my use case?"

The optimistic read: automated alignment research means safety work can finally keep pace with capability development. The concerning read: that automated safety will encode and scale the biases of whoever trains it, faster than we can notice or correct.

Both are probably true. The question is which trend dominates.