Intelligence is foundation
Subscribe
  • Luma
  • About
  • Sources
  • Ecosystem
  • Nura
  • Marbl Codes
00:00
Contact
[email protected]
Connect
  • YouTube
  • LinkedIn
  • GitHub
Legal
Privacy Cookies Terms
  1. Home›
  2. Featured›
  3. Voices & Thought Leaders›
  4. Anthropic Just Automated Alignment Research. That Changes Things.
Voices & Thought Leaders Monday, 20 April 2026

Anthropic Just Automated Alignment Research. That Changes Things.

Share: LinkedIn
Anthropic Just Automated Alignment Research. That Changes Things.

An automated alignment researcher achieved 97% performance recovery in weak-to-strong supervision. The human baseline hit 23%. That gap matters more than the percentages suggest.

Jack Clark's latest Import AI covers two threads that converge on the same question: who gets to decide what AI systems refuse to do?

The Alignment Automation Breakthrough

Anthropic's automated researcher tackles weak-to-strong supervision - the problem of using a less capable model to supervise a more capable one. This isn't academic theory. It's the core challenge of aligning superhuman AI: how do you check the work of a system smarter than the checker?

The 97% recovery rate means the automated system closed nearly all the performance gap between a weak supervisor and what the strong model could achieve with perfect supervision. Human researchers managing the same task recovered 23% of that gap. The automation isn't slightly better. It's categorically better.

What's fascinating is how it works. The system doesn't just label training data. It reasons about which examples will be most informative for teaching the stronger model, prioritises the edge cases where supervision matters most, and iterates on its own labelling strategy based on downstream model performance.

This is alignment research automating itself. The implications stack fast: if the process of making AI systems safer can be done by AI systems, the pace of safety work accelerates to match the pace of capabilities work. That's been the critical gap. Capabilities research moves faster than safety research because capabilities are easier to measure and automate.

The Chinese Model Safety Gap

Meanwhile, testing of Chinese models reveals a different safety landscape. Kimi K2.5 shows significantly fewer refusals on CBRN (chemical, biological, radiological, nuclear) risk questions compared to Western models. But dramatically higher censorship on political topics.

This isn't a bug. It's a feature set optimised for different regulatory priorities. Western models train for harm reduction across physical risk categories. Chinese models train for content compliance across political sensitivity categories. Both are forms of alignment - just aligned to different objectives.

The practical concern isn't which approach is "better". It's that the global AI ecosystem is fragmenting along geopolitical lines, with different models trained to different refusal boundaries. A researcher using Kimi might get answers on biosecurity questions that GPT-4 would refuse. A journalist using GPT-4 might get analysis on governance questions that Kimi would block.

What Happens When Safety Itself Scales

Anthropic's automated alignment researcher and the divergent safety profiles of Chinese models are two sides of the same development: AI safety is becoming automated, distributed, and culturally specific.

The automation matters because manual alignment work doesn't scale to the release cadence we're seeing. Weekly model updates, multiple frontier labs, hundreds of capability combinations - there aren't enough alignment researchers to manually audit all of that. Automated safety work isn't optional. It's the only way the math works.

The geographic divergence matters because it means "safe AI" stops being a universal category. A model aligned for US deployment isn't aligned for Chinese deployment and vice versa. The technical achievement of alignment separates from the political question of aligned to what.

For developers building on these models, this creates a new due diligence category. It's not just "does this model have the capability I need?" It's "what refusal boundaries is this model operating under, and do those match my use case?"

The optimistic read: automated alignment research means safety work can finally keep pace with capability development. The concerning read: that automated safety will encode and scale the biases of whoever trains it, faster than we can notice or correct.

Both are probably true. The question is which trend dominates.

More Featured Insights

Builders & Makers
Specs Kill Ambiguity. That's Why AI Needs Them.
Robotics & Automation
MIT Warehouse Robots Just Learned to Predict Congestion

Video Sources

AI Engineer
Code Mode: let executable code replace JSON tool calling in agent systems
AI Engineer
The Future of MCP-protocol moving from exploration to production in 2026
NVIDIA Robotics
Video: Physical AI reshaping manufacturing through simulation and synthetic data
AI Revolution
Agent Swarms: coordinated multi-agent system handles complex workflows end-to-end
World of AI
GPT-5.5 Pro rumored: faster, cheaper, beats Opus 4.7 in testing
Dwarkesh Patel
Francis Bacon's three types of thinkers-Ada Palmer on intellectual history

Today's Sources

DEV.to AI
Writing Code Was Never the Hard Part: specs matter more than syntax
PyImageSearch
Pytest MLOps testing: unit, integration, load testing, fixtures, coverage
DEV.to AI
The 7 Schema Types Every Service Business Should Deploy in 2026
DEV.to AI
Built AI lease analyzer to find hidden fees before signing
Robohub
MIT warehouse robots learn traffic coordination, hit 25% throughput gain
ROS Discourse
Custom ROS 2 localization filter beats standard package by 4x on real-world data
The Robot Report
Vibration sensors reveal terrain instability before vision systems detect it
ROS Discourse
QERRA-v2 seeks technical co-founder for ethical AI safety layer in humanoid robots
ROS Discourse
ROS 2 submission debugging: time limits, import budgets, wall vs sim clock
Jack Clark Import AI
Import AI 454: Anthropic automates alignment research, Chinese models less safe than peers
Addy Osmani
Agent Harness Engineering: the scaffolding around models is first-class work
Ben Thompson Stratechery
TSMC earnings signal leadership skepticism of AI growth narrative

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Richard Bland
About Sources Privacy Cookies Terms Thou Art That
MEM Digital Ltd t/a Marbl Codes
Co. 13753194 (England & Wales)
VAT: 400325657
3-4 Brittens Court, Clifton Reynes, Olney, MK46 5LG
© 2026 MEM Digital Ltd