Claude is Now Alignment-Pretrained

Anthropic are now actively using the approach to alignment often called “ Alignment Pretraining ” or “Safety Pretraining” — using Stochastic Gradient Descent on a large body of natural or synthetic documents showing the AI assistant doing the right thing in morally challenging situations. They tried this out, found it works well and generalizes well, and they’re now using it. I’m absolutely delighted. I’ve been repeatedly advocating this approach on LessWrong and the Alignment Forum for a couple of years now: How to Control an LLM's Behavior (why my P(DOOM) went down) Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? A "Bitter Lesson" Approach to Aligning AGI and ASI Why Aligning an LLM is Hard, and How to Make it Easier The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem? Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training I’ve been very

Read Original Article →

Source

https://www.lesswrong.com/posts/Xqh9bDw7Ei5bExC6h/claude-is-now-alignment-pretrained-1