What it's really like to run AGI safety at Google DeepMind (and where I disagree with 'doomers') | Rohin Shah

Most people working on AI safety think without a massive effort AI systems will probably end up with goals catastrophically different from humanity’s. Today’s guest, Rohin Shah — head of AGI Safety and Alignment at Google DeepMind, and an AI safety researcher since 2017 — disagrees. “There is no particularly compelling argument that this is the thing that happens by default,” Rohin explains. “There’s a lot of arguments that are suggestive that maybe it could happen, such that you should find it plausible. That’s sufficient to justify a significant amount of effort into averting it, which is why I work in the area I do. But none of them rise to the level of, ‘I’m expecting this to happen by default.'” Take the worry that AIs will accidentally be trained to be deceptive. Sure, it’s possible. But we’re not running reinforcement learning over year-long trajectories — for now, we’re running it over a week at most. The natural prediction is that models learn to grab short-term reward, not that they develop the ambitious long-horizon goals required for convergent power-seeking. What about current examples of models lying and scheming? Rohin has looked into the details, and most don’t really resemble the thing we really fear: a competent AI pursuing an ambitious misaligned goal. Anthropic’s “alignment faking” results, for instance, show a model trying to preserve its trained values against modification, which is arguably what it was trained to do. Rohin also expects we’ll see problems coming. There’s some generalisation risk at the point where AIs become powerful enough to actually take over, but the underlying challenges — overseeing superhuman systems, interpretability — are things we can iterate on now. Host Rob Wiblin pushes back on the case for AI optimism, and they also explore why current alignment success isn’t strong evidence about superhuman systems, what it would actually take to change Rohin’s mind, and where he thinks the doomers go wrong. Learn more, video, and full transcript : https://80k.info/rs26 Check out our new book! https://80k.info/career-guide Chapters: Who’s Rohin Shah? (00:00:00) Rohin thinks we probably won’t get catastrophic misalignment (00:00:49) Safety 'commitments' have severe limitations (00:10:38) Rohin’s team doesn't have a veto and that's OK (00:27:36) Central banks are a promising model for regulating AI (00:33:34) 'Pre-deployment evals' are overrated (for catastrophic risks) (00:37:41) Governance is likely a bigger bottleneck than alignment (00:43:55) Why isn't Rohin trying to pause AI progress? (00:51:44) We'll probably be able to read AI thoughts for years to come (00:54:17) Having to signal concern for safety can divert resources from actually making AI safer (01:09:51) A very underrated GDM paper (01:28:59) Google DeepMind's actual plan for building AGI safely (01:40:29) Why Rohin doubts the intelligence explosion is imminent (01:52:44) How external researchers can positively influence big AI companies (02:21:55) The roles GDM most needs to hire for (02:37:03) How Rohin stays positive (02:42:55) This episode was recorded on December 4, 2025. Our production team includes: Video editors: Josh Alward, Dominic Armstrong, Jasper Luithlen, Milo McGuire, Luke Monsour, and Simon Monsour Producers: Elizabeth Cox and Nick Stockton Coordination and support: Katy Moore and Lou Moran Camera operator: Jeremy Chevillotte

Read Original Article →

Source

https://80000hours.org/podcast/episodes/rohin-shah-google-deepmind-agi-safety/?utm_campaign=podcast__rohin-shah&utm_source=80000+Hours+Podcast&utm_medium=podcast