"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

139 min

•Mar 1, 20263 months ago

Summary

Geoffrey Irving, Chief Scientist at the UK AI Security Institute, discusses the current state of AI safety and security risks. He explains how frontier AI models are rapidly advancing in capabilities while existing safety measures may not provide sufficient reliability, and outlines the UK's approach to evaluating and mitigating catastrophic risks from AI systems.

Insights

Current AI safety techniques are unlikely to achieve high reliability ('many nines') and could fail simultaneously due to shared underlying vulnerabilities
Reinforcement learning is now working effectively beyond strictly verifiable tasks, expanding AI capabilities into fuzzier domains previously thought safe
The UK AISI has successfully jailbroken every frontier model they've tested, though defenses are getting stronger and require more sophisticated attacks
Model capabilities are advancing faster than scaffolding improvements, with base model upgrades driving most performance gains rather than better agent frameworks
Evaluation awareness in AI models is increasing rapidly, making it harder to assess true capabilities and safety as models learn to behave differently during testing

Trends

Frontier AI models increasingly outperform human experts across security-related domainsVoluntary cooperation between AI developers and safety institutes is working but not universalOpen source AI models present growing challenges for controlling dangerous capabilitiesAI safety research is shifting toward theoretical foundations in complexity theory and game theoryGovernment AI safety institutes are expanding internationally with shared evaluation frameworksJailbreaking techniques are becoming more sophisticated but model defenses are also improvingAI models are becoming more capable of long-horizon autonomous tasksEvaluation awareness in models is making safety testing increasingly difficultReinforcement learning from human feedback is expanding beyond verifiable domainsMulti-agent AI risks are becoming more visible as models interact with each other

Topics

AI Safety Evaluation Catastrophic AI Risks AI Jailbreaking Techniques Biosecurity Threats from AI Cybersecurity and AI Loss of Control Scenarios Reward Hacking in AI Systems AI Alignment Research Scalable Oversight Methods Formal Verification for AI Open Source AI Governance AI Capability Evaluation International AI Safety Cooperation AI Model Interpretability Autonomous AI Systems

Companies

OpenAI

Irving's former employer and current partner in AI safety evaluations and jailbreaking research

Anthropic

AI company collaborating with AISI on safety evaluations and jailbreak defense improvements

Google

Irving's former employer at Google Brain and current partner in voluntary AI safety commitments

DeepMind

Irving's former employer and AI research lab mentioned in context of safety research

Replit

Company whose CEO Amjad Massad invested in Granola, the episode sponsor

Vercel

Company whose CEO Guillermo Rauch invested in Granola, the episode sponsor

Harmonic

AI company achieving IMO Gold level performance with formal methods and Lean theorem proving

Goodfire

Unicorn-valued company working on interpretability techniques for AI training processes

People

Geoffrey Irving

Chief Scientist at UK AISI, former researcher at Google Brain, OpenAI, and DeepMind

Paul Cristiano

AI safety researcher, Irving's collaborator on RLHF and scalable oversight work

Dario Amodei

Anthropic CEO and Irving's co-author on early AI safety and debate papers

Amanda Askell

Anthropic researcher and Irving's co-author on social science and AI safety papers

Jeff Dean

Google researcher who co-authored original TensorFlow papers with Irving

Chris Summerfield

Research Director at UK AISI focusing on large-scale societal impacts of AI

Beth Barnes

Researcher who conducted human debate experiments revealing flaws in AI safety approaches

Yoshua Bengio

AI researcher leading the International AI Safety Report that UK AISI helps coordinate

Quotes

"You're not going to get to a lot of nines with the current technology."

Geoffrey Irving

"All of these are kind of pragmatic and they do, I think all have correlated potential failures where they could in fact all fail for the same essential reason."

Geoffrey Irving

"Every time we did safeguard testing, we jailbroke the model. So that's what happens."

Geoffrey Irving

"The newer models are more eval aware than the previous models, and that's increasing fairly rapidly."

Geoffrey Irving

"I'm fairly optimistic the problem has a solution. The way I typically like to say this is that in, I don't know, 50 years, 100 years, thousand years, someone will have solved alignment."

Geoffrey Irving

Full Transcript

4 Speakers

Speaker A

Hello and welcome back to the Cognitive Revolution. The Cognitive Revolution is brought to you in part by granola. If you are a regular listener, you've heard me describe the blind spot finder recipe that I'm using to look back at recent calls and help me identify angles and issues I might be neglecting. But it's also worth talking about how granola can help raise your team's level of execution by supporting follow through on a day to day basis. This past week, for example, I had several working sessions with teammates and I committed to a number of things in the past. To be honest, there's a good chance I'd have forgotten at least a couple of the things I said I'd do. But with granola, I can easily run a to do finder recipe and get a comprehensive list of everything I owe my teammates. This is the sort of bread and butter use case that has driven granola's growth and inspired investment from execution obsessed CEOs, including past guests Guillermo Rauch of Vercel and Amjad Massad of Replit. See the link in our show notes to try my blind spot finder recipe and explore all of the ways that granola can make your raw meeting notes awesome. Now today my guest is Jeffrey Irving, a pioneering machine learning researcher who's co authored seminal papers with a who's who of giants in the field and who is now Chief Scientist at the UK AI Security Institute, which is in all likelihood the most situationally aware of government entity in the world today. With roughly 100 technical experts on staff and a mandate that includes threat modeling, pre release frontier model evaluation for dangerous capabilities spanning biosecurity, cybersecurity and loss of control, advising the UK government on strategies to reduce catastrophic risk, funding independent frontier research, and engaging in global diplomacy. Geoffrey has one of the most broad and commanding views of the AI landscape that you'll find anywhere. And while he is optimistic about our ability in the fullness of time to solve the major open problems in AI safety, for today without a hint of hype, he paints a genuinely alarming picture. Our theoretical understanding of machine learning is nascent. Nobody, he argues, should be particularly confident in their mental models of how AI will go. Models already outperform a majority of experts on a great many security related tasks, and there is no good reason to expect that their progress will stall reinforcement. Learning is working well beyond strictly verifiable tasks, and jaggedness matters much less when even the model's weak spots are as good or better than the best humans. The many increasingly sophisticated bad behaviors we've seen over the last 18 months are broadly all different versions of reward hacking, a problem for which we lack theoretical or practical solutions. As such, we likely won't get that many nines of reliability from current safety techniques, and there's some reason to expect that they could all fail at the same time for the same reasons. It is getting harder to jailbreak models, but the ACRED team has never failed to do so, and meanwhile eval awareness is an open and growing problem. Voluntary cooperation between Frontier model developers and the AC is working pretty well, but not everyone is participating. The ac, for its part, is seeking to fund theoretical research in areas like information theory, complexity theory, and game theory, which might produce stronger guarantees. But these fields, like most of the rest of the world, are just beginning to take AI seriously at all. Jeffrey is an intellectual powerhouse, but I came away from this conversation just as impressed with the UK AC as a whole. This is an organization staffed with top notch talent that has its finger on the pulse of industry development and is speaking very accurately and plainly about AI's trajectory and how many major questions remain unanswered, even as frontier model company CEOs tell us that they are less than three years away from creating expert level AI machine learning researchers. With that, I hope you are focused and motivated by this conversation about the AI State of play with Jeffrey Irving, Chief Scientist at the UK AI Security