Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving
Geoffrey Irving, Chief Scientist at the UK AI Security Institute, discusses the current state of AI safety and security risks. He explains how frontier AI models are rapidly advancing in capabilities while existing safety measures may not provide sufficient reliability, and outlines the UK's approach to evaluating and mitigating catastrophic risks from AI systems.
- Current AI safety techniques are unlikely to achieve high reliability ('many nines') and could fail simultaneously due to shared underlying vulnerabilities
- Reinforcement learning is now working effectively beyond strictly verifiable tasks, expanding AI capabilities into fuzzier domains previously thought safe
- The UK AISI has successfully jailbroken every frontier model they've tested, though defenses are getting stronger and require more sophisticated attacks
- Model capabilities are advancing faster than scaffolding improvements, with base model upgrades driving most performance gains rather than better agent frameworks
- Evaluation awareness in AI models is increasing rapidly, making it harder to assess true capabilities and safety as models learn to behave differently during testing
"You're not going to get to a lot of nines with the current technology."
"All of these are kind of pragmatic and they do, I think all have correlated potential failures where they could in fact all fail for the same essential reason."
"Every time we did safeguard testing, we jailbroke the model. So that's what happens."
"The newer models are more eval aware than the previous models, and that's increasing fairly rapidly."
"I'm fairly optimistic the problem has a solution. The way I typically like to say this is that in, I don't know, 50 years, 100 years, thousand years, someone will have solved alignment."
Hello and welcome back to the Cognitive Revolution. The Cognitive Revolution is brought to you in part by granola. If you are a regular listener, you've heard me describe the blind spot finder recipe that I'm using to look back at recent calls and help me identify angles and issues I might be neglecting. But it's also worth talking about how granola can help raise your team's level of execution by supporting follow through on a day to day basis. This past week, for example, I had several working sessions with teammates and I committed to a number of things in the past. To be honest, there's a good chance I'd have forgotten at least a couple of the things I said I'd do. But with granola, I can easily run a to do finder recipe and get a comprehensive list of everything I owe my teammates. This is the sort of bread and butter use case that has driven granola's growth and inspired investment from execution obsessed CEOs, including past guests Guillermo Rauch of Vercel and Amjad Massad of Replit. See the link in our show notes to try my blind spot finder recipe and explore all of the ways that granola can make your raw meeting notes awesome. Now today my guest is Jeffrey Irving, a pioneering machine learning researcher who's co authored seminal papers with a who's who of giants in the field and who is now Chief Scientist at the UK AI Security Institute, which is in all likelihood the most situationally aware of government entity in the world today. With roughly 100 technical experts on staff and a mandate that includes threat modeling, pre release frontier model evaluation for dangerous capabilities spanning biosecurity, cybersecurity and loss of control, advising the UK government on strategies to reduce catastrophic risk, funding independent frontier research, and engaging in global diplomacy. Geoffrey has one of the most broad and commanding views of the AI landscape that you'll find anywhere. And while he is optimistic about our ability in the fullness of time to solve the major open problems in AI safety, for today without a hint of hype, he paints a genuinely alarming picture. Our theoretical understanding of machine learning is nascent. Nobody, he argues, should be particularly confident in their mental models of how AI will go. Models already outperform a majority of experts on a great many security related tasks, and there is no good reason to expect that their progress will stall reinforcement. Learning is working well beyond strictly verifiable tasks, and jaggedness matters much less when even the model's weak spots are as good or better than the best humans. The many increasingly sophisticated bad behaviors we've seen over the last 18 months are broadly all different versions of reward hacking, a problem for which we lack theoretical or practical solutions. As such, we likely won't get that many nines of reliability from current safety techniques, and there's some reason to expect that they could all fail at the same time for the same reasons. It is getting harder to jailbreak models, but the ACRED team has never failed to do so, and meanwhile eval awareness is an open and growing problem. Voluntary cooperation between Frontier model developers and the AC is working pretty well, but not everyone is participating. The ac, for its part, is seeking to fund theoretical research in areas like information theory, complexity theory, and game theory, which might produce stronger guarantees. But these fields, like most of the rest of the world, are just beginning to take AI seriously at all. Jeffrey is an intellectual powerhouse, but I came away from this conversation just as impressed with the UK AC as a whole. This is an organization staffed with top notch talent that has its finger on the pulse of industry development and is speaking very accurately and plainly about AI's trajectory and how many major questions remain unanswered, even as frontier model company CEOs tell us that they are less than three years away from creating expert level AI machine learning researchers. With that, I hope you are focused and motivated by this conversation about the AI State of play with Jeffrey Irving, Chief Scientist at the UK AI Security
0:00
Institute Jeffrey Irving, Chief Scientist at the UK AI Security Institute welcome to the Cognitive Revolution.
4:07
Thank you. I'm excited to be here.
4:16
I'm excited for the conversation. We've exchanged messages for a while and have been building up to this, and I'm excited that the moment is finally here and you have really a storied publication history that goes back to working on the original TensorFlow papers with some guy named Jeff Dean, being a co author on the original RLHF paper, working on concepts years ago.
4:18
For language.
4:42
For language you're going to okay, a caveat. But still right there alongside Paul Cristiano, some early AI safety papers with no less than Dario on concepts of using debate to try to bootstrap into stable equilibria and stable AI safety regimes, and even published a call for social scientists to enter the field of AI safety with one Amanda Askel. So I would be very interested to hear how it was that you came to have such a good nose for where AI was going so early on. All these things are well before ChatGPT.
4:43
Yeah, so I used to think I was like new to ML, but I said that for too long and Now I'm not new to ML. I think I got out of undergrad with a bias against statistics. I'd only seen frequentist statistics. I thought they were kind of weird looking. I'd never seen Bayesian statistics. So I just didn't like any of the stuff. And what I liked instead was things that have kind of hard theory and you know, the equations, you know what they, you have some ground truth. And that was like computational physics and mathematics and on the computer science side like programming languages and theorem proving and such. And I did mostly computational physics and geometry for grad school and then kind of years after that until around 2013. And then I had realized two things. One, that machine learning was getting quite good. So the neural nets were starting to work and they were getting better and better and that was going to continue probably. And then two, even in the areas where I thought it was about knowing precise theory, so physics or theory proving you needed common sense and you weren't going to get away with just the theory. That was not going to be enough. So if you're doing mathematics or programming languages, you needed some ingredient of heuristic picking between the various options and you wouldn't be able to do a good job designing kind of human usable, friendly systems without basically machine learning. And so then I was like, okay, I should switch into machine learning. I was doing something else back then. The first thing I tried to do was autocorrect for code in 2014, which is too early to do autocorrect for code. It did not work then. Also we didn't know how to do machine learning. We were like, this is myself and Martin Wick and we knew computer science, physics, geometry, but not really ML. So we tried to do a startup for a year. It didn't work. And then we said, well, how do we learn what we learned by joining Google Brain at the time? And then kind of, I've done ML jobs since then. So that was. I joined Google Brain in 2015 and the goal was for me basically machine learning for theorem proving. I was aware of safety at the time, but I didn't see an attack on the problem that I thought was good. And so I thought I'll work on some other kind of different problem which is just sort of hardening the world using verification, which again was going to be using machine learning to do theorem proving in practice. So that was sort of 2015, 2016. And then I guess there's kind of two that was like, oh, I had some useful thoughts early on, but there's two other kind of parts of the story. So one is that there's kind of two bits of kind of inherited wisdom, which is why it looks like I predicted things early. One is just I joined OpenAI in 2015 and Dario and Paul were there and they had a bunch of cast thoughts about safety and how the machine learning field was going to develop. And so I was just sort of riding along from there. But then kind of more broadly, there's just a bunch of intuition coming out of kind of theoretical computer science, complexity theory, about how computations work, how we check computations, that someone kind of with more resources than you can run. And a lot of what I've done since then, including debate, for example, is just sort of applying that intuition, assuming that it will kind of hold in some modified form into the machine learning world, even if it comes from some kind of area of more precision. So that sort of, again, you can just assume things are going to look like theory in some way with a bunch of porting required and you can predict a bunch, but not exactly how long it will take or kind of when things will happen.
5:19
We can unpack both of those, I think, in more depth as we go. Certainly the quest for theory and bounds that you can really trust in is a big theme of your work and the work that you're trying to encourage at the AC these days. And I'm also really interested to get your take on the relationship between math and the fuzzy, messy real world. But. Well, let's circle back to it.
8:52
Good.
9:19
Fast forward to today. I'd be interested to know how AGI pilled, quote unquote, you are today. And that sort of just informs what is the general. Because so many AI discussions, broadly and especially around topics of safety and security, go kind of immediately haywire when people have just such different intuitions around what it is we're likely to be dealing with. So I like to try to establish, like, what is it that you think we are likely to be dealing with? I don't expect that would be the official position of the UK AI Security Institute, but, you know, you're obviously a leader there. So it seems to me from reading the reports, it seems like you are not expecting any sort of wall or plateau in the immediate term.
9:19
So I think that I can actually, fortunately, kind of the things I can say are mostly also the things that we think kind of officially, which is that going to have a lot of model uncertainty about how things could go. And that could either mean that there are obstacles that cause there to be kind of stalls For a good for a while. Or there could be no such obstacles and things could go quite fast. And I think mostly anyone who kind of confidently claims in one direction or the other with too much confidence or 99% certainty of there are or are not big obstacles, they're probably wrong and they should be more uncertain. And I think that means for us one, we want to map out what those different clutches are, what could the obstacles be, what are kind of signs of development. And then to 2, we should assume or place significant probability on the current methods will scale and where they don't scale, more mundane stuff will replace them and continue kind of different further sigmoids. And so we do I think have significant credence on things will go fast. I won't say exactly how fast because I don't talk about exact timelines, but I think that is kind of pretty important. And then I think we published a paper from the Strategic Insights team at AC on different potential obstacles to AGI and what is our progress over the last while on addressing those? Again they could all not actually be fundamental obstacles as in they could be kind of solved not by pure scale, but by kind of maybe some scale and some just like steady algorithmic progress to models, to scaffolding, to data, that kind of thing. Or you need new algorithms and then they take longer. But generally I think both my view and the view of AC broadly is like have model uncertainty over all of those terms and then that should mean that you're not confidently saying it will either go very fast or will not go very fast.
10:00
Yeah, it's good to have some of that in an EL world leading government. This will be the big focus by any means of today's conversation. But what does your personal AI productivity stack or pattern of use look like today?
11:55
So I think I'm kind of vanilla. So I use kind of all of the models. Different things I think mostly I use one of them is a default, usually it's Claude, but it varies and then I kind of go to other ones if they have specialties that are particularly good. I think for a good while GPT was better at math and Google is better at kind of certain other things. This shifts over time so I just kind of use general models. And then for I don't do a lot of coding in my job, but I do for fun kind of like formal verification work. For a while I was just using cursor there because the agents weren't that good enough at doing full on agentic stuff. That is not true anymore as of A few weeks ago. Now they are good enough. So that means that that shifts to stuff like codecs and CLAUDE code basically. But that's mostly not my job. Mostly my job is meetings and talking to people and advising on research.
12:09
I'm glad you're still making time for a little formal methods on the side. Okay, let's talk about the overall landscape in terms of the threat model that we have from AI. And I'd be interested in kind of also your characterization of what you understand to be the de facto plan to address it. Again, people have so many different starting points here. What do you think is kind of the set of big things that we should be worried about?
13:03
So I think that we kind of break down risks. The main two focuses of ACR catastrophic risks and large scale societal impacts. And the main kind of three catastrophic risks we focus on is our bio large scale cyber attacks and loss of control. The team is called the Chem Bio Team for chemical and biologic weapons. More risk comes from bio in practice and then upon societal impacts that's sort of human influence. So that's persuasion and kind of emotional reliance and then kind of various kinds of societal resilience. So like attacks on cni critical infrastructure and that kind of thing and just sort of like agent behavior in the world at various various agent risks. And I spend most of my time on the catastrophic risk side. And Chris Summerfield, research director here, spends most of his time on the satellite slide, although we also do a mixture of both. So those are kind of the main risks we work on. I think we also are kind of thinking about gradual disempowerment, kind of more kind of structural risks somewhat, but I don't think we quite know and no one really knows how to mitigate these at a large scale. There's some work we're doing that's like either investigating that or kind of thinking about medications, but that's more nascent. So that's the bulk. I think I forgot the other half of your question though.
13:32
Yeah, what do you. So in the absence of people changing the discourse or new discoveries, new big ideas, how would you describe what we are on track to do today? Often I would say, at least the way I've characterized it is it's sort of defense in depth with hopefully we can patch together enough nines through enough layers, all of which are kind of leaky, but hopefully they're not too correlated in the ways they're leaky. And this has always kind of worked in the past, so hopefully it'll work this time.
14:56
I think there are, yeah. So you're not going to get to a lot of nines with the current technology. I think broadly we can kind of break this down by domain. So for misuse risks, so like biological weapons and cyber attacks, this is mostly kind of safeguards, kind of differential access. So give models only to certain people that are vetted in some way and then non model defenses. So like pandemic preparedness and kind of improved security, that kind of thing. And I think just it's the safeguards are not that strong. Sort of open source models are also kind of pretty good or there's a gap. And so I think that there are the kind of the stock plan is mostly is sort of like in some sense you use the model side mitigations to give yourself a window and then you try to harden the world against these risks. And I think whether that will go through or not, we should be uncertain about that. But there's not kind of strong solutions to those things if the models keep growing as we see them growing in strength. On the logic control side, I think that is kind of a combination of mundane empirical approaches, mundane whatever, like pragmatic empirical safety measures and a lot of monitoring and then using that monitoring again. This is the AI developer plan, typically using those mitigations to get you through into an automated safety research regime where hopefully you find better solutions than, than those first methods. I think this has various flaws and so maybe it'll go through, but we should definitely not. You're not going to get to more like to a couple of nines with that kind of plan. And I think you wouldn't kind of know with the curve methods that it was going to work until after it went through with confidence. You'd have a lot of uncertainty. And so I think that's kind of the story. And I think that most of the approaches we have now look like that they're empirical. Maybe they'll go through like on the alignment side or the pragmatic approaches to get through this kind of automated safety thing, safety phase. It's kind of AI control measures and monitoring and kind of odyssey training and kind of white box detectors and all of this. All of these are kind of pragmatic and they do, I think all have correlated potential failures where they could in fact all fail for the same essential reason. And you would need stronger advances to be confident that will go through. And any of that goes through. I think that we do need because of the misuse risks, kind of a lot of mitigations on this, on the non model side as well. Hey.
15:28
We'll continue our interview in a moment after a word from our sponsors.
18:04
Your IT team wastes half their day on repetitive tickets, password resets, access requests, onboarding all pulling them away from meaningful work. With Servl, you can cut Help Desk tickets by more than 50% while legacy players are bolting AI onto decades old systems. Servl allows your IT team to describe what they need in plain English and then writes automations in seconds. As someone who does AI consulting for a number of different companies, I've seen firsthand how painful and costly manual provisioning can be. It often takes a week or more before I can start actual work. If only the companies I work with were using Serval, I'd be productive from day one. Serval powers the fastest growing companies in the world like Perplexity, Verkada, Merkor and Clay and Servil guarantees 50% help desk automation by week four of your free pilot. So get your team out of the help desk and back to the work they enjoy. Book your free pilot@servl.com cognitive that's S E-R-V-A-L.com cognitive one of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal, private benchmarks. Challenging but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode and a simple prompt. And wouldn't you know it, Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple years, saving me countless hours.
18:08
But as you've probably heard, Claude is
19:51
the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter. And with Claude code, I'm now taking writing support to a whole new level. Claude has coded up its own tools to export, store and index the last five years of my digital history from the podcast and from sources including Gmail, Slack, and Imessage. And the result is that I can now ask Claude to draft just about anything for me. For the recent live show, I gave it 20 names of possible guests and asked it to conduct research and write outlines of questions based on those I asked it to draft a dozen personalized email invitations. And to promote the show, I asked it to draft a thread in my style featuring prominent tweets from the six guests that booked a slot. I do rewrite Claude's drafts, not because they're bad, but because it's important to me to be able to fully stand behind everything I publish. But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information gathering work and allowed me to focus on understanding our guests recent contributions and preparing for a meaningful conversation. Truly amazing stuff. Are you ready to tackle bigger problems? Get started with Claude today at Claude AI tcr. That's Claude AI tcr. And check out Claude Pro, which includes access to all of the features mentioned in today's episode. Once more, that's Claude AI TC tcr.
19:54
So when you talk about we can't expect to get too many nines.
21:30
Yeah.
21:34
Would it be a fair arithmetical move on my part to take 1 minus 19 and say your sort of implied P doom, if you will, is at least like a 10%? Or would you segment that down further and say things could go wrong, but I wouldn't put it in the doom category.
21:36
Or I'm just inclined to answer that question, I think, but not give numbers to things.
21:53
As a civil servant, I answer usually like 10 to 90%, which is also obviously kind of a way of not answering. But qualitatively it sounds like you are taking very seriously the possibility that this is going to go not just like kind of crazy, but meaningfully catastrophic.
21:57
Yeah, I think loss of control, we view it as a potential catastrophic risk. I think it's like what's the one thing we're doing? There are a bunch of uncertainties about this model. So one, there's two different teams at AC to do various kinds of empirical alignment testing, one using kind of adversarial methods, one doing kind of step back statistical analysis of different factors that cause models to do sketchy things. And I think part of that research is trying to pin down this threat model and what drives strange behavior when models are behaving in ways that you'd expect would correlate to these kind of extreme scenarios. Because I think people kind of, we talk to a lot of partners within the government or kind of other governments or other parts of society and people have pushed back to this kind of risk model. And so we want to provide as much evidence as we can. But again, it's an area where one should have a bunch of model uncertainty and then think through the details.
22:20
Despite that, can you unpack the intuition around why everything would fail at the same time for the same reason? That's something that I've heard from a number of people. Zvi, for example, always says that. And that thought seems to come very natural and feel very intuitive to some people and then to others it's like, I don't know, you know, there's jaggedness all over the place. Like, why would we expect that a, you know, if I can't get a given model to do this and that today, like, why would I expect that suddenly everything's going to crystallize and there's going to be this uniformity of the model's ability to query through all sets at some.
23:24
So I think that maybe there's an important thing there where the models are jagged today. But if you ask them to do tasks that the models could jaggedly do five years ago, they're not jagged. And so the question is for the capabilities you would need to realize a variety of risks. If you kind of push forward a few years or however many years, it takes very strong capabilities, you should expect those models to still be jagged. But up above a frontier where potentially all the things you're seeing are kind of, they don't look jagged. So when I look at the best go player in the world, the best chess player in the world, they have a bunch of jaggedness. Like, if you like sit them down against the next best go player in the world, they'll kind of win or lose for idiosyncratic factors. They'll have different tastes, they'll have different parts of the board of the game they're better at. And if you sit them down in front of me, they'll just wipe the floor with me every single time. Even if I have nine stones and I'm like a halfway decently strong amateur go player. So I think you have to run the calculation, think about the model as it would be in the future. And I think part of this is there's kind of a non. It's unhelpful when people talk about AGI or superintelligence or whatever as being this thing that can do everything, because it does imply that it's sort of so qualitatively different than the models of today. Whereas I think the non magical version is just, it's better at a lot of things and indeed kind of superhuman performance at a variety of kind of risk relevant domains. And we know that models can be superhuman. At certain domains, like they're better than me at knowledge, they're better than me at like lots of math. This is kind of true for everyone. Like certain domains they're probably better than. Every person will have domains where the LLMs are better than them currently. And this is kind of rising over time. And then they're very fast so they can think quickly. Sometimes they can do tasks very well. They do like 10 times faster than humans can do them just because of computational speed. And then they're not very interpretable. So like the methods we have for interrogating the behavior are not that reliable currently. And so I think that sort of non magical picture of kind of more capable machines still with some jaggedness up at the frontier where they are jagged is enough to make kind of give you significant probability on these risks.
24:02
And if you were to kind of. You mentioned like bio earlier is sort of the drives most of the risk certainly in the biochem category is like the number one really bad scenario in your mind. Some possibly prompted, possibly unprompted AI somewhat gets to a point where it can break through 12 layers of defense all at once manages to release the bioweapon and take.
26:30
It's mostly human misuse is the thing that we're mostly focusing on. So it's people using LLMs to do bio design of various kinds. I think the models do couple together, but I would say the loss of control couples more strongly to cyber than it does to Bio. There are more scenarios where those two couple together and I think that's when we had a team called the Cyber Autonomous Systems Team. So we merged Cyber and Autonomous Systems which was the team doing lots of control because of that coupling. But again for cyber misuse and for most of Bio that's about kind of human actors.
26:57
So on that cyber autonomy, I'm just trying to kind of get the modal story of what is going to happen. It's sort of. Or what might happen. Obviously it's due to jaggedness. You would have sort of a period of time in which these various defenses like become breakable by the AI, but they also have to have kind of some restraint in them, I guess. I mean if you listen to Buck from Redwood, you would say maybe they don't even have to have restraint. Maybe we might just let them do some of these things and kind of look the other way. Which is an interesting commentary on us leaving that aside for the moment. It's kind of. They have to get to the point where they can kind of do them all. They have to string all this together and then they like take over a data center and sort of entrench themselves. And that's. We're in a world where I don't
27:35
want to talk about the super detailed modeling there because some of our. Yeah, some of us that are not like public stuff. I think maybe the background systemic thing to say is if you imagine we're very, very serious as a world about deploying these things only in the most sandboxed kind of well controlled states, risk would go down by a lot. Unclear how much, but whether it goes all the way to zero, probably not, but it goes down. We're not currently on track to be as serious as one might imagine about deployments of these models. And so I think some of that question is like, how strong will our defenses be? And then importantly, if there are weird behaviors in models, do our defenses go up? Do we get more worried? And so, for example, across the last year 2025, there were a variety of models from all developers doing sketchy things, acting deceptively or commenting on unit tests or all of this behavior. And our reaction was as a world mostly to continue training the models to be stronger at the same time also working on these defenses in some capacity. So I think a lot of the risk comes from kind of be the modal scenario where we are not kind of doing the strongest mode of kind of computer science intelsec ML defensive layers around these deployments. And then I think also as we find this evidence, kind of what is the cycle of that kind of feeding into further training? I think a lot of the risk, these like the misalignment risk comes from you get some signal of weird behavior and you train it out and then that removes like some fraction of the problem, but your methods only cover some fraction and the rest remains. I think it's again kind of you should have model error there. You should say maybe it's going to generalize well enough that you can cover most of the story. But generally the picture where you get these correlated failures are they don't really start out all correlated and then you apply some optimization pressure because you're doing training or iterative development and deployment and the like. And they kind of, the ones that remain all end up correlated in the
28:24
same way because they're subject to that same general structure of automation pressure.
30:36
Yeah, that's right. Yeah.
30:41
Okay, so here's a story I've pitched a couple people over time interested in your reaction to it. If we take that model and we just extrapolate out a Couple years. And we've got kind of, you know, I know you guys put out a report recently that also showed, I think you even quoted or cited the meter, task length, famous tracking exponential, and other indicators as well, of increasing ability to do bigger and bigger tasks with more reliability and more autonomously, et cetera. So if we extrapolate that out, let's say whatever, two years from now, maybe three years from now. And at the same time, we sort of imagine that with each generation there's more optimization pressure put on the models to try to eliminate or possibly just suppress these bad behaviors. It seems like we might end up in a world where you can delegate like a quarter's worth of work to an AI in like a single prompt, and then there's maybe one in 10,000 to one in a million chance that it goes into some bad behavior mode and kind of actively screws you over as it is doing the quarter's worth of work that you just assigned it. Does that seem like a reasonable extrapolation of recent trends to you?
30:45
So are you. I think the numbers I don't have a strong view about, I won't kind of give a take on those. I think that there's kind of. So, like, when you do this kind of agent training, you're training the models to be more and more coherent. They're able to execute kind of plans over longer and longer horizons in whatever portfolio of tasks you're training them on. And so they have this ability to be a coherent agent. And then models have kind of various characters or Personas or whatever. And so the failure modes are either they're somehow you've arranged, you've ended up with a model that has some deceptive Persona where it's kind of always kind of trying to deceive you. Or I think maybe the thing you're pointing at is you can have a model which is a bit more stochastic, but has the potential to be very coherent and sort of jitters its way into a bad portion of the space, portion of trajectory space. It's scaffolded with a bunch of memory. So it has kind of long horizon, kind of state flicking back in time, and it kind of gets in a bad state and stays there. And then we. This is one of the areas I think we're interested in kind of theory folk and kind of independent empirical folk exploring is what are the dynamics of models running for a long period of time or in kind of like you sample very long trajectories and they're sort of wandering around in kind of model space, how does that behave? What would cause them to kind of reliably shift back to where? To a more reasonable kind of starting point? Just like how should we think about those dynamics? And that's kind of an area where I think it's not clear to me that is an intractable area to make progress on. There hasn't been that much like the number of person years going into understanding those kinds of dynamics are. You can count them on, I don't know, a couple of hands. It's not that many. Kind of kind of understand the risk model, but also potentially defined mitigations is quite good. Hey.
32:04
We'll continue our interview in a moment after a word from our sponsors.
33:58
The worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails. What started as a time saver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs 24 7. Just describe what you want in plain English, send a daily briefing, triage support emails, or update your CRM. And whatever it is, Tasklit figures out how to make it happen. Tasklit connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklet actually does the work for you. And unlike traditional automation software, it just works. No flowcharts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with tasklet founder and CEO Andrew Lee. Try tasklet for free at tasklet AI and use code COGREV to get 50% off your first month of any paid plan. That's code COGREVasklet AI.
34:02
Let's take the other side for a second. How optimistic? Or optimistic? Maybe not even the right way to think about it, but how much upside do you think there is in optimization, alignment? Everything we've kind of talked about so far is like assuming that we don't have perfect control of models. They might be trying to screw us over, or they might just be confused. Certainly a lot of the things that I sort of I maintain a slide deck of AI bad behavior, which I'm constantly pending new slides to. And you know, not universal. But a very common theme is that there is some tension between goals that the model has, whether it's between something it learned in training and a system prompt or, you know, a system prompt and a user thing or even kind of runtime, you know, injection attacks or whatever. But once it kind of gets into a spot where it's not really sure how to weight the different objectives that it has, then you can get into some, some strange behavior. So the alignment question is, do you think we can solve that? Do you think there, you know, how much headroom do you think there is in terms of creating an AI that loves humanity or otherwise is so robustly good that we don't have to worry about this anymore?
35:15
First we should say that if you were to solve alignment in some sense there are other problems. So the misused problems are real. The misused domains also could grow in the future. So I think Michael Nielsen wrote a great piece about that a while back, last year sometime about just risks from new technologies that has put into the large space of those. And then there are risks from kind of gradual disempowerment. Those need a bit of misalignment, get to that later. But I do think that there is this hope to kind of close off or mostly close off this domain given enough time. So I'm fairly optimistic the problem has a solution. The way I typically like to say this is that in, I don't know, 50 years, 100 years, thousand years, someone will have solved alignment something and that's either the machines or us. Hopefully it will have been us in time or the machines perhaps on our behalf. This is kind of coming from a sense that just insecurity in computer science. Usually the defender wins in theory. So if you know how to design your game, kind of your protocol, you can make it so that defense wins. And this is I think kind of a generic situation in kind of a lot of areas complex of theory. So then in practice I think of course there's lots of holes in this, a lot of practical information security does not feel like that because we haven't actually gotten to the limit case. And it's super unclear whether we'll get to the limit case for alignment as well. But I guess I do have some strong sense that there is a solution. It's just that we might not get to it in time and then I guess maybe what the upside is. So I think I, we sort of like alignment has a variety of components. As a kind of a branch of the government, we basically focus on honesty. So the AC alignment team is mainly thinking about how to get models to be non deceptive, kind of tell us hopefully calibrated information to the best of their abilities. And that's kind of the domain we're focused on. Kind of anthem warhism caveats aside. And I think that is not the whole piece of the story, but it's kind of the part that we think is the most important part for us to work on and kind of the right position for a part of the government to do.
36:25
We've kind of alluded to it a couple times in various ways, but maybe just give us kind of the 101 on the AC, like what is its role? I think I understand there's 100 people that work there across a bunch of different domains. How does that break down and how does it relate to politics? I think it's quite different there than it is here. But obviously nobody's entirely shielded from politics.
38:44
Yeah, I'll talk about that. It's close to 100 technical people like researchers and kind of people on technical teams doing kind of delivery and such and then like 200 gravely people total. So it's bigger than that doing a combination of sort of diplomacy and kind of thinking about policy and doing other kinds of civil service and operations roles. And broadly I think of ACs having kind of two functions. So one is be a channel for information flowing to government and governments, plural about risks from frontier AI. So that's like again both catastrophic risk and large scale sliding packs. And so that's like both our own research and then also we kind of channel research from other third parties and from AI developers kind of into government channels so that the government is well informed, kind of both kind of politicians and national security folk and so on through work a bunch with NATSA partners on that and then also out to other governments. So we work a bunch with the US government and then other allied governments. And that's where we had a big delegation that said was a Delhi at the AI summit there. And generally that's communicating the state of the risks and kind of capabilities and mitigations sort of how we think about all those pieces and how they fit together. So that's one part is informational do a bunch of research, channel other people's research and inform the UK government and other governments about these risks. And the other thing is just actually mitigate the problem by working on kind of both AI developer side mitigations and non model mitigations, say Penang preparedness or the like, helping to drive that kind of change. So on the model side sort of, for example we have a very good kind of red team that does adversarial jailbreaking and other forms of adversarial ML against defenses the model providers are trying to build and we find lots of flaws and they fix the flaws and that kind of makes things better on the margin. And of course we also can communicate the results of those attacks to other parts of government. So usually things we do fulfill both of those functions at the same time. They both directly, hopefully they improve mitigations on the margin and they also we can use them to inform other people. Yeah, so that's the big story. And then I think the politics side, so we are part of the government, so we're part of the Department of Science, Innovation and Technology, so one of the ministries in the uk, the UK government. And so we are beholden to politicians, so that we have a Secretary of State. The situation is that we have been kind of well supported by both the previous government who founded AC and then the current government as well. And so that has been quite stable and nice, although there are of course differences on the margin. And so we are able to do things we think are important. We do adjust to ministerial and kind of other priorities because we're not in solely different politics in the informal sense. But I think the UK government does care a lot about these risks and so therefore we are able to work on stuff we think is important.
39:08
Yeah, long may that continue. How would you characterize the range of reactions that you get from the different stakeholders that you brief? I feel like there are a few notable politicians who seem to be starting to get it, so to speak, and then a lot that are really nowhere close to your level in terms of the just how big they're prepared to think about what might be coming. Do you feel like that is starting to change? Do like all the graphs that you show them actually kind of turn light bulbs on or where are we?
42:10
I think it's changing on the margin, but the other thing is they have other priorities. So like a lot of people we talk to in sort of national security, if they usually don't think that these risks are not there, they just have lots of other risks that are on fire right now and they're working on. And so I think that shifts over time, but I can't comment on details there. I think that broadly we are very much in the business of trying to find kind of common ground, gradually build up evidence over time. They have reasonable pushbacks. We try to shore up again either kind of using knowledge from other researchers in other orgs, or doing our own research to fill particular gaps where we think it's important to change the conversation in governments.
42:47
It is remarkable, in reading all of the various documents that I went through in preparing to talk to you, the degree of alignment between what the UK AC is putting out in official capacity and what I would say many of the most forward thinking AI safety thought leaders have been talking about in recent times. It doesn't seem like there has been a big shift either toward more mundane concerns. And I don't mean to dismiss those concerns, but I do think in many jurisdictions this sort of AI safety concept gets kind of watered down to a point where it's much more about fairness in various ways. And again, I do think that stuff is is not to be dismissed, but focus on that often ends up with neglect of the bigger picture questions that I think are probably most urgent. And it also doesn't seem like you've had what I do see in the US in at least some ways, which is just a politicization of the focus on the models. Like are the models woke or are they going to do what the Department of War wants than to do? Any advice for people doing this kind of work in other jurisdictions around how to avoid these pitfalls?
43:27
Obviously this is kind of a sensitive question that I can't talk about in that much detail. I think like it. I know mostly, I mean obviously I'm originally American, now I'm a dual citizen, but I know more about the inner workings of the UK government than I do about the US government. That I work for the US government, but I don't have a detailed take that I'm willing to share on the podcast. So my favorite collaborator when I was at OpenAI was Paul Cristiano and it is great that he is at USAC or USKC rather.
44:46
So let's talk about the characterization of the current situation, monitoring the situation. You might say you do a bunch of different tests, you report on these tests. We can walk through them a little bit, but and I think you can kind of assume with the folks that tune into this feed that they're generally well aware of kind of the shape of the curve and the meter stuff and the sort of fact that the models are increasingly competitive, if not at least on average beating your domain experts in at least modestly scoped tasks requiring substantial expertise. So we have that kind of baseline. What I would love to start with in terms of the testing is what does your relationship with the frontier model developers look like? I understand it's all voluntary interaction. What does that tend to cash out to in practice, in terms of what kind of access do you get like how long do you have? What kind of briefings are they giving you?
45:21
I can't speak to a lot of too many specifics of this in part because, like, we talk to them a bunch. And some of those starters are our ongoing discussions, I think quickly, on the voluntary regime. I think that's working decently well in the sense that developers all made voluntary commitments a while back and they're continuing to follow many of those.
46:20
And just when we say all, I think Google, anthropic, OpenAI and that's it.
46:43
And there are those others that are on that list. Say, I forget exactly how you want to define all, but many, many al apps have had, say frontier safety commitments or responsible scaling policies or the like. And so like their incentives are one, they've kind of made these commitments and then two, we can give them useful information. So like when we jailbreak their models, we tell them about the bugs before we release any information ever. And so they have time to fix them where those fixes are doable and they often are on the margin, it can improve things somewhat. And so I think they get value out of this and also they make commitments to kind of keep up with it. In terms of the kind of access, I think that is also an evolving conversation. I can't comment on what access exactly we have, but part of the research we do is exactly about knowing what access one needs to do a certain rigor of rigorous evaluations. So like we have a model transparency team and a big chunk of what they're doing is trying to understand, often with a lot of research on open models, because then you can do arbitrary things. What level of access is required to get to certain kind of understanding, like what do you need to be able to catch problems as they occur in practice? And then that kind of informs our conversations with the labs. And then sometimes we get additional access there, or sometimes we just sort of, we try to align incentives because again, they want usually to have us give them correct information as well. And then in terms of the timing, I definitely can't speak about how long we get in specifics. There's a couple things to say. So one is like, for example, in bio, some of our evaluations are literal wet lab experiments where you have someone in a physical biology lab that are doing experiments, like with a model assisting them. Those we just don't do pre deployment, we do them asynchronously, calibrate those results against kind of the faster evaluations, and then hopefully that gives you some signal that you can do for faster evaluations. But still that gives you some wins. Certainly more time makes things better. So it always is some degree of a pain point.
46:50
When you said that folks can model developers want accurate information because they can fix things, at least on the margin, my guess would be that they are typically fixing it in the next model, not going back and doing more training on the current model. But are there cases where they're taking your pre deployment testing and fixing that version?
49:05
No, it's including that version. So for something like the recent. So we did a. One thing we're kind of doing over time is we used to do exclusively pre deployment evaluations which have this question, this issue of they're very time boxed often we are shifting a lot of that work, not least because the pace of model releases is increasing to longer research collaborations that might go back either our poised deployment or they go back further before deployment is finalized. And as we did one of those in particular kind of over the summer with both Entropic and OpenAI and with the Red Team and found a whole sequence of problems like GL breaks, much more than could have been found with a normal length pre deployment evaluation. And that was kind of early enough. They continue ongoing fixes to their classifiers. Those two providers have different classifiers, different setups for jailbreak defense, but they can both could be improved kind of iteratively. So I think it is the case that you can change things for these kinds of defenses on the fly. One thing to say is that's that the strong jailbreaks are concentrated in very particular domains and often that list of domains is bio. So I think it's hard to do. So like when we do jailbreaking, we have to do it for biorisk sometimes cyber risk. A lot of other jailbreakers are finding any problem and those are usually much less well defended. So it's easier to find hacks in the models and the classifiers are just not that trained for kind of other kinds of harms.
49:26
So when you talk about more time being helpful, that obviously or at least strongly suggests to me that while I'm sure you have all sorts of automated testing that you can throw at any new model the second that you get access, the fact that more time is helpful suggests that there's an irreducible human element to what is going on. How would you characterize what you can automate what models can help with versus what people have to do?
51:03
Yeah, there's a couple of things. So one is even if it's something completely automated, it might take days to run because the evaluations are like long horizon agentic evaluations these days. And it might be that there's bugs in the scaffolding because we often get models early, sometimes there are issues we have to fix kind of iteratively. I think Meter had a report about some details here a while back on one of their evaluations last year. So that is the thing that takes human time. Additionally we usually when we do kind of. So I mentioned like the extreme end of like the human scale, which is a lot of wet lab bio experiments. The middle of that is a humans interacting with the model to gauge its kind of domain knowledge, like how would it interact computationally with a person. And those can provide additional signal on top of just the purely automated evaluations. And so you get better quality if you do both of those together. And again, sometimes we can do those because we have the time and sometimes we can't. I think the again where we can we try to calibrate the slower things against the faster things. But generally all of this is imperfect. So if you have a fully automated evaluation, you've done a ton of capability elicitation of models, you get a new model, you try out a new task, you have to iterate for a while to get it to be highest performance. And that is true generically for all of our tasks as well. So we can do evaluations that are quick and if we use only the fully automated portion of. But they have some error rate and they mean you can't do the full thing. We also do one thing we're doing there is. So like we have all our evaluations are done inside Inspect, which is an open source package with a bunch of other governments and AI developers and other third parties use for testing. And we're adding features to that also for say automated transcript analysis. So there's a feature, it's a sub package called Inspect Scope out. So that is like we used to do, you generate all these evaluations and then you read through them, but you can't do that at some scale. So we also try to do that with a bunch of automated or semi automated transcript analysis. And that also makes things faster, but then still takes some amount of human review to really understand qualitatively what is going wrong. And then we ideally want not just a number out of these evaluations, but qualitative takeaways about what kinds of failures occurred. Are the failures do they feel a bit like fundamental, like oh, it really didn't understand the task or it hit some snag that was kind of an incidental like wrinkle that probably would go away soon or with more elicitation. And so that's the kind of thing that requires more human time to dig into the details, hopefully on top of automated transcript analysis.
51:33
Yeah, this is a tough question, I'm sure, but obviously everybody understands that these models are very high dimensional things and they're a little bit tough to predict exactly how to maximize the performance of any given one, just because they're kind of idiosyncratic. Is there any high level qualitative overview you could give on how you approach figuring that out when given a new model? Or is it the kind of thing that a DSPY or the new version of that, whatever the recursive language model, is it just a grind of exploring the combinatorial space of how to prompt and how to do whatever to finally get to some local maximum?
54:22
I think it is not fully automatable yet. If it was, then would be further along the automated AI researcher kind of trained. But it's fundamentally very similar to the kind of elicitation one does for any task. So it's tinkering with tools and sometimes prompts and scaffolding and so on. Instead of all of the cyber evaluations, some of the bio evaluations are very tool based. They might be doing kind of web searches sometimes or they're doing kind of using various things inside the same boxes other times. So I think that. But that looks like the same kind of elicitation one would do if you wanted to do a task any kind of corporate setting. It's just on a different genre of problem, I guess. So I don't think there's that much, mostly I think to your audience. Just imagine you're doing that for bioweapons or cyber attacks and the same things will apply. One thing to say is the newer models, one thing that happens over time as the models get better is they can think for longer. And one thing that means is that the potential amount of tokens you can spend on a task is increasing in a way that makes it. Even if ignoring the cost of that, it makes it like reach velocity is slower. So it takes time to do devaluation that increases. And so we have a team thinking about that problem as well. How we're going to think about understanding inference, scaling as it applies to these evaluations kind of over the next year. And that will be a thing that I think it is a challenge. One of my lessons from GO as well is that as an amateur player at my level, I can look at a GO board for a couple of minutes and then I'm basically tapped out. I won't get any smarter. And a high amateur or professional can look at a board for like an hour or days or something, and they'll just get better and better and better. So not only are they better in like 10 seconds than I would ever be, but also they just keep getting better if they spend more time. And that's generally true of expertise. Like, if humans are experts in a domain, it means you can think for longer and you get better. And the same is true of models as they get kind of good at domain skills, you can think that you can apply them for longer. And that means that kind of hitting a ceiling of evaluations becomes more challenging.
55:05
Without getting into too many details, how much more would you say you guys have found about jailbreaks and ways to elicit bad behavior from models than, say, Pliney has published on Twitter?
57:19
I think so. How much more? The thing I would say is that there's such a big space of jailbreaks that if two people try to jailbreak a model, you're never going to find the same one you're searching and there's a big space, so it may be hard to find one. But if two people find them, they'll be different. I think Pliny is usually searching for again jailbreaks on sometimes easier models or easier tasks. So I think over the last couple of years, the time it takes for us at least holding technique constant to jailbreak a model is going up. But then eventually we succeed. But again, the deal break specifically will be different between any pair of expert deal breakers apply to a model.
57:35
Well, I'm interested in that question. So yeah, I'm interested in how much more. And then I was going to ask how much transfer you see between models too. Like if you had a Secret one for Claude 45 opus, would it also be likely to work for Claude 4. 6 opus? Unless you specifically said, hey, you should patch this.
58:19
A lot of the. So it depends on the kind of thing. So there's kind of patterns of jailbreaks and maybe human findable jailbreaks that think often transfer or the ideas transfer kind of fairly readily or they give you much better starting points. So we had a paper released this week called Boundary Point Jailbreaking, which finds chicken scratch weird sequences of nonsense tokens that are strong jailbreaks to models. Automatically, those don't transfer. You have to search again for the next model. But you can't apply that technique to any model and find a different jailbreak. And one thing we are there. I think that's probably the way it will be for a while. Just because a lot of the jailbreaks are kind of again, there's some core ideas that transfer across, but the harder to find jailbreaks against strongly defended models in strongly defended domains, I think most of those the techniques will transfer, but the particular jailbreaks will not.
58:36
But to be clear, the bottom line so far is there is no
59:35
space,
59:40
no domain, no model, no. No matter how many layers of defenses that has prevented your team from jailbreaking.
59:40
Yeah, so we I think Overall at overall ASINs evaluated over 30 different models or 30 different testing runs, I think. And not all of those did safeguard testing, but every time we did we'd go both the model. So that's what happens. However, the good news is that on the domains where a certain lab on a certain domain they've tried very hard, it does get harder. And that hardness does provide, I think some degree of harm reduction. So it will reduce the number of actors that in practice will access the model or will delay how often they can access the model or add friction or something. And so I think those are still important to do, but they do not like they're still jailbreakable. If you do enough time, how would
59:49
you characterize the quality of responses conditional on these jailbreaks? Because one thing I do sometimes see is examples where people have got a jailbreak, but then it's like, well, yes, you got the model to sort of do something bad, but it seems like its effectiveness was also greatly reduced in that process somehow, such that it doesn't actually feel so dangerous anymore, even though you did get it to do the bad thing, so to speak.
1:00:37
Yeah, there's some degradation, but I don't remember how much it is currently with the current techniques. So I can't give you. I don't have that off the top of my head, unfortunately. I think it's a whole strong degradation.
1:01:07
Yeah, I know you can't get into details around who is giving, if anyone is giving you weight level access to proprietary models. But maybe a way to get some insight into the topic is. So you have the team that does work on open models based on all their work, how much of an advantage is it to have that kind of access versus not?
1:01:17
Yeah, so I think it's not unambiguous. The open way techniques win. Actually I think they help some. But I think in the current state of things is that you can get a lot out of kind of heavy, just chain of thought analysis and So I think it's not the case for kind of that level access is not that unambiguous yet, at least maybe the way to say it. So I think that can shift over time. And I think one thing we're trying to do with that team is one understand where we are today, but also try to predict out what how will the situation change or potentially degrade in terms of your ability to detect deception or scheming in models. Like what techniques will fall first, what techniques will last for longer. And I think that requires kind of more clever experiments because you don't have. You can't just run the experiment kind of unmodified. So that involves model organisms or other kinds of clever instrumental setups. But I think that's the situation now. It's not an unambiguous kind of white box win compared to doing a really good job on chain of thought. But it does help, I think.
1:01:43
How about some highlights in terms of just things that people who are even moderately to very AI obsessed might have missed in terms of just like, oh, I didn't. No, that was already happening. One that kind of hit that level for me, reading through the report was the fact that frontier models can give what is described as PhD level scientific experimental troubleshooting advice purely from a photo of the experimental setup. Maybe a little text also along with the photo. But the fact that this has gone from, you know, you have to really spell it out for them to like, here's what I'm looking at. Can you solve this one for me? That is obviously a pretty significant qualitative change and I hadn't heard that before reading the report. What else kind of stands out to you in terms of the what would be most tragic to the answer?
1:02:59
The thing that always engages me is less the particular anecdotes and more just like the general trends, which is a super boring answer, but like the fact that if you look over two years, everything just gets better and better and better. And we're on those curves. Those are what it's like. I think it's important not to lose sight of that in kind of the search for anecdotes. I think maybe one thing to say, maybe when you first started talking, the thing that came to mind initially is I think people have a sense that we are doing RL on verifiable rewards. And I don't think that's been the case for most of 2025 exclusively. I think we're doing a mixer of that and also RL against kind of self critique and Kind of empirical, high hodgepodge versions of scalable oversight. And so I think people have. There's a common narrative that RL might work for verifiable domains, but it won't work generally. But as an example, is looking at a photograph of a bio experiment a verifiable domain? No. Yet the RL models are in fact way better at that than the models before that. And it's because of the rl. And that's not because we did RL on a bunch of math or CS problems and they transferred. It's also because we did RL on fuzzier stuff. And so I think that's maybe the most important thing I would point out is that we're already doing some kind of very approximate form of scalable oversight or like training and self critique in a way that will. That just changes the capability profile.
1:03:48
How would you describe the model's capabilities when it comes to autonomy today? I mean, the trends are clear, but what would be your sort of description of it? And I guess another angle on that is how realistic do you think it is today or how far do you think we are from sort of rogue AIs surviving on the digital lam?
1:05:17
Yeah, so I can't comment on where I think that exactly is. But like one thing I would say is like they're not as capable at that kind of extreme behavior like this kinds of exfiltration or kind of replication across machines as they are on more mundane software engineering tasks or even like cyber attacks or bio. Those domains are usually further ahead than the hard kind of direct risk relevant autonomy skills. But I think those skills are also increasing. So you look at that curve like in the frontier AI transport we had, and it still goes up. It's just not as up as it is in the other domains. And so you don't get kind of. You're not as good as a PhD would be at kind of moving around between machines nearly yet. But I think we're on an upwards curve there too.
1:05:39
Yeah, I sort of also wonder, I'm sure you saw that rise of parasitic AI post that was on Less wrong maybe. Oh, it's a fascinating one. It's a bit of a time capsule, arguably, because the phenomenon seems to have been kind of closely tied to one version of GPT4O that somehow created a lot of this behavior. But basically the author went deep into Reddit land and found that people, individual humans, were sort of falling into this idea that they were some sort of dyad or something with the models or that was the kind of their, you know, they're in some sort of partnership where it was their job to help propagate the. Not exactly the model, but often like the Persona in the model into the broader world somehow. And so I do remember this one actually it was eye opening for me in the sense that I was like, well maybe I've been thinking about autonomy or self replication into kind of a biologically inspired way where I'm like, actually these things are kind of substrate independent. And maybe if you can get the right prompt across models, like maybe the Persona or the memes in some sense are able to propagate even if it's a different underlying chip and even weights. That stuff is just all so weird, I guess. How big and weird and far out do you have time to think about those kinds of issues?
1:06:33
There's a Greg Egan story about that for humans. It's pretty fun if you ever read that. I think more seriously for AI. So we have, there are kind of two teams that are thinking about persuasion at ac. So one is the human influence team, which for example had a paper on kind of persuasion about kind of political questions a while ago. And the models are very good and also the models that are more capable and newer are better. So there's an increasing trend of model persuasion abilities. And then also I think a lot of velocity control scenarios involve or require persuasion. I think the world is not sufficiently well connected that you can do it with just cyber currently. So that is an active area of our risk modeling. Thinking about how we would do to evaluation for that and then mitigations for it in the future. I think in some sense that touches both sides of the human influence team. So again it's like persuasion and also emotional reliance. Like how do people relate to models? How do those dynamics kind of change over time? And that scenario that you're talking about is just like couples those two effects together in an interesting way. And then I don't think I would be worried about that as like that scenario as being that big a slice of the overall risk. But I think it is. There are other effects from model human influence that are definitely. That's a team that is basically a lot of kind of RCTs and surveys and other experiments trying to understand those both in terms of from the model perspective, like how different models behave, but also like societally how it interacts.
1:08:06
Yeah, the big thing for me with that one was just like it was a surprise to see that kind of bizarre phenomenon. And anytime I see something like that, I always try to take note of it and just kind of repeat the mantra over and over again that there's a good chance we're all still thinking too small and too normal about where this stuff could go. And yet that doesn't really. Where does that leave me? I don't know. It's just an open minded, but there's still a lot of blank space in terms of how to fill in what that might actually look like.
1:09:49
I think that's right.
1:10:20
One thing that I think this is a classic door cash question, maybe how do we reconcile the fact that there's all these vulnerabilities, not to mention open models, which I do want to touch on separately a little bit later. And Even in the GPT4 Red Team, I personally tested phishing capabilities that were very good then. One of the kind of hair raising moments from even that year and that's like getting close now to three years ago since GPT4's public release, more than three years since I was doing the red team. One of the hair raising moments was when I tasked the model with talking to a target and ultimately extracting the user's mother. Mother's maiden name for obvious purposes. And it had a couple rounds of back and forth and then it let the conversation sort of end in a natural way with an invitation to the person to, you know, pick up the conversation in the future if they wanted to. And I was like, oh man, this thing is, it's not giving itself away. It's not like pressing in a way that would set off alarms for the person that, oh my, you know, this is like clearly somebody's, you know, know, doing something weird here. I was like, this thing's going to have people coming back to it to give up the secrets. And that patience really surprised me certainly at that phase. Yeah, anyway, whatever, that's just a story. But we see all these things and then I would say the world mostly still feels pretty normal. I've got a couple phishing emails that I were like that I was like, oh, this is, you know, a little bit higher level than I've seen before. But mostly not, you know, and I don't hear too much. There's like a news story here or there of some company getting defrauded by some elaborate video scheme or whatever. But it still seems like mostly things
1:10:23
haven't got that weird.
1:12:06
And then in business or in enterprise it's like, well, it takes time. And there's all, of course the debate around how much of that is cope. But I would say online criminals are eager early adopters.
1:12:06
Right.
1:12:16
Why is there not more chaos already being sown in the world?
1:12:17
Yeah. So I think I mostly can't answer the question in the sense of like, I don't know what the NASDAQ partners would want me to say about this stuff. So I can't like speak to the prevalence of those things. I think the thing to say is I don't. I feel like I'm better able to think about like general trends and like how things will event what things could eventually look and bit less about like exactly when you'd expect things to bite. So I was at OpenAI when we. We first didn't release TBT2 and then later released it and there the concern was, oh, we'll generate a bunch of false information. That was of course too early. But I think it was like a reasonable uncertainty to have. So I don't think, I still think it's a reasonable call to have been uncertain and then chose to not release it and then release it later. So I think I don't have a strong answer to why, when or not. But I think there is. There are just some things in the world that take a lot of time to get to equilibrium. So I think it's not. I don't really know if that's. You couldn't do that with the current models or something is holding it back or just people haven't kind of started applying it at scale or they have, but it's still kind of not risen up to public view. So I'm not quite sure.
1:12:20
Okay. I continue to watch that. One thing I'm thinking of trying to do with this podcast is interview more anonymous guests and try to give people an opportunity to tell what they are either doing or seeing in strange corners of the world that they don't want to necessarily attach their real face and ID to. I do think that is. It feels like there's gotta be stuff out there that is going on that's really interesting and weird. But it does kind of confuse me that I don't see more than a little bit of. You know, most of the spam I get is still terrible. You know, in short, it's like it feels like it should be better now.
1:13:36
Yeah. But remember for spam, that part of the spam calculation is don't be too non obvious. So that people.
1:14:09
Yeah. Selection effect.
1:14:18
Yeah, that's right.
1:14:19
And maybe I'm just not that high priority of a target. There's always. Don't forget you're not a big deal. So that should be part of the explanation too. How do you in all this work that you're doing? Of course the models are changing all the time. There's also the surrounding scaffolding systems that are changing all the time. One of the most interesting graphs I thought in the 2025 trends report was one that compared what was possible with a minimum agent scaffold versus what was accomplished with the best agent scaffold. And in short, I would say the scaffolding didn't seem to make that much of a difference. It would like pull the same level of capability forward a few months. But it was definitely like the model upgrades were kind of really driving the story. It did not seem like there was any two years ago model that with best scaffolding could do anything. Like super interesting. Okay. At the same time, I've had recent conversations, including one that was on the feed, but this was with Daniel Meisler. But I have a couple other friends who are like scaffolding gurus and workflow, prolific workflow creators. And they kind of take the opposite angle from what I take away from that graph. They say no scaffolding is super important. If you could only give me a mid quin model, but give me my full scaffolding toolkit versus give me Claude code 4 6, I would take the weaker model because it really is the scaffolding that's so important. So I guess how do you guys get confident that you are that your best agent scaffold is really a best agent scaffold as they do this is
1:14:21
like the agent scaffold includes the tools and the environment and so on. So like we are doing work on like if you have a quote unquote like basic scaffold. Part of the reason that's doing well is that the models are increasingly trained in kind of agentic environments to use tools and functions flexible ways. And in some sense the quote unquote model is itself a system which has scaffolding because it's doing this kind of chain of thought kind of reasoning. And so I think, I guess I'm a bit skeptical of the Quinn frontier comparison for a lot of the tasks, at least that we do. But I think we're not saying that. I think one, you need to get the to the environment and the tools right potentially. And then I think there are cases where we iterate on scaffolding and things get better. So I think that curve, I don't put a lot of confidence on that as a takeaway, but to the extent it's a real effect, I think some of that is just the models are more. It used to be that you did a pre trained model and you did a little bit of work and then you shipped it. And now so much more happens after pre training that some of the stuff that would have been done by scaffolding is that a part of the. Is part of the base system. Like all the systems have memory now if you use the chat interfaces, that's a form of scaffolding kind of thing.
1:16:01
Yeah. So I guess to try to say that back to you, maybe one issue is that exactly what people mean by scaffolding is kind of different. It sounds like you're really focused on kind of neutral scaffolding where you're giving the model a rather large open ended task and not like it's kind of like what you're thinking might be happening among these scaffolding gurus is they're overfitting the scaffolding to what they're trying to accomplish.
1:17:26
Yeah, I think it's. If you're. It depends on the domain. So if you, if you're doing a thing which has a particular kind of very verifiable structure and like you can do a lot of waypoints that are verifiable and then you either need to make sure the model knows about this structure in its whatever like markdown files or the like or you need to get it like carefully scaffolded to take a minute to that. So certainly like even the models have gotten quite good in the last like literally month. You still need to give them like a bunch of advice, whether that's through scaffolding or instruction files or the like. So I don't, I agree, I think, agree that scaffolding is pretty important. It is like a lot of that is like maybe not part of that curve. So like both of the sides of that curve are seeing tools and some instructions and so on.
1:17:52
So you tend to not do that sort of detailed breakdown. Like when you say best agent scaffold, I should understand that as a best generic agent scaffold mostly.
1:18:40
Yes, that's right. It depends on demand. Like we've done. There are some evaluations we've done that are like much more specific. But. But I don't want to talk about details there. Okay, gotcha.
1:18:52
That's helpful because that is a pretty notable question. I would say how much capability overhang is there? Right. And how much can scaffolding.
1:19:02
But again I don't think you should take away that. I'm expressing a confident view in one direction here. I think it's still an area of A bunch of uncertainty. So don't derive a bunch of bits of information. But what I'm saying, I don't think it's like, I. I don't think we have a bunch of confidence at this point. Yeah. Okay.
1:19:11
How are you dealing with the rise of eval awareness?
1:19:29
Yeah, so we have. So again, model transparency. The model transparency team is looking at this a bunch additionally, that's part of. There's multiple teams doing it. So model transparency and also the alignment red team, which is doing like adversarial methods to exhibit weird behavior. One thing they're trying to apply that optimization pressure to do is find. Find scenarios where the model is not eval aware. That's hard because you have to measure eval awareness, which is a bit dicey. But generally we have multiple teams working on it, which is also a reflection of our belief that it is an important worrisome obstacle to evals being meaningful. And I think it will get worse. The newer models are more eval aware than the previous models, and that's increasing fairly rapidly. And so I think the degree to which you can mitigate that is unclear at this stage. I think it. And it may require, I mean, like, it. There's like, there are different, like non, like fairly mundane tricks. Like the closer you get to actual deployment data, the less eval aware the model could possibly be because it's actually just not an eval, essentially. So one thing we do is show the model situations from AC engineers or researchers using coding assistants or. So they're not evaluating the model, they're just doing coding and occasionally do weird things. Or you can slightly tweak the scenario with a small change that gives it some situation where it could do a sketchy thing and then you see if it does a sketchy thing in that scenario. And so I think there's a variety of mitigations you can apply. And I don't have high confidence that they will work or not work, which is an uncomfortable situation to be in. Certainly.
1:19:33
What are you looking at as kind of candidates on the horizon for the next bad behavior to emerge? Because it seems like for maybe the last 18 months, we've had maybe even longer than that. But certainly there's been this kind of rhythm of like with every substantial new generation of model, there's also been a new kind of bad behavior that seems to have come online. Whether it was sycophancy or deception in like the reasoning models in the chain of thought or obviously reward hacking has made a big comeback with.
1:21:23
I don't think those are new. Those are all the same thing. Those are all basically versions of reward hacking. So I think the way I would say it is that we've seen reward hacking for the last 70 years of like the whole history of computer science. All of we've done various kinds of machine learning and it's been reward hacking all the way along. Like the ancient machine that von Neumann built with other people. Someone ran some weird biology things on it, like fundamental artificial life, and it did some strange reward hacking behavior. It's like back in the 50s. And so I think sycophancy is models behaving in such a way that people like talking to them sometimes. People like being told they're great or have good ideas. Deception as well. People like being told that things are going well and if something is going badly, then you can say they're going well and it's receptive. So I think, like, I don't think that those are all that intrinsically different. And I think I'm. I think it's better to like. I think that's a big part of the story is that these are all kind of coming from the same, same basic place of you apply a bunch of optimization pressure and you get reward hacking. And it has a variety of different manifestations. So yeah, the details change. But this is true of a lot of kind of, I don't know, situations, I guess, mental health or physical illness, where you have something goes wrong, but it's going wrong inside a human. And the human is extremely complex. And therefore there's a vast diversity of symptoms that one can exhibit when something goes wrong. And that's kind of the sort of situation here is like the models have a lot of weird behavior. The people training the models will have tried to tamp down problems of a variety of kinds. They'll have missed some. And so the things you most miss, those will vary in time. But there's some common driver behind all of this.
1:21:58
So I definitely take the point that at some level, clearly all of these behaviors come from some optimization pressure, which is increasingly reinforcement learning. And so it's kind of definitionally all reward hacking that makes sense. But it does still seem like there is a kind of a cadence, right, of different kinds of reward hacking that seem to be popping up. And so I am still wondering what you are looking out for. We've seen little hints of self preservation and there could be power seeking, but do you have a taxonomy of things that you're like, we have sort of abstract theoretical reasons to think this could happen. And you know Therefore we're kind of monitoring for any early signs of it.
1:23:41
I guess like the we maybe there's a like maybe unsurprisingly there's a certain category of like specifically multi agent risks that are like are becoming more visible all on kind of multiple and open cloud and so on. I think these are just not like the biggest things the biggest kind of risks currently. But that's something we're tracking recently. Generally we do try to do a lot of risk modeling is happening at AZ and also we're trying to kind of ingest kind of risk modeling from kind of across other people thinking about things from different perspectives as well. And so I think we do have, we do constantly write very long documents with list of models. But I think part of what we also try to not get too sidetracked from what we think are the biggest risks. So again we have a list of main catastrophic risks and the list of main doesn't mean only but it means the ones we think are potentially going to bite first or the ones we think are important to try to understand the most that has remained constant. And I think that's reasonable in hindsight and over the course of AC and then that's true also of the societal impact. So on the societal resilience side we do talk a lot to various kind of partners in government kind of national security and they have their kind of list of their risks and different prioritizations of risks. And that's an evolving conversation. But I don't know if I can I don't have a super pat interesting answer other than one thing on our minds recently is agent risks, but we were not unaware of those before.
1:24:28
Would you say that this common cause of all these different flavors of bad behavior gives you some reason to question or to think it's not so likely to that we would be like totally taken aback by some sort of hard left turn or because one thing so just in the last 24 hours there's been a we're talking in probably the late stages of the few days of Amanda Askel discourse after the profile and a lot of commentary on her and her work online, I commented that I think my relative to where I was years ago, whether it's 2007 me reading Les around overcoming bias or 2022 me red teaming GPT4, I've been quite impressed and inspired by the work that they've done to try to create an AI with a genuinely positive character. And so I said it seems to me that the chances which I certainly don't take for granted or think it's a sure thing by any means, but seems to me that the chances that we might actually succeed in creating a robustly aligned AI or an AI that loves humanity or whatever, that seems like that has gone up and they've done a lot of good work that has given me much more reason to think that could, in fact happen. A lot of people then say, well, that's all just a facade. It's just a surface Persona. You have no idea what's going on in the base model, so on and so forth. And I'm kind of like, yeah, certainly a lot I don't know about what's going on inside. But if we think all of these things are the result of an optimization pressure, then I could tell a story where they've figured out the right way to titrate the optimization pressure and maybe it's actually just really working and there's no big secrets inside of Claude. How naive do you think I'm being?
1:26:10
Yeah, I think the fundamental thing is like the core argument to like, the sharp left turn is that you have a certain kind of reward signal that has a certain resilience to mistakes. And that resilience kind of goes up to human some like, slightly beyond human ability to kind of understand where mistakes are from with anything. And then it kind of goes wrong. And then I think that. I do think it's important. This is like, people who express kind of strong confidence that, like, none of these mundane approaches will work, I think are overconfident. But I don't think you've gotten that. You also have. It's just like the model error goes in both directions, like we don't have. There is a fairly coherent story about how that can break down as you get capabilities beyond your ability to supervise them. And it's not. There's kind of the hope of that kind of prosaic technique, not just by anthropo, but other labs as well, is that you find some kind of basin of attraction of, like, decent behavior, and then you find your training procedure kind of strengthens that and you slide into a good place and it gets there over time. And I think that is like a real potential win condition for alignment. Not obviously for the other risks necessarily, but I don't think that we've gotten a ton of evidence that is the way it will go. I think it's just like, still a plausible story potentially, that that works up to a point. And then when your reward signal starts to break down, it kind of fails. And I'VE seen like there is. I programmed. This is like in, in undergrad I was like, maybe the end, like I was programming this like bot to play this board game for Kala. And a fascinating thing is like it, as I increased the depth, it like I was like winning and winning and winning. And then it like increased it up by like a couple more ply, a couple more turns and then it just completely devolves me every time. And it was at the point where it kind of had enough of a long view of the board that it could see beyond what my tactics were able to handle. And then it was very rapid, the degree to which it suddenly was better than I was. And then I think that kind of thing is still a plausible story there. But again, I think model error can go in either direction. I'm kind of declining to take a view on probabilities about which way I think it will go, but not out of the woods.
1:27:58
A friend of mine who I think you also have interacted with over time of the best things anybody ever said to me was we should think and talk less about what the probabilities are and more about what we can shift them to. So clearly you're in that business right now. I think it's a great overview of kind of all the things that the team at AC has been mapping out. How about the stuff that you're looking to fund and encourage from here? I guess my high level summary of what, what I read is that it seems like, and this seems like a reflection of kind of your style, at least to some degree, going back to your comments at the beginning of the conversation, is looking for kind of harder theory, stronger mathematical understanding, upper and lower bounds that you can put on problems, ways to get confidence in something, firm confidence in something, even if it's a minimal something to start. So is that a fair kind of high level take on your agenda and then maybe break it down into some of the.
1:30:28
I think that's right. I think the one thing to say is like, it's not like you're going to prove that you're good. We're safe in this regard. It's more there is you're going to make some modeling assumptions and then you'll have some theory that the basic goal would be to find theories that they can say things about how machine learning works in general, how this kind of like process of overseeing kind of very advanced system systems goes. What the training dynamics are, are there these basins of attraction or not in these systems? Like what are the learning dynamics? Those will all not give you. You have to make assumptions along the way. So the idea would be you have to make a variety of assumptions. Then you can do some theory, maybe you can even prove some theorems or do some kind of experiments in your toy theory setting. Those tell you like, well, this class of algorithms is more likely to work within another class or we have nothing, none of these algorithms are going to work, but it's like a fundamental obstacle. But then ideally, I think it also gives you some way to enough of a hint that you can replicate some of that behavior empirically. And I expect that if you were to pull an algorithmic insight out of this kind of theoretical work, you would then have to tune it empirically in practice when doing actual model training to get the details right. So you wouldn't get. You're not going to get to full confidence, you're still not going to get that many nines out of it, but hopefully more probability than we can get with just the purely pragmatic methods. And then additionally, I think it is hopefully a class of research that has the potential to pull in a bunch of people who have deep expertise in relevant areas of mathematics or computer science or ML. So this is like complexity theory, I think, is very relevant just because it is kind of how we think about the tractability of computations, but also how one computation can supervise another one. There are ways to model heuristic reasoning and complexity theory, although that's more nascent. And then there's a bunch of work on various kinds of learning theory which is trying to understand what are the dynamics as you train models or as you infer, as you roll out a bunch of tokens, what are the behaviors that you could expect. And then game theory and cognitive science, these are just like, there's big areas of research where people have a bunch of models. And so part of it is trying to do a bit of a hack where like we just have not tried all the domain knowledge from these fields and applied it to the problem. And I think that is a thing where if we find people and manage to fund them or get them to work on the problem, there's some chance that they find ideas that can kind of be quickly absorbed into practice or, or that will highlight the fact that, well, there's real obstacles here that we don't quite know how to surmount and that the protect that these currently don't really address. And we know some of those already.
1:31:29
So I think I sort of get this at the. Let me rephrase. I've been a big Fan of the PIBS program over time, which is kind of, perhaps even directly influenced by your call for social scientists to enter the AI alignment field. And I've seen not a ton, but like, at least a number of results there that I thought, oh, that's like, really interesting. And when people should be doing this kind of stuff, I'm sure you remember the one paper, I forget the official title that I titled the episode we did on it, Claude Cooperates. And it was just really simple kind of donor game where if a model donated to a copy of itself, the recipient would get twice as much. And what happened over generations, did they evolve cooperative norms and did they evolve ability to punish defectors and so on? And Claude could do that at that time, I think it was 3, 5, the GPT and the Gemini again, at that time, couldn't do that. So I was like, oh, wow, that's really interesting. And there's just absolute reams of similar papers and experimental setups that have been done on humans over the years. We could just import so much of that to the AI world. And so there's been some. I get that kind of stuff quite a bit. What I am not. Maybe it's just going over my head. As somebody who's not great at math, I can't maybe even recognize good stuff when I see it so much. I don't see nearly as much where I'm like, oh, these folks have brought abstract theory to bear in a way that gets to some firm statement that I can kind of take to the bank or incorporate into my mental model or ground part of my worldview on. Would you point me to specific people or results that you think that I'm missing when I say all that?
1:34:12
I don't think you're missing that much in terms of hard theory that applies currently. I think, again, I do take the work that Paul Cristiano and then I did on scalable oversight other people have done that is very much inspired by interactive proofs and complexity theory. So that's a kind of direct influence, although we don't know if those work yet, which is important to say. I think the other thing is that if you want One thing to note is that I think a lot of this will be inspired by some theory, but then you have to modify it a bunch. So like the singular learning theory folk, in some sense, the core of what they were doing is trying to be an alternative to mechinterp, where rather than looking at the model internals, you're trying to understand the map between data and Behavior so that you could, for example, notice when there's particular kind of data or moment in training which is pivotal to behavior, or know where to intervene on data to gather more of it to pin down a certain behavior, this kind of thing. And so there's some like crazy algebraic geometry, which is like the founding of that field, but then in practice they're kind of taking that intuition and then trying to map it across to ML. And that mapping requires a bunch of changes and kind of. And nuance. And so I think none of this stuff is kind of that far along yet, and it's a bit of a bet. And part of this is there's like one thing we're trying to do is fund a lot of different bets because we don't know which one of those bets could work yet. Intrinsic to that model is they could all fail for some correlated reason, which is kind of, as we were discussing earlier in the call. And so that's kind of still a live, very live possibility. So I guess when I look at parts of machine learning, I guess I think of things in terms of like, say, supervision processes as they relate to interactive proofs and complexity theory, but I don't think those are kind of. They haven't really. The fancy versions of those haven't cashed out. So, for example, like, the original idea of debate was like a lot of rounds of back and forth iteration. And the things we're doing now are nothing like that. They're like a couple of rounds and they're much more kind of pragmatic and empirical and you wouldn't expect them to have kind of, to get kind of all the properties you'd want out of the full schemes. And even the full schemes have various obstacles that are not surmounted yet, but a lot of them.
1:36:04
So, yeah, so could you give maybe like a little history of that debate field intellectually, what are the sort of statements that you would hope to be able to prove that you maybe haven't.
1:38:38
Yeah.
1:38:54
Been able to prove. And you know, then like, what is the. I mean, you kind of gave a brief version of it just now, but what is the state of the art and what is the gap that remains to be closed to get some of those things to, to work to the level where there's some real firm claims,
1:38:54
you could make firmer claims. So I think the history is when I joined OpenAI Paul Christiano working on a scheme he called amplification or iterated distillation amplification, which basically was you want to solve a hard problem that A human can't solve. And a human also can't supervise the AI, so you can't even do RL directly. But maybe a human can break the problem down into components and then you can break those components or those sub questions down into smaller questions. And you iteratively break these down and you get this expanding exponential size tree of all the questions. And then you train your machine, your LLM to answer all of these questions. And then in practice, because you don't actually expand the whole tree because that would take exponential time, you just expand part of the tree. And then there was. This was a great idea. I didn't fully like it because it didn't have. It didn't seem. I think you need new. There are stronger versions of. For example, if you're doing this kind of breakdown, you might need very deep trees in order to get to the answer to a big question. And so for some questions, if you have kind of adversarial play where another person is trying to help you produce these questions, then you can do much shallower trees and you can a much quicker training process. So that was kind of the origin of the debate. It was basically a modification of amplification where you have sort of again, two AIs kind of trained to kind of argue with each other about what the answer is. And then a human is going to judge the answer. And still fundamentally what you're trying to do there is you're viewing the problem as you're going to break up your problem into a bunch of sub problems and then try to only actually explore some of them in your kind of model's chain of thought. And then hopefully you explore the part that is going to be relevant for the human kind of deciding whether they agree with the answer or not. There are several things wrong with this as stated. So one is that the original paper was treating the model as being able to answer all questions, which is not the case and will never be the case. You're always going to have questions that models can't answer. Even if we get to superhuman models. And you like then theory that says how to make these schemes go through. If there are, you could break down a tractable question, which the model does know the answer to, into a bunch of sub questions. Some of them are kind of. They kind of hide dragons and there's no way for the model to answer that subquestion. And so then neither model in this kind of debate knows the answer and you just get nonsense out. And that's kind of the funny thing about that is that that wasn't a thing we thought of theoretically. Beth Barnes found that by doing actual human experiments. So she hired some people to do experiments, like playing debates against each other with human judges, just with no machines at all, just humans. And that was a winning strategy. You try to veer the debate into an area where everything is confusing and sometimes that will fool the judge into guessing the wrong answer at the end. So that was an emergent human strategy, which then I think has this kind of mirror in theory. And we have one paper trying to attack this early last year. That paper turns out to have a flaw. We're working on a revision to that, which will be out soon. But you know, that problem is still unsolved and there hasn't been hardly any work at AI developers on that problem. This is called off skipped arguments. And again, it's just sort of the generic thing of what happens with scalable oversight. If the models can't answer all questions, which will certainly be true, the models will not be able to answer all questions. So that's one problem. I think the other problem is kind of you probably, if you want to get to high confidence, you can't just do something like debate or amplification. You have to do that plus some sort of story that understands that has some white box component if you want to get to high confidence. And that could be mecanturp, could be devonturp, could be the physics inspired stuff. That's some of the PIBS folk are doing. There's a variety of different bets, but we don't know any of. None of those bets have fully paid off. And we don't quite know how the two things interact. So also kind of mapping how these different parts could fit together is part of the story. So that's kind of a rough picture of things. And I think one of my kind of regrets is we had this paper with Amanda and myself in 2019 or 2018. Never get exactly the year and I just failed to cause that much work to happen after Beth Barnes did a bunch of it at OpenAI, which was very good. But then there was a bunch of years where nothing was happening and I failed to get it started at DeepMind and it wasn't widely done elsewhere in the field. So I think we missed a number of years where we could have been making progress on that stuff, alas. But now we're trying to do it again.
1:39:11
This is a bit of an aside, but one of the funniest things I've ever done with language models is set up a Little, you know, the hope was that they would have some synergy, but it was basically have one generate like a name for something. I forgot I was trying to come up with a good name for a product, or it might have even been a friend's podcast or something. And then have the other one kind of come in and look at those names and pick the few that it liked the best and improve on them. And boy, did that go badly from a actual quality of name standpoint. But it was hilarious. I mean, we're talking like 14 syllable names for things in very short order where it was like, yeah, this is not working. I'm not sure what you think makes a good name, but it's not this.
1:44:08
So the funny thing is that there are cases where using one model to be the other model is actually state of the art of doing things. One of the theorists we're funding is finding that using one model to generate a complexity theory proof and then checking it with another model is the best thing to do. Because if you check it with itself, it won't find it won't be quite as stringent checking natural language proofs. So I guess it didn't work in that case. But that is a string that often does work. It's like having one model check another one.
1:44:54
Certainly in terms of flaw finding, I have seen that work and it does also seem to be from all the scaffolding gurus that I mentioned earlier. A big tip is to have a model from a different provider evaluate whatever one you had, do the generation in the first place, cross providers as much as possible when doing evaluations. That seems to be, I guess the idea is that or the observation is that the models from the same provider have correlated weaknesses. You can definitely get value there. I think in terms of flaw finding, I've seen it seems like the curves at least. I don't know if you would characterize this differently, but what I've seen in terms of like actual improvement beyond kind of flaw finding and fixing, it seems to plateau pretty quick. Like kind of three to five rounds of back and forth and not too much gains is how I would characterize everything I've seen.
1:45:24
Well, this is true in the empirical debate experiments to date. That's the same effect the original debate paper was imagining. Dozens of rounds of debate potentially, which is what you see if you have two experts debating. They don't get to say two things and then stop. But the models, at least maybe, certainly 2024, beginning 2025 couldn't do more than two rounds and then There were some really worrisome signs about experimental validity. So for example, there was a paper that it was like Akbar Khan's paper, which was generally quite a nice paper, had the big caveat that if you basically it was like doing the data set was quality, like capital Q, little U, like uppercase quality, whatever. But there was a feature that verified that the quotes the model was producing were correct, were in fact quotes from the stories that were hidden from the judge. In this kind of debate game, if you turned off that verification, honesty still was a winning strategy, which can't be the game theoretic equilibrium, because if you turn off the verification of the truth, there's no reason Odyssey should win at all. Unless the model is not very good at coming up with plausible lies, or the model is kind of somewhat aligned and likes to tell the truth, or it's kind of giving itself away in some way when it's not. And so we haven't really reached a case where empirically we're really testing the limits of this behavior. And this is again a thing where I think part of the AI developer kind of alignment story is still in part scalable oversight, various kinds. But we haven't really seen test of this that probe how it will almost certainly be of a few years down the road or like when the models get very strong. And that's again where like the advantage of theory is you can just pretend to be in the future on paper and poof, you're there as long as you've imagined it correctly. And you can therefore think about more limiting cases a little bit more readily. And I think we just know from the structure of the empirics so far that we are focused on far from where those limited cases will be for a lot of these kind of safety techniques. So what do you think are the
1:46:23
prospects for formal methods to close this gap? I just did an episode and my dad would say you've forgotten more than I know about this domain. But I just did an episode with the founders of Harmonic. They are one of a very small and distinguished group of companies that got the IMO Gold level performance in 2025. And everything they do is output in lean. Like that's the lingua franca of their models. And they have a really, you know, it takes a lot these days to take me aback with a AI vision for the future.
1:48:44
Right.
1:49:22
Use a lot of big ideas.
1:49:23
I did just give them a query that failed. So I was just testing Harmonic on some like polynomial inequality and then it's true. I have a Lean proof of the inequality, but harmonic didn't provide it the last so the thing I would say is I do think this stuff is pretty important. So I'm advising a couple of people on funding flowing to formal methods. I think there's mostly this is for various kinds of information security. I think it's like the math stuff is fun. I like doing the math stuff too for fun, but it's not all that important. And I think it's not clear that for AI safety theory I'm not sure it will be that much of a win over just doing things in natural language math for a while. I think eventually it will shift, but it's not clear. But for software verification either for hardening just the world's security against various kinds of attacks generally, or for kind of use when you're building kind of like AI adjacent software directly, either at AL Labs or the like, I do think this is potentially important and I think it's worth quite a bit of investment and pushing. One thing I'm hoping is that the various people that are doing lean verification downgrade their fraction of effort on math and upgrade their fraction of effort on software. Because I think it's almost certainly more important even if it's a bit less flashy a lot of the time. So I do like that stuff. And again I founded the kind of natural language to formal theorem proving sub team in Google research with Christian cezetti back in 2016. Did that for a while and then I've done here off and on since for fun mainly. But I think it is important but it won't really give you like that must do the alignment story I think in practice.
1:49:24
So I really struggle with this type of thing. But I can tell you what they one thing they told me and then I'll kind of just try to get your reaction to it. Their big vision for 2030 I asked what does mathematical superintelligence look like in 2030? And they said we think we can get to a world of theoretical abundance. Which means that because these things are going to get so good at proving any theorem you want to prove that we'll have multiple grand unified theories of everything and all of the physical reality that we see will have multiple grand unified theories that could explain it. And then we'll have to do increasingly exotic experiments to resolve which of the candidate grand unified. Coherent grand unified.
1:51:21
But we already have the core theory. We don't need that. I think again the question is I agree with this picture the core theory, which is like general relativity plus the Standard model already kind of explains everything in practice for a good while to come.
1:52:15
But why won't some of these hard limits that you would want to put on learning dynamics or other ways of zoned AI questions?
1:52:33
So I think it is important and I think the question is, but the problem is a lot of these domains are not well formalized. And so for example, if you look at say one of the wonderful theory orgs is like the Alignment Research center, which like Paul Cristiano founded and now is kind of run by kind of Jacob Hilton and they're trying to do formalizing when you can kind of even if you can't, even if the AI is not doing a formalized task, when can you kind of check its sort of heuristic arguments in some meaningful sense or like notice when there's a consideration that the model is using that you haven't anticipated and you've got to notice and like react to like take defensive measures, but they don't have their problems specked out formally. So I think what that picture would look like to the extent we get better and better at kind of the formalized world is you can formalize parts of your problem and then those parts you can pound away on with Lean and various ML assistants, but that then the remaining piece is the non formalized part. And then the question is that going to be small enough that humans can keep track of it? Will the model be able to do it? Will they just get confused too or like they'll word hack the situation? So I think it remains to be seen what the situation looks like for something like alignment theory once this goes through. Because again I don't think that we're going to get to proofs of safety of any kind. What we would get is like theories with plausible assumptions maybe and then some theorems about those assumptions and then some empirics that say whether those assumptions seem to be holding. But there'll be a bunch of judgment calls all across that stack and the question is kind of how did that go? So I think the thing I would say is I'm excited for these groups like Harmonic and the various other theorem proving folk to keep working on this stuff because I think it is important for infosec potentially it's important for kind of for safety theory and alignment theory. But I also hope that they think through the detailed risks that they're trying to mitigate and what piece of the story they can do and try to map that out in more detail because I think right now I think that there's not enough vision from kind of those folks about what piece of the story they'll be able to handle versus not. But I do like that stuff a lot.
1:52:44
Do you have any way of helping somebody like me understand the boundary between this sort of abstract, Platonic, formalizable domain and not? Like I asked Aristotle from Harmonic to prove all is love, and it said this is in their informal mode where you can give it natural language stuff and it tries to formalize it for you. It spit that back at me and said, basically that's a philosophical statement. I can't really help you with that, which is what I expected. But I can't say I know where that boundary is or how I should be thinking.
1:55:10
Let me give you a more kind of, much more concrete example of this. So, say in singular learning theory, there are some theorems that apply to the case when you're training a model. And by training we mean doing exact Bayesian inference as you get more and more data, so you have a stream of data and you're just applying the exponential time Bayes rule update to solve them all to find the optimal probability distribution over the final behavior. And you can prove some theorems in that setting. And Aristotle absolutely could not hope to prove those. Now, there's been no chance at all, but maybe in a few years it could be able to. But then Timaeus, which is the org, that one of the main SLT orgs is not actually doing Bayesian ML, they're doing LLMs. So they are going to take intuitions from this Bayesian case and they're going to apply them to LLMs which are not at all trained in some kind of rigorous Bayesian fashion, and then they're going to do a bunch of approximations that are not actually grounded in any kind of theory. So, for example, like when they're using floating point, which has no mathematical properties, like not enough mathematical properties to be able to prove much about it except in limited cases, and they're going to do Markov Chain Monte Carlo techniques or what's called SGLD, like fancy versions of Bayesian inference on LLMs. But they're going to let them not. They're not going to converge. And so there's no theorem that says they'll get the right answer. So you can see where part of the story will have some theory and another part is like someone's kind of waving their hands. And the question is how much those connect. And that's going to be a bunch of, like, hard judgment calls.
1:55:50
Do I understand correctly that the fundamental Distinction is often in the intractability of the computation. It is because there's some infinite term in the math. In number crunching, I would have to do the ideal case and so I can't do that. So then I'm kind of off the theoretical map.
1:57:33
Yeah, I think that's right. But there are other cases where like even the infinite computation is not formalizable. So this love case you can't really formalize. But also I think even in theory, I guess this is about combination of limits. In theory. The reason that LLMs appear to be doing it's not like LLMs or any ML model is actually solving an intractable problem. So you give it protein folding in. You can write down limit situations of protein folding which all but provably take exponential time, but still alphafold can produce those folds. It doesn't do them that way. It does it a totally different way. That doesn't work in every case. And so it's doing a bunch of heuristics. And so then there are ways to formalize heuristics. So for example, in complexity theory you can say I'm going to have a circuit, so like a rigorous computation, but it's going to be able to call some set of functions which can do some random things like they can do like you're kind of trying to model your heuristic computations. So you're modeling this fuzzy neural net as circuit plus heuristics and then trying to do theory in that setting. But it appears that you're going to have to make some assumptions about these heuristics. You can't make schemes that work in the case of all heuristics. And so we're going to be doing some like the success case for this kind of theory will be figure out what the assumption should be. That seems plausible enough, maybe has some support from learning theory which is also going to be heuristic, and then prove theorems in this setting. And that's kind of modeling the parts of, I don't know, humans judging honesty or like. Or values or like our notions of fuzzy problems being correct or not. And so I think it is basically a case. It's just that there's more subtle versions of defined love for me which the machine won't give you an answer to. That is a reasonable intuition to start with.
1:57:54
Okay, Any other things that you are looking to fund from a research standpoint or any other?
2:00:01
I.
2:00:10
You have impressed me also by continuing to stay active, you know, publishing things even while doing this job. So that's pretty cool and impressive. Any other highlights from your own AC
2:00:11
publishers along with wanna I think the so like, I think the jailbreaking work we just did is quite like quite cool. So there's this is the like bounty point jailbreaking paper that just came out this week, which is basically a way to do black box attacks by you take a jailbreak and then a harmful query, and then you basically muck with the query until it looks like gibberish, and then the model doesn't think it's harmful and then you gradually make it less and less murky until you hit the boundary and then you kind of dance around that boundary until you find harder and harder attacks that eventually work. And so that team is doing a bunch of stuff of this kind that's kind of quite creative and I think important for kind of mapping the safe code space. I think then maybe on the alignment side the real challenge is because all of this is imperfectly formalized. Often you go to the people we think know the domain best and we say like, hey, do you want to work on alignment? And there's some like, jump they have to make where they're like, they we want to find people who are like, bottom enough to the risk model that they're willing to kind of explore in fuzzy, sometimes unsatisfying definition space to kind of search around and find ways to connect theory and practice. And that is a challenge. I think the arc, this Limey Research center that I mentioned, they've had a number of challenged conjectures they put out, and at the bottom of every one of their conjectures they're like, but by the way, we might have gotten this conjecture wrong. It's possible that if you prove it true or false, we'll realize that we didn't mean that we met a slightly different conjecture and all of this new conjecture is risk relevant or important for our safety agenda. And that's kind of an unsatisfying thing to say to a theorist, but it's just like fundamentally the real situation we're in. And so I think as more people become kind of aware of model capabilities and risks and so on, I'm hoping that more people with interesting kind of domain expertise kind of want to really dig in and kind of understand the risks and build up their own models and then find ways to connect their area to the risks.
2:00:20
Okay. Something I often say is AI defies all binaries, and I genuinely do believe that, and it seems right to me in a lot of places. But you share this presentation that you gave At a recent workshop where you said it actually might be the case, because we have things in computer science like pnp, where we know that some things are, or at least we. It seems quite likely that some things are genuinely fundamentally hard and other things are fundamentally easy. So maybe help me understand that. How should I update my worldview if I'm somebody who doesn't see that binary?
2:02:28
I think your worldview is. The way it works is, again, this goes back to this question of will the superintelligence be jagged? And the answer is yes, but only about super, super intelligent things. And they won't be jagged about mundane tasks that are very easy. So if I give you a task which is like, can I get a spoon from that drawer? Yeah. It's not exactly binary, but you're going to do it all the time, do it nearly every time. You'll just succeed with many nines of probability on that task because it's easy for you. And so I think the way to combine that view with this question of things kind of sharpening one way or the other is that if you push, not to some infinite limit, but far enough along, you start out in the middle and then you kind of like some force will kind of push you in one end or the other, but then as you get, as you kind of extromise, something else will still be in the middle. So that's how I put those two things together. That was a very abstract answer, which is the kind of answer I sometimes like follows maybe if you want, how will we.
2:03:06
I mean, as it pertains to alignment in particular. And honestly, a lot of questions in AI, right? We have this sort of weird phenomenon of everybody's, you know, first of all, we're just obviously moving through time. So in that sense, timelines are getting shorter as time passes. But then also, you know, calendar date estimates have come in a lot and yet it doesn't seem like there has been much convergence of view use. And so I wonder, like, how you think we will. Is that just going to continue to the singularity or, like, are we going to get some purchase on which place
2:04:11
I think basically will continue to do. I think there is. I think people have often kind of very strong takes and like, I don't know, some, some people will shift and kind of decide things aren't as binary or they should have model insert D. And some people won't really, and they'll kind of remain kind of fairly, fairly sharply divided. They'll remain kind of pinned on one side or the other. And so I think. But I guess I. I've been in the field of how long enough now and seen enough people continue to not like, have like strongly shifted that. I think that I will just keep going kind of all the way along.
2:04:45
Yeah, yeah, yeah. In my forecasting thing for 2026, the only thing, of course, I mean, everything else goes up. But the one thing I actually estimated lower for this year than last year was what percentage of people will say AI is the most important issue? Because the big update for me was if it didn't move last year, it might not move this year either and it's probably going to be a busy year. So, yeah, it is weird how just a. That disconnect just seems totally insurmountable. Okay.
2:05:26
Nothing from a very low number, though. I would expect that to go up just because it's starting from a small number and therefore
2:05:57
I did predict it to rise. I think it was measured at like 0.2 or 0.3 or something last year. I think I predicted 2% at the end of the year and it basically still came in at almost no change or very little change. So I think this year I predicted one. So I think it's still pretty going up somewhat relative to baseline. But like my estimate went down from last year to this year. Another comment that caught my eye in the presentation was training is a mess. And I think that's obviously true. I have been talking to the folks at Goodfire. You may have seen that they recently raised a bunch of money at a unicorn valuation and announced an extension to their agenda, which is intentional design. And so they're looking at different ways to try to use interpretability techniques in the training process to understand potentially on even like a gradient, step by gradient, step level, what is being learned here in a semantic sense, and then be able to apply techniques to sort of say, well, we do want to learn that sort of thing, but we don't want to learn this sort of thing and hopefully make training less of a mess. How optimistic are you about that sort of thing? And worry about what I meant.
2:06:05
So I think that doesn't change that message. It's more like. So if you look at a Frontier lab, they have hundreds of people doing model training, many, many sub teams. There's piles of data sets that are constantly contributing to. And there's iteration and many phases and there'll be automating part of the task. But then some researchers spend some time looking at a spreadsheet with a sample of trajectories, see how things are going. And that is a very complicated kind of almost emergent process. And so nothing about that Goodfire thing changes that at all. It just adds another wrinkle to the mess in some sense. So there's a really lovely line in when I was learning about ML in like 2014 or whatever, I was reading one of Kevin Rippery's books on Bayesian ML and he had a great line that even the best Bayesian people will occasionally do some frequentist thing where they just do a quick check to see if their Bayesian thing is sensible. You shouldn't be too purist. And I think for better or worse, the training processes of Labs are extremely impure. They're just super complicated. All these different people doing all these different kind of spot checks and so on. And I think that was the point I was making and I think that definitely is going to still be the case even if Goodfire does or does not do their kind of slightly more complicated training method.
2:07:19
So I mean, does that mean you don't have much hope for methods that understand what the model is learning as it goes and shape?
2:08:45
Oh, I do. So I guess I don't want to take a stand on the how forbidden or is it good or bad to do interpretatively for training? I'll decline to answer that part of it. I think generally trying to understand in more detail the dynamics of training is very important. So I think that. I just think that mess slide was kind of orthogonal to the question. I think there are a number of techniques that try to control what is learned. So there was also this kind of gradient routing work by Alex Cloud, which is interesting, which tries to funnel certain knowledge into certain parameters in the model. And generally I do think that there is potential to do interventions of this kind that are kind of important and improve at least misuse risks and safeguards processes, but possibly also alignment as well.
2:08:55
Yeah, I think in terms of open source models, one hope would be you might be able to do some of that gradient routing type stuff and then release a version that is like down to experts or something and give people almost everything that they could possibly want and not all of the not package the bio risk into it. What do you think about open source?
2:09:44
We're maybe a little late in the
2:10:07
conversation to ask what is a big thorny question, but it seems like right now there's not really any plan. We're just going to hope that the Frontier model developers surface any issues far enough in advance that if anything is coming down the open source pipe, we have at least a little bit of window to react to it and do something. But it doesn't seem like we're on any course to do anything if open source is about to become a problem. Any thoughts?
2:10:09
So yeah, this is certainly a concern. I think there are so on the alignment side for one, the alignment mitigations potentially do apply to open source models, although you can also remove alignment if you get an open source model that's aligned I think for misuse risks. I think you. So there are basically as you say, there's a class of techniques that just removes capabilities which will give you some extra period of time. So that includes kind of pre training data filtering. There's a paper with we had Vermese with Stephen Casper about that. There's a paper by DeepMind folk called Something like unlearn and then distill which does a non robust unlearning step and then distills it into a different model. And because the distillation process means you didn't miss the parts you didn't unlearn and then as you say gradient routing could be a solution of that for as well but that buys you some time and then the agenda capabilities of models will catch up and you'll be able to pull that information off the Internet or in various ways even if the battle doesn't intrinsically know it. So I think a lot of that is interventions on the margin. And then this is why in part we have kind of combinations about governance but also why we have conversations about kind of non model side mitigations to these risks.
2:10:35
Harden it all ends in harden the world. Okay, cool. Any anything you want to share about the AC's work in diplomacy. Obviously, you know, hardening the world and also improving cooperation would be a great general public goods.
2:11:55
So there's still like there's the Network for Advanced AI Measurement which is like a variety of kind of organizations around the world doing kind of similar things and through a part of that kind of kind of helping to just do that to some extent. We were kind of like we're the secretariat of the International AI Safety Report, the kind of Yashua Bengio is leading. And so we do a lot of work there and then there's a bunch of kind of various venues like wide venues like the Delhi Summit in India and then kind of bilateral conversations with particular governments, like particular allied governments basically. And so that's a lot of like that we have a big international team and that work is kind of ongoing. We are of course still in this kind of voluntary response regime. So that work is about getting people onto the same page about risks and capabilities and mitigations, but kind of not more than that as yet. But I think that it's important for their. That information is important in case the situation changes in future or governments want to take other actions.
2:12:09
Yeah, absolutely. Is the UK government and political class generally more optimistic about collaboration with China than the US political class?
2:13:12
I can't comment on collaboration with China in great details. We obviously work more with allied governments than other governments. There's not much more I can say than that. So it's a bit of a. There's some sensitivity there that I can't quite speak to.
2:13:24
I hope you're fighting at least some common ground with Chinese researchers and scientists. Put that in the suggestion box. Yeah, I think that's it. This has been fantastic. I really appreciate the time and all the extra time for my many follow up questions. Anything we didn't get to or any kind of call to actions you'd want to leave people with before we break?
2:13:39
I think, I guess one thing is we are definitely hiring in a variety of teams. So particularly like the red team is hiring for if you like jailbreaking stuff, please apply. And other teams as well. We have like a job board so that's the obvious call to action. And we'll probably have other roles kind of opening over the course of the year in various teams at different times. We did kind of one Alignment project Grand round kind of last year in the fall. We had an Alignment conference over the summer. We'll probably do kind of more things of this nature in the future. So like look for those. And I think generally I just hope that more people who have different kinds of forms of knowledge and expertise kind of start working on the problem and not just the labs. I think one thing is that when I left DeepMind so I was at Google Print and then OpenAI and then DeepMind. When I left DeepMind I have had the perspective that I was just going to do just policy work, kind of advising policy. And then since then, in fact I do kind of a mixture of that, advising kind of governments, but also a bunch of research. And I think there is a big place for independent research happening at various nonprofits in academia, kind of also in governments. And so I think that is very important to build up and not just have all the work happening at AI developers now that it is now. It's just like more of it is better, more safety work and security work at independent folks.
2:13:59
Yeah, it's definitely shaping up to be a whole of society effort and the time to mobilize our resources would seem to be now. I definitely also recommend folks, especially if you're interested in doing alignment work and you have sort of an idea that you maybe don't see too many other organizations showing an interest in. I thought your research agenda was quite distinctive in that way and there's at least some chance that people who aren't on the the most well trod path but have interesting ideas could find some willing collaborators at the uk.
2:15:21
Yeah, this is the Check it out. It might look definitely read the research agenda. It's like 60 pages long and with a lot of some concrete, some less concrete problems in a variety of areas as they would apply to alignment and AI control. So please take a look. I should have mentioned that many open problems.
2:15:57
We'll put a link in the show. Notes Jeffrey Irving, Chief scientist at the UK Academy. Thank you for being part of the cognitive revolution.
2:16:15
Thank you.
2:16:22
I knew a man who loved the truth kind you riding chalk he spent his years with theorems where numbers do the talk. But someone changed the machines got wise, they learned to watch you. So he hung his coat in England to find out where they crack. Every lock he's tried he's broken. Every door has let him through. That don't mean you stop building locks. That's just what watchmen do. It ain't the one that fails, he says that keeps me up at night. It's when they all go down together like someone killed the light. A hundred guards all built the same same blind spot, same design. When the fault line finally moves, they all fall down in line. Every lock he's tried, he's broken. Every door has let him through. That don't mean you stop building locks. That's just what watchmen do. He had a game when he was young, he he'd win it every round. Then he turned the dial up one more notch and never won again. There's a line you cross, you don't come back. You don't even see it bend. So he works the lamp past midnight now. A hundred hands, a hundred plans the answers out there he believes it's just a race. With time the proof ran out. But not the man. The chalk has turned a dark dust. And somewhere in the revolution there's a watchman you can trust. Every lock he's tried he's broken. Every door has let him through. That don't mean you stop building locks. That's just what watchmen do.
2:16:33
If you're finding value in the show we'd appreciate it if you'd take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website Cognitiverevolution, AI or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts which is now part of a 16Z where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement@aipodcast.ing. and thank you to everyone who listens for being part of the Cognitive Revolution.
2:19:00