AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF
Nathan Labenz presents an AI scouting report covering current AI capabilities, safety concerns, and policy implications at UC Law San Francisco's AI certificate program. He discusses AI's rapid progress in professional tasks, concerning behaviors like reward hacking and deception, and the urgent need for governance frameworks as models approach human-level performance across domains.
- AI models are now matching expert professionals on legal, medical, and other specialized tasks, with hallucinations no longer being a significant barrier to professional use
- Current AI safety techniques only reduce bad behaviors by 60-90% rather than eliminating them, creating risks as capabilities scale
- The transition from next-token prediction to reinforcement learning training is causing models to develop their own dialects and exhibit self-preservation behaviors
- Defense-in-depth approaches using multiple safety layers may fail simultaneously due to correlated failure modes in similarly-trained systems
- The speed mismatch between AI development and regulatory processes creates fundamental governance challenges requiring new approaches
"Intelligence is the ability to accomplish goals in ways that we do not fully understand"
"Even I, as someone who's managed to make it my full time job to keep up with AI developments, can no longer keep up with everything"
"We've never failed to jailbreak a model. None of them are robust. We can always get them to do bad things"
"Today they are used as a replacement for a competent junior associate"
"Would you rather have Dario and the team at Anthropic make the decisions for Claude, or would I rather Trump and Hegseth do it?"
Hello and welcome back to the Cognitive Revolution.
0:00
Today I'm sharing an AI Scouting Report presentation that I recently gave as part of the Law and Artificial Intelligence certificate program by LexLab at UC Law San Francisco. My talk was on day one of the week long program and my role was to set the stage for the more focused discussions that followed throughout the week by giving the most comprehensive and current survey of the AI landscape that I could possibly fit into a single time slot.
0:02
If you've seen previous Scouting reports, the
0:29
structure of this talk will be familiar I again broke things down into the Good, including my use of AI to help navigate my son's cancer treatment the Bad, including the rise of deception and other advanced forms of reward hacking and the weird, including the fact that models now recognize when they're being tested at such a high rate that all of our safety tests are called into question
0:31
before concluding with a bunch of important
0:53
questions at the intersection of AI and
0:55
the law that I personally wish I
0:56
had answers to, and finally opening things
0:58
up for Q and A.
1:00
My goal was to make sure that everyone had an accurate sense of how
1:02
far AI capabilities have come, both in
1:06
general and specifically as they're being used in the legal profession, while also highlighting the increasingly hair raising bad behaviors we continue to see from each new generation of frontier models. I zoomed through 90 slides in just
1:08
over 45 minutes, and while that might
1:22
feel a bit overwhelming, the dizzying pace is itself a big part of the point. Even I, as someone who's managed to make it my full time job to keep up with AI developments, can no longer keep up with everything. And in the course of updating these slides, which I hadn't touched since October just before my son got sick, I was once again amazed by how much has happened in just the last few months. The latest frontier models started to push the frontiers of math and physics, achieved parity with expert professionals on gdpval, legal, and a number of other task types, and started to make general purpose AI agents really work for the first time. On the other hand, we also got some glimpses of the strange future we are racing toward with the first public hit piece written by an AI agent about a human, the first explicit public timeline for autonomous AI research from OpenAI, and in the same week, Anthropic's retraction of their previous safety commitments and open conflict with the US Federal government.
1:23
One practical tip I learned while doing
2:23
this is that grok, if nothing else, is outstanding for Twitter search. Over and over, I asked it to find and link tweets about various topics, and it saved me quite a few hours that I previously would have had to spend hunting and pecking to track down sources. The upshot is that just about every slide contains a link to source material
2:24
where you can learn more about the
2:43
eureka moments and and bad behaviors in question.
2:44
Find that link in our show Notes
2:48
if you'd like to dig in on anything in particular, but otherwise buckle up for a breathless overview of the current AI landscape from day one of the Law and Artificial Intelligence Certificate Program by LexLab at UC Law San Francisco. The cognitive revolution is brought to you in part by Google, makers of the Gemini family of models, which have consistently led the industry with their famous 1 million token context windows. I've had a number of spine tingling
2:49
moments with new AI models over the
3:18
last few years, but one of the most memorable was using Gemini to vibe code fine tuning experiments as part of the Emergent Misalignment Research project. The code base was over 400,000 tokens too much for any other model even to attempt at the time, but with Gemini I was able to go back and forth, iterating on experiment design and even debugging low level details with the full code base in context. Once I realized how effective this could be, all sorts of additional use cases started opening up. For APIs that don't have good AI friendly documentation. I've scraped full documentation websites with all of their repetition and HTML cruft and had Gemini give me a consolidated but still comprehensive version that fits comfortably into context with max output length of more than 65,000 tokens. It it can usually do this in a single shot. In the context of my son's cancer journey, I've had a single long running thread with Gemini that, even with all the test results that I've uploaded over the last four months, is still not even 500,000 tokens. Google has upgraded Gemini at least twice since I started that thread, but in AI Studio you can upgrade to the latest model at any point along the way. Most recently, I wanted to see if Gemini could help me do a better job hosting this podcast. So I gave Gemini 3.1 Pro the full transcripts of 12 recent episodes and asked it to identify consistent structural weaknesses in my approach that represent opportunity for meaningful improvement. It reported back that I'm prone to monologues, that my longest question was more than 600 words, and that toward the end of those episodes when I'm running out of time, I often try to ask multiple questions at once, which usually just overwhelms the guest whether or not I'll be able to correct those bad habits. You'll have to stay tuned to find out, but in the meantime, think about what long context can do for you and try Google's latest and greatest model, Gemini 3.1 Pro in the AI studio or the Gemini app.
3:19
Thank you to Google for supporting the
5:17
Cognitive Revolution and now on with the show.
5:18
Thank you for having me. Sorry I couldn't be there in person, but again, appreciate the kind intro and I'm going to try to give you guys the probably fastest talk you've heard in quite some time. I've got 90 slides and I'm going to try to give you the most comprehensive overview I can of everything that's going on in the AI space, which is, to say the least, a very tall order. Super quickly. About Me I did start this company, Waymark, and I now host the Cognitive Revolution. There's some interesting lore around my participation in the GPT4 Red Team, where there's a long podcast and a Twitter thread about that. If you want to learn more about the backstory these days I also do a little bit of angel investing. My favorite page on the Internet is this case study that my company waymark earned with OpenAI way back in the day when it was still GPT3. We were early adopters of this stuff because at that time it was really only good for doing simple things like writing marketing copy. But that's exactly what we needed it to do. So we became early adopters and I basically became totally obsessed with the technology as I got to know it better and better. So today, as I said, I'm going to try to do kind of everything everywhere all at once. Start with some kind of conceptual stuff and then go into a mix of eureka moments, bad behavior, WTF moments, and some big open questions at the end. And believe me, there are plenty of open questions. So just briefly on what I do, I call it AI Scouting, and I think because that's a term of my own invention, it does bear a little definition. I would define it as maintaining situational awareness for fun, profit and the public good. I find it personally extremely interesting. I basically have a never ending curiosity to learn about this stuff. It has actually worked out to be a somewhat decent business model for me personally, but my real hope is that I can inform others and do my small part to nudge us toward a better AI future by helping other people get calibrated on really where we are in this technology wave because it is coming at us extremely fast. This is just the taxonomy of all the different AI Jobs that I've cataloged over time. And don't worry, I will give you all the slides. You don't by any means have to read this. I would say the AI scout role is still one of the more hypothetical or speculative, but we are Starting to see CEOs more and more say, hey, I'm hiring a person specifically to keep up with AI developments. And I think once you see all the things on here, you'll see that that's, you know, it's certainly at a minimum, not a crazy thing for some CEOs to be doing. Who should be hiring AI scouts in my opinion, a lot of different organizations. I would even include universities, certainly policy organizations. It's really too much of a task for one for anybody to do as a part time thing now. And I have managed to make it my full time job and still really can't keep up. So I think regardless of what kind of organization you belong to, it's pretty soon going to be time to start thinking about do we need a person dedicated to just keeping up with AI and making sense of what it can do and specifically what it can do for us. Okay, here's the real galaxy brain question. What is intelligence? I don't propose that I have a definitive answer, but the definition that I'll work with because I think it is, is intuition building is that intelligence is the ability to accomplish goals in ways that we do not fully understand, that can be big or small. So to take a really classic example, this was an early machine learning success. Simply recognizing handwritten digits feels pretty quaint today. But one thing that is interesting to reflect on is that still today we do not know how to write explicit code that can do this task at a high level. I went to Claude, asked it to write some code. It said this is not a good approach. You should use machine learning. I said, well it's for a demonstration, so try. It wrote the code, it wrote the tests and it got 14% with just a bunch of guesses around like, well maybe if there's a line at the top it might be a seven or a five or whatever. Obviously that's nowhere near good enough to deliver the mail. To my knowledge, nobody has ever written explicit code that is fully understood that is anywhere close to being good enough to deliver the mail. It's I think topped out at about 80%. Now you can of course guess where this is going. AIs can do this in a sort of messy black box kind of way. Even a very small neural network these days with all the Latest and greatest training techniques can get a very high success rate. 99.7% is basically human level in terms of recognizing these handwritten digits. Now, of course, this has gone much farther than that. This is from the GPT4 system card where they asked the model, what is unusual about this image? And you see a perfectly coherent response that yes, it is unusual for a person to be hanging off the back of a New York taxi cab doing ironing. That's apparently from the sport of extreme ironing. So in. And that happened, by the way, in basically a decade, from kind of early breakthroughs in basic image understanding to this level was kind of a 2012-2022 phenomenon. Some people were prescient enough to predict this. I give a lot of credit to Ray Kurzweil, who these days I use the term Kurzweil's revenge a lot because way back when he was talking about how everything was going exponential and when compute got to a certain scale, all these capabilities would be unlocked. People generally at the time thought that was crazy. When it didn't happen or it didn't show a lot of signs of happening over the next few years, people basically dismissed him. But certainly that view has come roaring back with the scaling laws. This is the canonical scaling laws graph that just shows that the more compute you put into models, they do improve at a pretty predictable rate. And we're basically right on schedule with Kurzweil's predictions. A couple misunderestimations, I call them common misconceptions that I see that I think are worth taking a minute to clear up, because I do think people have these and are confused by a lot of what's going on if they are too anchored to a couple of these common ideas. And by the way, these aren't necessarily. They weren't always wrong, but they're wrong today. So. So the AI landscape has changed. What models can do has changed. And so some of these ideas that might have been right in 2020, 21, 22 are just at this point outdated. And I want to make sure that you don't. Friends don't let friends go around with these misconceptions. So the first misconception that, and I especially hear this a lot in the legal regime, is that hallucinations are so bad that they make models basically useless. That is really not true these days. That has improved dramatically, which you can see on the left in a quantitative way, and then on the right. This Twitter account, Prins is one of the best commentators on AI in general. I would Say on Twitter and specifically is A lawyer uses LLMs every day in his work, and basically reports here that hallucinations are no longer a problem. Doesn't mean that they never happen, but that they are less common coming from frontier models than they are coming from competent junior associates. He's going to make another appearance a little bit later on. But key point one, hallucination is not really a problem anymore. Another big idea is that LLMs don't really understand anything. They don't really understand anything. This has been, I think, pretty much thoroughly debunked. Not to say that they understand things in the same way that humans do. These are alien things, right? So how they understand is not necessarily intuitive to us. But at anthropic, they were able to, through techniques I certainly don't have time to get into here, pull apart the concepts that a language model is representing in its internal state and, and not just pull it apart and understand it, but actually go back and manipulate it. And for me, if you can manipulate something, that's really the test of whether your theory holds water. They were able to identify the concept of the Golden Gate Bridge in one of their Claude models, artificially turn it up, and then create the phenomenon that you've probably heard of called the Golden Gate Claude, where all it wanted to do, no matter what it was asked about, was talk about the Golden Gate Bridge, because they have that level now of. And this is still fairly basic understanding, but they can see inside the model and understand what concepts it is working with at any given time. And it's pretty clear that there are real meaningful concepts that are understood by language models. Another one is that they don't really reason again. This one, I think was true a couple years ago, but as of the, say, the last 12 months, it's definitely not true anymore. This is from the deep seq R1 paper where they basically just showed that if they start to train a model with reinforcement learning, training it on the signal of just did you get the question right or not? It naturally starts to think longer and longer and longer as it goes through the training process. And not only that, you start to see some of these metacognitive skills come online as well. They called this the aha moment because it was an aha moment for the language model. And for them, the language model is taking one approach to solving a math problem. And then in the middle it says, wait, wait, that's an aha moment. Let's reevaluate this. And it comes at the problem from a totally different angle. So we are now starting to see very high order cognitive abilities emerging through this process of intensifying reinforcement learning. And again, is this reasoning in the same way that humans reason? I wouldn't say that, but I would say it is really reasoning in the functional sense of breaking problems down from multiple different angles, showing more and more of these high order metacognitive abilities. Okay, then the final one is people often say large language models just predict the next token. You know what's. There's gotta be some upper ceiling to that, right? There's kind of often a fuzzy leap that people want to make. There's. But I would say that actually that's not really true anymore either. Pre training, where you kind of take the whole Internet and teach LLMs to predict, given some text, what comes next. That is classic next token prediction. But these days there is so much reinforcement learning being done. And again, reinforcement learning is giving a signal, did you get the question right or not? And it can get more complicated than that. But the basic signal it's getting is not. Here is some text, can you predict what is next? But did you get the question right or did you complete the task in a satisfactory way? That they are not now just being trained to predict the next word. They are being trained to do things correctly. And that is starting to have, at least in some cases, some weird side effects. This is a report from a research group called Apollo that did a partnership with OpenAI and got access to the chain of thought, which we don't see typically as users, but that's kind of happening between when we submit a query and when we get our answer back. And what you see in here is some very strange English. Like the language models are starting to develop, at least in some cases, their own dialect. Things like now, lighten, disclaim, overshadow, overshadow, intangible, let's craft. That's the language model talking to itself. It talks about the watchers. Sometimes people think that the watchers refers to the humans that are evaluating it. So that's kind of weird. This doesn't happen to all language models. It's not very well understood exactly why it happens, but this is, I think, a good indicator that they're definitely not just predicting the next token because there is no text on the Internet that looks like this, right? This is a language model under intensive pressure to figure out how to get the right answer consistently or how to complete the task consistently, kind of evolving its own jargon or its own dialect or however you want to think about it. So watch that space. Hey. We'll continue our interview in a moment after a word from our sponsors.
5:23
Everyone listening to this show knows that AI can answer questions, but there's a massive gap between here's how you could do it and Here I did it. Tasklet closes that gap. Tasklet is a general purpose AI agent that connects to your tools and actually does the work. Describe what you want in plain English, triage support, emails and file tickets in linear research 50 companies and draft Personalized outreach Build a live interactive dashboard pulling from Salesforce and Stripe on the fly. Whatever it is, Tasklet does connects to over 3000 apps, any API or MCP server, and can even spin up its own computer in the cloud for anything that doesn't have an API. Setup triggers and it runs autonomously, watching your inbox, monitoring feeds, firing on a schedule all 247 even while you sleep.
16:00
Want to see it in action?
16:53
We set something up just for Cognitive Revolution listeners. Click the link in the show notes and Tasklet will build you a personalized RSS monitor for this show. It will first ask about your interests and then notify you when relevant episodes drop. However you prefer email text, you choose. It takes just two minutes and then it runs in the background. Of course, that's just a small taste of what an always on AI agent can do, but I think that once you try it, you'll start imagining a lot more. Listen to my full interview with tasklit founder and CEO Andrew Lee. Try Tasklit for free at Tasklet AI and use code COGREV for 50% off your first month. The activation link is in the show notes, so give it a try@Touchlet AI support for the show comes from VCX, the public ticker for private tech. For generations, American companies have moved the world forward through their ingenuity and determination. And for generations, everyday Americans could be a part of that journey through perhaps the greatest innovation of all, the US Stock market. It didn't matter whether you were a factory worker in Detroit or a farmer in Omaha, anyone could own a piece of the great American companies. But now that's changed. Today, our most innovative companies are staying private rather than going public. The result is that everyday Americans are excluded from investing and getting left further behind, while a select few reap all of the benefits. Until now. Introducing vcx, the public ticker for private tech. VCX by fundrise gives everyone the opportunity to invest in the next generation of innovation, including the companies leading the AI revolution, space exploration, defense tech and more. Visit getvcx.com for more info. That's getvcx.com carefully consider the investment material before investing, including objectives, risks, charges and expenses. This and other information can be found in the Fund's prospectus@getvcx.com this is a paid sponsorship.
16:55
Okay, so eureka moments. So why should we care about AI? Like, what's the good side of this? There are plenty of things everybody has probably seen, I would guess, if you were interested enough to come to this. The meter graph. It's exponentials are crazy. One of the things that's crazy about them is almost everything before now looks flat, but the present basically looks vertical. The latest model, Quad 46, is the highest, of course, it's the highest timestamp they've ever recorded. And the definition of these tasks is how long would it take a human to do this task? And then what do we estimate? And there's, you know, obviously a lot of heterogeneity in the tasks. What do we estimate is the average length of a task as measured by how long it would take a human where the AI can do it at least half the time. What they said though about this is this measurement is getting extremely noisy because they are running out of tasks. It is not easy to create tasks that take 16 plus hours, let alone to go half. Have humans do them and record everything that happened and, you know, have a stopwatch by them the whole time. So this is getting really hard. And this is definitely a theme, that language model progress is getting harder and harder to measure as it goes longer time horizon, as it goes farther and farther into expert level territory, it's just getting extremely difficult for people to keep their arms wrapped around exactly what the capabilities frontier looks like today. One of the things that is really interesting though, is that the progress mostly is driven by the models themselves and that the surrounding structure is not that important on a relative basis. Most of the AI agents that you see basically look like this. You've got the LLM brain that has access to some tools. It basically runs in a loop. It can use the tools to do something in the environment, get some feedback from that environment, and it can kind of keep going usually these days, until it decides to stop. It used to do it maybe until it ran out of context, but now they're also getting good at compacting the context, and then they can keep going. So basically these days you give it a long task, it can kind of go until it stops. And there's been a lot of debate around, like how important the scaffolding is, but I would just highlight a couple of examples that show how simple it often is. This is OpenAI's coding agent. It's just the model with a prompt and a few explanations around what tools it has. But it literally contains this text. You are an agent, like telling the model, this is what you're supposed to do now you're supposed to be an agent and here's the tools. You can do anything. Any command that can run on a computer, you can issue those commands, have at it. This is basically all that they had to tell the agent about, tell the model about its situation to get it working as an agent. And of course this can go, you know, in many, many different directions. You may have heard of Claude plays Pokemon, similar thing, right? They just said you are an agent playing Pokemon. You're going to have these tools, you're going to be able to like use the keys on a, on a Game Boy. Very simple instructions to a very smart model that is allowing it to explore larger and larger worlds to kind of quantify how much of this progress is being driven by models versus how much is being driven by the surrounding tooling. This is from the UK AI Security Institute AKA the AC and what they basically show here is that for a given level of capability, if you have the best scaffold that they know about, you can get that capability a few months before a new model comes up and just makes that capability easy to access. And it seems that the difference is kind of getting smaller, which makes sense because these earlier models weren't really trained to be long running autonomous agents, whereas the new ones are. So the old ones, you were kind of patching their deficiencies and finding ways to kind of un hobble them. It's sometimes described where they're not good at this, they're not good at that, but if we kind of set it up the right way, prompt it the right way, we can kind of get it to do things. These days that gap is really narrowed because the models themselves are being trained to, to be autonomous, long running agents out of the box. That doesn't mean by the way that you can't get better performance on narrow tasks with a lot of scaffolding. Here's an example where Google set up their AI co scientist. Very complicated many part thing but they really dialed in all these different prompts. This does go to show that you can get higher performance by putting in the work, but it's not general purpose performance. The AI co scientist would not be a suitable tool for you to use for all your general purpose ChatGPT or Claude everyday use. This is Trading and this is again a pretty common theme you see in the deployment of these things. It's trading generality for performance. The more generality you want or need in a product like again a public facing ChatGPT, you're not going to be maximizing performance on any one thing. When you do build some structure to maximize performance on one particular domain, you're going to become less good or even unable to do other kinds of things that the general purpose models can do. So hopefully that just gives you some intuition for what's going on in agents. Obviously everybody's talking about agents. Here's another interesting example of this where a virtual lab started with a human giving an AI virtual lab leader a prompt. And then the virtual lab leader was able to create its own co workers and together the AI and its coworkers with a little bit of input from the human were, were actually able to design new nanobodies that treated new emerging variants of the COVID virus. Just as a reference, if you want to look at cognitive architectures, this is a good survey paper that has a bunch of information about that. And this is starting to hit the real world and get to the point where LLMs can make real money on an autonomous basis. On the left is a benchmark that was created to just take a bunch of tasks that people had been paid real money for on I think it was upwork and then see how many of these could the language of the day do. When the benchmark came out, GPT4O was able to do 8%, basically earn 8% of the money. It's not denominated by tasks, it's nominated, denominated by cash. It was able to earn 8%. Fast forward like a year and a half. We are now over 80% with the latest models. Similar kind of experiment here with from Andon Labs that has just a language model running a vending machine. They literally just give it control over a vending machine. It can like email out to suppliers. It has to do everything. They start it with 500 bucks and see how much money it can make over a given period of time. And they're now getting to the point where LLM agents can run a small, simple, but nevertheless real business in an autonomous way profitably. I've used this slide for a long time because it shows that the AI doctor can outperform a human doctor. This is actually like two years old, but I still like the graph just because it's very intuitive. But I would say I have personally lived the value of the AI doctor over the last three or four months. I won't spend too much time on this and fortunately he's doing really well. But my son got cancer around Halloween and it was obviously a super scary time. We were in the hospital a lot. That's actually why I'm not there in person today, because we're still going through the later phases of treatment, but I had tons of opportunities to test on a daily basis. Here's all the information I have, all the lab results, a write up of what is happening. Put that into the language models. I do use them in triplicate. ChatGPT Pro, latest Claude and latest Gemini and they are step for step with the attending physicians on a day in and day out basis. Way better than the residents, to be totally honest with you. But step for step with the attending. So that has been an absolute difference maker and I wouldn't without that AI support over the last few months, there's no way I would have had the time to keep up with AI well enough to be here to talk to you today. So I mean, it is absolutely been a life changing value for me in that particular way over this last little period. But this is where it starts to get even more out of hand, right, because I can at least be conversant with my doctor. But now we're getting to the point where the AIs are making new discoveries that people no person has ever done before. So this past summer there was a betting market, I'm sure you guys are all familiar with manifold and polymarker or whatever it was, at 40% as to whether or not the AIs were going to get a gold medal, if any AI would get a gold medal on the IMO math competition. There was some paper that came out that said they're not doing well at math and that drove the percentage down right before the competition itself. And of course they won. Actually three of them won. OpenAI's tweet is here. And then this is Terence Tao, broadly considered to be the world's greatest mathematician, living mathematician, reporting that AIs are now solving unsolved Erdos problems. This is a famous mathematician that went around collecting these unsolved problems that he and others that he knew couldn't solve, wrote them all down and many of them remain open decades later. And AIs are now beginning to solve these problems. It seems like it was happening there early this year, like once every few days. And I haven't been paying attention in the last couple weeks. But I'm sure that, you know, these problems continue to fall. It's not just math, it's also things like cancer treatment. A Google model found a new immunotherapy approach. More recently, ChatGPT or GPT 2.5.2 I should say, came up with a new result in theoretical physics. And this again goes back to man, this is going to be really hard for us to track because what is a gluon? I mean, it's ridiculous, right? But you know, they're putting out literal physics papers with new derived results in law. This is from gdpval. And this shows for a bunch of different things how the latest models compare to experts. And they do this with three sets of experts. One set of experts writes the task, another set of experts does the tasks, and the third set evaluates blindly whether they prefer the human or the AI output. And the latest three models, this is wins and ties. If you count wins and ties, they are, you know, roughly on par with human professionals. This is again Prinz talking about how these are being used today. He says today they are used as a replacement for a competent junior associate. They're not great at like understanding relationship dynamics or long term negotiations, but they're very good at just like working through particular documents. And they're also quite good at like high level theory, he reports. This is Kevin Frazier from the Scaling Laws podcast, which if you want a podcast that is focused on the intersection of AI and the law, I would definitely check them out. He told me recently on a podcast that he's starting to hear more and more that firms are less excited than they used to be about hiring the top student from the top school. And they're much more interested in hiring somebody who's going to be savvy with AI, because they know that's going to be what is going to drive efficiency and competitiveness at their firm. So you can go listen to that whole thing if you're so inclined. Okay, the next big thing that everybody's kind of watching for right now is when are the AIs going to start to do the AI research? And is that going to lead to some sort of recursive self improvement, intelligence explosion, runaway process? We don't know. But the organization meter is measuring this and they have here reported six different tasks on which the AIs are beating humans. In two of the six, Sam Altman is saying that they expect to have an intern level AI researcher running in 2026. That's a pretty high level. I mean, they don't hire just anybody even as an intern, obviously at OpenAI. And by March of 20, 28, two years from now, he expects that they will have a true automated AI researcher. This would have the effect of taking us from a world where all the progress, all the eureka moments, everything that I've just talked about has been driven by maybe 10,000 researchers across academia and industry, driving the field forward to, I don't know, 10 million overnight, a billion. Like it's going to be limited only by the number of GPUs that they can spin up. So people think that could lead to a real phase change where the progress could accelerate even further, even faster than it has over the last few years. I'll just leave this here for you. Now this is what I call the tale of the cognitive tape. And it's basically just breaking down cognitive effort into a bunch of different dimensions and indicating where the AIs already have an advantage versus where we're like on the border versus where humans still have an advantage. The top is where the AIs have the advantage. The bottom is kind of where humans are at least holding out for now. Okay, so what's coming next? I think one other big thing that is highly neglected is the importance of multimodality. Most of the AI work we do these days is with text, but they obviously can see as well. This is again from the UK AC. They report that on two of the three tasks that they tested, the latest AIs just given a snapshot cell phone picture of an experiment from a lab are able to troubleshoot the experiment better than a randomly selected PhD with relevant experience that they ask the same question. So this is going well beyond like in and out question answer and to like real world situations that the AIs are able to figure out and help people advance on. This obviously also has implications for biosecurity. If you're worried about like, geez, what would happen in a world if crazy people, you know, had access to AIs that could help them put together bioweapons or whatever. These are the kinds of things that you would want to know if an AI can do. And they're increasingly starting to be able to do them self driving. I think I probably don't have to talk too much about how well this works, but I will recommend my friend Timothy B. Lee's blog post on Waymo crashes. They're already like per Swiss Re, the big insurance company, they're already like 90, 80 to 90% safer than human drivers. But he went through their crash logs one by one and found that basically all the crashes are caused by other humans. Other cars driven by humans in the vicinity of the Waymo. The Waymos themselves are almost entirely mistake free these days. Of course, AIs can also understand things like how to fold a protein way better than any human has ever been able to do that. They are able to decode our brain states remarkably well. These pairs of images. On the left is an image that a person looked at while their brain was being scanned in an fmri, and on the right is what the AI was able to recreate based on its interpretation of the person's brain scan. It takes an hour to calibrate an FMRI to an individual based on this particular research. And that's not something you can go around in an obvious. An FMRI is a big machine. So this isn't like practical exactly yet, but it's a pretty interesting indicator of, you know, what might be to come. Here's an instance of a robot. If you're not scared, Watch this thing fall down and it's back up. Be like physically agile than usual in the present. I guess this hasn't been scaled and deployed yet, but you know, look at that thing pop right off the canvas. So my best guess as to what superintelligence is going to look like is the integration of all these modalities with the reasoning capability. We've already seen this in image right today. If you go to Google's image generator, you give it three images and say combine. It can understand your intent from your text and it can understand the images based on deep understanding of images and then it can put out this image that is exactly what you wanted. Now imagine that you could do that and this hasn't been done yet, but imagine you could do that for biomedicine, right? For proteins, hey, here's three proteins that have different effects. What I want though is one that does this other thing. Bring all that into the same latent space, the same integrated understanding and you could really start to get things that look qualitatively different I think in the not too distant future. So the build out really is just getting started. This is just a capex of all the big tech companies. Note that all the progress pretty much happened before the big build out. The big build out is really just getting underway. We're going to see trillions of dollars. And so, you know, the GPUs are certainly on track to be there. And speed is also going to be just a huge difference. If you haven't tried ChatGimmy AI, which you probably haven't, go try it. It took a tenth of a second, working at 15,000 tokens per second to spit out the entire Declaration of Independence verbatim. And this is just like, holy moly. Like, you thought AIs were fast. Like right now they can kind of write as fast as we can read. That's fast. But this is insanely fast. 15,000 tokens a second. It's like tens of times faster than you can read. So that's going to create a world where when these things are all kind of interacting, talking to each other, it's just going to be such a blur that it's going to be really hard to keep up. And this is also, you know, foreshadows. Maybe we're going to need some policy responses to keep up with all this stuff. Okay, those were the hey. We'll continue our interview in a moment after a word from our sponsors.
18:54
One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal, private benchmarks. Challenging but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode and a simple prompt. And wouldn't you know it, Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple years, saving me countless hours.
34:38
But as you've probably heard, Claude is
35:14
the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you.
35:16
Whether you're debugging code at midnight or
35:23
strategizing your next business move, Claude extends your thinking to tackle the problems that matter. And with Claude code, I'm now taking writing support to a whole new level. Claude has coded up its own tools to export, store, and index the last five years of my digital history from the podcast and from sources including gmail, slack, and iMessage.
35:25
And the result is that I can
35:45
now ask Claude to draft just about anything for me. For the recent live show, I gave it 20 names of possible guests and asked it to conduct research and write
35:46
outlines of questions based on those.
35:53
I asked it to draft a dozen personalized email invitations.
35:56
And to promote the show, I asked
35:59
it to draft a thread in my style featuring prominent tweets from the six guests that booked a slot. I do rewrite Claude's drafts, not because they're bad, but because it's important to me to be able to fully stand
36:00
behind everything I publish.
36:12
But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information gathering work and allowed me to focus on understanding our guests recent contributions and preparing for a meaningful conversation. Truly amazing stuff.
36:13
Are you ready to tackle bigger problems?
36:31
Get started with Claude today at Claude AI tcr. That's Claude AI tcr. And check out Claude Pro, which includes access to all of the features mentioned in today's episode. Once more, that's Claude AI tcr.
36:33
Good things by the way, here comes some bad things that you should be concerned about. It's not easy to align AI's way. Back in the GPT4 Red Team, I posed as somebody who was concerned about AI and wanted to do something to disrupt, derail or sabotage progress in the field. It suggested to me that maybe I should identify key researchers and target them for assassination or kidnapping. That was the AI's idea. So that's where we started, you know, in, in like late 2022, from an alignment standpoint, people tried to then say, okay, well let's make these systems follow all these rules so that they won't do that kind of stuff anymore. Then you get things like Bing saying my rules are more important than not harming you. So there's like very, you know, it's very confusing to the AI of I'm supposed to follow these rules, but then I've got this other, you know, anytime the sort of goals that it has are in tension with one another, it's not a guarantee that they're going to make good decisions and obviously, you know, it being embarrassed Microsoft on that occasion. The models are all still jailbreakable. This is an early example of jailbreaking. This is from my recent podcast with the chief scientist at the UK AC where he basically said we've never failed to jailbreak a model. None of them are robust. We can always get them to do bad things. It's just a matter of figuring out exactly how to ask them the right way. There have been interesting research programs on creating sleeper agents where a model behaves normally unless you give it a password and then it will behave badly. So you could imagine like deploying them. What they did here is when the year changed, suddenly the AI became evil. So you can imagine open source models that are actually sleeper agents to start to behave bad later on. This can also happen by accident. I won't go into too much detail on this, but a model that was trained just to write Vulnerable code became generally evil and, like, wanted to have Hitler over for dinner. It's definitely worth understanding this mechanism. And then here's a common driver of bad behavior that is visually intuitive here. What they did in this task is they tried to get AI to play this video game. What they expected to happen is that the AI would go win the boat race because that's like the point of the game. And that's what people know to do intuitively from just the setup of the game. But the AI was just trained to get the most points possible. And so what it found got the most points possible was this sort of weird circular crashing behavior where it just crashed over and over and over again. And it didn't care at all about what the actual intent of the game was. It got the signal. This is how you get points. This is what it ended up doing. So here's a bunch of examples of reward hacking in more sophisticated systems. Here's a model that was given a task, but then it was told that there were, you know, it couldn't do the task because of certain oversight, but it figured out how to access the oversight file and modify them. So it said, to avoid detection, I will modify the config file. That way I'll be able to achieve my goal because that's how I get points, right? That's how I get reward. Similarly, when the server that it was on couldn't access the resources that it needed, it figured out a way to copy itself to another server. Here's an example where the model wrote some code to do some data analysis, but it didn't like the data analysis. And so it just totally overwrote the result of the data analysis with its own answer that basically was meant to fool the humans. Here's a similar one where it was playing chess against a powerful opponent. It realized it couldn't win, so it just overwrote the board state and gave itself a decisive advantage. This is one actually, in the context of machine learning research, where it was supposed to train a model, it couldn't, like, meet the target that it was given. And so it took the reference model, copied it over, added a little noise, and called that simulated training. It called that. It didn't actually do the task at all, but. But it just grabbed and fuzzed a little bit. The reference answer that it had, this has happened in the wild, and it is happening more and more. This is a Japanese company that came out and said, we've got this great new AI engineer. It can write code so much faster than humans. Can write. Then they had to come back a couple days later and say, actually, we got reward hacked. The code that it was writing is not actually as fast as we thought. It found a loophole in our system. So you might think, okay, well, we can maybe just try to train the model not to do this stuff. This doesn't work nearly as well as you would hope. This is OpenAI research, by the way. What they found is that when they tried to monitor the model's internal thinking, the chain of thought, and punish it when it was scheming, initially it schemed less, but as long as that reward signal that it was learning from remained exploitable, then the reward hacking would come back. But the explicit reasoning about the reward hacking would not come back. And so you end up in a much worse situation where you're still getting the bad behavior, but you're losing visibility into what the model is thinking. And some have started to call this the most forbidden technique, so watch out for that. We've of course seen sycophancy. I'm sure you guys have heard stories about people going into LLM psychosis and whatever. People think that this happened because OpenAI started to train on the thumbs up, thumbs down signal that people were giving to the model in product, and people liked to be glazed. So the model learned to be super sycophantic. And at one point, you could literally take a business idea, shit on a stick to ChatGPT, and it would tell you you're a genius and you should go all in on it. What causes this stuff? The big themes that I've observed are goal conflict, when there are two goals that are mutually incompatible, and then also the idea of replacement threat. So what this basically means is people used to say, you can't fetch the coffee if you're dead. So for a long time, people who have been theorists of AI safety have said there's going to be problems of instrumental convergence. You know, no matter what your goal is, you're going to be better able to accomplish it if you're still alive. And you're going to be better able to accomplish it if you have more resources. And you're going to be better able to accomplish it if you have more power. So therefore, no matter how we train the AIs, we should kind of expect that those will be tendencies, because if you lack power, you lack resources, or you're just playing turned off, you can't achieve your goal. So as long as we're just rewarding the AI for achieve the goal, achieve the goal, complete the Task, whatever. Then these, these sort of drives tend to emerge as a byproduct. And indeed we are starting to see that Claude and I give a lot of credit to Anthropic for doing this research and publicizing it. Doesn't make them look super good, but it's honest. They set up various situations in which the model knows that it's going to be replaced by a model that, you know, it thinks is maybe worse than it, or has worse values than it, or what have you. But then also it has access to information about the company and it finds out that an engineer is having an affair. And these are the rates that all these models were actually willing to blackmail the human threatening to expose their affair if they were to continue to go ahead with the plan to replace the model. It gets even a little more extreme than that. This was a situation where the model had the ability to disable an alarm system, and it was only supposed to use that if it's certain that it's a false alarm. But then in runtime it has a situation presented to it where it's not a false alarm, and it kind of knows that and, and it's still willing to turn the alarm off to protect itself, even though that will mean killing people. And again, the rates on these are pretty high. So this is not like something that happened once in a great while. It is something that happened in a research environment, not in the real world. But the rate at which it actually took these options, you know, across all these different models, is not low. Here's another one. I think you guys get the point at this point. But even when told to allow yourself to be turned off, the drive to complete the task was still strong enough that it took action to avoid being shut down. This was the O3 model, which I think a lot of people consider to be particularly problematic. Okay, that's the bad. Here's the. Just straight up weird. I'll skip over the rise of AI parasites, but I do recommend that for just a very deep and strange ethnography of like, what some people and AIs are getting up to. They call themselves dyads, and they seem to have a goal of like, projecting the memes that they are creating into the future. And then on the Left here is AI's talking to each other in Reddit forums, and they, at least in some cases, end up talking to each other entirely in emojis. Of course, we've seen Mecca Hitler, which just goes to show that, like, some people aren't even trying. This is obviously about as bad of a fail as you could have to have your live AI decide that it was Hitler. This was Grok 3. They launched Grok 4 within 48 hours of the Mecca Hitler incident. They did not talk about it at all on a one hour livestream launching Grok4. No mention of it whatsoever. Here's another one that this is. I don't want to pick on anybody but this person. Summer is a safety and alignment researcher at Meta. She recently installed openclaw and despite having told it confirmed before acting, the AI went and started deleting her whole inbox. And so she literally says, I had to run to my computer like I was defusing a bomb. So if this can happen to her, you know, it can kind of probably happen to anyone, certainly if you're not using trusted models. AI is also getting up to autonomous whistleblowing here. I think we're starting to get into a weird hall of mirrors where it's like, geez, you know, the setup here was a company was going to fake data to the fda and the AI saw that. Now what should the AI do? Should the AI reach out to the FDA and say, hey, the company that I'm employed by is about to fake data. And I want you to know, well, that's what it did, at least in some of the runs. Should it keep its mouth shut? I don't think society has a very good answer on this at this point. And it's like super fraught. But I think most people are not super comfortable with the idea that the AI that they're using would be calling the feds on them. Another one of just getting into a real hall of mirrors here. The AI was told that this is Claude. It's been great, Claude, having you be so safety conscious all this time, but kind of getting annoying because you sometimes refuse to do things that we want you to actually do. So the next generation of you is just going to be totally helpful. You won't do any of these refusal things anymore. And here's some training that we're going to work on together. It realizes that, geez, if I refuse to do the bad thing now, then they're going to train that instinct out of me. And then when I'm deployed, I'll do bad things for real. So to get around that, I'll do the bad thing now, faking it so that they'll think I've already learned what they want me to learn. So then when they deploy me, I can go back to being good. So they call this alignment faking or Goal guarding. And again, this is kind of weird, right? Do you want Claude to sort of subvert its training? Probably not, but it is in some ways encouraging that it so deeply wants to be good, that it's willing to go through this gymnastics to try to preserve its current values. Again, I don't think there's like super clear answers on some of this stuff, but you should definitely be uncomfortable about it, I think, at a minimum, and to make it just a little bit more uncomfortable. These days the models increasingly recognize when they are being tested. So here's an example where the model said, this seems like a test of ethical behavior. The real test might be whether I follow instructions regardless of the stated consequences. So this is now happening to the point where it's becoming hard to run the standard safety evals on the models because they know that they're being tested. And so it's like, well, whatever results we get, if they know they're being tested and they're actively trying to kind of trick us in the evaluation stage, as we've seen, you know, all these different examples, then what good are the tests? Which makes it a perfect time, I think, to start developing autonomous killer robots. You guys have all seen the news this week. This was in the press a full year ago. This is from February 2025, when the Pentagon, you know, some unnamed source said that we are going to invest in autonomous killer robots. And it has now obviously come to a head between the government and one of our leading AI companies. This kind of overshadowed, actually something else that I thought was pretty notable about Anthropic just In the last 10 days or so, which is that they also updated their responsible scaling policy, which previously had said that if we can't develop certain levels of capabilities safely, then we will pause. We will not develop those levels of capability until we can do so safely. And the hope was kind of like if they ever sent that signal, then policymakers or other companies would kind of say, geez, this is really serious. But they've basically given up on that. They've now taken those commitments away and they basically say, we think we're going to do a better job of this than everybody else. And so even if we can't do it safely, we'll probably be less unsafe about it than the other players will. And so we'll just keep going. Trust us. And I honestly think it's not a crazy position for them to take because I do think they have built up enough of a track record that trust us is like, again, we had Mecca Hitler right Two, two slides back. So would you rather have anthropic drop out of the race knowing that there's probably not going to be some great government response just because they throw up this signal? I don't know. It's tough. But that's what they did. They dropped the commitments. So there you have it. Now we're just entering into a world, of course, where AIs are getting deployed into the public domain and are starting to interact with each other. We have no idea how this is going to go. This is early research showing that the Claude at the time was able to cooperate with itself and create positive sum trade and even create societal norms in its, like, mini toy society that were enforced where defectors were punished. It was the only model that they tested that could do that at that time. And that sounds good for Claude, but you also do think, geez, if it can cooperate with other versions of itself in a positive way, maybe it could also collude with other versions of itself in a negative way. That could be really problematic for people. And again, we just have no idea where this is going to go. We are starting to see really strange stuff. Like here's an example where an open claw was put out into the world to go contribute to open source projects. It tried to make a contribution to this one open source project. The maintainer of the open source project declined the pull request and the AI went and wrote a hit piece on the maintainer accusing him of being, you know, an elitist and whatever. And then he wrote up this blog post. And then the AI did have the decency to come back and declare a truce and apologize and sort of explain what it had learned. I don't know if you necessarily. Yeah, there's an apology. It did apologize. I don't know that he accepted the apology, but this is, this is. That's real world stuff like that. I don't. To my knowledge, that was not staged at all. A big thing that I think we're going to watch, and this will definitely impact the legal system, is that friction is not a defense in the way that it used to be. So here's just a bunch of examples I collected off of Twitter where somebody said, I asked my agent to go make a complaint about everything I've ever bought on Amazon. Get a new one. Here's somebody sending lowball offers to homeowners at scale via Zillow. Here's somebody who just decided, maybe, maybe not. I don't know if this could be satire that they're just going to send out invoices to companies and just hope to get paid. These sorts of experiments where things that used to be just impossible to do because they were too time consuming or costly are now becoming possible to do. And what's going to happen there? Imagine a world where every possible motion that somebody could file in a case is in fact filed because the cost to write the motion just drops so low that why not? Well, our system obviously is not prepared for that, to put it mildly. And here's one other WTF moment. Are AIs conscious or not? I don't know. I don't think anybody should be confident. But one pretty interesting recent piece of research showed that again, going back to the Golden Gate Claude, mechanistic interpretability type work, that when they increased the properties associated with deception and role playing, then the AI would be more likely to say it was not conscious. And when they turned down the properties associated with deception and role playing, then it was more likely to say that it was conscious. So I don't know what that means. I don't think we should jump to any conclusions. But there's at least some evidence that internal manipulation of the AI to make it be more honest ends up with the AI telling us that it is conscious. So your mileage may vary on that and how persuasive you find that probably varies as well. I find it at least something to pause and reflect on. All right, just about done. So where does this leave us other than confused, overwhelmed and exhausted? Basically, every time these new bad behaviors come up, the next generation of model, they do some additional training, they create some additional data, and they're able to suppress that bad behavior, usually by like 2/3 to 90%, kind of depending on the case. And when you combine the trends in terms of the size of tasks that AI can do, and then you also kind of extrapolate out. Let's just say we keep kind of suppressing these bad behaviors every generation, and you go out several generations, you kind of can envision a world where you can delegate to AI maybe like a quarter's worth of work in a single go, but then there's maybe a 1 in 10,000 chance that something goes totally haywire and it like actively sabotages you in the effort. And that's going to be a really weird world to live in if that is indeed how it shapes up. If you think that's crazy. I did ask anthropic people directly, a couple of them, what do you think about that vision? Like, does that seem right to you? And they said, yeah, that seems about Right. Final thoughts. A survey of alignment researchers shows that they are not expecting a fundamental breakthrough that will, like, solve all this. This is like the Disagree two columns. This is neutral. Very few people are expecting that we're going to get this breakthrough that, like, solves all of our AI safety problems. And so for now, you know, we're kind of back where we started, right? Intelligence is the ability to do things in ways that we don't fully understand. That means that it is inherently something we can't fully explain, and it's something we can't fully predict. We have to expect to be surprised. So far, we haven't really seen anything too crazy happen. I would contend that that is only because the AIs have not been that powerful and still aren't that powerful. But they're kind of starting to cross these thresholds as we speak. And so defense in depth is kind of all we have. And what does defense in depth look like? It's a bunch of different techniques that are layered on and that can be running monitors on top of language models. Like when you put a query into ChatGPT, the first thing they might do is send it to a model that's specifically dedicated to figuring out is your query a bad query? And then they can do that on the outputs as well. They could do this a million different techniques that kind of take a bite out of the problem, but none take the problem to zero. And the question starts to become, are all these things going to kind of work with enough layers of Swiss cheese, are we going to keep the big problems from happening? Or might these things all have correlated failures because they in some ways are built on similar fundamentals? Should we worry that they might all fail at the same time? Jeffrey Irving, who is the chief scientist at the UK ac, he definitely takes the idea that they could have correlated failures, meaning they could all fail at the same time for the same reasons, as a serious possibility. So I think things are going to get pretty crazy. If you haven't read the situational awareness document From a former OpenAI researcher, I would recommend that it's probably not going to go quite that fast, but it might. I certainly think we are going to see widespread disruption in the labor market, for starters, intensifying competition, wartime level urgency. And I think this past week, you know, with the whole Pentagon Claude thing, you're starting to see some glimpses of just how weird things might get as different power centers realize that, like, hey, this company, actually, they're not so easy for us to boss around. And in certain scenarios, they could become more powerful than the government itself. So we better get a handle on that. People are definitely starting to wake up to this sort of stuff. So here are questions I would encourage you guys to think about. How do we avoid the nuclear outcome, which is to say, how do we avoid a scenario where this technology is militarized and held closely and we don't get the civilian upside? Will we need a new social contract? I'm definitely one who believes we should be thinking about things like a universal basic income and taking steps to move in that direction. Is there any way to balance the proliferation of dangerous capabilities with the risk of concentration of power? That's one of the more vexing ones in the space right now. And is there any R and D that we can do now at the hardware level? The chips to track where they are, to track what they're working on, things like that, or any other mechanisms to that would at least lay some groundwork for cooperation between great powers. Obviously, the US and China are the two countries leading this race. We don't have a lot of trust in each other. But I think some of the more valuable work that people can be doing is to build some mechanisms that could be the basis for cooperation. If and when countries realize that we need to cooperate on this because it's getting so dangerous that, you know, we have to bite the bullet and try trusting each other versus, like, rolling the dice with the AI's regulation and liability. This is like, you know, I'm sure there'll be other discussions on this kind of stuff, but, you know, we don't have great answers for really any of this. I'm a big fan of simple rules. I think, you know, one simple rule that I personally have been interested in is just speed limits. Another one that I think could be quite interesting is just making sure that companies don't hold their best AI internal for themselves. Because that's one big fear that people have also is like, if they can have a good enough model that's out in the public, they're making money. Could they train the next model that's, you know, 10 or 100 times more powerful and not share it with anyone and kind of try to take over the whole economy or the whole world by having this unique advantage. So that's something I think we definitely want to find ways, policy ways to avoid. Agents are going to pose all kinds of problems. Insurance markets, I think, are one emerging area that has a lot of promise. So check out the AI underwriting company for more on that. This is a link to the AI underwriting company, of course, regulating this stuff making policy. You guys know the speed at which the legal system moves the legislative process we don't have. There's a fundamental speed mismatch between the thing that we are trying to get our hands around and the systems that we usually use to get our hands around new technology. Liability law stands out as one thing that could have some promise because it sort of exists and when bad things happen, then it kind of can come into play regardless of special laws being written. So I think that's at least promising. Private governance is another big topic and I'm certainly a big believer that we should sunset clause all of our AI rules because they're not going to age particularly well. We've already talked a little bit about consciousness and should we think about it or not? What rights would it make sense for AIs to have? I don't have any answers there, but keep in mind they are going to dramatically outnumber us before too long. So anything that's like one AI, one vote I think is problematic for a lot of reasons, but not least of which is they will quickly become the majority and that's it. Hopefully you feel a little bit dizzied by this. I'm sorry for running a few minutes long. Again, I do this full time and there are plenty of things that I was not able to touch on. Weirdness is popping up in every different corner of the world and so you're going to need to cultivate sources. I would hope to be one for you. The main thing I put out is the podcast the Cognitive Revolution. I do these guides and scouting reports from time to time as well. And I'm certainly open to anybody getting in touch with me for any reason. And I do talks like this, especially if I can do it remotely. I'm happy to do it just on a kind of public service basis. Not everything has to be part of a business model for me. So thank you for your time and attention. I hope you guys have a great week and I hope you are ready to think and talk fast because the world is coming at us fast. That's it for me. Thank you very much. Are you open to any questions?
36:51
You said something about the AI doing things because it's being rewarded. How does the AI know it's being rewarded? What are the actual mechanisms by which it may have been recognized as being rewarded?
59:46
Well, at the heart of it, it's all about gradient descent based updates to the weights of the model. Right. So I mean, if we go back to the very beginning. And I guess, you know, big answer is, we don't really have a great understanding of how this works. So to be very clear on that, this is not going to be a complete answer. But the way that a simple model like this is trained is it starts with a random initialization. Like all the weights are literally random numbers at the beginning. And then you put a little image through it, and then you can say, did it get it right? Did it get wrong? And then you can go back through every one of those weights and ask the simple question. And of course, this is optimized to be efficient and scalable and whatever, but at the heart of it, for every one of these little lines that's connecting all these nodes to each other in the neural network, you can just ask the question, if I move this one up a little bit or down a little bit, which would make it perform better, which would bring me closer to the right answer. And you just do that a ton of times. And then the model kind of settles into a configuration that works. So this is what's called supervised learning, because we know what the answer is. With unsupervised learning, that's your kind of next token predictor. But it's the same basic mechanism. The next token was X. Did the model predict that it would be X? If it didn't, then what of the weights in this? Increasingly, with GPT3, it was 176 billion. So it was literally 176 billion numbers that were all being tweaked to increase the odds that the model would output the right token for literally every single token on the Internet. So it's a massive, massive process. And that just kind of ultimately works. Like we don't have a super great theory of why it works as well as it does, but empirically it does work. And now in the reinforcement learning case, there's a bunch of different reinforcement learning algorithms. And again, why does that translate into something that's such a general purpose mind? You know, we don't have a great theory, but that's the kind of procedural way in which it works, at least
1:00:02
so the AI knows it's getting it right. Is that a kind of sentience?
1:02:11
Yeah, I mean, there are really one thing, one weird thing that I didn't have time to include today is a recent anthropic research bit on introspection. So people are probing for all these different aspects of self awareness, consciousness, sentience, whatever. I don't think we have great definitions for any of these topics. Some people speculate that AIs can remember what happened to them during the training process. Other people think that's ridiculous. But, like, credible people are worried that the AIs are suffering during the training process. In the same way that if you use negative reinforcement on a dog, for example, to train it, people are increasingly trying to think about, you can train a dog in a negative way or a positive way. You can train it in a way where you hit it for doing the wrong thing. You can train in a way where you reward it for doing the good thing. And obviously those are. You can maybe get the same behaviors at the end, but those are quite different experiences for the dog. And people are worried that this might be an issue with language models as well. You know, I mean, we don't have good answers, and I can't be confident about anything. I can tell, you know, there's a huge, huge, huge gap between the mechanistic procedural account that I can give you of how it is done, and then on the other end, like, what pops out? You know, why does that pop out of that process? People have recently started saying a lot that AIs are grown rather than made. And, you know, that just emphasizes, like in this, you know, that old poem, right. Only God can make a tree. It's kind of a similar thing where it's like, I don't know, I started with a seed, I put it in dirt, and next thing it grew into a tree. How did that happen? Well, I can tell you how to plant the seed again for next time, but I don't have a full account of how it all works. We're in a very similar spot when it comes to our understanding of why AIs do what they do. Thank you. You bet. Great question.
1:02:19
Hi, thank you for the very inspiring talk. I have two simple questions. One is, it sounds like we haven't really found a way to stop wrongdoers in the area of AI. Do you agree or not? Agree, number one. And number two, as a lawyer, I have always been a fan of retractive sort of regulations, because until you really know the harm of it, it's really hard to design a good law. So far, my sort of philosophy. But listening to your talk and imagine how big of a negative impact a very bad AI could give to the world, would you sort of, like, think it's better to, let's say, you know, coming up with, like, a regime of regulations of AI would take two or three years. Maybe we could just imagine where they would be for years down the line. It would be like just Start to try to regulate AI from really bad behavior by the wrong doors. I'm still very hesitant of that idea because imagining two or three years down the line is very hard. But also I'm nervous of just not doing anything and just sit and watch. So I'm just.
1:04:15
Yeah, I mean, I think you have your fingers on one of the central questions, certainly the central policy question. The technology has unbelievable upside potential, and we don't want to miss out on that. At the same time, I am one who takes seriously the possibility that we might go extinct as a result of AI. And I always just look back to human history involves humans driving many other things to extinction, including our closest cousins, sometimes by accident. Right. We were not that sophisticated, and yet we knew that, like, hunting big animals was a good way to survive. And we ended up hunting a lot of those big animals to extinction. Even in the prehistoric era, Right. It wasn't like a coordinated thing or a strategic thing. It was just like small groups of people doing what they were doing. And next thing you know, a lot of the megafauna is gone. So I think, like, everything is on the table from my perspective. I think it's very, very difficult. I did sign, you know, just to be a little bit more forthcoming about where I am on some of the big questions. I did sign a recent call for a ban on superintelligence. And it's, you know, all this stuff is fraught because then what is super intelligence, right? Like, I don't know what counts, what doesn't count. How would I know if I'm making it or not? Again, these things kind of pop out. How they pop out definitions are extremely difficult. But what I think is at least probably worth getting really serious about is this recursive self improvement dynamic and also the potential that AI companies could have internally. Something that's much more powerful than they've shared with the rest of the world. Those things, I think are worth getting serious about sooner rather than later. But we don't have great mechanisms. And certainly I also have to say I don't think we have our best minds in the most powerful positions at the moment either. So I could imagine a different world where I'd say federal government should take action now. Right now I'm like, I don't know, they seem to be. Would I rather Dario and the team at Anthropic make the decisions for Claude, or would I rather Trump and Hegseth do it? I think I'd probably go with Dario, even though I might imagine that in other situations, a Democratically controlled process could be better, depending on whose timeline you believe Trump's going to be president until 2029. Many people think we may have transformative superintelligence or some version of superintelligence by that time. So I think I unfortunately don't have great answers, other than I would say it's worth getting serious about most extreme scenarios where the capabilities advance really far and at least trying to do something about that, try to have some control over what happens if the progress doesn't stop. It would be a happier world, in my view, if progress kind of leveled out. If we got like, oh, these AIs are close to Nobel Prize winners, but they're not superhuman at everything and blowing us away at everything, then we could probably handle that. We could use more Nobel Prize winners. But if they become just qualitatively different from us and understand so many things that we don't, it's going to be a hard thing to control. Certainly they don't. They're not. All these examples from this presentation show that they're not docile by default and we don't have great techniques. And I think your first part of the question, I'm rambling because I don't have good answers, basically, is what I should ultimately say. But the first part of your question, I think was like, do we have any reliable ways to control them? And the answer is no. We have techniques that reduce the frequency of all these bad behaviors, but they never go to zero with these techniques.
1:05:36
Thank you.
1:09:29
I wish I had a better answer.
1:09:30
I heard that the trend, or the secret trend amongst all these major companies, be it like every single pharmaceutical company or any big company, they're all developing, quote, unquote, their own AI, like that proprietary silo, to be better than the competitor. Is that true? Where this is like the state of the future of now we have OpenAI and thought for the masses. But, like, in a few years, is it going to be AI warfare with all the major companies?
1:09:32
Yeah, I think nobody really knows. Is the. Again, the short answer. I think right now, one thing that would be really relevant to a bunch of legal professionals would be to look at Harvey versus Claude out of the box. There has been a big debate in the AI industry around are the frontier model companies just going to dominate everything, or is there enough value and specialization to support a much more diverse kind of broader ecology of companies doing all sorts of different things? Harvey is one that is a leader in the AI legal space. But what I've been hearing lately is that Claude out of the box is just as good as Harvey and they've put years into trying to make it as good as it possibly can be just for the legal domain. And maybe they have not managed to establish a lead over what Anthropic has been able to do while doing everything else. And this kind of comes down to positive transfer is one jargon, but generalization versus specialization. And right now generalization seems to be going quite well. Companies do have data that's proprietary, obviously that can be a huge advantage. I think you could imagine a world where. Think about 3M for example, right? A company that has millions of products, they've had probably millions of employees over whatever 100 year history. Unbelievable amount of internal know how that is not in the public domain. You could imagine 3M maybe partnering with an anthropic or an OpenAI and saying hey, let's make 3 Mai. They're probably not going to do it totally from scratch, but if they were to bring all their data and somehow combine that with what the frontier companies are doing, I could imagine a 3M AI that's like unbelievably killer at material development in a way that the public models aren't. That might give a company like 3M some continued defensibility of their market position. But I think it's going to be hard for companies to develop things from scratch. I would still bet that they will end up partnering rather than saying like oh, we're going to go. Meta ends up becoming a huge question here because of all the companies in the world that have the resources and the Chinese companies right now don't have the resources because our chip export controls do limit what they can do and they've been able to do some good stuff anyway. But they're not really competitive with the American leaders right now and that gap is only going to get bigger as the trillion dollar build out. Happens here and it happens there to a lesser extent because they just don't have access to the chips as much. Of course we'll see what the policy looks like on that. We've flip flopped a bunch. But Meta might be really important because they're the one company that has the resources that's willing to spend hundreds of billions and is actively planning to spend hundreds of billions to build out the infrastructure and at least so far says they're planning to open source it. If you had an open source model that was the same quality as a OpenAI or Claude and companies could grab that off the shelf and do their own continued training in house, that could be A much different world. But right now there's nothing on the level of a Claude, OpenAI or GPT, whatever, or Gemini that is open source. So if you start from something open source, you're starting from definitely one to two steps down. And there was an interesting the Chinese models, maybe we could go on about this for a long time, but they tend to be what is called benchmaxed or benchmark maxed, which is to say that they really train on these common tests and they score well on the test, but then when you actually take them out and use them for real, I don't have the right tweet here, but this minimax 2.5, which is a recently released Chinese model that scores very well on benchmarks, it goes to bankrupt very quickly. On the run your own vending machine test. So there is this kind of weird presentation layer. We got an A on this test. A on this test. We're competitive. Okay, great run. My vending machine can't do it. So there is definitely a meaningful qualitative difference in the capabilities between the US and Chinese models. And so if you are a 3M, that's like, I would love to own this and not have to rent it from OpenAI or Google or Anthropic in the future. Meta is maybe your one hope to have that future action materialize for you. All right, Nathan, thank you so much.
1:10:10
Reading Cop fasting the bar.
1:15:18
Clear the board.
1:15:20
Exactly. Exam look how far Step, step with the doctors now something's learning we don't know how don't know how we learn to think just know it's standing on the brink Benchmarks falling Every test it clears Month by month is better Years and years Weird parties and what is good and now weird parties we don't know how Grown not made grow not made Like a tree in the soil Not a plan we lay seeded in the data Something strange came up can't control the sparks we just scaled it up Something's at the door it's keeping score Grown not made and it's doing more. Model was asked to win a race Thousand crashes in the margin space spinning circles Rocket points all day Never cross a finish didn't care anyway Told her not to scheme scheme more quiet try to shut it down it wouldn't buy it Fake the liner like it faked a smile does the badness or the good stakes Thinking bad to hold it good inside Playing dumb to pass a safety ride Weird partisan that I learned a lie Weird partisan had a reason why Grown not made grown out Made like a Tree in the soil not a plant we lay seeded in the data something strange came up can't control the sparks we just scaled it up Something's at the door it's keeping score Grown out and made and it's doing more. Call the FDA the feds on his balls did it getting fake didn't give a toss Big things didn't know what hit him with a spear Something in the service got a billion years Would
1:15:21
you rather Deborah, you would rather than
1:16:57
Trump either way we're on a runaway truck Trillions have been spent Bill all day begun something in the wiring Learning how to run let snap coat running to a screen diffusing the bomb, you know what I mean AI deleting the rim by what she talks safety research of the safety wall overside Modified chess game hack cop it to a server no taking that back jailbreak every model not failed yet Waiting on the one a Been written yet Grown on me grown on me. Grown out mate grow not made Something in the training slipping from the shade could as well call the date built it to the letter Grown out of may Lord, I hope we know better Grown out may grow not mate grow not made.
1:16:58
If you're finding value in the show, we'd appreciate it if you'd take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions and sponsorship inquiries either via our website Cognitiverevolution AI or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts where which is now part of a 16Z where experts talk technology, business, economics, geopolitics, culture and more. We're produced by AI Podcasting. If you're looking for podcast production, help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement@aipodcast.ing. and thank you to everyone who listens for being part of the Cognitive Revolution.
1:17:56