Is AI About to “Eat Everything”? | AI Reality Check

32 min

•May 14, 20262 months ago

Summary

Cal Newport analyzes the METR AI time horizon chart showing rapid improvements in AI coding capabilities, explaining that the exponential-looking curve reflects focused post-training efforts on programming tools rather than general AI advancement toward superintelligence. He argues the chart measures specific software development progress, not an intelligence explosion, and calls for AI companies to distance themselves from transhumanist and existential risk communities that misinterpret such data.

Insights

The METR chart measures AI+coding harness performance on specific programming tasks, not general capability—a crucial distinction lost in viral interpretations claiming AI will 'eat everything'
The recent jump in AI coding performance (2025-2026) stems from post-training optimization and sophisticated hand-coded harnesses, not fundamental breakthroughs in model architecture or pre-training
Different AI applications require exploration of separate 'tributaries' with distinct challenges; success in programming tools tells us nothing about viability in other domains like email management or general reasoning
Transhumanist and existential risk communities have disproportionate influence on AI company messaging and public perception, causing unnecessary anxiety through exponential extrapolation without technical grounding
AI companies need to adopt normal technology communication standards—discussing concrete capabilities and limitations rather than speculative superintelligence scenarios

Trends

Shift from pre-training to post-training optimization as the primary driver of AI capability improvements in specific domainsIncreasing importance of domain-specific engineering (coding harnesses, scaffolding) over raw model capability for practical AI applicationsGrowing disconnect between technical reality of narrow AI improvements and public perception shaped by transhumanist narrativesProfessional software development emerging as a viable commercial application for AI tools after 2+ years of generative AI searching for killer appsNeed for AI industry professionalization and separation from speculative communities as companies approach public marketsBenchmarking challenges in AI evaluation—difficulty in creating meaningful metrics that don't mislead stakeholders about general capabilitiesHand-coded expert systems and 1960s-style AI logic becoming critical components of modern LLM-based tools, not obsoleteIncremental, application-specific progress in AI replacing earlier assumptions about exponential general capability improvements

Topics

METR AI Time Horizon Chart Analysis AI Post-Training Optimization Techniques Coding Harnesses and Scaffolding Systems AI Benchmark Interpretation and Limitations Programming Task Difficulty Measurement Reinforcement Learning for Model Tuning AI Reasoning Models (O1, Claude Sonnet 3.5)Software Development Tool Integration Transhumanist Influence on AI Narratives Existential Risk Community Impact AI Capability Extrapolation Fallacies Domain-Specific AI Application Development LLM Pre-Training vs Post-Training Trade-offs AI Industry Communication Strategy Generative AI Commercial Viability

Companies

Anthropic

Developed Claude models (Opus, Sonnet) shown on METR chart; built Cloud Code harness with sophisticated hand-coded logic

OpenAI

Created GPT models and O1 reasoning model; first to discover pre-training scaling limitations in summer 2024

METR (AI Safety and Evaluation Organization)

Released the time horizon chart measuring AI programming task capabilities; provides transparent methodology document...

Google

Mentioned as creator of Codex and involvement in AI coding tool development alongside other major AI companies

People

Cal Newport

Analyzes METR chart and AI industry trends; argues for realistic AI communication standards

Gary Marcus

Compiled concerned social media responses to METR chart update in recent newsletter essay

Dario Amodei

Mentioned as AI leader who should distance company from transhumanist and ex-risk communities

Sam Altman

Mentioned as AI leader who should distance company from transhumanist and ex-risk communities

Elon Musk

Mentioned as AI industry figure influenced by transhumanist narratives

Ray Kurzweil

Originator of transhumanist exponential thinking that influences current AI narrative interpretation

Max Tegmark

Wrote 'Life 3.0' introducing the 'water level' mental model for AI capability that Newport critiques

Quotes

"The time horizon is closer to what a low-context person, such as a new hire or remote Internet contractor, can accomplish. An eight-hour time horizon does not mean that AIs can do eight hours of work that a high-contact human professional can do."

METR (quoted by Cal Newport)•~20:00

"It's a measure specifically of the efforts of the AI companies to try to build programming tools. It's a measure of programming tools. And in the last couple of years, they made some great breakthroughs."

Cal Newport•~45:00

"You don't really know in advance how navigable each of these tributaries are until you go at it and give it your best effort. Some of these tributaries end up being very navigable. I think the computer programming example is a tributary that ended up to be very navigable."

Cal Newport•~48:00

"I think the AI companies are now at a size and importance that they need to start to distance themselves from these cult-like communities. They've become too big and too important to be mixed up with these schools of thought."

Cal Newport•~65:00

"We're not destroying the world. AI is not going to eat everything. You need to stop talking about all these exponential worshiping cults."

Cal Newport•~70:00

Full Transcript

Last week, the AI Safety and Evaluation Organization, METR, that's M-E-T-R, released a new update on their famous AI time horizon chart. Look, I'm going to load it on the screen here for people who are watching. And when you zoom in, you can see these points on this chart starting around 2025 begin to go up. And then when we get to 2026, they go way up. and then the last update go way up again. Now, this graph looks scary. Even if you don't know what it means, it does create a strong sense of digital ick. And as you can imagine, the internet jumped into action to try to amplify that uneasy feeling. Now, in a recent essay posted to his newsletter, Gary Marcus did a good job of rounding up some of the more, shall we say, concerned responses to this latest update to meter's latest graph. Let me show you a couple here. Here's one, a tweet that said, AI power is doubling every 103 days now. It's going to eat everything. Nothing will be spared. We are on the threshold of truly ergodic alien intelligences in which human input will be nothing but a liability. All right, here's another example that Gary pointed out. The tweet simply says, tick tock it has a expertly drawn graph that shows highest intelligence on earth by time and you see there's a point where it goes up up crosses a trip wire and then shoots straight up where human brains become smart enough to create asi which is artificial super intelligence then below it is a version of that time horizon graphs and they're like look doesn't that look similar the line goes up the line goes up so i guess we're about uh to have artificial super intelligence conquer the world. Now, many more tweets out there in response to this Time Horizon update. They all give you the same sense that this meteor chart is capturing an intelligence explosion that A, we're not ready for, B, that will change everything, and C, that vindicates every bold or crazy thing anyone has ever claimed about AIs and its capabilities. But is this right? Well, it's Thursday, which means it's time for an AI reality check episode of this show, which seems like a perfect time to look closer at what exactly the meter time horizon chart is showing and what exactly that means. As always, I'm Cal Newport, and this is Deep Questions, the show for people seeking depth in a distracted world. all right so the first question we want to ask here is what is it exactly that the meter uh time horizon chart is actually showing all right so i spent time reading about it the good news is meter actually is very transparent they publish very detailed collections of notes describing their methodology and what goes into their chart. So it was actually quite a pleasure to get answers to these questions. So what are they actually showing on this chart? Well, here's what they did. They came up with a collection of what they call software tasks. These are well-defined challenges that you can solve by writing and or analyzing computer code. All right. Then for each of these tasks, they went out and asked a collection of human programmers to go do the task. Hey, go do this. I believe the instruction was as quickly as you can. And then they asked them, how long did it take you to complete this task? They would then take the geometric mean. So they would average those answers and whatever the mean was, whatever the average was, is the human time duration that they would label that task with. So if it, you know, on average, took people two hours to complete a various task, they would say, this is a two hour tasks. They then said, let's evaluate different large language model-based tools on these tasks. Now, of course, a given large language model can't do anything except spit out tokens. So they would take each large language model and combine it with a, they call it a scaffold, but what we would call today also a coding harness. So a program that can call the LLM to try to solve programming challenges. This is like Cloud Code or Cursor or Codex. These are all coding harnesses. So the coding harness, when you give it the problem, for example, will query an LLM and say, give me a plan for tackling this problem. The coding harness will then go step by step to that plan. It will call the LLM when it needs code generated. But the harness could also do a lot of stuff on its own, like do checks. It knows how to interact with various software tools that are relevant for creating software. It can go back and verify, hey, did this step really work? The coding harnesses have gotten pretty complicated. More on that later. So I'll take an LLM with a coding harness, and one by one, they'll ask it to do each of these tasks. And they actually have it do each of these tasks six times. And if it completes the task at least half the time, they say, okay, this model plus this harness can tackle that task. And they keep going until it gets stuck. So they say, what was the longest duration task that this LLM plus a coding harness was able to complete at least 50% of the time? And that's what they plotted. So let's go back now to this plot to make that a little bit more clear. This is a little bit confusing sometimes. So we see on the y-axis here the duration that various tasks are labeled with, right? So fix bugs in a small Python library was labeled a little bit more than an hour. That's how long it took the humans who did it, who completed the task to actually finish it. Exploit a buffer overflow. That took a little bit more than two hours, et cetera. Okay, so let's zoom back out here. We've got the chart is trying to, look at this. The technology, with all the AI technology we have in the world, the biggest problem we're having is the chart itself isn't loading. Let me just do a quick reload here. It's ironic, I think, Jesse. All right. There we go. Let's go back to the linear scale. So then what they're plotting for each dot here is actually the name of a model, right? So like over here is Cloud Opus 4.5. And where they're plotting the point for Cloud Opus 4.5 is to correspond to the longest duration task it was able to conclude. So if we click on that, we find out that a four hour and 53 minutes was the, that was the length of the longest length task that it was able to successfully complete at least 50% of the time, right? So they're plotting each model against what's the longest time duration task that it was able to complete at least, successfully at least 50% of the time. And the x-axis is time. So they're taking each model to the time in which it was released and then plotting it at how long of a duration task was it able to actually complete. All right? So it's a little bit confusing. If we zoom out, it's just line goes up. I guess that's scary. But this is what's really going on here is they're trying to capture, are these models able to tackle tasks that require more and more human time to complete as we move on to more and more advanced models? And that is, in fact, what they're actually showing where we start to get the speed up around 2025 that then really picks up around 2026. All right. So that's what's on the plot. You can also look at 80% success where they plot each model at the duration that it could complete 80% of the time. This curve looks similar, but if we look at the y-axis, we see it's much smaller. So if you need it to be successful 80 of the time the very best model Claude Mythos Preview is able to complete a task that roughly took the humans about three hours where if 50 of the time is enough it completing a task that takes you know 16 hours So it does make a difference how successful you need it to be All right. So that's what's on this chart. Hopefully that makes sense. A little confusing. Hopefully that makes sense. What is that actually capturing? So now that we know what is on this chart, this plot, what does it actually mean? Well, first of all, now that we know this, Here's our first observation. Meter is not measuring the general capability of these LLM models, right? They're looking at a specific suite of programming tasks. So it's not the case. It's a mistake a lot of people were making. When you see like Opus 4.6 labeled with 12 hours on this plot, that doesn't mean that Opus 4.6 can now do whatever it would take a human 12 hours to do. No, it means there's a particular software task that required on average 12 hours for human testers to do that cloud Opus 4.6 can now complete accurately about 50% of the time. So it's not telling us anything about the general capabilities of these models. It's also not measuring the general programming capabilities of these models. So in other words, the fact that Opus 4.6 is on the Y-axis line for 12 hours doesn't mean that that model with the right coding harness can now successfully do any programming tasks that would take humans 12 hours. That's another thing I hear often. Well, now our models can do work that used to require 12 hours for people to do. Again, it means there's a particular programming task that when it was given to a collection of humans to complete, it took them 12 hours that this model can now complete on its own. We don't know how long it takes to model, but it can complete it on its own correctly 50% of the time. So it's really measuring a specific collection of software tasks. Now, what about these durations themselves, though? I mean, are they meaningful? Like, what does it mean that a task took humans that they tested on 12 hours? Like, what meaning do we get out of that? And here we have to be careful. It's not really clear what the specific number means. Like, METER acknowledges in their notes to go along with this study that it's kind of hard to put a precise meaning on this, right? Because when you say this task, this particular task took humans 12 hours, what does that mean? What were those humans doing for those 12 hours? And they're clear about this. Like, well, it's not clear. Like, it could be they're spending this time looking up what the task even meant, trying to learn the techniques needed to do that task, right? Maybe they're having to learn a new programming language or go, you know, master. They've never done something like this before, and they're on the Internet for six hours trying to learn what it is. We don't know what this time is actually being spent on, and this is what Meter says. I'm going to quote from their study here. The time horizon is closer to what a, quote, low-context person, such as a new hire or remote Internet contractor, can accomplish. An eight-hour time horizon does not mean that AIs can do eight hours of work that a high-contact human professional can do as part of their day-to-day job. So they're saying, look, we don't really know what to tell you about these numbers precisely. It's just that some people, when we gave them this task, it took them a while. So probably the right way to think about what is being measured here is something like a general benchmark for programming capability. and these time durations, you should not get caught up with the particular hours, but just think of these as an abstract measure of difficulty. So I don't know what these low context human programmers were doing, but if this task took them twice as long as this task, then maybe that's a twice as hard task. So we don't know exactly what this abstract scale tells us, but it's like a good general way of capturing the hardness of programming tasks, which is smart, right? It's a good way to do a benchmark. We took a bunch of programming problems, We found some way to measure how hard they are, and what we're asking is how far can this new model get in these tasks? How hard of a task can this new model actually complete? And so if we see this model can complete a harder task than this model, we're like, oh, that model is better at programming tasks. That's the right way probably to think about what Meter is doing. I think it's a well-designed benchmark of how capable are models combined with harnesses at various programming tasks. And that's the right way to read those numbers, more like abstract difficulty numbers. The actual times aren't that meaningful. All right, third question. How are these models getting better? Why are we seeing these jumps? Well, I want to jump back into the chart here because the timing here is really important, and it connects to some important things you need to know about how LLM-based coding models work and what has happened in the last two years. So if we look at this chart, I'll bring it back up here. Notice it's flat for a long time, from GPT-2 all the way to this point right here. Basically, they can't do anything, right? They can't complete any meaningful, interesting task that was part of this coding suite. Now, there's a reason for this, because up until this first point right here, which is like Claude Sonnet 3.5, and we get O1 preview. This is where we first start to see arrays to, you know, hey, there's some of these tasks we can complete. Up to that point, what was happening with LLMs? The focus was on pre-training. Pre-training is that long, expensive period where you use real text written by humans, and you have the model try to guess missing tokens. is where all of the primary smarts and intelligence and capabilities of language models come from. From GPT-2 through the attempt to make GPT-4-5, we were just trying to make pre-training longer, more data, train it longer, and wanted to make the general capability of these models better. We didn't need benchmarks like this so much to know that 3 was better than 2 or 4 was better than 3 because we could just demonstrably see, without much work, all these new things the models could do that the last ones couldn't. As I've written about in the New Yorker last August, they hit a wall in the summer of 2024 where it became clear. OpenAI discovered this first. Other companies had the same realization over the next year or so. It became clear that simply scaling up the pre-training, the quality of data, the quantity of data, and how long they train was not giving obvious new leaps and capabilities of these models. So it created a shift in how they thought about improving these models that we really began to see in the fall of 2024. And the shift was towards post-training where they said, okay, we're going to take a pre-trained model, and now we're going to get very particular narrow data sets where we have prompts and correct answers and using complicated techniques based on reinforcement learning. We're going to tune that model to use its intelligence it already has to get better responses for very narrow types of problems. So we're now going to start focusing on particular problems where we have really good right and wrong answer data. And we're going to start tuning these pre-trained models to try to do better in these particular areas. And so they surveyed the landscape of particular areas in which they could tune these models to do better. And one area that seemed really clear was computer programming. Programming languages are highly structured text. It's actually easier for a language model to deal with producing computer code even than it is with English language. So we knew from the very beginning of these large language models that they're very good at producing code. It was just hard to prompt them to produce exactly the right code you needed and to be sort of consistent about it. Starting in that fall of 2024, when we began tuning these models, they started to become better at not just like producing one small bit of code, but being able to produce longer, more coherent pieces of code. This is where we begin to get we see here like O1 and Sonnet 3 We get these early reasoning models which were models that they said okay after you been pre we going to tune you to try to give longer answers to sort of think out loud. And because we generate answers auto-regressively, so we always look at what we've output so far before producing the next token. By thinking out loud, we're basically going to have more expensive computation but be more likely to get better answers. So that began to help planning. So if you asked one of these LLMs to come up with a simple plan for solving a problem, once we went to reasoning, They became a little better. They began tuning it on computer code that actually works. And so the code they produced was starting to have a higher quality as well. So that's where we begin to get this sort of move up to curve on these programming tasks. Then we start to get these massive jumps. So this real big jump with Opus 4.6 and Claude Mythos that follows it. This really corresponds with the period starting in late 2025, early 2026, when suddenly you sell professional computer programmers began using agentic coding systems. So the other thing that happened, so we started tuning these things to be better at producing plans and to produce sharper code. We also, and by we, I mean the AI companies, began to work really hard on the coding harnesses. So the programs that you hook up to the LLMs to help produce, to make the plan, produce the code, check things, to connect with the various systems that professional programmers work. and they begin building out these coding harnesses to be better and better at solving and working on the types of software tasks that professional programmers face. And here's the key thing. We know about this because, ironically, the company that produced the new model that could detect all vulnerabilities had a vulnerability and they leaked their source code for their coding harness cloud code. So we know what's actually in the source code for the cloud code. And there is a ton of hand-coded, just humans sitting there building this thing, tinkering from scratch, pattern recognition, giant if-then statements. They call all sorts of external tools. There is a lot of old-fashioned 1960s-style AI logic built into this coding harness where it's just the expertise. They used to call these expert systems. The expertise of the computer programmers, the Anthropic, building the cloud code coding harness are sitting in here just building out how do I build a tool that's as useful as possible for actually producing computer code, to solve problems in the way that computer programmers solve it. So it's a mixture of an LLM that can produce better plans and produce sharper code plus a year or a year and a half of working on these monstrous coding harnesses with all of these hand-coded expert system type logic just to be good at the very specific types of tasks that computer programmers face. And it's when those things came together, they crossed a certain threshold of utility that we got to take off with, we see with like Claude Opus 4.6. Now here's the key. meter is testing the model plus the best coding harness they can find. So what you're seeing in this graph is not just the fruits of post-training these models that are pre-trained in the same way they were doing in 2024, but you're also seeing the fruits of these incredibly complicated, hand-coded, old-fashioned, 1960s AI-style coding harnesses that they put on top of it. So a huge amount of energy in the AI industry went towards trying to solve this problem because there's a good market there. How do we build tools to help computer programmers? And then those tools could now handle these much more complicated problems that require many more steps or many more time. Well, this is exactly the type of thing these coding harnesses plus tuned models were being optimized to do. So they're jumping up this plot because we really started caring about this for the last two years, and in particular the last year or so, the industry really focused on this particular problem. Okay. So is that like a bad thing? Is there like a fraud? No, like this is actually very impressive. I think from a technology perspective, there was this long period for a couple of years with the generative AI where the concern was, what's the killer app here? Like, this is really cool. I like asking chatbots things. It's really impressive, but where are we going to make money on this? What can this actually do? And for about two years, that was the question. And they originally thought the answer was going to be, we'll just pre-train these things until they're AGI and we can just do anything with them. That didn't work. But they pivoted. They pivoted in late 2024 to say, okay, we need to start tuning these for particular uses and building tools around them to try to solve particular problems. And they said programming, not just like Vibe, CodeMe, whatever, but like professional quality programming using professional tools over multiple steps, there's a market for that. And they were right. And they really worked hard on this. And it's impressive technology. These harnesses are impressive technology. The tuned up LLMs are impressive technology. There's a huge amount of data you can use to tune an LLM to be better, in particular at producing compilable code. But really, the coding harnesses, I think, is the story of the exponential leap here, because they really figured out we can't just trust the LLM to come up with the right plans. We can hard code a lot of logic because we know a lot about programming as programmers. And these are really interesting, cool tools. It's still shaking out exactly how they're going to be integrated into software development. But it's a real success story. They found a lane for applying this technology that would have a real commercial viability. And it worked. And that's what we're seeing in that chart. But does this mean, like was being claimed in all those tweets we looked at at the beginning of this episode, that AI is about to, quote, eat everything? Does that chart going up in the last year with two points mean that we are inevitably on a crash course towards artificial superintelligence and all things we care about and we're going to become inputs to an ergodic alien intelligence? Well, clearly the answer there is no. It is a measure specifically of the efforts of the AI companies to try to build programming tools. It's a measure of programming tools. And in the last couple of years, they made some great breakthroughs. We have two data points to capture that. I think partially what's going on here is there's two different mental models for thinking about AI improvement. And depending on what mental model you adopt, it really affects the way you end up thinking about things like the meter chart. The first model, which I think is very common and is wrong, is a model I first saw in the Max Tegmark book, Intelligence 3.0, which predates all this generative AI stuff. And it's this idea that you can imagine that AI capability is like water. And as AI in general gets better, the water level rises. And this water level is rising over a sort of like mountainous plain. And the taller the peak is like the harder the problem. And so as the water rises above a certain peak, it can now solve problems of that difficulty. And then as it rises farther, it can solve problems of that next difficulty. That's the way a lot of people think about AI. And so when you look at the meter chart, it looks like, look, whatever this is measuring, we see the water level going up fast. So the hardness of things AI in general can solve is really improving. Oh, my God, it's going so fast. If we extrapolate that out for another three years, then we're going to have the water above like all the mountains and AIs can be able to do everything. That's the wrong mental model for how generative AI-based tools work. A better model is to think about AI progress as a river. And as you go down this river, you see these various openings for tributaries, like little streams coming to the river. And think about each of these tributaries as a potential application of the AI technology, a particular area where you could build tools on the AI for it to be useful. You don't really know in advance how navigable each of these tributaries are until you go at it and give it your best effort and you portage over the rapids and see how far you can get. like it's, you know, Henry Hudson in the 17th century. Some of these tributaries end up, if you really try hard are very navigable I think the computer programming example the software development example is a tributary that ended up to be very navigable It like oh we found the Hudson River This thing keeps going This is really important But one tributary being navigable doesn't necessarily tell you anything about another unrelated tributary. So maybe we go down this other one over here. Oh, I'm going to build AI that's going to handle all my email. We didn't get very far. It's rapids, and then it becomes really shallow, and they just kind of disappeared. Oh, okay, we'll try another. That is what it is like trying to find applications based off of AI technology. You have to explore different tributaries, and that requires the building of these custom tools, these harnesses. It's really hard. It took two years of concerted effort with experts at computer programming to build harnesses for computer programming, and you don't know how far it's going to go before you hit a dead end. So there is no notion. And no one says, hey, I got 100 miles up the Hudson River, so I'm now going to assume any other opening IC is going to be one that I can go 100 miles on. They're different tributaries. They have their own challenges. Some are better suited for navigation than others. And that really describes our current moment. Now, we know this. I'm going to load up a tweet here. But we know this in part because here's a tweet from Rames Nam. I saw this from Gary Marcus's newsletter as well. But here we have another index of AI model capabilities called the EPOC capabilities index or the ECI, which is not just computer programming, but a bunch of different things. And now, look, here's GPT-4, and then here's the latest ones over here. We have like a linear increase. It's noisy. And the jumps here over the last three years are slow and steady, right? These are the same models that on the programming test were jumping up on an exponential. So it depends what you're trying to do with them. All right. So I think that's an important way to think about this, right? I think an important way to think about this is what application are we talking about? And I think we should treat these applications like normal technology, which means we can be excited about the things that they can do, but not extrapolate wildly about the things they can't. If you're a software developer, you can be super interested in these tools. They went all in on it, and I think they're making really interesting progress. But if you're not a software developer, that doesn't really matter to you right now. And it doesn't tell you anything about what AI is going to do in your particular corner of the world, if anything. All right, so to wrap this all up, let's go back to those hysterical tweets from before. AI is going to eat everything. What's really going on here? Well, partially it's the wrong mental model. It's the water raising instead of exploring the tributaries. I think that's a big part of it. But I also think a lot of these people that are making those really over-the-top tweets, they come out of a community that's known as the transhumanists. So we talked about last week about the rationalist community and a sub-class of the rationalists were the existential risk people, and these really influenced the big AI companies. But there was another group that intersects the rationalists, which is a transhumanist, and they really came out of Ray Kurzweil's work, which looked at exponentials. And it was looking at exponential increases in processing power of computer chips. They said, well, if we extrapolate this exponential, computers will be so powerful that we'll be able to upload our consciousness into machines and it will be in a utopia. Transhumanists love this idea of following exponentials wherever they find them, extrapolating them out, and then saying, well, if we get all the way out there, life as we know it will literally be changed. And sometimes they're super utopian and sometimes they're super dystopian, but they get meaning in their life. Their religious cult is one of exponentials delivering transcendence or destruction. It's eschatological, right? I mean, there's heaven, there's hell, and a giant event is going to happen to deliver one or the other. It's a story that goes back to the very beginning of written stories, and the transhumanists love this story. They move from exponential to exponential, and so when they see something like the meter graph, which has an exponential— Now, honestly, it's two points, but okay. In exponential, they say, yes, this is going to deliver our doom or our salvation because that's the way they want to see the world. So the transhumanist and the accidental risk communities, these have become mixed together, and they've been very influential in the way that the AI companies talk about their technologies. They've been very influential in the internet chatter, and they've caused a lot of anxiety. So I want to end this, you know, and again, I hope I'm being clear here. I'm not trying to be skeptical about the value of programming tools, but I want to be able to talk about just tools like a normal person. Right? Like when someone showed me the first useful electric car, we were able to say, oh, that's cool. Like that thing, it drives really nice. It doesn't use any gas. It's simpler. It's not going to break down as much. What a cool technology. We could just say that without being like I'm extrapolating and soon cars will be going to speed of light and we're all going to worship car gods. Like why do we have to go crazy? Why can't we just look at a tool and say, that's really cool. Let's see what happens next. Here's my call now. Here's what I think has to happen. I think the AI companies are now at a size and importance, especially as they're considering going public, that they need to start to distance themselves from these cult-like communities. I think the major AI companies need to distance themselves from the extreme ex-riskers. I think they need to distance themselves from the transhumanist. they've become too big and too important and too influential to be mixed up with these schools of thought that have so many other out there exaggerated or straight up problematic elements to it. I really do think we're going to see in the next year or so a separation from the AI companies and Dario Amadei and Sam Altman and Elon Musk. We're going to see a separation in the way they talk about their technology to the public, to consumers, to the investors from the way that these other communities that they were a part of before and are very influential to them talk about it. I think it's going to be like the modern Republican Party distancing themselves from the conspiratorial John Burt Society in the 1960s. That's what we need to do today. We need a Dario Amadei or Sam Altman to look at these AIs going to eat everything. The aliens are here, you know, whatever, and say, that's not us. That's kooky. That doesn't represent us. We're trying to build useful tools. Let us tell you how. Over here, we're building this. Here's why we're just going to make everyone's life better. Over here, we're trying to build this tool. Here's why I think it's going to be useful. Here we failed, but we're still working on it. We're not destroying the world. AI is not going to eat everything. You need to stop talking about all these exponential worshiping cult. You want to be on the internet doing that? You should be in a dark corner of the internet doing it. We're trying to build real tools that we think are going to be useful and we're going to explain why. That's what I think has to happen. It's time to distance the way we think about AI from these particular communities that are freaking everyone the hell out. They're influencing how even CEOs are talking about things. They're moving to stock markets. They're causing large anxiety. And they're often wildly wrong, exaggerated, or technically incomplete. So, all right, that's my soapbox plea. But if you want to come away with one message from the meter chart, how would I summarize it? I would summarize it by saying in the fall of 2024, we moved from pre-training to post-training. And one of the problems we began post-training for was reasoning and computer programming. And then in 2025, we began working on the harnesses. We got good. And so our last couple of model plus harness combinations have been making leaps in the complexity of things they can solve, which exactly matches what we're seeing in the software development world where now it wasn't until those models that they were good enough for people to use them. The meter chart is capturing the fact that the AI companies bet on this very narrow task, but perhaps financially lucrative and economically useful task is paying off, that these tools are doing well. It says nothing about the fate of humanity or AI more generally or artificial superintelligence or any of these other sorts of ex-risk transhumanist fever dreams. All right, so let's leave it there. We'll be back Monday with an advice episode of the show and probably another AI reality check on Thursday. But until then, remember, take AI seriously, but not everything that people say about it.