Understanding the Most Viral Chart in Artificial Intelligence

57 min

•Apr 25, 20263 days ago

Summary

Odd Lots explores Meter's viral AI time horizon charts, which measure AI model capabilities by tracking how long tasks take humans to complete. Hosts Joe Weisenthal and Tracy Alloway interview Meter president Chris Painter and technical staff member Joel Becker to understand what these exponential capability curves actually mean, their limitations, and why they've become the most influential metric in AI investment and safety discussions.

Insights

The most viral AI chart measures task difficulty in human time, not actual autonomous runtime—a critical distinction lost in public interpretation that drives investment decisions
AI capability progress is doubling every 4-5 months on software engineering tasks, accelerating from the previously cited 7-month doubling time, with compute investments already baked in through 2027-2028
Meter's safety mission and the investment excitement around exponential capability curves create a paradox where the same data fuels both AI safety concerns and venture capital enthusiasm
Only 3 human baselines per task and focus on software engineering tasks create methodological constraints that may overestimate real-world productivity gains compared to benchmark performance
The nonprofit research model struggles to compete for talent despite competitive cash compensation, with only 30 staff members able to address dozens of critical world-important research questions

Trends

AI capability measurement shifting from abstract benchmarks to time-horizon metrics that better communicate autonomy and task complexity to non-technical audiencesExponential progress in AI capabilities creating financial and competitive lock-in effects where companies must continue scaling regardless of safety concernsChinese AI models (Qwen, DeepSeek) lagging 9-12 months behind US frontier models, potentially gaming benchmarks rather than demonstrating genuine capability gainsIncreasing gap between AI lab capabilities and real-world productivity due to reliability issues, code quality requirements, and messy collaborative task environmentsGovernment and policymaker engagement on frontier AI capabilities assessment remains minimal compared to industry focus, creating asymmetric information and decision-making powerNonprofit AI safety research severely talent-constrained despite access to models and unique communication platforms, unable to pursue critical research opportunitiesAI labs' public statements about existential risks and safety concerns appear sincere but create tension with financial obligations to continue scaling compute and capabilitiesReinforcement learning optimization toward measurable benchmarks may be narrowing the task distribution that AI systems are trained on, potentially accelerating progress on specific domains

Topics

AI Capability Measurement and Time Horizon Charts AI Safety and Existential Risk Assessment Frontier AI Model Benchmarking Methodology AI Autonomy and Task Complexity Evaluation Software Engineering Task Automation Progress AI Lab Competition and Capability Race Dynamics Nonprofit vs. For-Profit AI Research Models Investment Decision-Making Based on AI Metrics Chinese AI Model Development and Benchmarking Real-World Productivity vs. Benchmark Performance AI Misalignment and Control Risk Evaluation Compute Infrastructure Investment and Scaling Government Policy and AI Regulation Gaps Talent Recruitment in AI Safety Research Task Distribution Bias in AI Evaluation

Companies

Meter

Nonprofit research organization measuring AI capabilities through time horizon charts; primary focus of episode discu...

Anthropic

Claude model performance discussed extensively; latest Claude Opus 4.6 shows 12-hour time horizon at 50% success rate

OpenAI

GPT-4 and GPT-5.3 Codex models compared on time horizon metrics; mentioned as major AI lab with safety concerns

Google

Gemini 3 Pro model referenced in time horizon comparisons; mentioned as frontier AI lab

Anthropic

Discussed as nonprofit-owned AI company with unusual corporate structure designed to address safety concerns

Tesla

Self-driving data referenced in cross-domain time horizon analysis for vision-based task evaluation

Bloomberg

Podcast network and media company producing Odd Lots show

People

Joe Weisenthal

Co-host of Odd Lots podcast conducting interview about AI capability measurement

Tracy Alloway

Co-host of Odd Lots podcast asking critical questions about methodology and implications

Chris Painter

Explains Meter's mission measuring AI autonomy risks and discusses nonprofit research model challenges

Joel Becker

Details time horizon methodology, human baseline procedures, and limitations of current measurement approach

Beth Barnes

Co-founder of Meter approximately 4 years ago with focus on AI autonomy measurement

Paul Christiano

Co-founder of Meter with initial motivation to measure AI capability progression over time

Nathan Witkin

Published critical Substack post 'Against the Meter Graph' questioning human baseline methodology

Quotes

"We are plotting the difficulty of tasks that AIs are able to complete over time. And the particular way that we measure the difficulty of tasks is in how long it takes humans to complete those same tasks."

Joel Becker•~25:00

"If you're trying to establish the stakes for talking about could AI systems go rogue or one day could they like try to take over and subvert human control, three years ago if you went back to around when Meters started about four-ish years ago, the most damning in the moment reason is the AI system just can't do much."

Chris Painter•~18:00

"I think that there is a distinction between our motive for assessing time horizons and the kind of how it gets used then by the rest of the world. For Meter, the reason that we work on things like the time horizon charts is because if we're trying to establish the stakes for talking about could AI systems go rogue."

Chris Painter•~16:00

"The thing that is more happening is that we're having to turn down opportunities to do stuff like that because we don't have the staff that we need to make those things happen."

Chris Painter•~85:00

"I think there's some principle at play here of like, I kind of want to enable people to do whatever they will do with that information. And I think that we don't engage a ton in kind of the like business side or investment implication of the work."

Chris Painter•~50:00

Full Transcript

Bloomberg Audio Studios, podcasts, radio, news. Hello and welcome to another episode of the Odd Lots podcast. I'm Joe Weisenthal. And I'm Tracy Allaway. Tracy, one thing about AI is that lots of lines that go up. Yes. Famously, there is perhaps one line that has captured the attention more than others when it comes to lines going up. Yes. But we're recording this April 7th. Did you see the anthropic revenue chart, by the way? Oh, it's just like straight. It's just on the number of lines going up. I mean, there are some really. Let me caveat that. Up until recently, there was one chart of a line going up exponentially that became, I think it's fair to say, the most viral chart in AI, right? Yes, I would absolutely agree with that. So one of the many lines that go up, or there are various lines that sort of capture this, is essentially just measures of AI progress of what they can do, what the models are capable of and so forth. And, you know, there's all different benchmarks out there and hobbyist benchmark creators, et cetera, all kinds of benchmarks out there. Organization called Meter based out in San Francisco. And they measure how well AI models are doing at various sort of engineering tasks, et cetera. And they have these charts showing how long, you know, certain tasks, how long would take a human to do them and then whether AI could do them. And, yes, the line's just almost vertical. I think there was someone, one of the ones that came out maybe very early this year or late last year showing the latest Claude model. Yes. It's like, this is crazy. When I look at these charts, they're called time horizon charts. Yeah. When I look at them, like intuitively, I kind of understand what they're saying. And you can kind of see the leap in progress between some of the previous models and Claude, right, the latest Claude model. And that's what got everyone excited was you had this big exponential shift up in the capability of that particular AI model. But then when I start like diving into what it actually says on Meter's website about what these charts represent, I start getting really confused. I know everyone wants to get excited about AI and charts going up in general, but I think there's a lot of nuance here. And we should probably talk about it because the other thing going on with Meter right now is they've become sort of the industry standard benchmark. And so a lot of investment decisions are being based on these charts. And if you oversimplify them as just like, OK, lines going up. Yeah. And then suddenly it goes up even more. Obviously, people are going to start to get like maybe a little overexcited. Can I say one other thing, too, that I'm very curious about? Like, I'm really glad that there are people designing various benchmarks for measuring progress. Seems like an important thing to get a handle on. But like if I were like, say, like talented or smart enough to be like doing these things, I would go work for one of the labs and make $10 million a year or something like that. And so I'm actually curious to a lot of nonprofits, et cetera. It's like, do you really like want to be like working at the cutting edge of AI in a nonprofit? I mean, I guess open air is owned by a nonprofit, weirdly enough. But you know what I'm saying? Like I would want the money. We should talk about it with our guests who are currently sitting right here. That's exactly right. I'm very excited to say we have the two perfect guests to talk about the most viral and maybe important chart in AI right now. We're going to be speaking with Joel Becker. He is a member of the technical staff at Meter. And we're also going to be speaking with Chris Painter, the president of Meter. So, Joel and Chris, thank you so much for coming on OddLots. Thank you. Thank you for having us. Yeah. Really excited to chat with both of you. Chris, since you're the president, I'll start with you. What is Meter? How long has it been around? What is this organization? What's its goal? just give us the sort of 60-second synopsis of Meter. Yeah, totally. I can try and, you know, sometimes I give a long version. I can try and do a short version here. So Meter is a research nonprofit based in the Bay Area, like you said, dedicated to advancing the science of measuring whether and when AI systems might pose catastrophic risks to humanity as a whole, focused specifically on threats that come from AI autonomy or AI systems themselves. So when you talk about this kind of this whole field in AI of dangerous capability evaluations, people seeing can this AI system assist with a chemical or biological weapon attack? Can it advance kind of like bad actors ability to execute cyber attacks on a really large scale? Meter is sort of specialized in specifically assessing how autonomous our AI systems, what is the scale and length and difficulty of tasks that they're able to do by themselves, partially because we think it sets the stakes for conversations about AI misalignment. So we sort of see ourselves as being on the hook for at any given point in time, giving humanity the bits of evidence that are most informative for establishing the stakes of are we reliant on AI systems as a society in a way that could make it really bad if they are misaligned. I'm going to let Joe ask the question about why you're both working in a nonprofit instead of one of the labs later. But one question I do have is when I think of meter, you guys always come up in the context of these time horizon charts. And I don't mean this as an insult or anything, but I hardly ever hear anyone talk about the actual like safety aspect of your mission. Why do you think that is? Yeah, so I think there's some distinction between our motive for assessing time horizons and the kind of how it gets used then by the rest of the world or kind of like what the origin of the rest of the world's interest in it. For Meter, I think the reason that we work on things like the time horizon charts is because if we're trying to establish the stakes for talking about could AI systems go rogue or one day could they like try to take over and subvert human control? three years ago, if you went back to around when Meters started about four-ish years ago, and if you, it was started by Beth Barnes, Paul Cristiano, and this was kind of the initial motive, is if you went back then and you said, why don't I think that AI systems are going to go rogue and like take over or overthrow humanity today? The kind of most intuitive, you know, you can come up with a lot of abstract reasons, debates about the goals AI systems might or might not eventually have. But the kind of most damning in the moment reason is the AI system just can't do much. Right. It doesn't make sense to talk about a question answer system that like can't even reliably answer programming questions saying like, is it going to hack my systems or like backdoor me in some way? It just doesn't make any sense to talk about that. No, it's going to write you a poem that you asked for. Right. Or won't even at the time they couldn't do anything by themselves. And so if you're like kind of being able to subvert human control depends on agency. And so we wanted to come up with a measure that kind of tracks agency over time to kind of say, when would this argument no longer apply? When are AI systems now able to kind of do long, complex enough actions by themselves that the argument kind of the goalposts almost move somewhere else to like, well, we would catch the AIs or the AIs don't want to subvert human control. And so I agree that there is a distinction between like how I think partially the exercise of trying to come up with these measures throws off things that are very like grounded and intuitive measures of AI progress that might be more intuitive than just benchmarks, right? So if you, a lot of people are in the game of making just benchmarks where you say like, here's my harm bench or something, the AI gets 70%. That's much less of a kind of grounded or long lasting metric. Like it's hard to say what that means or how that generalizes. But the idea with time horizon is like maybe it's more intuitive. And I think that helps both for safety and for like business understanding. So let's talk about what this chart, the main chart, here at meter.org right on the front page. It's this time horizon chart, and it shows Claude Opus 4.6 as of February 2026, able to complete a task length in 11 hours and 59 minutes with a 50% success rate. I have to admit, the first time I saw this chart or versions of this chart, what I assumed, and I suspect others assumed, is that it was able to go off and work on a task for 11 hours and 59 minutes and then come back with an answers. But apparently it's not that. What do you walk us through? What's really being measured here? By the way, the previous high was GPT 5.3 codex. That was five hours and 50 minutes. So I guess part of the reason this chart just blew people's minds is literally that's basically a double. But why don't you talk to us about what's really being measured here? Yeah. So fundamentally, in simpler terms, we are plotting the difficulty of tasks that AIs are able to complete over time. And, you know, the particular way that we measure the difficulty of tasks is in how long it takes humans to complete those same tasks that we're asking the AIs to do. So in this case, you know, we're talking about Veribus 4.6, something like tasks that take humans 12 hours to do. We predict that it will succeed at those tasks around 50% of the time. And yeah, it turns out that when you plot using this particular difficulty measure, how performant AIs are relative to how long it takes humans to complete these tasks, we see an exponential increase in capabilities for AIs. And what that ends up meaning is that you keep on having these doublings of capabilities every, let's say, four months, it seems, on recent trends, where the next model is not merely going to have necessarily an hour longer time horizon, but perhaps be having some multiple of the time horizon of the previous model that's come out. So then explain how that number, that 12 hours is established. So there is some engineering task and you say, OK, this is a task that would require 12 hours. But humans have all different types of talent capabilities. How do you establish that? OK, this was a 12 hour task. This was a six hour test. This was whatever it is. Yeah, so the simple answer is literally we get humans to sit down and complete the tasks that we give to AIs and as close to identical conditions as possible. So first we come up with the tasks and that's, you know, that's a whole kettle of fish. We can talk about exactly how we do that. And then using essentially the same tools that we're about to give the AIs, we take talented humans, you know, not people who have seen this particular type of task before, but people who have relevant expertise. So if it's a software engineering task, you know, they have software engineering expertise, machine learning tasks, they have machine learning expertise. And then we time them. We see how long it takes for them to complete those tasks successfully. And then roughly, we call the difficulty of the task as measured in human time to complete as the average time it took these humans to complete the task. Then we'll run the AIs on the same set of tasks. Typically today, for the very easiest tasks, they're more or less always going to succeed. There's some mid-range of tasks where perhaps they succeed 50% of the time, or perhaps for some tasks in that range, they succeed 0% of the time and for others, 100% of the time. And so they're getting 50% on average, let's say. And then for the much harder tasks, perhaps they're getting closer to 0%. And then the point at which we predict, you know, in the middle of all these 0% and 100% by task, the point at which we predict that they'd have a 50% chance of succeeding, that is either a 50% chance of succeeding on some task or 50% of the tasks of that difficulty that we think they would succeed on. That's what we're going to call the time horizon of these models. I think one thing also that could be good to explain here is the task distribution. I mean, this is not all activities that humans do. We are specifically here interested in, or there's some question in what tasks are, like Joel mentioned, we're having people come into our office, do the task to get a sense of how long it takes. We're not having them come in and paint paintings or write novels. We're focused here specifically on things that are the distribution of work that a engineer at a like, we like to think of it as like a frontier AI lab, the tasks that they might be doing. So this is things like software engineering, it's fine tuning AI models. It is like software machine learning, that kind of task. Wait, can I just ask, why did you decide to focus on engineering? Because you could have widened it out to, you know, if we're talking about AI being capable of, you know, taking over the world, there are all sorts of substantive tasks that would fall under that category. So why just do engineering? Yeah, I think that for one thing, maybe other people on the team or maybe Joel has thoughts about this. But I think my particular motive in being interested in the time horizon on software tasks is that, first of all, it's the thing that the industry is very, like already, even before we started working on this, is very focused on. So it's one of the capabilities that you should expect to come along for the ride earliest. It's the thing that like a lot of optimization pressure is being exerted on. And then I think that it is kind of the like thing that you would expect as an early warning kind of sign of this AI R&D automation. So to some extent, Meter thinks of itself as trying to build science or advanced science that can say, when are we getting to the point that AI systems could improve themselves or speed up the pace of AI development? When will AI research kind of feed on itself? And the kind of core capability for that might be software engineering and machine learning research ability. There are other skills that could be relevant to taking over the world. Right. I think other people have done time horizons on like cybersecurity since. Yeah. But I suppose it is true, like the basilisk isn't going to paint its way into like power or something like that. Okay. It might deceive you. It might be very convincing or cunning in some way. Fair. Hand over the keys. I might say for your mental models, you know, we don't have perfect evidence of this whatsoever. But my rough sense sort of colloquially or, you know, my prior before evidence comes in is that if we did study tasks on these very different distributions, you know, not machine learning, not software engineering, I'm not sure about painting exactly, but, you know, perhaps other kinds of task distributions that we could enumerate, that basically we would see this similarly shaped exponential progress over time where every, I'm not sure exactly, but let's say, you know, four months, six months, something like that, the level of capabilities as measured in time horizon would be doubling at something like that pace maybe from a much lower level So you You know one example that we do have better evidence of is that the AIs today are much less performance at you know anything that requires vision capabilities seeing what on a screen clicking around at a computer But they're getting, you know, tremendously better at that sort of thing over time. I just do mention quickly, we did actually do a very kind of brief investigation of this on other task distributions. That's on our website somewhere, like cross domain time horizons. I think we looked at data from the Teslas shared on self-driving. I'm forgetting the other. There's like OS world. Maybe some of these are like somewhat similar, still kind of in the distribution of software tasks, but trying to get further afield into things like vision. How big is the sample size on the humans who are actually doing work? And also, is it getting harder getting like human engineers into the room to compete with like Claude Opus 4.6 versus say, if I was a mediocre engineer, and I'm not, I'm a non-existent engineer. But if I was a mediocre one, I would like maybe I would feel good about going up against like GPT-3 or something. And maybe I would feel a lot worse about myself going up against like Claude. Yeah, you know, on these tasks, I'm in a pretty similar position myself to you. So we have approximately three, although it varies quite a lot across tasks, human baselines per task. So typically we're averaging over something like three. I think the final numbers, it's my impression that they're not going to be so sensitive to the particular baselines that we choose. Aren't the longer tasks more weakly baselined? Yeah. So indeed, I think it will get a lot harder to baseline these tasks as the length of tasks the AIs are able to successfully complete gets longer and longer. You know, you might think at some point the length of tasks that they can complete is longer than the doubling time. In four months time, they're going to be able to complete tasks of more than four months. And then it's, you know, kind of becomes perhaps close to impossible to get these four month long baselines. Of course, we're not at that point yet. But, you know, definitely it has become more difficult to get these baselines as time has gone on. At the moment, not impossible, but very challenging. Joe, these are the future jobs for displaced engineers, right? It's competing against the codes for benchmark purposes. Benchmark evaluation. We found the jobs. So we mentioned at the beginning, the most viral chart in AI is this chart that you have on the front of your website. Your website defaults to this. And it shows, you know, this doubling. So if we actually, like, go back to November, let's say November 2025, Gemini 3 Pro, 3 hours and 44 minutes, Claude Opus 4.6, 12 hours. That's at the 50% success benchmark. If we go to the 80% benchmark, which the website doesn't default to, the pace of improvement looks a little less impressive to me. So, okay, now it's like it does not have the same gap pretty clearly. Now, 80% is still not 100%. And I know that your meter's goal is about human safety and all this stuff. But when we think about people look at this and they use it as a stand in for how performant are these models, even 80 percent, you know, certainly for like any business application. I understand you're not like serving business here, per se, but probably businesses care about this. Even 80 percent may not be very good enough. And it does not look as crazy when you look at the 80 percent chart as it does at the 50 percent chart. Why the focus on the 50 percent chart? And given, like, why not look at the chart that just does not look as impressive? Yeah, maybe two central things to say. One, to my eyes, the 80% chart basically does look as impressive. The doubling time is about the same. This is cope on my part. It's the same increase but an offset. It's the same pace of progress. You know, it's something like five times smaller than the 50% number. But, you know, that only takes you two doublings. And if each doubling takes around four months, that means that in eight months' time, you're going to have the same 80% success rate roughly as you do 50% success rate today. That's one thing to say. Maybe a second thing to say is, you know, remember at the beginning I said, essentially what we're doing is plotting the difficulty of tasks that these AIs can complete over time just with this particular measure that ends up showing this clean exponential trend. And we've picked a particular number as our difficulty number. And, you know, that is this 50% reliability threshold. We could have picked a different one. I think there are reasons for picking the 50% one in particular. It's the one that statistically we're better able to measure for some technical reasons. It's the one that shows up in previous literature. There are a couple of other reasons why we can go for 50% rather than 80%. Maybe a final thing to say is that this 50% number is sort of equivocating between these tasks it's able to complete 50% of the time and 50% of the tasks it's able to complete 100% of the time and 50% it's able to complete 0% of the time. And actually, I think the situation is, it's somewhere in between, but it's a little bit closer to the latter, where there are some tasks that it's completing with near perfect reliability, and some tasks in that range that it's completing with very low reliability. And for downstream economic applications or for applications inside of these major AI companies or something, you know, you might think that that's more favorable in some sense, that there are some of these tasks where we're getting 100% reliability, even for very challenging tasks. I think two other things, maybe it could be useful to just explain when you said that there are technical reasons why it's easiest to measure at 50 percent. One, like it is just the case that it is 50 percent is the point at which it is like least sensitive to like the distribution is kind of thickest. Right. I mean, correct me if this is wrong, but I mean, there are like to resolve something like 95 percent, you would need way more samples because then you need to have some that are like you need way more samples to be able to resolve that level of precision. I think there are some caveats to that picture, but let's say even more extreme. Let's say that we cared about 99%. In that case, if we had 1% label noise, quote unquote, if sometimes we were accidentally grading some of the failing tasks as passing, some of the passing tasks as failing, then we'd just never be able to estimate that reliably, right? And at 50%, this comes a little bit closer to washing out. And I think one other intuitive thing here, one intuition is that if you give me a task and you give me the model, it is the point at which I think that the model, if all you tell me is the time or the length of task that it takes a human to do the task, the 50 percent time horizon is the point at which I think it is more likely that the model will be able to do the task than that it can't. And I just find that intuitive. How much interest do you get on these charts from potential investors specifically? And the reason I ask is because I was just messing around and like Googling some stuff. And when the Opus chart, the latest Opus chart came up, someone posted it on Reddit. And I think like the second comment on it was someone going, how do I invest in OpenAI? And like and like people were they were trying to club together to like invest in these companies. So clearly there are people out there who are using these charts as investment tools. I would say, you know, we don't get an enormous amount of inbound from investment firms. I mean, sometimes, you know, VCs or whatever, we're based in the Bay Area will reach out to us. I think that there's some kind of principle of our goal is to inform the public and give them the best evidence that we can about when we might get to this point of kind of, you know, AI being fully autonomous or able to improve itself. And there's some principle at play here of like, I kind of want to enable people to do whatever they will do with that information. And I think that we don't engage a ton in kind of the like business side or investment implication of the work. One kind of thought experiment I sometimes say to myself is if I do believe that at some point we're going to get this AI that's improving itself and we're like AI research is automated and you have all these fears about a singularity. Would I rather that like all of Wall Street like falsely didn't think that was coming when I believed it was coming? Or would I want them all to know that it was coming given that I believe it's coming? And I think all of humanity, maybe this is more a personal view, but I think if this is possible that we will automate AI research, I think all of humanity being aware of it, aware of where we're heading is sort of a precondition for us all being able to figure out what to do about it. And so I don't kind of want like certain people or one side or one team to kind of like selectively be in the dark because they might invest on the basis of this or something like that. But we don't you know, it's not where we put our time. We're focused on informing the public. The public includes some investors. So on that note, what is the actual level at which we're all presumably supposed to panic or at which if you're a policymaker, you would start to get worried about AI being able to automate and improve on itself in a way that eventually becomes detrimental to humanity? I don't know exactly what the level is on this time horizon measure. I think one thing to say is we have made real progress on the science of measuring these AI systems and how capable they are. But I think there's a long way to go. And in an important sense, I think we're behind on this task. We're measuring some underlying technical trend. And at some point, I do think that implies greater risks of astonishing things happening. Although Chris can speak more to other arguments that we might back out to for why, even if AI is a very capable, we still might not see catastrophic dangers emerge in the short term. Yeah, I'm unsure. You know, I think part of the reason why the AGI chatter has really picked up, particularly in the wake of like everyone using cloud code, is it's very easy to imagine. It's like you're sitting there. It's like, yeah, do this, do this. It's like, I don't even need to be here, right? I think you sort of get a very intuitive feel for like how the human can come out of the loop. What happens today, because I'm sure this has been tried, like if you go to like ChatGPT and you say, here, you have cloud code access, go build something. And the AI is what actually happens today when AI is working with AI? Yeah, my sense is that at some point, you know, a further away point than would have been true some time ago, the AIs will more or less fall on their faces. that, you know, there are some things they're not so capable of today. Like collaborative hallucinations, will just like, it'll just like devolve into terrible. Yeah, I think all sorts of ways this can go. You know, at some point, they're going to need to rely on external resources. And today they're not as capable at managing these external resources effectively. I think they're less capable at sort of ideation and sort of self-awareness about where they are in the problem today than they are at these kind of raw software engineering skills. You know, as you mentioned, And the ways in which AIs are autonomous today or close to autonomous today is the human has the idea and then, you know, submits that idea to a code code or a codex or one of these other agentic AI tools. And then they handle the software engineering components. And possibly there's still some intervention after that. I do imagine that the sort of circle of autonomy or something gets larger over time. I do think there's no fundamental barrier, it seems to me, to the AIs having those ideas and to be moved to a greater level of abstraction. But if we were purely relying today on these fully autonomous capabilities, could you manage research departments, any department of your choice inside of a major AI company? My guess is probably not. Actually, on this note, this reminds me of something I wanted to ask. So when you look at the domain-specific time horizon charts, so the ones that show, I think you call them task suites or something like that, I guess productivity by a specific job. And you see these different lines. So sometimes you see like almost horizontal lines and sometimes you see squiggly or steeper lines. What is actually happening there? Like, how are we supposed to interpret that? Like, is this a measurement problem or is it saying something very fundamental about like what AI can and can't do under current conditions? The thing that I think would be good for Joel to explain is that I think that there is a distinction here between will AI, like the time horizon chart doesn't by itself, I think, tell you will productivity in one specific kind of job increase because of access to AI. Yeah, maybe one thing to say on that chart, showing the time horizon on these different task distributions. Relative to my guesses ahead of time, I think those time horizons are remarkably similar. I think the doubling times, the pace of progress in AI seems more similar than I would have guessed to the original trends that we published, although imperfectly so. On this difficulty translating what we might call raw AI capabilities in some sense, capabilities on benchmarks or something to real world productivity. I think there are a number of differences and a number of ways in particular in which the benchmark results are overestimating what we might see in the wild. You know, not hugely overestimating. I think we do see that people are getting real utility out of these modern agentic AI tools, but overestimating to some extent. One is that the scoring implicitly is different. In real problems, I'm scoring based on something a bit more holistic than these algorithmic scoring procedures, these automatic scoring procedures that we're using at Meta. and many other people are using in the benchmark world. There's some notion of code quality if you're working in software engineering. But for other tasks, there's... Beautiful code, elegant code, people always talk about. Yeah, yeah, yeah. For other tasks, there's going to be... If Anna Wintour was coding, this is what it would look like. One more thing is that the tasks that come up in the wild are more likely to be messy in some sense. They involve working with other people. They involve working in much larger code bases or sort of more open-ended problems, maybe with something even adversarial going on. In the software engineering context, that might be that someone's trying to make a change to the part of the code base that you're currently working on and you need to work around that. And we do tend to see that the AIs are less capable at working on these more messy problems. I don't want to overstate that. It's not an enormous effect, but that's one thing that gets in the way of these productivity increases. And then I do think there's something to the reliability question, right? Where if it was true that for a certain type of task you only had 80 reliability then every time you going to need to go back and verify the work of these AIs and not only verify the work of these AIs but without the context of how they implemented the solution relative to if you went about the task yourself you already have that in your head And so this verification step quote unquote would take less time. You know, I don't expect these frictions to be sort of so fundamental in some sense, or I imagine they go up levels of abstraction. I think not only is the underlying technical progress real, but I think that the productivity improvements are also going to show up increasingly. But yeah, there are these fractions. Tracy alluded to this question when she asked about VCs and investor interests. So people see these charts and regardless of what meter's point is like, this is incredible. I got to invest in this. But this brings me to this broader thing that I find very strange about AI, which is this kind of odd sort of Baptist and bootlegger relationship between the AI labs, people who are building this stuff and the sort of alignment, safety people, and they sort of go back and forth. Then like you have the heads of the lab saying, yes, this might destroy the world and take all your jobs. And the safety people and the alignment people says, yes, this might destroy the world. And like, it's a very strange industry, right? Like the only thing that I can think of as cigarettes were like they warn you that smoking is bad, except they had to do that because they lost a lawsuit. I don't think they were particularly inclined to do that. I can't think of any other industry where the most enthusiastic people about it are also warning and dooming about how bad the thing they're building could be. So I'm sort of curious, like, you know, first of all, like, and I talked about this in the intro, like, who is the type of person that's like working at Meter that is like skilled enough to do like advanced evaluations? And like, where's the funding coming from? But like, talk to us about like, who's behind Meter and why they're there? Yeah, totally. So I think one thing to say on the history of kind of people caring about AI safety in the Bay Area is that this concern goes back like quite a ways. I would say for over a decade. There are many people who got into the field because they saw this trend of deep learning. Like what if deep learning works and it kind of goes all the way to artificial general intelligence and then super intelligence. And if that works, then it could affect everything. I think possibly when people worry about this, there's a future that they have in mind with superintelligence that's even more capable than what people who think of themselves as like AGI pill today think of. They're imagining AI systems that can run, you know, the entire economy. And I think people who kind of a while ago or many years ago saw that vision and were sort of alarmed about the stakes of it. Many people have this intuition that the thing to do is go and work in the industry because if you're like helping build it, you know, what's the best way to shape the future? It's to build it. And I think that there's obviously you could have questions about how sincere that is for many of the people who are in the industry or if there's kind of a mix of different motivations and like, you know, different wolves inside of them where maybe they partially are motivated by that. But also they're like there's kind of this like Oppenheimer, like it feels good to feel like you're in the position of making something that's dangerous. Someone once described OpenAI to me. This was years ago. A friend said it was like OpenAI was sort of like the Manhattan Project, except the goal was to not build the bomb at the very end, if that makes any sense. So to your Oppenheimer point, it's very strange. And I think one thing to emphasize is, you know, while it could be that there's a mix of motivations now, there are definitely many people, I think, in the Bay Area who sincerely believe that the technology is headed to someplace that will be very difficult for where it will be very difficult for humanity to stay kind of in the driver's seat or like stay in control and kind of a meaningful sense. It does seem as though like people talk about, oh, the big AI labs have like a PR problem or something like that. They keep bringing this up and it's like maybe they just believe it. So I think that this concern is quite old. And I think many people have this intuition that they're like, I can influence the thing by building it. But now there's this problem that that logic kind of always recommends that you continue building more advanced technology or like more advanced AI systems. And now you have this problem where there's all of these companies and they all say that they need to build it because if they don't build it, another company will. And then even if all of the and they could all have doubts about each other's commitment to safety or to these principles. Famously, the leaders of the labs really do not get along. They're not friends. It's not easy for them to sort out the safety thing among themselves. And then even if all of the USAI labs kind of agreed to do that, they then have this kind of external boogeyman of China. Right. What will the Chinese companies do? And so there's this sense in which just like even if the concern is real, I think a lot of people then who are in the industry have the instinct that they kind of there's no guiding principle for what they should do on safety other than to like build leverage for themselves for later. And I think that is a concerning state of affairs for AI development to be in globally. You know, obviously, we're trying to do something different by like informing the public or kind of giving like, you know, you could imagine that the situation would be better if or like one gap that exists right now in that picture is that it's the people building the technology who most believe that it's going to be destabilizing and sort of all encompassing. Maybe if the public and governments all were on the same page and believed the same thing, if it were true that it was headed there, then there would be kind of like more time for society to figure out a response from people who are not trying to build leverage over the technology themselves directly or, you know, control the technology via some kind of like public action or government. Can I just ask very quickly, since you brought up China, and I don't want to forget to ask this question, but Quen doesn't show up on your like main charts. I think you did a preliminary assessment of it a while ago. But like, what's the difference between assessing one of the closed models in America versus one of the open source models over in China? I think one thing to say is that the capabilities are lagging behind. We think that they're They're lagging behind. I'm not sure. So they're still irrelevant. They just like don't make it onto the chart? So we do try to prioritize just because META has limited resources, staff time in particular, the models that we anticipate being on the frontier. And in general, the Chinese models have been something like, you know, nine to 12 months, let's say, behind the U.S. models. And I think the gap by time horizon is probably even larger than the gap by benchmark scores where there's some, I'm not sure how scientific I can make this, But there's some colloquial sense or something that the Chinese models are stronger according to benchmark scores than they would be on, you know, truly held out problems in some sense. You mean they're gaming the benchmark? Is that what that means? I'm not sure technically, you know, exactly how that shakes out, but something spiritually close to that. I'm not sure that's true for all Chinese models. I'm sure it's true for lots of models outside of China. But I think that's at least one possibility. I'm very curious when you talk to external actors in all of this, and I'm going to group them into, I guess, policymakers, investors and the labs themselves. Like, who are you interacting the most with at the moment? I think that in practice, we end up interacting a lot with AI labs because there's some amount of sorting out, getting access to models, working with them to set new precedents on things related to third party red teaming and third party risk assessment. We think of our audience as being sort of like high context members of the public. So the kind of like people, you know, who are maybe like you two, right? People who are kind of like. People listening to this podcast. Yeah, people listening to this podcast. people who have to make important decisions that will be informed by the pace of AI progress or the kind of profile of AI capabilities overall. Because we're based in the Bay Area, I think we disproportionately end up interacting with people who are building the technology and closer to it. Partially, I think, back to Joe's point before, I think this is kind of because it is the case that to kind of care about a lot of these frontier problems, you're kind of selecting for people who are building the technology themselves. There's some sense in which like the companies in the industry spends more time thinking today about frontier capabilities assessment than the government does. I think like one day you could imagine us getting to the point where the government is like very focused on this and dedicating a lot of resources to it. And at that point, I would expect Meter to be spending more time talking to governments. Yeah, that's kind of what I was getting at because our senses and a lot of the conversations like we talk to people and they'll They'll say something about like, oh, it's important to have a social safety net for an AI-enabled future. But no one seems to be really thinking about it in a lot of detail. And when you say, you know, it's easy to imagine or maybe the government will care more about this, not so easy for me to imagine. It seems like they mostly care about, you know, data centers and like where they located and stuff like that. It would be nice if we had policymakers really looking at like frontier capabilities and stuff. Still seems kind of a way off. But it is interesting. you know, you're like talking about like this sort of like capitalist dynamic, right? There's competition. And it's like you have a lot of people that are really worried about, oh, what if the other guys get to ASI or AGI first? Or what if the Chinese, et cetera? How much does the fact of like free market capitalism and the demand, you know, the big investors at the VC funds, like they want a return, they want an IPO. We might get some big AI IPOs this year. In fact, How much do you find that to be perhaps intention with the safety element? Yeah, I maybe. Yeah, people on our team would have different views on this. I personally don't feel there's yeah, there's some thing here of like investors are key decision makers and, you know, they're people, too. That sounds strange to say investors are people, too. I sound like Mitt Romney or something. But I think that like I think that the element of this that feels like it could be intention is if you build a bunch of financial obligations to keep kind of the pedal to the metal, no matter what the risks are going into the future. So like one thing I think a lot about is if you're like building up a huge amount of debt to build data centers and then say that you do find evidence that you're now worried about, about the, you know, loss of control from AI systems. You do find instances of AI systems going rogue. Do you now have like a financial commitment to build up those data centers and like continue kind of the pace of progress? I think that is one place where I feel the tension pretty acutely. Like you're building these expectations into the market that could kind of force you to continue development when you otherwise would rather invest more in safety or, yeah, like it at least gives you a kind of financial obligation to continue scaling, at least compute. I think that like the people themselves being informed about the progress does not seem bad to me. I think it's like good in some ways for everyone to be on the same page about capabilities that could be related to subverting human control later on. But I think in the world beyond like the information that Meter shares, I do think there is a tension like the fact that private companies are building this, I think, could cause really acute tensions in the future where people make these commitments that they wouldn't if they were trying to like slow or, you know, maximize social resilience to the technology. Yeah, I'm not sure how these things shake out, but I think there are some forces on the other side, right? Like, you know, some safety promoting technologies, quote unquote, or techniques do make the models more useful, you know, if they're better complying with your will in some sense. And so you have capitalist incentives, standard capitalist incentives to invest in that kind of research. Maybe that doesn't cover, you know, the broad suite of safety research that seems important. And it certainly doesn't rule out capabilities progress as being an important axis on which you do want to scale. But I think there are some forces in each direction. Since you mentioned compute just then, can you talk a little bit more about, I guess, the relationship between the time horizon improvements and the cost of compute at the moment and what you've actually seen and how that impacts it? Yeah. So one extraordinary fact from my perspective, I'm not sure how to fit these facts together, But something like the R&D spend on compute of these companies has risen exponentially, of course. And in fact, it's risen exponentially at essentially the same rate as time horizon progress. You know, I think there's nothing necessary about that. You know, it doesn't mean by itself that if compute progress slows, then capabilities progress will also slow. But, you know, it's clearly an important input into AI progress. I expect that to continue to be true in future. Sometimes people ask us if we think it's plausible or how plausible we think it is that the capabilities progress, this exponential capabilities progress might slow down at some point in the future. And, you know, one reason it seems it's hard for me to consider it plausible that it will slow down in the next at least small number of years is that a lot of those compute R&D investments are basically already baked in. The data centers have already been built, plans for data centers even beyond 2027, 2028 are presumably coming to fruition, coming about. And so some of these input investments are already baked in, in some sense. So it would be surprising to see capabilities slow to the extent that compute has been an important input. After that, maybe you need to think about other arguments for how capabilities might slow, but that's roughly how I think about it. There's a very good or interesting critical substack post called Against the Meter Graph by someone named Nathan Witkin, who brings up an interesting point that I wouldn't have thought of had I not read it, which is you're paying the software engineers to come in and perform these tasks, right? It seems, you know, maybe this will be the last job of humans is just doing benchmark. If I were like a good software engineer and you say, Joe, come in and do this task. How do you prevent me? Oh, man, this is taking me a long time. Meanwhile, I keep getting $100 an hour for like looking at my computer and this is tough. I'm going to have to come back tomorrow and keep working on this. How do you avoid the sort of conflict of interest where the person who's paid to work on this problem may be encouraged to take as long as possible to solve it And with only three people working on it at times I don know just like this does not it seems like a conflict of interest to me Yeah So the short answer is you know in general we are incentivizing these people to complete the task as soon as possible, in particular to complete the task faster than their peers who are attempting the same task, the time that it will take for them to complete the task. Is there like a bonus if they do it faster than? Yeah, yeah. Approximately, there's a bonus if they complete it faster and faster than anyone else. You know, another thing to say is, I think it just is true that our baselining methodology or the ways in which we compare to humans in some ways leaves a lot to be desired. That, you know, ideally we would have invested, you know, a hundred times as many resources in having a hundred baselines, human baselines per task. And those would have come from, you know, perhaps the very best software engineers or machine learning engineers in the world. Maybe that would be the, maybe that would be the comparison that we're making. And indeed, we'd be doing all of this procedure over many more tasks, not just many more tasks, many more tasks over wider task distributions than just software engineering or machine learning engineering. I mean, I do think Time Horizon still represents progress over what's come before in the science of measuring AI capabilities. But, you know, in some ways, I'm sympathetic to a lot of criticisms of Time Horizon. I do think that some of the details, at least for the work we've done so far, you know, aren't going to matter as much as you might naively think. So choosing the shortest baseline time that we end up observing or the longest time, it's actually not going to make that much difference to the final measurements. Of course, we do think these people are talented software engineers or cybersecurity people or so on, depending on the task. But perhaps we could have found even more talented people. They would have completed it in half the time. And so naively, it would seem like the time horizon that we estimate of these models would be half as long as we actually end up observing. But of course, that wouldn't change the doubling time. It would mean you'd get to the same level after another four months. In some sense, the big picture that I want Time Horizon to point to is less this like Opus 4.6 is 12 hours in particular, and more that we're seeing this remarkable pace of progress that shows no signs of slowing in the recent past. And I think in the near future as well, you know, in fact, it shows some signs of speeding up. Well, I was going to ask about this because I think recently the statistic that you would always hear was like a doubling every seven months, something like that. How fast do you see it going in the near future? Yeah. So I was a doubling over every seven months person. There was controversy in our team about what to believe here. Because when we originally published this work approximately a year ago, you'd see, you know, if you plotted a single straight line, a single exponential, you'd get something like, you know, six or seven months, let's say. But if you restricted to just the time since, I think, GPT-4-0, since the 2024 models onwards, you'd see something closer to this sort of like four or five month trend. And some people believed in that. And, you know, some people like me had the intuition that, well, we have so few data points. We should really be estimating over this larger number of data points than a larger number of data points says every six or seven months. There are a couple of things that have changed my mind and made me realize my colleagues were right since then. One is that for the models that have come out since, what trend has better predicted how performant those models would be? And it's very clear that the answer to that is the four-month doubling time and not this seven-month doubling time. There's some possibility that could speed up again. We've seen it speed up once. I think there are some reasons in principle why you might expect it to speed up again. I think there are some caveats about this. These are maybe some takes that my colleagues would agree with. And so maybe you should discard that or you should think that they're going to convince me in the way that they did with the four month versus seven month doubling times. I have some suspicion that the tasks that Meta is measuring performance on are, you know, in some sense, more and more a narrow slice of possible tasks. And in particular, a more and more narrow slice that is perhaps similar to the kinds of tasks that you'd expect these major AI companies to be training on in the first instance. And so in some sense, we're increasingly more so than was the case before measuring progress on the exact types of tasks that they're trying to get better at. You might think, for instance, the kinds of tasks that would make for good reinforcement learning environments, the kinds of tasks that you can score quickly and cheaply and automatically. I think that progress is real. I think that progress generalizes to some extent to other types of tasks. I think we're seeing remarkable progress in these more messy tasks, for example. I have one last question, which is like, how big is your team funding? And also, how many people at Mater are basically really rich from AI? And they're like, you know what? I'm good. I don't need to pursue like stick around for the IPO or whatever. I'm sad. And now I want to work on something that let humanity know. I've seen like there are other independent AI researchers and they talk about this. I want to be able to talk about what I saw. Miles Brundage, someone who has like a little think tank, he's talked about this. What's like how many people are like rich already? And they're like, OK, now I want to work for something that's public facing. Yeah. So Meter right now is about 30 people that we're growing and hoping to grow fast. We are hiring, I should say, meter.org slash careers. And you were touching before and kind of the thing about, is it difficult to be a nonprofit? You know, we can't pay people in equity. Right. No one's going to get an IPO. Right. Yeah. There's no IPO or anything for Meter. But we do try to pay competitively on cash compensation. Right. So that's an area where we feel we can like somewhat compete with labs. And it's true that I think a lot of our team is just motivated by trying to kind of do something different, like not, you know, all the companies to some extent are in this business of kind of like building somewhat redundant products, kind of competing for the same role in the world. And Meter is in a really unique position at the moment where I think that we have like access and the ability to communicate these ideas and explain the state of AI research to a number, like a lot of audiences that might be hard for like individual researchers inside of a company. Like we get to talk to a lot of governments directly. We get to come here and talk with you all. And that's kind of different i think if you look at all the actors that are working on the frontier of ai research or ai safety you kind of if you compare us to ai lab staff i think that our work gets to be we get to kind of every day work on whatever research we think will be most informative to the like public decision and you have x ai not x ai but x is a former ai lab staff who maybe there was a tender at some point and now they work at meter yeah we do okay those yeah so we do we do have some people who previously worked at AI labs. I do think that as time goes on, I think one hope that I have is that more, you know, there will be more and more researchers who have kind of like made the money that they need from working in the industry. And now we're excited and kind of like lifting all boats by working on kind of like inside of an organization where the North Star can be what is most informative to the rest of the world outside of these like relatively small set of companies. Chris is very polite. I think that's wonderful. I'm tempted to be a little bit more aggressive. In this conversation, I think we have spoken through Mita's work on some of the most important problems in the world. Problems that are going to define the future, I think, for not just the next years, but coming decades, maybe even coming centuries. And we've also spoken about some of the ways in which Mita work is not what you might want it to be. That there's a long way to go in the science of evaluating these AIs. Why have we not made more progress? maybe a couple of reasons. I think clearly the central reason is that we are bottlenecked on technical talent, on incredibly capable people to come work on these questions. I was on a meet-to-work retreat recently where we were brainstorming 20, 30 of these what seemed like world-important problems, problems that we think no one else is going to get to if we do not get to them. And we are able to conduct research on how many of those problems. I think it's one, Two, maybe if we do an extraordinary job this quarter, it might be three. As Chris alludes to, I think if you're interested in less working on redundant products at these major AI companies and more advancing our understanding on some of the most important questions in the world that are going to shake the world for years to come, METER is a great place to go. Yeah. One more thing to say about that is the vibe inside of METER is a state of triage, right? And I think people often tell themselves externally, people might guess, oh, METER is outside of any of the AI labs. So the thing it might most struggle with is things like access to AI models. You know, you can't do the research you want because you don't have you're not building the thing yourself in practice. That's the story that people always tell us. You have to build the future to shape it. In practice, I think our experience at Meter is that like when we want to try new types of research that would require new kinds of structured access. Our experience at this point has been that AI labs are like pretty game to play ball on that. And the thing that is more happening is that we're having to turn down opportunities to do stuff like that because we don't have the staff that we need to make. those things happen. Joel and Chris, thank you so much for coming on OddLots. Absolutely fascinating conversation and I appreciate your taking your time. Great to have you in studio. Yeah. Thank you so much. Thank you so much for having us. That was a really interesting conversation. And to that we are starting from the end, sort of the idea of like, OK, here are some really important questions. Like, let's just set everything aside. And there's 30 people working on that. There's, you know, and like how many people want to do it? It's like, OK, we try to match cash comp, et cetera. That seems like kind of a tricky issue. If like if you accept the premise that these are some big questions we have to get right and we got to land this plane, hopefully. Like that's a bit of an issue. Yeah. The other thing I thought was really interesting was the Chinese models not really making it on the charts, even though like we know in the market itself, like when DeepSea, when that new version came out, that was like this huge thing where everyone started to panic and to not see it even like land on the time horizon chart. It's kind of interesting. I guess it's I mean, I guess I buy the reasoning from their perspective that the only interesting question from meters perspective is like the most cutting edge, which may be slightly adjacent to the most interesting chart for like business. Right. So it's like, OK, we know that Deep Seek and Quinn and Kimmy and all those are like very impressive. Do they push like the very frontier? Perhaps not. But just in general, I find this space so weird because it's like here you have these people who are like clearly quite alarmed at the potential here. And most people, I think, look at these charts and they say, like, wow, this is like I want to invest in this or this is like really exciting. I know. Like that's why my first question was like you're here for AI safety purposes, but everyone seems to get excited about the line go up chart. Yeah. Like there's a disconnect. They're all connected. Like I say when an industry basically says it's worried by itself, you should pay attention. It's really strange. This gets back to, you know, it's very strange where you have the CEOs of these companies who are in many cases the most alarmist. And there's this sort of cynical thing. And I don't totally discount the cynical interpretations like, oh, they're saying this because they want to get investors and so forth and they need all this money. But look, it was also true that open AI and anthropic, but open AI a little more were like founded with these very exotic corporate structures of like a private company owned by a nonprofit, et cetera, which they presumably did because they took pretty seriously the fact that this technology and science was like very strange and not just like it's not just enterprise software. Right. Like they were self-limiting in a way. One other interesting thing, too, that this idea is like, OK, like, first of all, what's the difference between seven month and four month time doubling? Not much. You know, it's like people's like, oh, I think. Yeah, but it's exponential, isn't it? I guess it's exponential. But it's still funny to me. It's like, oh, I think like AI is going to destroy all white collar work in two years. And someone else is like, no, no, I think it's going to be three years. As if that makes like any different whatsoever. But one thing to consider, Joel sort of alluded to this, you know, you had like OpenAI shutting down its like video efforts, et cetera. So perhaps part of the story is just this intense focus now on the software engineering side as what these labs are working in. Yeah. And sort of like all these other side quests are not as important. And so maybe we will see even more rapid progress on some of these technical benchmarks, because clearly from the lab's perspective, that's where the action is more than some of these consumer things like making making images or videos. Yep. All right. Shall we leave it there? Let's leave it there. OK, this has been another episode of the All Thoughts podcast. I'm Tracy Alloway. You can follow me at Tracy Alloway. And I'm Joe Weisenthal. You can follow me at The Stalwart. Follow our guest, Chris Painter. He's at Chris Painter. Yup and Joel Becker. He's at Joel underscore BKR. Follow our producers, Carmen Rodriguez at Carmen Armand, Dashiell Bennett at Dashbot, Kale Brooks at Kale Brooks and Kevin Lozano at Kevin Lloyd Lozano. And for more OddLots content, go to Bloomberg.com slash OddLots. We have a daily newsletter and all of our episodes. And you can chat about all these topics 24-7 in our Discord, discord.gg slash OddLots. And if you enjoy OddLots, if you like these AI episodes, then please leave us a positive review on your favorite podcast platform. And remember, if you are a Bloomberg subscriber, you can listen to all of our episodes absolutely ad-free. All you need to do is find the Bloomberg channel on Apple Podcasts and follow the instructions there. Thanks for listening. Thank you.