METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

56 min

•Feb 27, 20263 months ago

Summary

Joel Becker from METR discusses their groundbreaking AI evaluation research, including the viral time horizon chart showing exponential AI capability improvements and findings that AI currently slows down developer productivity. The conversation covers AI safety evaluation methods, threat models, and the challenges of measuring real-world AI impact versus benchmark performance.

Insights

AI capabilities are improving exponentially with remarkable consistency, following a straight-line trend when plotting task difficulty (measured in human time) against model releases
Current AI tools may actually slow down experienced developers on complex tasks, contrary to popular productivity claims and marketing hype
Independent AI safety evaluation is crucial as labs have inherent conflicts of interest when assessing their own models' risks and capabilities
The gap between benchmark performance and real-world messy tasks remains significant, with models struggling on open-ended, context-heavy work
Prediction markets on AI progress suffer from insider trading issues, with industry insiders having material non-public information affecting market prices

Trends

Exponential AI capability growth following predictable trend lines across multiple model generationsShift from human-written code to fully agentic coding workflows among senior developersGrowing importance of independent AI safety evaluation organizations separate from major labsIncreasing focus on real-world task evaluation versus synthetic benchmarksRise of multi-agent AI systems for complex collaborative tasksCompute growth slowdown potentially limiting future AI progress ratesAutomated R&D acceleration as a key threat model for AI safetyGrowing sophistication in AI evaluation harnesses and scaffolding methods

Topics

AI Safety Evaluation Model Capability Assessment Developer Productivity Impact AI Benchmarking Methodologies Threat Model Development Time Horizon Analysis Autonomous AI Systems Compute Scaling Limitations AI Progress Prediction Markets Independent AI Research Agentic Coding Workflows Multi-Agent AI Collaboration AI Evaluation Harnesses R&D Automation Risks AI Capability Explosion

Companies

METR

AI safety evaluation organization conducting independent research on model capabilities and risks

OpenAI

Frequently referenced for model releases, compute spending data, and capability comparisons

Anthropic

Discussed for Claude models, particularly Opus 4.5's significant capability jump

Google DeepMind

Mentioned for Gemini models and AI research contributions

Cognition

Referenced for internal productivity tracking and engineering velocity measurements

Meta

Discussed for compute spending and AI model development efforts

xAI

Mentioned for model distillation practices and compute investments

Mistral

Cited for transparency about GPU usage and model development timelines

Manifold Markets

Prediction market platform where Becker became top trader through charitable donations

Polymarket

Real-money prediction market platform compared to Manifold's play-money system

People

Joel Becker

METR researcher and guest discussing AI evaluation methodologies and safety research

Alessio

Podcast host and founder of Kernel Labs interviewing Becker

Swyx

Podcast co-host and editor of Latent Space discussing AI development trends

Sam Altman

OpenAI CEO mentioned for giving AI models dangerous permissions on his personal computer

Noam Brown

Researcher working on cooperative multi-agent AI systems at major tech company

Quentin Anthony

Developer who participated in METR's productivity study and showed positive results

Nicola

METR colleague with shorter AI timeline predictions, involved in AI2027 research

Quotes

"METR stands for model Evaluation and threat Research. We think about what the capabilities of AI models might look like today and tomorrow, as well as their propensities, what they'll actually do in the wild."

Joel Becker

"Part of what makes it so extraordinary is that this pattern does seem to be so regular. In fact, it's just way more straight than this incredibly scattered graph that we had at the beginning."

Joel Becker

"I've seen some of the most talented engineers I know go from being picky about not using AIs for coding to practically not writing a line of code."

Joel Becker

"I think basically no one is stopping to do science except for you guys. Because we know RCTs are the best, right? But sometimes human intuition is good enough."

Swyx

"It's vital to have this independent source of expertise. I can bang that drum forever."

Joel Becker

Full Transcript

3 Speakers

Speaker A

So METER stands for metr. First two letters, model Evaluation. That is, we think about what the capabilities of AI models might look like today and tomorrow, as well as their propensities, what they'll actually do in the wild, given that they have some level of capability. And then threat research is the final two letters. We try to connect those capabilities and propensities to particular threat models that we have in order to determine whether AI models pose enormous or catastrophic risks to society. So the secret, if you read this article about how I became the number one most profitable trader on manifold, mostly comes down to this one market where.