⚡️SWE-Bench-Dead: The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

26 min

•Feb 23, 20263 months ago

Summary

OpenAI researchers Mia Glaese and Olivia Watkins discuss the end of SWE-Bench Verified as a reliable coding benchmark due to saturation and contamination issues. They announce the field should move to SWE-Bench Pro and discuss the challenges of creating meaningful AI coding evaluations as models become more capable.

Insights

Popular AI benchmarks have limited lifespans - they become saturated and contaminated as models improve, requiring constant evolution to new, harder benchmarks
Contamination in coding benchmarks is pervasive across all frontier models, with models showing knowledge of specific implementation details they shouldn't know
Creating reliable coding evaluations requires massive human investment - OpenAI hired nearly 100 software engineers to validate SWE-Bench Verified
Future coding benchmarks need to measure qualitative aspects like code design, maintainability, and real-world applicability rather than just test passage
The field is moving beyond simple GitHub issue solving toward evaluating multi-day, complex engineering tasks that require design decisions

Trends

Benchmark saturation forcing migration to harder evaluation frameworksWidespread contamination across all frontier AI models in coding benchmarksShift from test-passing metrics to qualitative code assessmentMovement toward longer-horizon, more complex coding tasksIntegration of real-world usage metrics into AI capability assessmentEmphasis on agentic coding capabilities over simple problem solvingIndustry collaboration on benchmark development and validationFocus on measuring AI's ability to make design decisions in underspecified problems

Topics

SWE-Bench Verified deprecation AI coding benchmark contamination SWE-Bench Pro adoption Frontier AI model evaluation Coding benchmark saturation AI preparedness framework Model contamination detection Human data annotation for AI evaluation Agentic coding capabilities Real-world AI impact measurement Code quality assessment Multi-day engineering tasks AI research automation GDP-Val evaluation methodology Open-ended design decisions in AI

Companies

OpenAI

Primary focus - researchers discussing their benchmark work and AI evaluation frameworks

Scale AI

Creator of SWE-Bench Pro, the recommended replacement for SWE-Bench Verified

Princeton University

Original creator of the academic SWE-Bench benchmark that OpenAI later verified

Anthropic

Their Claude Opus model showed contamination issues in SWE-Bench Verified testing

Google

Their Gemini Flash model exhibited contamination in SWE-Bench Verified evaluations

People

Mia Glaese

VP of Research at OpenAI, leads Codex, Simulator, and Alignment teams

Olivia Watkins

OpenAI Frontier Evals team member, co-creator of SWE-Bench Verified

Quinn

Mentioned for creating HLE Verified for humanities evaluation benchmarks

Quotes

"The main thesis is that Sweeping Verified has been one of the North Star coding benchmarks that the field has looked at to measure coding progress. But recently we've seen that progress is kind of stalled."

Olivia Watkins

"Maybe it's hard to overstate the amount of effort that it took to create that benchmark. It was literally like many expert software engineers reviewing the problems like sequentially multiple times."

Host

"In over half of the problems that were investigated in that deep dive, there was one problem or the other. I think the most common problem are overly narrow tests."

Olivia Watkins

"We're kind of starting to measure not necessarily like what we want to measure, which is like coding capability of our agents. But like the agent's ability to correctly guess how to name a specific function."

Mia Glaese

Full Transcript

3 Speakers

Speaker A

Okay. Hi, we're here in the OpenAI studio with Mia and Olivia from the Frontier Evals team, or however you want to introduce yourself. Maybe you want to introduce. Name what you do at OpenAI and we can get started.