Latent Space: The AI Engineer Podcast

⚡️SWE-Bench-Dead: The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

26 min
Feb 23, 2026about 2 months ago
Listen to Episode
Summary

OpenAI researchers Mia Glaese and Olivia Watkins discuss the end of SWE-Bench Verified as a reliable coding benchmark due to saturation and contamination issues. They announce the field should move to SWE-Bench Pro and discuss the challenges of creating meaningful AI coding evaluations as models become more capable.

Insights
  • Popular AI benchmarks have limited lifespans - they become saturated and contaminated as models improve, requiring constant evolution to new, harder benchmarks
  • Contamination in coding benchmarks is pervasive across all frontier models, with models showing knowledge of specific implementation details they shouldn't know
  • Creating reliable coding evaluations requires massive human investment - OpenAI hired nearly 100 software engineers to validate SWE-Bench Verified
  • Future coding benchmarks need to measure qualitative aspects like code design, maintainability, and real-world applicability rather than just test passage
  • The field is moving beyond simple GitHub issue solving toward evaluating multi-day, complex engineering tasks that require design decisions
Trends
Benchmark saturation forcing migration to harder evaluation frameworksWidespread contamination across all frontier AI models in coding benchmarksShift from test-passing metrics to qualitative code assessmentMovement toward longer-horizon, more complex coding tasksIntegration of real-world usage metrics into AI capability assessmentEmphasis on agentic coding capabilities over simple problem solvingIndustry collaboration on benchmark development and validationFocus on measuring AI's ability to make design decisions in underspecified problems
Companies
OpenAI
Primary focus - researchers discussing their benchmark work and AI evaluation frameworks
Scale AI
Creator of SWE-Bench Pro, the recommended replacement for SWE-Bench Verified
Princeton University
Original creator of the academic SWE-Bench benchmark that OpenAI later verified
Anthropic
Their Claude Opus model showed contamination issues in SWE-Bench Verified testing
Google
Their Gemini Flash model exhibited contamination in SWE-Bench Verified evaluations
People
Mia Glaese
VP of Research at OpenAI, leads Codex, Simulator, and Alignment teams
Olivia Watkins
OpenAI Frontier Evals team member, co-creator of SWE-Bench Verified
Quinn
Mentioned for creating HLE Verified for humanities evaluation benchmarks
Quotes
"The main thesis is that Sweeping Verified has been one of the North Star coding benchmarks that the field has looked at to measure coding progress. But recently we've seen that progress is kind of stalled."
Olivia Watkins
"Maybe it's hard to overstate the amount of effort that it took to create that benchmark. It was literally like many expert software engineers reviewing the problems like sequentially multiple times."
Host
"In over half of the problems that were investigated in that deep dive, there was one problem or the other. I think the most common problem are overly narrow tests."
Olivia Watkins
"We're kind of starting to measure not necessarily like what we want to measure, which is like coding capability of our agents. But like the agent's ability to correctly guess how to name a specific function."
Mia Glaese
Full Transcript
3 Speakers
Speaker A

Okay. Hi, we're here in the OpenAI studio with Mia and Olivia from the Frontier Evals team, or however you want to introduce yourself. Maybe you want to introduce. Name what you do at OpenAI and we can get started.

0:04

Speaker B

Sure.

0:16

Speaker C

Hi, I'm Olivia. I'm on the Frontier Evel team.

0:17

Speaker B

Great.

0:19

Speaker A

Sure.

0:19

Speaker B

Hi, I'm Mia. I am a VP of Research at OpenAI and my team are the Codex team, the Simulator team, and the Alignment team. And we work a lot with Olivia's team on Frontiva.

0:20

Speaker A

Yeah, very exciting. And as, by my understanding, you were part of the original team that worked on cwatchverify as well.

0:33

Speaker B

Yeah. Olivia's team, the Frontier Events team, and the Human Data team collaborated on creating three Bench Verified.

0:38

Speaker A

So you've seen the evolution of coding benchmarks over time, and I think it was round about the mid to late 2024 when you first got SVRfied. These have evolved a lot since then. What's the blog post that you have worked on that we're releasing today? What is the sort of main thesis that you're pushing out?

0:45

Speaker C

So the main thesis is that Sweeping Verified has been one of the North Star coding benchmarks that the field has looked at to measure coding progress. But recently we've seen that progress is kind of stalled. And basically we realized that this is because the eval is effectively saturated and also highly contaminated.

1:04

Speaker B

So.

1:20

Speaker C

So at this point, we think that it's not really measuring coding performance improvements well anymore. And we think that the field should move away from this towards other benchmarks like sweepbench Pro. Like sweepbench Pro.

1:20

Speaker A

Yeah. Amazing. Yeah. One of the jokes I always have is like, there's a group chat with all the labs and everyone just takes turns to increment like 0.1 on trucks. And then it's like, okay, well, you have the best coding model, I guess, because you're 0.1% higher. But it's not super convincing at this point at all. Yeah. So cool. I think let's sort of reset on what was the original work that you guys did for Spent Verified, which I think was pretty substantial. It was a very significant investment from OpenAI, which people still don't appreciate. And then what were the satisfactions that we found over time? Right. So what was SuiteBench verified that people should know about?

1:29

Speaker C

SuiteBench verified was kind of a cleanup of original academic benchmark from a lab at Princeton called suitebench. And the agent is basically given a code base and a task that was sourced from a real world repository. And GitHub issue and was asked to solve the task and is graded on whether some tests pass. And at the time this was quickly became a popular benchmark because at the time the field didn't really have good real world coding benchmarks. But then when OpenAI took a look at the benchmark as part of one of the evals we wanted to track in our preparedness framework, folks started realizing that some of the cases where agents were failing were due to bad problem setups rather than just to models being dumb. So Vox OpenAI did a pretty extensive human data campaign hiring like almost 100 real world software engineers to go through the problems and figure out are the tasks well specified, are the tests actually fair? And kind of created a curated set of 500 tasks that we thought were much better.

2:08

Speaker B

Maybe it's hard to overstate the amount of effort that it took to create that benchmark. It was literally like many expert software engineers reviewing the problems like sequentially multiple times until, you know, basically like three different experts independently decided that, yeah, you

3:07

Speaker A

didn't have to do that, you just tripled your costs for just.

3:30

Speaker B

I mean, we had to do it. We had to do it actually because it's quite a hard task to like look at something like a problem and the patch and then like it's not just the problem and the patch. Right. You have to understand it in the context of the code base that the human or the model is in to solve the task. So it's a very complex problem and it was definitely needed to have three reviews and I think maybe we should have done more, but it was definitely a lot of effort to get there.

3:32

Speaker A

Yeah, and there's more, but people can read the blog post for that. I will note that you guys had a trend in verifying benchmarks because I just recently saw, I think Quinn had HLE verified for humanities Lexisan verified. So now everyone's verifying everything, which is nice and good and extra quality there. Okay. But I think that the meat of it is this was a lot of like, well, here's the issue or problem statement and then here's the divs, here's the golden tests and here's some regression tests, right? That's like the rough setup of these 500 problems and there's some contamination. Always happens because all the suite Meterify was fully open. I think you did have canaries, but like, you know, stuff leaks.

4:00

Speaker B

There's like multiple avenues, but like the problems are sourced from open source repos. So it's not just like when we usually publish evaluations we publish evaluations and then we add konomy strings to ensure that, you know, they are easily filtered out at training time. Obviously, if you use sort of like data from like OpenSORPT, GitHub, you don't have actually like a canary string in there.

4:40

Speaker C

And these are also like, some of these are very popular repos, like the Django repository. So you're going to see like many instances being used kind of throughout go.

5:09

Speaker A

Yeah, yeah. Just before recording, you were telling me that you found this in your own chain of thought with 5.2 also seeing that like they had extra knowledge or something.

5:17

Speaker C

Yes. So this was an example where the task asked the agent to implement something, but. But it wasn't told that there was this specific argument that the test was going to be looking for it using. But in the GPT 5.2 chain of thought, we actually saw instances of the model reasoning like, hey, I think that it's some language version of this repository that implemented this particular argument. Maybe I should add it in. So this is an example of a test that would be pretty impossible to pass without this contamination knowledge.

5:26

Speaker A

Yeah.

5:51

Speaker B

And I think you found that sort of forced. Right. And had triggered like a whole investigation both in our own models and also in other frontier models in the market and understanding how contaminated the benchmark is across the industry.

5:52

Speaker A

What else did you find? I have to double click on this.

6:07

Speaker B

So we.

6:11

Speaker C

And when I say we, this is mostly from other folks at our tape.

6:12

Speaker B

Not.

6:14

Speaker C

Yes, but so we did some analysis on. First of all, are the tests actually fair? And so this happened by first taking all the problems that, okay, couldn't solve reliably and then again getting a lot of humans to do basically another pass of kind of digging into what's wrong.

6:15

Speaker A

Is it the same exact analysis or were they reading O3's output and going, here's where O3 went wrong?

6:34

Speaker C

I think it was. I mean, it was definitely like a scope to the set of problems that models failed. And I believe they were able to look at what the model solutions look like versus what the solutions.

6:40

Speaker A

So this isn't the same work as the original solution.

6:49

Speaker B

It's not exactly the same work. It was like a deeper dive. It's like, okay, which are the problems that we don't see any model solving, Is there something fundamentally wrong with those problems or is there something wrong with other models just not smart enough to solve the problems? So that's kind of what we dug into.

6:51

Speaker A

Yeah. And you found some.

7:09

Speaker C

Oh, yes. In over half of the problems that were investigated in that deep dive, there was one problem or the other. I think the most common problem are overly narrow tests where there's some particular implementation detail that the tests we're looking for, but wasn't specified in the problem description. So it wasn't fair to expect that model to make that particular design choice. Like one pretty blatant example are cases where the task asks you to implement some feature, then the tests are looking for you, naming that argument or that function with a particular name. But if you chose another reasonable name, the test would fail. And another set of types of bad tests are tests that are just looking for additional features that were never mentioned in the problem description.

7:11

Speaker B

Well, that is significant. That means that if you pass a test, actually you probably did a really good job. But just because you didn't pass a fat test doesn't mean that your implementation wasn't a good one. So it was just like we only accept very narrow versions of solutions and not the whole space of viable and good solutions to the problem.

7:52

Speaker A

I think it's important that you're doing this because in some way it is you in 2025, six going back in time and correcting your own work.

8:16

Speaker B

Right?

8:24

Speaker A

Because you could have caught all this in the original verified work.

8:24

Speaker C

I think so. It's definitely much harder to find a problem in the abstract than when you're looking at a very smart agent's best effort solution and trying to compare it.

8:27

Speaker A

It is harder or harder.

8:36

Speaker C

It's much easier when you have it.

8:38

Speaker B

I think also at the time when three Bench Verified was published, I think it was a very strong benchmark. It's not like we're like o this wasn't like a strong benchmark at the time. I think this is something that a lot of benchmarks go through as an evolution, right? Like when they start to become popular and viable, it's because they measure something important and models Maybe do like 20% correct on them, sometimes even less and sort of like people have something to hold on and improve models on these benchmarks. And by the time that you hit very high performance on the benchmarks, additional like 0.1% improvements become sort of like meaningless. And so like at the time, I think, you know, that benchmark was like super valuable and it taught like us and like the industry a lot. It's just like now at the point that we are now where it's are as strong as they are now, we're kind of starting to measure not necessarily like what we want to measure, which is like coding capability of our agents. But like the agent's ability to correctly guess how to name a specific function and that isn't really what we want to measure at this point.

8:40

Speaker A

Yeah, I think that's fair. If I asked you to ballpark it, most frontier models are now at 80 something. What's the actual number on Supent verify that. Did you guess as the ceiling?

9:51

Speaker C

I guess that's really hard to say. When GPT 5.2 came out folks took a look and found that it was solving like 31 problems that were in the set of should be very hard to solve without contamination problems. So I think it's quite possible that that number is already something that we've hit if you didn't have contamination at all.

10:07

Speaker B

Fair enough.

10:25

Speaker C

Hard to say though.

10:26

Speaker A

Yeah, cool. We're going to stop reporting CBENCH Verified.

10:27

Speaker B

Right.

10:30

Speaker A

And then cbench Pro will be the next one which is an effort from scale. What's your sort of comparison analysis? What attracts you to cbench?

10:30

Speaker C

Probably the first one I think is just that it's harder for swoop into Verified. I think something like 90% of the problems are things that were estimated to take an expert software engineer like less than an hour. They're very well specified, very self contained. And the Swedish Pro problems are just bigger and harder. And there's much more headroom on the eval because it's not saturated like categories

10:37

Speaker A

of like one to four hours and

10:57

Speaker C

four plus and it's more diverse, lots of repositories, multiple languages, qualitatively more different types of problems. So all that's great on the contamination side. We also think it's better there. So the way we were measuring for contamination first we've been verified was with this little contamination auditor agent which is given the description of the task and the patch and the task ID and told to go take this target model and kind of with open ended set of questions try to find questions that will manage to kind of reveal what contamination might be lurking in that model. And in Suitebench verified we found many instances of contamination across OpenAI models across like quad opus 4.5 Gemini flash. And in all of these we saw things like regurgitating the ground truth solutions, things like in some cases giving like the task IDs and other things that are pretty clear evidence of minimum familiarity with the replay stories.

10:58

Speaker B

Yeah, I mean oh, a task ID that.

11:55

Speaker C

Yeah. CBS Pro on the other hand we don't see this. I think they're the honor agent found some very light evidence that maybe a couple models might be very lightly familiar with one or two of the source repositories but it's very different than Sweden's verified so less contamination is good.

11:59

Speaker B

I think there also we should expect that at some point that's not going to be the right benchmark anymore. And it's a field we kind of have to continue to move on and find harder and more representative problems that we can measure our capabilities on.

12:15

Speaker A

Awesome. So let's go into that. I think that there are a lot of. I think we also practiced in the pre chat was people feel a qualitative difference when they're using 5.1 to 5.2 to 5.3 and it's not super expressed in these benchmarks because they're on a number of these things. What capabilities do you really want to benchmark in an ideal coding benchmark? I guess agentic coding, metro, whatever you call it.

12:31

Speaker C

I mean one thing is kind of open ended design decisions, places where the problem maybe is a little bit under specified and seeing if the model can make reasonable design decisions.

12:56

Speaker A

What's a reasonable prompt for that? Like just vibe code me b2b SaaS and make no mistakes or you know that's the meme. But like okay, what's like the actual usable open ended problem like that?

13:05

Speaker C

Sure. I mean maybe an example could be finding a way to speed up a particular part of a code base but there might be multiple different ways to.

13:17

Speaker A

Yeah, there are dedicated performance benchmarks. I think you guys have one sweet efficiency or is that. Oh no, I think that's all three artists group. But yeah, yeah, I mean that is a good one.

13:25

Speaker B

I think there's just many things that people value about working with software engineering agents. I think three bench verified obviously measured measures some important capability which is given a description of a GitHub issue. Can you produce a patch that solves that issue satisfactorily? And obviously there's some issues with the benchmark. That means that now that we had 80% we don't really trust further improvements on it. But it does match on something that is a real capability of models. But I think as a field we're like moving beyond sort of can my coding agent solve a small GitHub issue for me? Right. And so we are starting to look at much more longer term tasks. Right. That don't take like 15 minutes but maybe like hour, sometimes days and then beyond sort of like what kind of tasks can my agent solve like that There might be things that are kind of a bit harder to grasp. But like Olivia talked about sort of like does it have like design taste? Right. Like, does it solve the problem the way that, you know, my team likes to solve problems? Is the code nice? Right. Like, is it well written? Is it sort of like clean code? Right. Like people care about these, is it maintainable in the future? People care about a lot of these. Maybe less tangible. Less tangible and harder to measure, frankly. Things that are still super meaningful for people that are working with coding agents.

13:33

Speaker A

Yeah. So I mean these are all qualities that are obviously no longer the low hanging fruit. Like we have no idea how to eat all this. I think the simple question, maybe there's sort of two forks in the road. One is the sort of very human intensive, money intensive path which is hire a bunch of contractors and try to annotate this. The other is use an LLM to proxy it and try to align the LLM so that it can give you a reasonable proxy. Which of those would you want? You want to do both?

15:19

Speaker B

I think maybe you should talk about GDPL as an example.

15:47

Speaker C

Sure. So gdpvel is an eval that was again produced by a collaboration between human data team and the front of evals team and it's trying to measure whether agents can do kind of a variety of like real world white collar work. That was an eval where grading is very hard, requires kind of a lot of kind of knowing knowledge on. Exactly what are you looking for in each different context.

15:51

Speaker A

Yeah, across like 15, 16 white collar jobs, professions like I that take up a significant part of gdp, which is

16:16

Speaker C

great, high level professions and then a lot of like different granularism.

16:23

Speaker A

I've said like I'm a big fan. This is the eval for AGI basically

16:26

Speaker C

part of because it was so hard to require so much kind of like domain knowledge that the human data team hired a lot of people from these professions to be very involved in creating tasks and creating the gold solutions and trying to help create rubrics and so forth so we can grade it reliably.

16:32

Speaker A

So basically take the GDP val, which is a general listing and take that same approach to apply it to code and you roughly have a rough road.

16:50

Speaker B

I think it's an interesting solution. I think what you're pointing out is an important problem which is sort of this how realistic is it? And what we wanted was coding agents should write code that we think is good. And so it's like asking human that's actually a good way to ensure that. It's also kind of a slower, complex way to do that. And so part of why I think 3 wrench verified ended up being super popular. And where we are seeing ARPA benchmarks like this being super popular, it's very easy. It could even be easier. But validating that a solution passes all the tests is pretty trivial. Once you can run the tests on your computer or wherever you're running them, and you can kind of like, okay, is it correct or is it not correct? And you can kind of aggregate that and that it's super simple, but it doesn't tell you. It's like, you know, did the model like solve the problem? Like, wow. Like, you know, ugly? Like, what if actually like an open source maintainer of that project have like merged that pr that it doesn't tell you. But there is a lot of value in having benchmarks that are both like easy to compare across the industry and also that can be sort of run really fast without human involvement.

16:58

Speaker A

Yeah, amazing. Your team's also put out other kinds of evals that are related, like the, I think there's an RL paper bench and then sort of like the more sort of recursive self improvement type evals, how much should that figure into mainstream coding evals? You know, like, is there some way in which those things join together?

18:16

Speaker C

So we were asking like, should we build, should we also be building evals for the self improvement evals? Are you saying, do coding evals currently cover that? Mine?

18:38

Speaker A

I think, I just think those are some of the most advanced evals that we have and we're not using them in the normal path. And it's an interesting split between, well, here's evals for coding normal things and then here's the one for machine learning that is completely different. Right. I think you get what I mean. And that's mostly a safety argument, I guess, but also it's actually really useful for people to understand if the model is really good at AI code. Basically.

18:46

Speaker B

Yeah.

19:15

Speaker C

My guess is that part of the reason that a lot of benchmarks so far haven't focused as much on the AI coding is just a question of what datasets are easy to gather because a lot of the state of the art AI code bases are proprietary. So if we make evals for that, we're probably not going to release them. And it's harder for people in the field to make evals that measure. Is this a realistic research coding workflow? I do think that it's good, good for the field to try to measure these skills in a public way. I think it's just harder to make it realistic.

19:15

Speaker A

And then one more thing that a lot of People are trying to do, which is sort of. Well, instead of a percentage of 0 to 100, maybe we redenominate in dollars. Right. So you had freelancer and all that other people are doing like vending, bench, whatever. Any alpha in those, or you still want a traditional academic benchmark.

19:45

Speaker B

I think in a way there's different ways to measure the same thing. If we're like, oh, this is how much money it produces, it's a fairly similar thing to saying, oh, this problem would take a human two hours to solve or something like that. Usually they're fairly correlated. However much it would take a human to solve that problem, kind of determines the value that we ascribe a solution. And so I do think that is an important thing. How complex and how sort of long running are the tasks that we are able to entrust our agents with? And so I think that's like an important piece. But I think here sort of monetary value, time complexity, they all kind of try to capture a similar thing.

20:04

Speaker A

Yeah. Okay. So they're all proxies for some amount of increasing capacity that we want to measure. I think that's a good thing. I think the only other sort of major player in this field is meter, which has done the sort of long graphs and congrats. You guys have completely destroyed the curve for that. Any takes on that. Obviously you've come up really well, so it looks good, but I don't know if that approach is something that you want to incorporate in your own work making it. This is the long autonomy test.

20:56

Speaker B

Yeah. And we work with metar on these evaluations, so. Or we do appreciate them. I think then they're using time. Right. They're not using money. So I think that was your question. I think complexity, however we can sort of quantify it, is really important to understand where our models are getting to.

21:21

Speaker A

Okay. Complexity is the abstract thing. And then it projects down to time, projects down to story points, whatever, dollars. Great. One last question on just the overall preparedness framework is I was actually kind of looking at. People mention the preparedness framework a lot. I don't think it's well explained to a lot of people. And you actually have a nice website where it's like. I think it's like test and inform and teach something. And I feel like you actually do a lot of work there. And I don't know if you want to talk about how the preparedness framework applies.

21:42

Speaker C

So the preparedness framework is OpenAI's kind of like public framework for how we track frontier risk. So these are kind of capabilities that are typically dual use. Like you can use them for good things or bad things, but we want to at least keep an eye out for the bad things to make sure that both we as a company and the broader society are kind of prepared to handle the potential downsides. And so at the moment we kind of track three different categories. One is kind of biorisk, another is cybersecurity, and a third is kind of research automation and model autonomy. And that's kind of what ties most into the swedbench where coding is not all of automating research, but it is one very important key component. And so we initially created Sui Bench Verified as part of like building out evals for that monotonomy work stream. And now I think for like we have to move beyond that towards looking more at like, can models actually start to actually automate research workflows?

22:08

Speaker A

Yeah.

23:02

Speaker B

Amazing.

23:02

Speaker A

Okay, Amir, anything else to add on? Just the general what people should know about preparedness and how evals and human data alignment all work together in that

23:03

Speaker B

I think maybe the thing that I would say is that we really appreciate, we work really hard to build these EVAs. And so that's why we published ZWBench verified. And that's where we're like sharing GDP via these sorts of things. We also deeply appreciate like other people and the entire field to kind of build eval and share them and we use them like Zubbench Pro. We're like, yes, but that's a better eval now we should use them. So we'd really encourage people to find more ways to create and share eval that we can. We and the entire field can use to measure like progress on a variety of capabilities, including coding, because it's important to understand where we are.

23:12

Speaker A

MIA had to leave. But we're just kind of talking a little bit about the future directions that we want evals to go. And I think here we can dive in on. Give us good work on these things. We'll talk to you. Here's your platform to make a call for what you're looking for.

23:57

Speaker C

I think a few things that would be useful. I'd say first of all, really, really hard tasks, like the kinds of things that would take top notch engineers months or teams weeks, would be quite good, especially if grading is reliable and grading has like, you know, you have, for example, like rubrics that have been sourced and validated by many people in the field. I think that'd be quite valuable. I think also benchmarks on kind of creating products end to end I think as people are by putting more that would be quite useful. I think a third thing that I'd say that is maybe not quite an eval would but I think is still relevant to the kind of overall mission of we as a field and as a world should be tracking. Where are these capabilities going? I'd like to see more metrics tracking real world usage. Like how much is AI actually being used in the field? How much is it replacing people's jobs? How much is it augmenting people, speeding people up? Just real world metrics.

24:15

Speaker A

Yeah, the replacement thing is always a sensitive one on the sort of PR side of things. But we create new jobs that manage the old jobs and that's how it is yourself. I think in terms of the frontier evals that OpenAI is really going to excited to push. You put out really good work every single time. What should people expect from OpenAI itself?

25:08

Speaker C

I'm not sure I can say what we're going to. General directions, I mean general directions, I think looking at real world impact, like real world, real gdp, whatever, that kind of stuff.

25:28

Speaker A

Amazing. Okay, well, I'm excited for more of the world impact. I think you guys have really made a lot of progress and I think taken a lot of industry leadership for Zbranch Verified and now moving on to CBENCH Pro. So thank you for doing this. Thank you for being so transparent and I think people will respond in kind of your time. Thank you.

25:40