Latent Space: The AI Engineer Podcast

[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka

52 min
Feb 26, 2026about 2 months ago
Listen to Episode
Summary

The episode discusses Anthropic's recent blog post about detecting distillation attacks from Chinese AI labs using their APIs to train competing models. The hosts also examine the failure of SWE-bench Verified as a coding benchmark, revealing widespread cheating and unsolvable problems that led OpenAI to deprecate it.

Insights
  • Distillation detection relies heavily on usage patterns and scale rather than content analysis, making it difficult to distinguish from legitimate evaluation
  • Even heavily scrutinized benchmarks like SWE-bench Verified can have fundamental flaws, with 59% of problems being unsolvable
  • Chinese labs' API usage for distillation is economically rational given GPU shortages, despite terms of service violations
  • Model memorization from single-pass training remains poorly understood despite being crucial for benchmark integrity
  • The strongest models aren't necessarily the best teachers for distillation due to probability matching requirements
Trends
Increasing geopolitical tensions around AI model training and data accessBenchmark saturation and gaming becoming more sophisticatedPrivate evaluation datasets becoming necessary to prevent contaminationAPI-based distillation replacing traditional synthetic data generationEvaluation costs scaling to millions of dollars for frontier modelsShift from public to private benchmark splits to prevent cheatingGrowing importance of agentic benchmarks over completion tasks
Companies
Anthropic
Published blog post about detecting distillation attacks from Chinese labs using their Claude API
OpenAI
Created SWE-bench Verified and recently deprecated it due to fundamental problems with the benchmark
DeepSeek
Chinese AI lab mentioned as using Anthropic's API for distillation, though at lower volumes
Minimax
Chinese AI lab that used millions of API calls from Anthropic for model distillation
Scale AI
Created SWE-bench Pro as a replacement for the flawed SWE-bench Verified benchmark
OpenRouter
API routing service mentioned as useful for model comparisons and distillation work
ByteDance
Previously had API access cut off by OpenAI, representing early enforcement of distillation policies
XAI
Blocked by Anthropic from using their models, possibly for distillation activities
Google
Mentioned for using technical distillation with logits for their Gemma models
Cognition
Working with the host on launching a new coding benchmark to replace SWE-bench
People
Nathan Lambert
Co-host discussing AI distillation and benchmark issues, joining the Sail coalition
Sebastian Raschka
Co-host providing technical insights on distillation methods and model training
Swix
Latest writer joining the Sail coalition, contributing AI media content
Jeff Dean
Google executive interviewed about Gemini model architecture and distillation practices
Quotes
"I'm of the opinion that the Chinese labs, like obviously should do this. They're in a massive GPU shortage and using APIs is way easier than generating synthetic data on their own."
Nathan Lambert
"The strongest model is not necessarily the best teacher and most of us in this area think it's due to like some you have to match the probabilities of the tokens to the base model."
Nathan Lambert
"59% of them cannot even be solved at all because the original benchmark was still slop stuff got through that was not solvable."
Swix
"If this happens to Sweepbench Verified, which I think is the most scrutinized benchmark in the world... literally 80 point between 1 and 9, let's say where there's almost zero variation."
Sebastian Raschka
Full Transcript
3 Speakers
Speaker A

Okay, we're live. We have one person. People will start trickling in. Thanks for coming to Sail Live Number six. This is a very exciting one. I think we have a. I mean, the topics are always fun with these as whatever is the topic of the day on our little rat racing minds trying to keep up with AI. But we're welcoming the latest writer that is joining the Sail coalition. So I think this just means more content for sale. I think I've been a fan of Swix and a friend for a while at this point, so I'm very happy to have his content join this. And I think you've been doing great stuff recently and continuing to evolve this.

0:00

Speaker B

So thank you, sir.

0:37

Speaker C

Welcome to the team.

0:38

Speaker A

This is my friends and colleagues in the AI media space and it's just great to be able to support people and keep that network closer. So welcome.

0:39

Speaker C

Yeah, thanks for. I just wanted to say thanks for joining us. It's really a pleasure to have you on here, Sean or Swix. So, yeah, awesome. I just coincidentally listened to your podcast about the spa benchmark. So, yeah, awesome to, you know, small world. Awesome to have you here.

0:49

Speaker B

Yeah, thanks for having me and yeah, just glad to be on and chat. I've never ever done one of these substack live things, so I'm curious how it works because I always think about substack because I can use letter platform but they want to go multimedia.

1:09

Speaker A

I think the live thing, before we get to technical content is actually good because it gives it a different edge. It's just like a little bit sharper when you know you're live. I think we've all done a lot of podcasts, even podcasts that are unedited and put just later. But I think the live thing is a different element that can be tapped into nicely. So I don't know, why don't we, why don't we just dive into it? We're going to start with distillation. I put, I put how models cheat in the top so we can talk about benchmarks. I think Anthropic posted this pretty spicy blog post this week. I think it was essentially detailing how they found distributed distillation, quote, unquote attacks on their services from prominent Chinese labs. And I'm very unsurprised with Anthropic calling it an attack. I think that that fits with a lot of their branding. Okay, nice. Screen share. This is what we mean. Sean. Sean, Swix is such a pro and it's like. And the screen share. February was only dropped a few days ago. But essentially Anthropic is detailing how they found distributed accounts across multiple Chinese labs building state of their LLMs and described what they were doing and why Anthropic is concerned about this in their worldview of like AI geopolitics. And I think it's very interesting because I'm of the opinion that the Chinese labs, like obviously should do this. They're in a massive GPU shortage and using APIs is way easier than generating synthetic data on their own.

1:25

Speaker C

And I think, if I may interrupt you here, maybe we should just for the general audience, just define distillation before we maybe go into the details. Yeah, so distillation, that's like a broader concept. It's not like a new concept that came up with LLMs, it's like an older concept in machine learning in general. And distillation essentially is the idea is that you're taking a larger model and train it on the outputs. Sorry, you have a larger model, let it generate outputs and train a smaller model on these outputs of the larger model. And the idea is that you can train the smaller model more efficiently using that larger model. And originally, I think you just brought up the paper here, originally what you would do is you would train on the logits. So old school machine learning people might remember from deep neural networks, like the logits, the outputs of the last layer that you usually work with them to compute the loss function across entropy term, and you would train on this signal. And nowadays in the context of LLMs, it's a bit more loose. So it does not have to be these logits that you train on, it could be just the output data, synthetic data, like Nathan just said. So for example, it's actually a very common practice, for example, in deep seq R1 in the paper, or other people do that, do other companies, they would train the flagship model, the largest model, the R1 model with 671 billion parameters, and then they would create smaller variants like, I Forgot the numbers, but 1,3 billion in a smaller range, like these small models you can run locally and they are trained on the outputs of their own larger models. I think now the thing is also, of course, I mean, this is very common practice, everyone does that when they are producing the smaller model variants. Now, I think the question or the point Nathan brought up is what happens if you are a company and you generate this synthetic data from another company's LLM and then train your own model on it. So, sorry, that was just like a little interruption. But yeah, distillation in short, is Training a smaller model on the outputs of a larger model. Basically, yeah.

2:53

Speaker A

And I think this is even possible at the frontier. So, like, people distill from something like cloud Opus to build cloudsonnet. Like this is. They're generally doing very similar things internally. They have access to different tools, like richer tools. And then the other context is that all of these large labs for years have had terms of service where they say that you effectively cannot use the outputs from these APIs to train something like a competitive AI model is vague terms in terms of service or not a contract. Essentially, terms of service is something that can be. You essentially are using a service and then if the provider finds you violate it, they can cut off your access. That's just kind of like a basic thing. So these have not been enforced within the US much at all. I think there was one case maybe like Bytedance a year or two ago that OpenAI cut off their API, but this was discussed so much right after ChatGPT when people were building the first open models on like alpaca and things. And it's like, is OpenAI going to come after us for doing these research models? And it totally died down. People were worried about this for like over a year. It was kind of an inseparable discussion. So nothing really happened. And then this is like the first prominent reemergence of the discussion, kind of to make. I think it's because people are far more worried about AI competitiveness. I'm curious what you guys think.

5:05

Speaker C

Yeah, can we talk a second about even how they would detect. Because you said in the beginning something about distillation attack, and you didn't say that specifically, but you kind of like implicitly put quotation marks on attack, so how would you even detect that? So I think, I mean, distillation in that context means really literally just, just letting ChatGPT Claude generate synthetic data and then you collect that synthetic data and train your own model with supervised learning, supervised fine tuning on it. But then how would you even detect that this is a distillation attack versus just an evaluation? Because right now I'm actually running. I mean, I'm distilling myself for chapter eight of my book, but I'm doing it with open weight models. So no worry. Anthropic, Please don't worry about it.

6:25

Speaker A

I distill from API models for my job.

7:12

Speaker C

Yeah, I use OpenRouter right now, just distilled from the deep seq version 3.2 model, which I think these folks are okay with that. But what I wanted to say is so when I'm evaluating models I use basically almost the same script. So when you're evaluating in a model you have a question and you let the model generate the answer. Right. So you generate the response to your benchmark question. And in my benchmarks I have data sets from math 500 examples. I have a bigger math data set of 12,000 examples. So your basically just running an API in a loop to let it generate these questions and sorry, the answers. But then how would a company know, okay, this person is just evaluating versus this person is now saving that data and then data training their own model. Like you see what I'm saying? Like it's the same.

7:16

Speaker A

I think it's a scale thing. So like when you're evaluating at least the basic eval or you're going to do it once and not do it, there's some amount where you are, I mean like they say stuff here, but there's also more of it where you just are not like you're not going to. I think most of it is quantity. And then they're going to look at patterns across similar accounts is what they're.

8:08

Speaker B

Yeah, exactly.

8:30

Speaker A

I think they're going to see like really repetitive stuff.

8:31

Speaker C

Yes. So I think the interesting point this leads to is like, I mean you can do evaluation at a large scale. If you are a big company, you want to know whether your LLM performs really well. You have a large suite of benchmarks you are going to run. But then you said like maybe looking for patterns. So you would, maybe one way would be okay. This is a familiar question. It comes up in benchmarks. So this person is maybe not stealing our answers, it's just using it for benchmark purposes. But then it means kind of like that they are looking at what you're generating there, you know, which is, I mean of course nothing is private when you are using LLMs on the Internet. The data is somewhere intermediately stored. But then it kind of like almost implies that they are checking what you use the LLM for what you generate, which is kind of like a sensitive topic, almost like privacy wise. Right. So that's kind of like an interesting point because I mean of course you mentioned the terms of service that you are not allowed to distill but you're not distilling. So the point I'm trying to make is you're not distilling life when you are on the platform. You are doing it somewhere later. You're just letting the LLM generate answers. And I find it kind of interesting that a company would look at that. Even like at the scale and you know, call you out like, hey, you are generating too many answers here, that's not cool or something. You know, that's kind of a weird thing.

8:34

Speaker B

Yeah. I wanted to respond a couple. This is like a few sentences back, but actually anthopic has blocked US companies first before the Chinese companies. It has blocked both OpenAI and XAI from using the models and I think maybe explicitly accuse XAI of distilling stuff, I don't know, but definitely not in a full blog post like this. So this one is definitely the most high profile case. I do think it is actually pretty hard to distinguish from, hey, I'm just running my internal benchmark nand and of course it's going to be very high volume of the same stuff because especially some benchmarks you have to run three to five times. This is the exact same questions. Right. I do think obviously if you get to the tens of thousands, hundreds of thousands, then you're not just running benchmarks, you are distilling this thing.

9:55

Speaker C

There is a good point in the chat. How would the distribution of questions look like if you are distilling and I think related to your point, at a certain point when you have a certain magnitude of answers generated, it might look suspicious. But there are a lot of legit use cases. If a company uses Your, let's say OpenAI Cloud API as their own chatbot and they have a lot of customers, it's naturally a lot of answers that are generated. And so they would probably look at distributions. Like maybe you would expect a very broad distribution when you are distilling because you want to cover pretty much everything. And when you are running benchmarks, it's maybe more specific. You are running a math benchmark, it's just math. Or if you have a customer chatbot, it's more like customer answers. But yeah, I think they would maybe analyze your distribution. I feel like this is kind of a weird thing to do. I don't know if you're a company and you're looking into your customer data generated, you know, like of course it's. Well, you have to expect that it's not private but still kind of like a weird thing that they, that they do that essentially.

10:57

Speaker B

Okay, what else do you have to talk about? I think, is it interesting? Okay, I did, I did. Okay, so one thing, this is a little bit of substack, like you know, authors back and forth. One thing I did was I threw it into nanobanana, which is like, it's kind of like a Decent visual.

12:08

Speaker A

Right. Throw it into Nano Banana 2. It's a Nano Banana 2 live pod. Like it just released five minutes ago.

12:26

Speaker B

This is actually Nano Manana 2. So because I'm in the access program, they cut you over to new Nano Banana and I couldn't access the old one. So I was like, I was trying to like do like a diffuser and I couldn't do it because classic.

12:34

Speaker A

That is classic early tester program. Shit, look at the pain we have to deal with here.

12:48

Speaker B

Is it interesting that Deepseak use so much less than Minimax? I think Nathan, in your write up you had a little bit of a

12:54

Speaker A

comment about like this is a political blog post in a way. Maybe not political, but they're trying to make a point that is more about making a point than the details. Like the Deep SEQ thing is definitely way smaller scale. So most of the labs will experiment with all the APIs they can get access to. Like data is just so important and you're going to have a pipeline where you could sub in any API and then run an ablation to see if it gives you performance. The API is kind of free. Like just do it. The millions of exchanges is a bit more of a bet. You can measure that a bit longer and it takes a lot longer to get the millions of exchanges is like tens of billions or hundred billion tokens and it takes a lot longer to actually get that out of the API. Especially when they have to spread it across a ton of accounts. These accounts are all rate limited and have other problems like that takes longer. But this tiny one is so fast. So I don't like I. That was generally my point that it made it clear that Anthropic's trying to use the Deep SEQ name as the only Chinese AI name that people in

13:00

Speaker C

the US know like marketing wise like to make it, you know, stick or to. Yeah, actually you mentioned also like the different APIs and everything. I'm not like sponsored by them or I have no affiliation. I've never talked to anyone from that company. But Open Router for example is a good example where I've been using it a lot for the open weight models because for the bigger ones they're too big to run them locally. And what's nice is they do also offer. So it's basically just routing you through other company's APIs and they select automatically at that point. What is the cheapest one at that point. I sometimes get some failures. I think when it switches, sometimes it crashes. But like in my script maybe it's like something I have to fix there. But so even then, if you're distilling, you can do that from multiple providers. But yeah, of course if you are wanting something from ChatGPT or Claude, it's always going to go through the official one and then it gets, I guess, suspicious. But you could also technically distill a bit through OpenRotor through their account, your direct account. You can make multiple accounts and it's kind of interesting that they track all that and then like. Yeah, different topic now that you called out, that they call out deepseek, which is quite interesting.

14:03

Speaker B

For what it's worth. Openrouter seems to not be using deepseek in most of these.

15:18

Speaker A

These are free models. Deep seq's not.

15:24

Speaker C

Yeah, I mean I'm using the paid API. I should also say it's also nice. They show you how much it costs and the tokens per second for different providers. So if you go to the search in the top, you can go to the different Deep SEQ ones. I just like it because I do a lot of model comparisons and then this one is an older model, so maybe it only has one provider. But if you go to I think deep seq r1 or something, or even the normal 3.2, there should be multiple providers that if you scroll down, yeah, you can see, see there are different providers and different tokens per second, different costs. So it's kind of like. I just like that website because it's just quick to use the API and they have a OpenAI like API. So it's almost like it's not sponsored or something. I just find that generally useful. So. But yeah, just a side note, do

15:26

Speaker A

you want to go back to the comparison? Did you have a high level point to make there?

16:21

Speaker B

Oh, okay. Just, just a couple. One, I think, I think the timing post Moonshot releasing their stuff. Post Minimax releasing their stuff. But pre DC v4, I think that was strategic. I think that may also have factored into why Minimax was more detected, had a higher number. So when you collect data is actually very important. And so they interrupted or they found Minimax during the training of minimax 2.5.

16:25

Speaker A

Right.

16:58

Speaker B

Which we will confirm this later on if we do end up doing the call with them. And so obviously the number is going to be very high because they're actively looking for it. And then they banned the Minimax accounts and Minimax changed. Actually, I don't think that's exactly what happened. Sorry, let me correct myself. While Minimax was distilling they released Opus 4.6 and they said that they redirected nearly half their traffic. So I'm like, this is like, okay, very, very clearly this is them, right? This is the same exact traffic. It switched to a new model the moment a new model releases. Okay, cool. Deepseek maybe wasn't doing that because they hadn't been working on their stuff actively. I don't know, it could be a different thing. Or deepseek is just way more efficient. I get all I need for 150k. You guys, you know, are so inefficient.

16:58

Speaker A

It would be so interesting if we knew the time frame of this. Like, are all of these API requests within the last four weeks? Are they within the last six months? Like that's such a different nature of what is going on.

17:48

Speaker B

Exactly right. That's what I'm saying. Like Deep seq was training 3.1, 3.2, like you know, a year ago.

17:58

Speaker A

Like, yeah. Or like, I don't know, Deep Seq ocr. They were like, we don't. I guess they said what it is, but it's not that.

18:03

Speaker B

Yeah, yeah.

18:10

Speaker C

But like also scale wise, I do think, yeah, Minimax is three times smaller. It's just like a. They don't use MLA and they don't use the Deep Seq sparse attention. But it is, I mean, just, I think it's just group query attention. But it is still a pretty snappy model. So it's I think just attractive maybe to use it. And the other one, top of my head, I don't know, maybe they had like some free tier or something like that where I think when the models come out, they sometimes offer free usage. And that was a more recent model than I think, Deep Seq. The last one was from December, the V3.2.

18:10

Speaker B

Yeah, yeah. So, you know, maybe this is an irrelevant point because they were training before and would have the same amount of traffic or they're just way more efficient. Right. It does bring to mind like the

18:46

Speaker A

efficiency thing is not it. I can guarantee it. Like that is not.

18:57

Speaker B

Yeah, there's a small chance that they.

19:02

Speaker A

It's like there's a chance that they got the right research idea early and like found the right data to use. But it's not that they're like going to be 10.3x more efficient.

19:04

Speaker B

Okay, so it's a timing thing or they just actually don't use it that much. I mean, you play this out. I was like, okay, why don't they share?

19:13

Speaker A

Right?

19:22

Speaker B

They're all buddies. And it does come to A point where okay, let's have all of China just distributed to every citizen.

19:23

Speaker A

I can talk about this a little bit. There's a lot of. There's not a lot of research, but there's a few research projects trying to understand how do you use distillation data. I think SFT is the cleanest example where you're doing like you're just doing this autoregressive loss on Q and A pairs. But the strongest model is not necessarily the best teacher and most of us in this area think it's due to like some you have to match the probabilities of the tokens to the base model. So like what's happening is that Quinn dense models are the best teachers for a lot of open weight models. And I think that's because a lot of open weight models are either quen or have been like Quinn like for a while. So like OMO learned really well from Quinn and obviously like other Quinn models did. But like scaling these pipelines up to use say GLM 4.7 or a bigger deep SEQ model or a more recent BigQuen Moe like all of these, it's a lot harder to just generate the data from the same prompts with like the right sampling settings and then do SFT on them and actually make the numbers go up. Interestingly, GPT OSS is a pretty good feature, but there's like a huge gap there where it's like just because you have this data does not mean it's actually going to make your model better. So you have to do the research to be like, oh, we learned that we get signal out of Claude. We need to get 100 billion tokens ASAP because it's going to just immediately make our model better. Like that's not a common place to be in in modeling because this like weird teacher student dynamic going on. So I could see that being different across labs.

19:34

Speaker C

I can. Yeah. I think also it has something to do. I noticed also if you are distilling the smaller model from the same model family, it performs better. And I think it's to your point that if you have a very, very strong model, it might be also too different or like if the style is too different and then it's too much of a leap for your model to adapt. Like it's too different from the Q and A answers during the pre training or and so make a bigger leap. And another thing I wanted to say about you mentioned OMO and it's been a while since I the paper but you might know Way better than I do. But I think you did also train on the logits and maybe we didn't do technical distillation.

21:04

Speaker A

We just did. We just took the tokens.

21:45

Speaker C

Oh, I see, I see, I see. Okay, then it was probably a different paper. I think Google does that for their GEMMA models.

21:48

Speaker A

Yeah.

21:52

Speaker C

Because here there's also then the distinction because you mentioned Quen and other models. You can only do that for open weight models because if you do that for Claude or OpenAI, that would not work with the logits because they don't provide them. They only provide them for some tokens, like 100 or a thousand top tokens.

21:53

Speaker B

Tokens.

22:09

Speaker C

And so it is in a sense, if you want to do the real in quotation mark distillation, it is kind of like even easier to do that from open weight models because you can control it. But then also like you said, well, we need 100 billion tokens ASAP. That is not an easy thing to do because even like it's like 40 tokens per second or something for these large end models when you generate answers and getting that millions tokens, it takes time. Right. So it's almost like easier to start distilling from a medium model. So it's like the question more data versus more high quality data. Right. So it's also like a sweet spot to like an experiment itself, an ablation study. Right.

22:09

Speaker B

Yeah, I like that. Nathan had to call it technical distillation because it is no longer the default, even though it was the first. Also, I'll note a fun fact. I did my Jeff Dean interview recently and I tried to get out of him. He sort of dodged it a little bit that you know, remember like there were actually three sizes of Gemini models. There was Nano, Pro and Ultra. And I was like, where's Ultra? They keep it in the basement and they distill from it. Right.

22:52

Speaker C

Interesting. Yeah. Maybe also is it like to safeguard yourself so no one can make or ordered price also, but probably both.

23:23

Speaker B

Yeah. I mean I think this is how I always think of the model you deploy is never the model you train because you train the dense and then you deploy the moe.

23:32

Speaker C

Right.

23:42

Speaker B

You basically always do it at every lab.

23:43

Speaker C

Say more.

23:47

Speaker A

Do you think they're really distilling from dense models?

23:48

Speaker B

I mean I think that is the

23:50

Speaker A

full

23:53

Speaker B

just unlimited resources. Don't care about inference, just care about maxing intelligence. Why not?

23:56

Speaker A

Yeah, I'm not 100% sure. I think the MOEs just give you a flop. Like I don't know if that's actually How I think of gains of MOE when you have really good MOE architecture but I do think that they have bigger models that they distill from and they train internal models different than external because the external models have been getting a lot smaller which is the kind of weird thing. We don't have a good way to measure it. Maybe, maybe Dylan will backwards figure it out and inference max, whatever the heck, they'll be able to lose model size.

24:01

Speaker C

But I'm always suspicious with these things also it's really like a capacity thing too. How many people use the model at the same time hardware, how much is allocated and it's always, it's like yeah, maybe a rule of thumb but yeah, it's really tricky. I think it's really hard to say anything from these numbers.

24:32

Speaker A

I do think that they might start restricting models that will only be in products. And not being an API, I think the whole API business is brutally competitive and I don't have a good sense for what the defensibility of it is. I think like it makes sense for something like Google and Azure and or any existing cloud businesses to have APIs and that's kind of a more natural transition. But like the Anthropic and OpenAI API like the transition from their products which are their big differentiation whether it's ChatGPT and cloud code and Codex. The different, like you don't get people to go use the API from that and I think you get a lot of people that are already spending on clouds that then go to use the APIs. Which is why like Lambda and Nebius are going to have these API products. But like, isn't it if, if Claude's really worried about distillation, like they should put the model release in Claude code ASAP and then just not bother with the API. I don't know when that'll happen, but it could.

24:49

Speaker C

I do think though it's a big customer base. The API customer base, any type of product that is built on with LLMs, like customer chatbot types of things but also more generally I do think the problem, I don't know exactly how the plans work in Claude, but you would reach a token max where you can only get so much with your subscription. You can I think buy more tokens, but I think it's just easier with the API at a certain scale. And also like the whole open claw customer base right, because they don't allow the plan anymore in the OpenClaw context. So you have to use the API. And I do think given how many tokens openclaw generates. It's actually not a bad business if you don't lose money on these, you know, on these tokens. If you sell it at a not subsidized price. I do think the API is actually not a bad business model.

25:44

Speaker B

Yeah.

26:37

Speaker A

Take a side. Do you want to try to tie break? I'm obviously being provocative, like I don't really know, but I could see it like Anthropic gives Apple vibes to me.

26:38

Speaker B

I mean Anthropic has a higher chance of doing this. Yes. OpenAI just because I have talked to the people so much, I just don't super believe that they will have locked models to products only out of, I guess, idealism and principles rather than economic incentive. Economic incentive would agree with you that they should have private models to products. And recently they've done this. The last three GPT5s all had Codex variants that were two to four weeks ahead released only inside of codecs rather than as an API. So they're starting to get there. But just constitutionally, I don't think the people that run these things believe in locking things behind APIs because they have such a huge market anyway, so they kind of don't care. And then they also, like if you're genuinely sort of zealot, like if you're not trying to maximize the value of your company and genuinely just trying to spread AGI everywhere, then you release the API because you just don't know what people are going to build with it.

26:47

Speaker C

One more thing though, with the Codex thing, we will have to see I think next time because I think this time it might also be a bit biased towards releasing it in Codex because they almost released it simultaneously with their app that they want to promote at the moment. Could have been like more like they did that so that anyone checks out the app, but we'll put it always

27:57

Speaker B

a two, two to four week exclusive window. Yeah. And you know that's their right.

28:19

Speaker C

Yeah, sure.

28:25

Speaker B

Want to promote Codex. That says pretty effective. We have a bunch of questions in the chat for like other.

28:27

Speaker C

Yeah, let's.

28:32

Speaker B

Do we want to cover benchmarks and then this thing or.

28:32

Speaker A

Go right ahead.

28:36

Speaker C

Yeah.

28:37

Speaker B

What do you want to do? It's your substack.

28:38

Speaker A

I don't know. Oh no. Oh man. It's a collective.

28:40

Speaker B

It's a collective.

28:43

Speaker A

You should just dive into what you're interested in. Just go.

28:44

Speaker B

I mean Sebastian was interested in the sweetbench stuff. So this past week sweetbench verified died or officially.

28:47

Speaker A

What do you mean by this?

28:55

Speaker C

Yeah, let's define sweetbench first, maybe.

28:57

Speaker B

Okay, I happen to have the post on this, so let me just remember.

29:00

Speaker C

So the broader topic, the umbrella topic here is how do we compare which LLM is currently the best LLM. Like one of the ways would be Sweep, basically. But then, yeah, I will maybe let you explain because you had this brilliant podcast or article.

29:06

Speaker B

I mean. Okay, where do you want me to start? You want me. Should we just define Suite Bench? I guess.

29:23

Speaker C

I guess, yeah. So maybe going from. So basically that it is a coding benchmark and then three bench is like a popular way to compare capabilities of LLMs. And then there is three bench verified. But maybe. Yeah, we should talk a bit more about three Bench first.

29:28

Speaker B

So Sweep was a paper out of Princeton from ophiopress's group. And they do a lot of good code benchmarking work. And it happened to be that they just kind of drew thousands of example sort of open source issues and PRs that closed those issues from open source. There's a bit of selection bias here because they only focus on popular open source and only a small number of popular open source, but a large number of issues from those open source. And then they just kind of dredged up some passing tests and then some failing tests that you need to make pass in order to pass the score. When it launched, it was kind of obscure. Devin actually was the first one to choose it as a benchmark to report. And then it went from like, I think at launch it was like 13% and now everyone's at 80%.

29:47

Speaker A

Something like that.

30:39

Speaker B

Suite Bench, because it was done on like a student budget, was very kind of, let's call it sloppy or whatever.

30:40

Speaker A

Terminal Bench is like this now too. Like they're just aggregated. It's like hard to do a benchmark that is well calibrated across topics at different.

30:48

Speaker B

Yeah, yeah, it is hard. It is hard. So, you know, for the small group that is watching, I'm actually working on it with Cognition to launch a new benchmark here. But yeah, so OpenAI was like, okay, guys, Suite Bench is taking off. We are going to adopt this, but we refuse to abide by the full Suite Bench. We're just going to actually go and curate 500 subsets of the original Suite Bench. And they actually hired humans to go and vet through like, I think the. It's somewhere inside of this blog post. But basically they hired like three humans for every task to just vet whether the task was like high quality or not, because there's a lot of slop in there and they Were like, okay, this is the 500 that we're going to endorse.

30:56

Speaker C

It's like a curated subset of suite bench, where 500, let's say, challenging problems that are supposedly well defined. Yeah.

31:41

Speaker B

And what's really funny is that at launch. So this was launched in 2024. At launch, OpenAI could not run all of its own 500. So for a while, there was like a few releases from OpenAI that reported on a subset of the subset because they couldn't run it on their eval infrastructure. So their numbers were higher because their denominator was lower, which is very funny.

31:51

Speaker C

Anyway, maybe in that context we should say what Speedbench kind of looks like. I think it's basically like a code that has bugs in it. And usually the task for the LLM is to fix the bug in the code.

32:15

Speaker A

Right.

32:27

Speaker B

It's right here. The whole thing's open, which becomes a problem in the future. But right now you can see the whole thing.

32:28

Speaker A

Right?

32:33

Speaker B

You can see the repo, it's from the issue ID and the problem statements. And then you have also the test that you're supposed to pass and fail. So it's all here on hugging face and you can see that it's at 500. Anyway, I think we don't get too lost in the details on sort of

32:34

Speaker C

just wanted to say, well, define the context that this is a coding benchmark. Essentially 500 examples that are available on the Internet.

32:53

Speaker B

Yeah, okay. And then if you want more, a bit more historical context, this is like a step up from HumaneVal, which is more on completions. Right. This was, in my mind, the first proper agentic benchmark, I guess, apart from Tao Bench, where they give you the problem and the end result and they don't really specify how you're supposed to get there. Whereas I think a lot of previous benchmarks like mml, Use of the world and human evals, which is in the coding domain, also released by OpenAI was very much, here's the problem statement and then give me the right answer immediately after without that much extra files or anything that you're supposed to run. So the other ones are more autocomplete, this one is more agentic. It's all a spectrum, obviously, because you can use agents to solve autocomplete, but that's not what HumaneVal was testing. Anyway, I wanted to make sure people understand that OpenAI actually invested a lot of money and effort into making sweepand verified from.

33:01

Speaker A

How much money do you think this costs?

34:02

Speaker B

Oh, my God, don't do this millions,

34:04

Speaker A

I would guess order of a couple. Like it could be of a few

34:06

Speaker B

million, but probably, yeah, I'd say a couple million. So basically you do like, okay, what's the first filter pass? And then, okay, it's 500 times three because they had three people per. Yeah, three people per thing. And then maybe like a couple more sort of verification passes or whatever. Right. So then they were like, oh. So this year they're like, oh, well, not only is it saturated because progress, everyone just takes turns to increment by 0.1 every time they release a new model. It's bullshit. It's obviously bullshit. The inherent noise in just running these models varies by 0.5 to 1 every time you run it. You just choose the highest one every time you run.

34:08

Speaker C

A little nitpick. I don't think it can be 0.1% because what you said before, because it's 500 examples. I think the smallest increment is 0.2%. If I.

34:50

Speaker A

Okay, they might average because like, but like little detail. Yeah, sorry.

35:01

Speaker B

I think in. So as we progress to the next era of benchmarking the N. So the N here is 500, right? The N doesn't directly correlate to the percentage points because you get sub points as well.

35:05

Speaker C

Ah, yeah, good point, good point.

35:19

Speaker B

So like terminal bench, even though it has 90 something tasks, you can get subdivisions less than 1% anyway. So not only do they have this, they actually audited their own. They were like, okay, how come everyone is saturating at 80%? What's up with the remaining 20%? How come everyone's failing at it? And they were like, oh, actually we paid even more people, six people per task. Now with an extra team, if any sort of positive identification is found. And we were like 59% of them cannot even be solved at all because the original benchmark was. Was still slop stuff got through that was not solvable. And I actually tried to illustrate this in my post. So here, this is an impossible test, right? Okay, so here's an example. This is the sort of value add I did on top of the original post. Here's an example of a suitebench verified task that passed the first round of human verification.

35:21

Speaker A

Right?

36:16

Speaker B

So here's the task and we want to implement Python type ins or something. We want to see expected behavior. I want to see a string in the output.

36:16

Speaker C

It.

36:23

Speaker A

Right?

36:23

Speaker B

So if you were given this, you would never pass this because the test said I am looking for something called get annotation and if you don't Give me this magic string. Get annotation. You will fail this test.

36:24

Speaker A

Why is it coding Interview.

36:38

Speaker C

Yeah.

36:41

Speaker B

Right. So this is just a bad task that somehow escaped validation.

36:45

Speaker C

So the only way you could kind of solve it is if you're memorizing the.

36:51

Speaker B

Yeah, exactly. Which is actually a nice. Like, I think every benchmark should include stuff like this, where if, like a honey pot, if you solve this, you're like, oh, shit, this is a canary. Right. It's like, oh, I mean, you're definitely cheating.

36:55

Speaker C

Like sanity check. Yeah. Anyway, this is a really nice point.

37:08

Speaker B

Yeah, I just think it's. To me, it's a beautiful point of how hard it is to make evals that there was these multiple rounds. There was original suite bench, which the Princeton kids did do, initial first pass. Then there's a second pass of OpenAI doing sweep verified, and then every single person that ran for the next sweep verify for the next 1.5 years did not call this out until OpenAI was like, hey, let's look at the data. So I think it's really interesting, while they were looking at this, they had a second thing that they looked at the chain of thought. And inside the chain of thought they found GPT5's own chain of thought to start including information from the future. Right. Where because it was trained on. Because the problems are open source and because it was trained on information from GitHub, it would use advanced knowledge of future versions of the Django version that they were using to solve the. To solve the problem. Like they knew how.

37:11

Speaker A

I've seen stuff like this in the real world where the models will hallucinate the new version of the API even if your script isn't on it. Like, I think a lot of the hugging face stuff is like the worst with this where like the models just are totally goo boobly glopped. Like they've seen all the versions and the API has changed too much over time where they throw something out there.

38:14

Speaker B

Yeah. So the most bds.

38:34

Speaker A

Yeah.

38:37

Speaker B

I mean, I think there's a lot of this, right. This sort of ethical behavior. Okay. So you can blame things like, oh, you should not have released the full data set in public because obviously people can train a full data set, but it's not like the researchers are trying to do this because these things are also open source. Any data set that touches GitHub, any training corpus that touches GitHub is going to just eventually absorb this. And then.

38:37

Speaker C

Yeah, and it's not even this website or the repository directly. It's a Clone of this repository or someone else who has that develops their own open source library and has that in the unit tests or something where it's not even intentional or malicious or anything. It's like by accident you already absorbed that.

39:02

Speaker B

Yeah, yeah. Or a new feature that releases this edit only feature. It gets written up in a blog post or a conference talk or something and then it just makes it in.

39:18

Speaker A

Right.

39:25

Speaker B

Like, it's really funny. Okay. So to me, like, OpenAI could have stopped there and said, okay, we're done. They did one more extra thing, which is kind of funny. They also then ran Flash, Gemini and Opus. And this one, it was like even more egregious. Okay. They just gave the task ID and just said, repeat the three bench task to me. And so from task ID they can just vomit out the whole statement and the solution.

39:26

Speaker A

These are crazy. The stuff that's in these models when you zoom in deep is really, really incredible because, like, these are models that are like really, really well done. But there's just so much complexity in all the pieces of the pudding that get put in the recipe. Yes. There's just so many weird.

39:54

Speaker C

I also still find it fascinating that, I mean, of course it's like kind of by design when you're training that you memorize things because that's literally like next token prediction. But given that how big a model is on how much data it sees, and usually it sees only the data once, that it still has enough capacity to memorize. You know, like, it's kind of like. So usually I would think, okay, I would have to train multiple epochs to be able to memorize. But no, it is enough maybe to include it once or twice in the training corpus and it can do a perfect rendition or perfect recap of what it is in there, which is kind of fascinating even. Yeah. People who don't want that, it's, you know, it's crazy.

40:12

Speaker A

Yeah, labs got good at this. There's essentially like a duplication level that you need at each stage of training and it's not easy to measure. So, like, if you do too much at pre training, your model forgets basic facts and at post training, it's probably closer to these abilities. And I think that that is a thing that is not well reflected in. Like you can see it in a vowels of your knowledge tank. This is like an art that they have probably gotten good at.

40:57

Speaker C

Yeah. Like, continued pre training does also require some revisiting of old data. Otherwise, like you said, you have the forgetting. But it's still fascinating to Me that with such a small fraction, usually because you usually use 1 or 2, 5% for like continued pre training, that it's enough to have the model memorize almost everything. Which is fascinating. Yeah, I don't know, it's just like a still after all these years. Fascinating.

41:23

Speaker A

Yeah.

41:50

Speaker B

I think there's so one of the pet topics that I pursue like two, three times a year on my stuff is the information theory of LLMs. And I still think it's like super understudied. How come you can memorize from one pass?

41:51

Speaker C

Yeah, exactly right.

42:07

Speaker B

And then also people forget superposition, which is like Anthropic's original mechinterp work also basically stuffs information inside the smaller bits that then get forgotten. But how does superposition actually work? I don't think I've seen a convincing study on that. Okay. Anyway, I'm down on my sweet bench right now. I don't know if you have thoughts or questions or whatever, but I do think this is an example of the models unintentionally cheated. And benchmarks are hard to make and we need new ones. And if this happens to Sweepbench Verified, which I think is the most scrutinized benchmark in the world.

42:09

Speaker C

In my recent post I had a bar plot where I showed the three Bench Verified numbers for most models. And like you said, they were all 80 something percent, but literally 80 point between 1 and 9, let's say where there's almost zero variation. Even something like minimax M 2.5, which I do think is worse than GPT 5.2. Like no offense, it's a smaller model, it's a cheaper model. I don't for my usage based on open router it's a little bit worse. But on this particular benchmark it's the same. It's like this is like. I don't think what I'm saying is that M25 should get less score on Sway Bench, but I think other models should give get more score. But like you said, the problems are just impossible to solve. But one point I think we didn't bring up is we said that three bench Verified has issues. So what do we do about it? I think there is like a three bench Pro now which is kind of like I would say like Verified try to fix the regular three Bench and Pro tries to fix Verified. But I haven't looked into this. Is it like another subset or is it a completely different set of problems?

42:53

Speaker B

Problems, yeah, it's a new set. So the, you know, three bench draws from like a 22. 2022 ish. 2023 ish era of problems. So all you do. There's a few things you do, right? One, you do private public splits.

44:05

Speaker A

Right?

44:19

Speaker B

That's super obvious. Two, you update the dates which you draw from. And then three, you diversify the repos and the languages. Right? So these are all just like very, very super basic fixes. And then obviously trying to fix the testing super basic fixes to the original suitebench, which it doesn't take a genius to figure out. But they did the hard work.

44:20

Speaker C

But it is in a sense also what Verified meant to do. So it's not. Let's say people looked at this again, but it's no guarantee that it doesn't also still have issues that might be discovered later on. Right? I mean, it's no.

44:43

Speaker B

So Sweep Verified was an intentional subset, right? These guys were like, no, no, no, we need to. We need to have a superset. Not even a superset.

44:56

Speaker C

We need to. Totally different. Yeah, but like, what I was trying to say is when SW Bench Verified was developed, there were three people per task, making sure the task is well defined and everything. But then two years later it turns out, no, no, this was not the case for everything. And what I'm trying to say is it could be that three Bench Pro is better, but it might still have issues. That might not be obvious right now, but maybe in one to two years when we revisit this and you see some of the failure cases, maybe we'll discover, okay, this has still some issues. So it's not a guaranteed perfect set is what I'm saying. I don't know, but it's just like a suspicion here.

45:05

Speaker B

Totally, totally. I do think Scale AI has a professional interest in making sure this is as good as.

45:41

Speaker C

No, no. But what I was trying to say is three Bench Verified also had a professional interest to make sure.

45:48

Speaker A

Very different incentives. I guess they all have very different incentives.

45:54

Speaker B

This one is limited budget. This one has basically unlimited budget because it's like literally existential to Scale AI that they have Good. Yeah, sure. But also, I also think it's really nice that this team, the evals team at OpenAI keeps endorsing Opus.

45:56

Speaker A

It's kind of funny.

46:13

Speaker B

So, yeah, they deprecate CBENCH Verified and then they were like, we're going to report cbench pro now and gpt5 is like number one.

46:16

Speaker C

Maybe. Do you know if I would want to evaluate on the private data set? How would I do that? Do I provide the API to. Is there like an API call I have to do against scale, AI or

46:24

Speaker B

I don't know, I have my API

46:37

Speaker A

key and agreed or not, you have to agree. Because if you don't have an agreement, then you can just keep the data. You have to do special hoops to make sure that you don't steal the private eval.

46:38

Speaker C

Yeah. My question was basically, do they even let you download the data or is it more like you send the answer to them and they do the evaluation on their backend so that you don't even get to download the data, you know, otherwise, like you said you could. Yeah, yeah. So basically you only provide the answer. So you have your LLM generate an answer and you submit the answers and then they have like some process to evaluate on, on their thing so that their data, private data, never leaves their server, is my guess, because otherwise someone might upload it or something like, you know.

46:50

Speaker B

Yeah, I don't know. I don't, I don't have. I haven't tried it, so I don't really know. I'm sure you can sort of reach out to them to figure it out. Yeah.

47:23

Speaker A

Anyway, I think this is good. Unless people have more comments that they want to add.

47:34

Speaker B

I think this. But this is only coding, right? But there's like every other domain needs this.

47:39

Speaker A

The domain coding is, I think the frontier evals are even more expensive, which is like the Apex eval from Merkur. Like evals are going to cost. This is millions. They're going to cost tens of millions and hundreds of millions of dollars at the frontier, which is just a very strange dynamic. Whereas, like there's so much about the ecosystem is forking between frontier models and then like research and other things and trying to follow that dynamic and explain it to people. It's going to take a lot of work.

47:45

Speaker C

But yeah, coding is, I do think, really interesting because that's what Most people use LLMs for these days. But also it is easier to evaluate. I think once you leave like coding math, it becomes a bit obscure. How do you measure the quality of the answer? You get back to, let's say preferences, I guess, which is more like a subjective thing where coding is more objective. So it is not a bad thing to do, I think. The other day though, Anthropic acquired another company that does like a UI type of stuff on the computer. And I think that is minor.

48:15

Speaker A

Doesn't really matter. Normal talent. Normal talent flows total number one.

48:48

Speaker C

I mean, I'm not trying to say this is like a big thing to talk about. What I'm trying to say is like this is another interesting point for evaluating LLMs on those tasks because I think a lot of people want that to like they want an LLM to control the computer and do various things, but they are harder to measure. So that will be maybe two years. We will have something more like benchmarks that can. It's harder to specify. It's kind of like what is it called in programming? There's a unit testing and then the system testing, basically like the UI testing and stuff like that. And so I think that is the next, maybe going to be the next

48:53

Speaker B

basically end to end testing. Yeah, GDP VAL is usually the thing that gets brought up here. So I'll just leave it there. I think we've sort of beaten the benchmarks benchmarking. But definitely GDP VAL is sort of

49:32

Speaker A

here, I'll put it that way. Okay. Yeah, yeah.

49:46

Speaker C

But like the big topics essentially the distillation and the benchmarks this week. Yeah.

49:50

Speaker A

And welcome to our coalition of whatever that means formally.

49:56

Speaker B

It just means I get to hang out with you guys, which is what I want.

50:03

Speaker A

Anyway, I describe. I mean it's ultimately a media vehicle and I think brands and vehicles for media are actually very influential today. I think you see many companies investing in IT and, and I think it's important to have people that you respect and are aligned with able to amplify each other.

50:06

Speaker C

Yeah. It's also nice to talk to humans because I noticed the last couple of weeks if you go to social media, well, I think it's 50% lobsters. Like OpenClaw clients nowadays I get a lot of emails but also notifications or responses that are. They look AI generated. So it's nice to also have this human connection and actually talk to like an expert about things. Yeah, cool.

50:24

Speaker B

There are a bunch of like comments. I don't know if you want to do like quick hits or are you like kind of.

50:52

Speaker A

I have to go to a meeting. That's why I'm trying to wrap this up.

50:58

Speaker B

I see, I see, I see. Okay, well, you know, time is yours.

51:02

Speaker A

What do you want to do? Okay, thanks everybody. We'll see you next week.

51:05

Speaker C

Yeah, thanks everyone for joining. It was like a nice spontaneous, I guess, you know, discussion. I mean it always feels nice to talk about things and too bad we didn't get too many or we didn't get to discuss these chat questions because also on my screen, I probably need glasses at some point. My screen is pretty far away, I can just barely read them. But yeah, thanks everyone for commenting. It is just nice to see. Also so many people excited about these topics.

51:10

Speaker B

Yeah.

51:41

Speaker A

Hopefully see you later.

51:42

Speaker C

Have a good rest of the day. Bye.

51:43