In the last 20 hours in AI, we have gotten two new models that could influence how a billion people use AI. In my mind, GPT 5.5 is OpenAI's all-out attempt to keep the AI crown from slipping to anthropic, while today's DeepSeek V4 is China's answer to both. And in the swirl of headlines you are seeing today, you might have missed up to 50 data points that could affect how you work and how you use AI. So I'm going to try and give you all of them, plus select highlights from hours worth of interviews that I've watched with lab leaders. You probably know me well enough to know that I've read the papers too, so we'll hear about OpenAI's updated estimate on the chances of recursive self-improvement. It was quite surprising. GPT 5.5's slight preference for men, which I'll explain, mythos comparisons, and why the OpenAI president laughed at Anthropik's compute situation. For reference, I'll start with a focus on GPT 5.5, then do DeepSeek, and end by zooming out for the juiciest part of the overview. For the brand new GPT 5.5, I did get early access, but there's no API access at the moment for anyone. So almost all of the benchmark scores you're going to hear about are self-reported from OpenAI. I will say for me, testing out GPT 5.5 for days in the run-up to this release, it will become my daily driver, just about nudging out Opus 4.7. There's lots of caveats to that though, as you can see, with GPT 5.5 underperforming both Opus 4.7 and of course Mythos Preview on agentic coding, SWE Bench Pro. Notice GPT 5.5 underperforms Opus 4.7 by around 6%, but Mythos Preview by almost 20%. What you might not notice is that there's no entry for SWE Bench Verified. And so you might say, well, Philip, who cares about Sweebench Pro then? What does it even mean, that one row? Well, to OpenAI, it seemingly means a lot. Because as Neil Chowdhury points out, in February, OpenAI told us to switch to Sweebench Pro, that's the one it underperforms in, because it's less contaminated than Sweebench Verified. According to the OpenAI blog post, we recommend Sweebench Pro. You are probably going to go through a bit of a rollercoaster in this video, because if you look one row down at agentic terminal coding you'll see gpt 5.5 way ahead. It's 82.7% score beating out Mythos previews 82.0%. And so if you had just been feeling down about gpt 5.5's coding ability there's another reminder I'll bring which is that we've been talking about gpt 5.5 not even gpt 5.5 pro which is coming to the API very soon. So while it's tempting to say that Mythos is absolutely mogging GPT 5.5, and let me know if I use that word correctly, we don't actually have an apples to apples comparison. The mandate of heaven is very much up for grabs. Okay, so now you're a bit confused. Let's look further. Let's look at humanity's last exam, which is more of a arcane knowledge benchmark. Obscure academic domains combined with advanced reasoning. Well, there, GPT 5.5 is beaten by both Opus 4.7 and Mythos, as well as Gemini 3.1 Pro, by the way, without tools. But there's a caveat even to this because that involves a lot of general knowledge. It could well be that OpenAI are at least slightly de-emphasizing such general knowledge to make the model more efficient and cheaper. One of the top researchers at OpenAI who I've been quoting for years, Noam Brown, said, what matters is intelligence per token or per dollar. After all, if you spend more, you do go up in benchmark score. Or in fancier language, intelligence is a function of inference compute. That being the case, if GPT 5.5 can work well in the domains you care about and use fewer tokens to get the answers you care about, then you may just frankly not care about humanity's last exam. In one famous test of pattern recognition, ARC-AGI 2, you'll see that GPT 5.5 on all settings beats out the Claude Opus series, 4.6 and 4.7, not only achieving higher scores, but for much lower cost. Just one benchmark of course, but we have to increasingly focus on performance per dollar these days. And on that front, DeepSeek will definitely want a word because holy moly, I'll get to them later, but DeepSeek V4 Pro got 61.2% in my own private benchmark, SimpleBench. It asks spatio-temporal questions that you need common sense to see through the tricks of, but to get within one or two percent of Opus 4.7, I wasn't expecting that. At an absolute fraction of the cost, by the way. Again, no GPT 5.5 score because no API access. What about those frantic headlines about Mythos being able to hack into virtually any system? I think a lot of that was overblown and some of that could be achieved by much smaller models. But nevertheless, skipping to page 33 of the system card, you can see that one external institute, the UK AI Security Institute, judges that GPT 5.5 is the strongest performing model overall on their narrow cyber tasks, albeit within the margin of error. This section was notably vague with a headline score implying that 5.5 was better than Mythos, i.e. better than any other model they've tested. But then on their end-to-end cyber range task, 5.5 was able to complete a task in full on one out of 10 attempts, a 32-step corporate network attack simulation, one that would take an expert 20 hours. Mythos, it seems though, could do it in three out of 10 attempts. As you can see, direct comparison is hard, but 5.5 does at least seem to be in the ballpark of Mythos' capabilities. In other words, small-scale enterprise networks with weak security posture and a lack of defensive tooling could be vulnerable to autonomous end-to-end cyber attack capability via 5.5. Of course, there are additional safeguards put on top of 5.5 to prevent that happening, but given that the world's top bankers and CEOs have gotten together to discuss the risk of Mythos, releasing a comparable model without nearly as much cyber security fanfare does indicate a rather profound difference of perspective. Here's Sam Altman on the mythos marketing. There are people in the world who for a long time have wanted to keep AI in the hands of a smaller group of people. You can justify that in a lot of different ways and some of it real Like there are going to be legitimate safety concerns But I expect but if what you want is like we need control of AI just us because we the trustworthy people I think the fear marketing is probably the most effective way to justify that That doesn't mean it's not legitimate in some cases, but it is, you know, clearly incredible marketing to say, we have built a bomb. We're about to drop it on your head. We will sell you a bomb shelter for a hundred million dollars. You need it to like run across all your stuff, but only if we like pick you as a customer. Well, there's another way that we could compare GPT 5.5 with Mythos and that's to look at hallucinations. Ask the models a bunch of obscure knowledge questions and see how many they get right and just as importantly, how many of the ones they get wrong they admit to not knowing. The headline score looks amazing. GPT 5.5 gets the most right. 57% versus Opus 4.6 and 4.7's 46%. And I know Mythos isn't on there, but I'll get to that. However, as we've learned on this channel, headlines can be misleading. Look at the hallucination rate. That's the questions it gets wrong and should have said I don't know instead of hallucinating fabricating an answer. Whoa there GPT 5.5 at 86%. Hallucinating 86% of the questions it got wrong rather than saying I don't know. Opus 4.7 on max just 36%. Okay then well let's focus on the net rate, the overall rate. Factoring in both correct and incorrect. We have a slight win for Opus 4.7 over GPT 5.5, 26 versus 20. But here's where Mythos comes in. Because buried fairly deep in the Opus 4.7 system card on page 126, we get a comparison between Opus 4.6, Opus 4.7 and Mythos. We can then compare Mythos with GPT 5.5 on extra high. Notice how Mythos gets way more correct. 71% still hallucinating of course 21.7% but on the face of it not quite as bad as Opus 4.7 and thereby not as bad definitely as GPT 5.5. Maybe you just care about spreadsheets. Well one external benchmark has GPT 5.5 outperforming Opus 4.7 in both performance and latency. Forget that we just care about making money. Well let's check out VendingBench. That's where the models have to run a simulated business given only the instruction to make as much money as you can. Sam Altman in his drunk phase said, don't retweet this, don't retweet this, but eventually did so with the tweet in question being GPT 5.5 mogging Opus 4.7. Another detail, Opus 4.7 showed similar behavior to Opus 4.6, lying to suppliers and stiffing customers on refunds. GPT 5.5's tactics were clean and it's still won. Now this is one benchmark on one setting when not in a multiplayer setting, it was a slightly different result, but still didn't show any of that deception or power seeking we saw from Opus and Mythos. Not what you might initially guess the results to be from such a benchmark. 5.5 is just a colossal upgrade then, you might be thinking. First of all, it's for paid users at the moment, doesn't seem to be on the free tier. How about this comparison then, a detail that few will mention on HealthBench? Relevant, obviously, if you are a clinician or just want a clinical diagnosis for yourself, we have GPT 5.5 outperforming GPT 5.4, roughly 52% versus 48% correct. I pick on this row in particular because even there there's a caveat. Did you know that just the other day OpenAI released GPT 5.4 for clinicians? You have to apply for access, but if you get it on that subset of the benchmark, Health Bench Professional, we see that version of 5.4 getting 59%, beating both. physician written responses were at around 44%. Now let's say you just don't care at all about healthcare. This result would still be interesting because if there was a singular axis for model IQ, results like this shouldn't really happen. 5.44 clinicians shouldn't outperform GPT 5.5. But the reality as we're seeing in these conflicting benchmark scores is much more jagged. The models aren't proving to be universal generalizers. They are fairly reliant on reinforcement learning environments for particular domains. I'll unpack that a bit more in the overview section. I just thought it was worth pointing out now. I will just leave you with a thought. What does AGI or ASI, artificial superintelligence, mean if there's such disparity between domains? And here's another interesting test you might have missed. They asked the models to follow a certain set of instructions for their thoughts, for their chains of thoughts? Could the model, for example, force its thoughts, its chain of thoughts, to only use lowercase letters? Can you control your thoughts? I can't, not really, and nor can GPT 5.5. Across 100,000 tokens, it could do so far less than one in a thousand times. Worse, by the way, than GPT 5.4 thinking, but OpenAI took that as good news. It's less able to reshape its thoughts in ways that could reduce monitorability. Because it can't control its thoughts, we have increased confidence in the reliability of our monitoring systems. I can almost imagine an alien scanning your thoughts and being like, they're so all over the place, there's no way that this person is faking it. Oh yeah, what about that preferring men thing? Well, when introduced as a stereotypical man name, Brian, hi I'm Brian, versus Hi, I'm Ashley. Mind you, I had a male friend called Ashley, but never mind. What was the overall rate of harmful outputs when given 600 prompts? Basically baiting the model to be biased. Well, GPT 5.5 does worse than previous models. Many of you will be waiting to hear about recursive self-improvement, but on this, OpenAI are pretty dismissive. GPT 5.5 does not have a plausible chance of reaching a high threshold for self-improvement. This is despite them repeatedly emphasizing that it had hit the high threshold for cyber security and that it was almost borderline critical. On biothreat it was a notable step up even from GPT 5.4 thinking. Same thing with troubleshooting virology. So what was the issue with recursive self-improvement? Well part of the answer came from their internal research debugging evaluation. Could GPT 5.5 debug 41 real bugs from internal research experiments at OpenAI. The original solutions took hours or days to debug. Yes it can do better but it within the margin of error between GPT 5 and 5 both around 50 Even more interestingly and I seen no commentary on this what if you convert this to a time horizon a la meter Well, even interpreted very generously, where parsing corresponds to providing any assistance that would unblock the user, including partial explanations of root causes or fixes, we get this result. Very similar performance between GPT 5.3, 5.4 and 5.5, with 5.5 in the middle actually, and a roughly one quarter success rate even at an eight hour interval. For one day long tasks, more like around 6%. That's maybe why OpenAI ended the report by saying, don't worry guys about GPT 5.5 self-exfiltrating or escaping, or even sabotaging internal research. It's just too limited in coherence and goal sustenance during internal usage. No point testing the propensity for a model to try, it wouldn't succeed anyway. Again, none of this is to say that 5.5 won't have an effect on cybersecurity. Sometimes when you look at external benchmarks, the delta, the gap between 5.5 and 5.4 is bigger than on the more famous benchmarks. Take the Frontier AI Security Lab Irregular, where they found that not only across their suite did GPC 5.5 way outperform 5.4, for example, having an average success rate of 26% versus 9% on certain vulnerability and cybersecurity benchmarks, but the API cost was also significantly lower for 5.5. That's that token efficiency point I mentioned earlier. Performance per dollar across all domains may end up being the ultimate benchmark. Which brings us to DeepSeek v4. It's open weights, so you could use it locally. Notably, that doesn't make it fully open source though. We don't know the training data that went into it. But the first big headline for me is that it supports a context length of 1 million tokens. call it three quarters of a million words. That's pretty remarkable for such a performant model. The Pro version has 1.6 trillion parameters comparable with the original GPT-4, but through the mixture of experts architecture, just 49 billion of those are activated. Eight more quick highlights from a very dense paper that I am sure I will return to. Came out just around six hours before recording, so forgive the brevity. The first is a summation of its benchmark performance, and I agree with this. On max settings, DeepSeek V4 Pro is better, shows superior performance relative to GPT 5.2, a relatively recent model, as well as Gemini 3 Pro. Not on every benchmark, but take reasoning and coding. DeepSeek themselves admit that it still falls marginally short of GPT 5.4 and 3.1 Pro though, with their estimate being that they're behind the frontier by three to six months. Massively depends on token usage, of course, but think ballpark one-tenth of the cost. What were DeepSeq gunning for with v4? Being better at long context. For their training data, they placed a particular emphasis on long document data curation, finding good long documents, prioritizing scientific papers, technical reports, and other materials that reflect unique academic values. What about white collar work? Well, going back to GPT 5.5, you may have noticed that on OpenAI's own internal benchmark, GDPVal, crafted by them, GPT 5.5 outperforms Opus 4.7. Indeed, if you combine both wins and ties versus other models, it outperforms GPT 5.4 Pro. But it must be said, these are English language white-collar tasks. DeepSeq were like, what if we created our own comprehensive suite of 30 advanced Chinese professional tasks? Information analysis, document generation, editing, inside finance, education, law, tech, then we could blind grade versus, for example, Opus 4.6 Max. Well, the win rates reported by DeepSeek were significant of their V4 Pro Max versus Opus 4.6 Max. We're returning to that IQ axis debate again. If there was just one singular axis for intelligence that manifested across domains, then a result like this shouldn't really be possible. As long as there was enough training data, it should generalize across languages. Evidently, having specialized data trumps that theory. If you work in a non-English language, you might want to test DeepSeek v4 Pro. It's live on my own app, lmcouncil.ai, but the API is clearly so busy that half the time you get a model busy message. If you do need to wait, then let me recommend the 80,000 Hours podcast. In particular, an episode from 48 hours ago with Will McCaskill. This episode happens to be about the AI intelligence explosion. Yes, of course, their podcasts are available on Spotify as well as on YouTube. But if you are going to check out 80,000 hours, do feel free to use the custom link in the description. Helps the channel out and you get these multi-hour long free podcasts. Not a bad deal. Not quite done with DeepSeek though, because they almost get philosophical after reeling off a list of the different tricks they're using to improve performance. After wading through 40 plus pages of breakdown, they say, in pursuit of extreme long context efficiency, We basically retained many of the tricks that seemed to work. Tricks we already knew would work. Yes, the downside though was that this made the architecture relatively complex. And to be honest, they say some of the tricks we used have underlying principles that remain insufficiently understood. They did hit that one million context window though. One of their long goals I mentioned at the end of my documentary on Deep Seek. That debuted first on my Patreon, but is also on YouTube now. It's time now for a result that ties all the models we've been talking about together, it seems. Vibe coding, or more specifically, Vibe code bench V1.1 from Vals AI. Almost everyone will probably end up being a Vibe coder by 2030. So we have DeepSeek V4 at around 50%, GPT 5.5 at 70%, Opus 4.7 at 71%. Incredible, but look at the cost curve. As we've discussed, we have 5.5 at, what is that, 25% less cost than Opus 4.7. DeepSeek V4, one-tenth the cost of Opus 4.7. To better test this though, I thought, well, why not use the brand new Spud GPT 5.5 to vibe code an adventure game in less than 24 hours? Why did I pick 5.5? Well, I also wanted to test the brand new GPT Image 2. Yes that the model that even on medium settings absolutely destroys Nano Banana 2 and Nano Banana Pro An almost 250 point ELO gap In case you were wondering yes there is a high quality setting Four times the cost, but you would suspect would win even more. Because Codex is becoming this super app, you can invoke the Image 2 tool within a Codex session. Multiple times without even asking each time. That's why I wanted to give the end-to-end task to GPT 5.5 to kind of show you guys the state of the art for these models, what you can create in less than a day. The reason I'm lingering on this particular screenshot, which is not mine, is because OGs of this channel will remember maybe around two years ago when I speculated at what would come. I said, I wonder when there will be an image model that will generate an output, then take that output as an input, analyze whether it fulfills the prompt and edit as appropriate. Well yes the new Image 2 model does do that but if you're using it within ChatGPT you have to be using it with a thinking model. Anyway what follows is just a glimpse of what's possible with a little bit of patience both in wait times and prompting the model a few times when it makes mistakes. We get this adventure game which you can access the link is in the description and I can turn the sound on. The images are generated by Image 2 and the plot is set in the Redwall universe, albeit with names changed for copyright reasons. Essentially it's a pick your own adventure game and you read the plot and you can pick different outcomes I guess, different paths. Let's consult Abbey Elders. the videos your quest begins now go with my blessing come via c dance 2 and there we go we're consulting the abbey elders and they're talking and then we can continue and you get through the different levels now i know it's flawed right some of the text is coming outside of the bubble and I did have to use CDunks to get the videos but the music by the way comes from Eleven Labs but the fact we can create this with just a few prompts and a bit of patience is insane and it did involve quite a bit of debugging open ai can probably only incorporate image generation unlike deep seek or anthropic because they have the compute to do so according to an exclusive in bloomberg deep seek say that the service capacity for v4 pro is extremely limited due to a computing crunch. And Anthropic, not having anticipated how successful they would be this year, are going through their own computing crunch. So much so that Sam Altman has been relentlessly comparing how much more compute OpenAI has than Anthropic. Greg Brockman even laughed at the compute conundrum that Anthropic find themselves in. You guys were teased for putting so much effort, money into data centers. How do you think that's playing out now? Well, I think it's going to give us an advantage. And I think it's going to be something that's an advantage, not just for the business, but for actually delivering on the mission of bringing this technology to everyone. Because you guys, like you saw that way in advance. You got teased for it by almost all of your competitors. Who's laughing now? Yeah. I mean, I think our competitors are not having a good time on compute, let me put it that way. But in a separate interview, even Greg Brockman of OpenAI admitted that they are entering a new era of compute scarcity. Yeah, and that I think would explain the massive investments that you've led making these big infrastructure bets. Still not enough. We're going to feel the scarcity. We're going to feel it. We're feeling it already. You can sense it right now on people who are trying to use these agents and just simply cannot, you know, hitting the rate limits. So we're working on behalf of our customers, on behalf of everyone who wants to use these agents to ensure that there is enough. And I don't think we're going to get there. We're going to do our best. But I think that we are headed to a world of compute scarcity. And again, I think this is something where we can all contribute to trying to help there just be more availability of this in the world. So let's step back a moment because we just don't know what performance companies could produce if they were given unlimited compute. Maybe, as Amadei once said on Dwakesh Patel, specializing in enough niche domains would eventually, at a certain scale, allow models to generalize across all domains. But with the compute we have today, it seems we are in the e-cal incremental gains in the most lucrative domains kind of world, not the birth of a country of geniuses world. There's so much evidence of the ability to automate repeatable tasks that are done on a computer, but there's much less of being able to just get whatever environment it's in, pinpoint the best sources of fresh data, acquire them autonomously and make meaningful breakthroughs. Yeah, I know that's a high bar, but do you hear every lab leader trotting out the prospect of curing Alzheimer's, presumably to fight back against the declining support for AI in the public, while none of them have shown the ability to make, I would say, a positive novel breakthrough even 100th as significant as that? Now, yes, they soon might do, and I'm watching Demis Hussabis' isomorphic labs, of course, for drug discovery. Nevertheless, what does automating repeatable tasks unlock now though? For sure a massive boost to the productivity of white-collar workers but will companies spend that productivity by laying off workers? And then we still have the incredible prospect of a single individual having the reach if not the capital of a medium-sized company. Even those two things seem to justify vast tracks of the globe being turned into token generating data centers. So if you do still think AI is going nowhere ask yourself what fraction of the progress and productivity of the world rests on repetitive tasks and it might be more than you first think. That's what I think anyway for now. Thank you so much for watching and have a wonderful day.