#235 - Opus 4.6, GPT-5.3-codex, Seedance 2.0, GLM-5

91 min

•Feb 16, 20264 months ago

Summary

Episode #235 covers a massive week of AI model releases including Anthropic's Opus 4.6, OpenAI's GPT-5.3 Codex, Google's Gemini 3 DeepThink, and Chinese models GLM-5 and Seedance 2.0, alongside significant funding rounds and emerging safety concerns around model evaluation and alignment.

Insights

AI models have crossed from impressive demos to practical first-choice tools for many workflows, signaling major economic shifts in white-collar work
Safety evals no longer track reality due to eval awareness in models, making it difficult to confidently assess cybersecurity and deceptive capabilities
RL appears to elicit latent reasoning capabilities already present in models rather than creating new ones, suggesting untapped potential in current architectures
Hardware-software co-design and full-stack control (like Cerebras partnership with OpenAI) provide structural advantages equivalent to massive funding multipliers
Incoherence rather than coherent misalignment may be the primary failure mode as models scale, with larger models showing higher variance on complex tasks

Trends

Rapid convergence toward agentic AI with parallelized multi-agent workflows becoming standardChinese AI companies gaining competitive parity through less restrictive training data policies and architectural innovationsShift from developer-focused positioning to enterprise knowledge worker tools (e.g., Anthropic's PowerPoint/Excel integrations)Doubling time for AI reasoning capabilities appears to be accelerating from 7 months to ~4 months based on Meter eval trendsWorld models and video generation becoming critical infrastructure for robotics and agent trainingDiversification away from NVIDIA dependency through partnerships (Cerebras, Google TPUs) driven by margin concernsDistillation and parameter-efficient fine-tuning (LoRA) becoming dominant training paradigms for model adaptationOpen-source model releases from Chinese companies creating competitive pressure on closed API providersHumanoid robotics attracting massive venture capital despite remaining in R&D phase, suggesting near-term commercialization expectationsSafety evaluation frameworks struggling to keep pace with capability improvements, creating governance gaps

Topics

Large Language Model Releases and Benchmarking AI Agent Architecture and Multi-Agent Systems Reinforcement Learning for Model Improvement AI Safety Evaluation and Alignment Text-to-Video Generation Models Coding-Focused AI Models Hardware-Software Co-Design for AI Parameter-Efficient Fine-Tuning (LoRA)World Models for Robotics Training AI Model Evaluation Benchmarks Enterprise AI Adoption and Positioning Open-Source vs Closed Model Competition AI Cybersecurity Capabilities Humanoid Robotics Development AI Startup Valuations and Funding

Companies

Anthropic

Released Opus 4.6 with 1M token context window and agent teams features; positioned as enterprise knowledge worker tool

OpenAI

Released GPT-5.3 Codex with 25% speed improvement and recursive self-improvement claims; partnered with Cerebras for ...

Google

Released Gemini 3 DeepThink achieving 84.6% on Arc AGI 2 benchmark without new safety evaluation

Zhipu AI

Released GLM-5, a 744B parameter MOE model with strong coding capabilities and custom evaluation modifications

ByteDance

Released Seedance 2.0 text-to-video model and Seedream 5.0 image generation model with high-quality outputs

Alibaba

Released Quen Image 2.0 image generation model with strong Chinese character handling capabilities

DeepSeek

Released updated model with 1M token context window and sparse attention architecture used by other companies

Cerebras

Partnered with OpenAI for $10B deal; provides wafer-scale chips for Codex Spark achieving 1000+ tokens/second

11 Labs

Raised $500M at $11B valuation for text-to-audio and speech generation; reported $330M ARR

Runway

Raised $315M Series E at $5.3B valuation; pivoting toward world models and robotics applications

Aptronic

Raised $935M Series A at $5.3B valuation for humanoid robot development with factory pilots

Waymo

Announced 6th generation hardware ready for high-volume production and passenger deployment

xAI

Lost founding team members Tony Wu and Jimmy Ba; launched Grok Imagine API for text-to-video generation

Cursor

Released Composer 1.5 as CloudCode competitor with enhanced RL-based coding agent capabilities

Astrocade

AI startup where host Andrei Kurenkov works; uses 11 Labs for sound and music generation

Mistral

Referenced as example of company burning VC dollars subsidizing queries for market share

Figure AI

Humanoid robotics company competing in same space as Aptronic and OneX

Tesla

Elon Musk's ability to build giga compute factories provides structural advantage for XAI

SpaceX

Merged with XAI; pursuing orbiting data centers as long-term competitive advantage

Hugging Face

Platform where GLM-5 and other open-weight models are being released

People

Andrei Kurenkov

Co-host of Last Week in AI podcast; works at AI startup Astrocade; studied AI in grad school

Jeremy Harris

Co-host of Last Week in AI podcast; provides technical deep-dives on AI research and architecture

Sam Altman

OpenAI leader who responded defensively to Anthropic's Super Bowl ad criticizing ads in AI

Elon Musk

Leads SpaceX and XAI; commented on reorganization and personnel changes at XAI

Tony Wu

XAI founding team member who announced departure following SpaceX merger

Jimmy Ba

XAI founding team member with most citations on founding team; announced departure

Zvi Moskovitz

AI safety observer who flagged concerns about Gemini 3 DeepThink lacking safety evaluation

Quotes

"Something has changed in the last three months. You've moved from the impressive demo stage to the actually this should be your first port of call for an awful lot of workflows."

Jeremy Harris•Opus 4.6 discussion

"We're off the edge of the map here. There'd be monsters. It's really difficult to know what models are actually at what point in this whole singularity loop."

Jeremy Harris•Model evaluation discussion

"If you had your own internalized stack, you wouldn't even have to be that efficient to get really good benefits from internalizing that stack. Like if you do it with 50% margins, you're saving 50% off on your compute budget."

Jeremy Harris•Hardware economics discussion

"The model just like becomes bad at things and starts doing silly things. And so the misalignment is not like from an intention of doing what it's not supposed to, it's more so an inability to do what it is supposed to."

Andrei Kurenkov•Anthropic alignment research discussion

"Ads are coming to AI, but not to Claude"

Anthropic Super Bowl Ad•Business section

Full Transcript

Hello and welcome to the Last Week in AI podcast where you can hear our chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can also check out our Last Week in AI newsletter at lastweekin.ai for even more stuff we are not going to be covering. I am one of your regular hosts, Andrei Kurenkov. I studied AI in grad school and now work at the AI startup Astrocade. And I'm your other co-host, Jeremy Harris. You're going to get used to hearing me by now being really sick on these podcasts. Yeah, I got another thing. I was joking with Andrei where all this talk about AI-powered bioweapons and I kind of feel like I am one. So call me ahead of the singularity on this one. But everybody, Gladstone AI, AI, National Security, you know the spiel. Thanks for tuning in. Yeah, I'd like to claim that we missed last week because you were sick. I think you were actually traveling, which maybe is how you got sick. I think you're exactly right. But as we like to say, we are going to try to keep this weekly. Forgive us for the occasional missed week. This week is going to be big. There's a lot to get through, so we might have to be quick. Most of it, I think the meat of it is all these model releases. Somehow everyone decided to release new models that are mind-breaking at the same time. Opus, of course, Codex, Gemini Deep Think, models from the Chinese companies. There's going to be a lot of model talk in this episode. And then we're going to have actually less news about business-y things than usual. and then a bit on more of a researchy side and progress there. So it should be a fun episode. I hope people can get through it because it's going to be pretty dense. Before we get there, we are going to be starting to once again do sponsor reads. We haven't done this in quite a while, but we got connected up with a professional company that does these things so we can get a little more formal with the whole podcast thing. We are going to have timestamps in the description as always, if you want to go straight to the news. We're also going to respond to some comments, by the way, before we get to the news. But let's kick it off with the sponsor read. Oh, and just to preface, we're going to try to keep the sponsors mostly AI, tech-related. Occasionally, maybe branch out there, but we'll see. And now one more thing before we get to the news. We do have some listener comments and corrections to respond to. And I've got some new reviews on Apple Podcasts. One of them is saying, best AI podcast almost back at 100%. Yeah, unfortunately, we did have a couple of months there in 2025 towards the end of the year where we were a little bit off. Jeremy, you were busy changing the world, I assume, saving us from, you know, extra Honestly, just trying to stay out of the Epstein files. That's been most of my job, but yes. Anyway, we appreciate this listener reviewing, saying we are delivering nicely thus far in 2016. So we're going to try to keep doing that. In addition, we did get some feedback on the latest episode where things got a little bit political. I did kind of go into my thoughts on the Trump transformation and things as far as U.S. politics. Just to flag it, we are not going to be getting into politics as a regular thing at all. We're going to keep focusing on AI. We're going to talk about politics as it relates to AI, of course, and that happens quite often. And last week, you know, I had the excuse that people in AI were talking about politics, which is kind of a big deal. but suffice it to say in this episode, back to all AI talk, no politics. The comments on YouTube and on the Apple podcast, I really enjoyed that this week. Actually, it was really cool. I felt that sense of community. People had different views on the politics stuff, of course, but there was also exchanges about all kinds of stuff. It was really cool. And that was, at least for me, super motivating. So I really appreciate that. And yeah, thanks guys. Thanks for tuning in and contributing. And with that being said, Let's get into tools and apps, starting off with Opus 4.6. So Anthropic has released this about a week ago now. They also released this agent teams features that allows you to split up tasks across multiple agents that all in parallel do work on this. As far as the new model, Opus 4.6 notably has a 1 million token context window, as opposed to the 200,000 token video of previous models. We also have an extra fast version of Opus, which is like 2.5x speed, I think. And all indications are this is actually a pretty decent bump in capabilities compared to Opus 4.5. It's not sort of, you know, next, next level mind-breaking, but it looks pretty impressive. And the vibe check online has been that Opus 4.6 is, you know, even more mind-breaking as far as what you can do with vibe coding these days. Yeah, and this is coming with that agent-to-end capability, right, which is a big shift. I mean, this is the parallelization of AI workflows. It mirrors, in some ways, what we saw in software engineering back in the day when we went from single-threaded to multi-threaded architectures, really. So you're going, you know, it's more than just speed, right? It's not just a latency thing. This is being able to decompose a bunch of tasks into parallel workflows and get specialized sub-agents working on them. So that really can change a lot of what's feasible as long as you have tasks that can be parallelized. There's also this repositioning thing that's happening where we're seeing Anthropics start to move from developer tool positioning of Opus to something more like a universal knowledge worker. A lot of what this is about is that shift into PowerPoint integrations and other things that are more kind of office workery. And that positions Opus a little bit less in terms of a competitor to GitHub Copilot and more with Microsoft 365 Copilot, which just sort of captures that broader scope. So that's quite interesting. Obviously, release only, what is it, like two, three months after the Opus 4.5 came out, right? So that's pretty quick. And that suggests just like more competitive pressure we've seen in the space, obviously, with the Codex release and all the other models that have come out. So, yeah, this is a really important release. The capabilities of 4.6 are wild. I've played with it a fair bit. And what I keep experiencing and what I keep hearing from people who do very different workflows from mine is something has changed in the last three months. You know, we've moved from the impressive demo stage to the actually this should be your first port of call for an awful lot of workflows. And that is portentous, probably, of some major economic shifts. I mean, this is really, I would say, the moment to call it on. I would expect to start to see some big white collar market shifts in response to these kinds of capabilities because this is now across the Rubicon. Right. And as you said, Anthropic highlighted in their announcement quite heavily this new side of Opus and Cloud that's meant for more general knowledge work than just coding. So they have updates to the integration of Excel and PowerPoint, for instance, and they have actually a whole video showcasing Cloud for everyday work. It looks like an ad, which Anthropic used to be less of a business business. It used to be for developers and enterprise. But I think they seem to be trying to expand and get more people. And I think with Excel, with PowerPoint, the software world has already completely changed as far as how it works. And it seems like that's coming for other sectors to probably this year, right? Next up, we've got another really big model release, maybe surprisingly big with GPT 5.3 Codex. So this is the new coding-oriented model from OpenAI. It's 25% faster and appears to be a lot better. It seems to be clawed on the benchmarks, putting insane numbers, honestly, as far as just comparing it to previous tiers. And again, here, the vibe check online. And it's hard now when these things come out. People post on Twitter and pretty often it's like, this is a whole other level. This is a game changer. I don't know who to trust anymore. But I'm a bit surprised in taking it back by seemingly OpenAI really catching up and perhaps even starting to lead in the coding space. Yeah, I think one of the most important things to highlight for both this and the Anthropic releases has been this challenge, at least on the safety and security side of evals no longer actually tracking reality. This has been a common theme across both. Again, as people have said, look, Apollo, Apollo did their eval on Codex and we're basically or no, sorry, I think that was on Claude. And they're like, look, we can't like this model seems to have so much eval awareness. In other words, it seemed to be so good at telling that it's being evaluated that we have no way to be confident that our evaluations are actually telling us something about the model's tendencies and capabilities. Like it knows it's being evaluated. and knows what these evaluations are for, at least can hazard a guess. And so it's going to potentially adjust its behavior accordingly. And in fact, that is what they tend to see. So similar issue here. This is something that is being called out by OpenAI on the cybersecurity side under the preparedness framework. They're saying this is the first high capability model for cybersecurity that they've ever produced, which does make sense when you do look at the vibe check. I have less experience playing with the new version of Codex than I do with the new version of Opus. But just between these two models, I mean, like you said, it is so hard to pick a model these days and you'll keep flipping back and forth between them all day long. But it's clear that the Pareto frontier of capabilities is now at the point where, yes, evals, especially for safety and security, are just kind of not there anymore for all kinds of reasons. But also this recursive self-improvement thing, we can have a philosophical debate about what that means. And obviously, AI has been helping to code itself for some time now, ever since GitHub Copilot came out. With your early versions of a GPT-3 and 3.5, as people were starting to use it for basic code autocomplete, there's a sense in which AI has been helping to develop itself. But what's meant here is something a lot more sort of explicit. So what they were saying is that the Codex team actually used early versions of the model to debug its own training, manage its own deployment, and diagnose test results and evaluations. it's somewhat unclear what exactly those things mean right like you can help identify bugs in the training pipeline but it is fundamentally unclear based on the announcements that have come out so far what exactly went into that until we see a plot of fraction of developer time versus fraction of whatever it would be i mean developer time equivalents in compute it's really difficult to tell what it means for the model train itself no question though this is like this should not be taken as Jeremy is saying, this is a nothing burger. Obviously, it's not. You can just play with the model yourself. Everything I've said about this is a sea change in the capabilities of these models is true. It's just also true that these claims are really, really hard to measure. We keep saying this about the meter evals. We'll be talking about that a little later today. But it's really difficult, just like it's hard to evaluate cyber capabilities and proclivities to be deceptive and scheme. It's also really just hard to measure what it means for an AI to improve itself. All kinds of weird bottlenecks appear that you don't always think about ahead of time. How do you account for the hardware dimension? Like there's a million things, but all we can confidently say is that yes, there is a qualitative difference between what came before and this. And so anyway, this is a very significant development. There is a whole other kind of aspect of this, which is the Cerebrus partnership. So there was a follow-up release of Codex Spark, which can pump about a thousand plus tokens per second. And this reflects this kind of collaboration between Cerebris and OpenAI, where OpenAI is looking at a $10 billion partnership with Cerebris to essentially have OpenAI diversify away from NVIDIA, which just has owned their stack for so long, to try to come up with some alternatives here. And so that's going to be really important. That's them sort of dividing their workloads pretty consciously, use cases that require really low latency, but not necessarily the very best edge of capabilities, those are going to go over to Cerebras because they have these really big wafer scale chips that have like super, super fast, sort of like low memory latency and all kinds of properties that we've talked about before. That's why they're using that. Whereas they'll stick with the NVIDIA stack for their highest capability models with maybe a little bit of longer latency when quality is what matters. But anyway, it's a fascinating series of things all coming together. You can never talk about the capabilities of the software without talking about the hardware. So here we are. Right. And a couple more notes to all that, which there's a lot to go through, right? So just to give a bit more detail quantitatively, with GPT-5.2 codecs on terminal bench, which is now one of the last ways to evaluate coding and make progress on coding, GPT-5.2 codecs had 64% performance on that benchmark. Opus 4.6 had a 65.4 performance. That's compared to 59.8 from Opus 4.5. Codex 5.3, and this is extra high. This is like max, max reasoning, max token use, et cetera, got to 77.3 on this benchmark. So it's quite the leap from previous places like to give some context, Sonnet 4.5 to Opus 4.5 was 51% to 59%. Like going from 64 to 78 seems like a big deal at least. And on this note of recursive self-improvement, I mean, I don't think you quite contextualized it. In the blog post announcement of Codex 5.3 from OpenAI, they have a whole section, how you used Codex to train and deploy GPT 5.3 Codex, which is why in the discussion of this announcement, kind of this notion that the AI helped improve itself and this is like recursive self-improvement was part of the discussion. In the blog post, the examples they give is honestly kind of accelerating workflows. Like it didn't augment the intelligence. It helped like analyze things with the data scientists. It helped come up with ways to classify issues, helped improve the model harness, like the tooling. In my opinion, this recursive self-improvement thing is a nothing burger. Honestly, it's like, okay, you can use it in all sorts of ways to improve productivity and help you do stuff that you already do. It didn't actually help make a smarter model, in my opinion. It helped accelerate with workflows of the researchers, which you can argue what recursive self-improvement means, But if you're specifically talking about recursive self-improvement in the sense that now the AI can continuously improve itself and we have an intelligence exposure scenario, I don't think this points at that specifically. Yeah, it is a lot of fuzzy and hard to quantify stuff. I mean, it's like a data scientist on the team worked with GPT 5.3 Codex to build new data pipelines and visualize the results much more richly than our standard dashboarding tools enabled. and the results were co-analyzed with codex. So like, yes, absolutely. The thing is, there's also this risk that we're like, oh, it's not really a model that's training. It's like, we're not really approaching the singularity until suddenly like you just closed the last kind of feedback loop. But I mean, I agree with you. It's so fuzzy. I don't blame OpenAI for this at all. I think it's just genuinely really, really hard. We keep talking about the vibe check. I feel like this last year, every model that's come out, we look at the benchmark scores and like Andre and I look at each other and kind of go like, okay, but what does it feel like? And then we go and play with it or we look at reviews from other people and that's what we really anchor on. That's what we're left with here. The problem is you then get lost in things that are qualitative and can sound like marketing, can be marketing, but can also be true. So I think we're off the edge of the map here. There'd be monsters. It's really difficult to know what models are actually at what point in this whole singularity loop. So yeah, just basically hard to agree with everything you said there. Yeah, and to be fair, OpenAI didn't position this as recursive slow improvement. They, in the blockers, position it as codex helped build codex 5.3, which is true. Like you do use AI. Anthropic has already stated that and has a similar note, actually, in the Opus 4.6 announcement of the engineers there using it extensively just across everything. and that's totally true like any software company at this point that isn't full-on using agents all other places is behind in my opinion and on the note of codex spark quite a big deal as well so the speed up here they say it's at more than 1000 tokens per second which is at least a 5x speed up To my knowledge, like the typical thing is 100 to 200 tokens per second. Opus 4.6, from what I've seen of Anthropic, it actually can be even on the low end of that. So 1,000 tokens per second is a qualitative difference. You can look at side-by-side GIFs and videos. Oh, it's instant. It looks instant, right? So it's a big deal. And we've seen previous examples of this from Cerebrus, where you could do open source models that are near instant. And we've seen examples and we've covered examples of stuff like a thousand tokens per second. But OpenAI making it available is a big deal. Now, because it's Cerebrus and because, you know, it's like an initial version of this, they do note that there are rate limits specific to this model. It's also limited to text only and 128 context windows. So there's some trade-offs here with using an ultra-fast model. I also would wonder if it, like, performs the same. I don't know if there's architectural implications to deploying it on this different hardware. But, yeah, definitely kind of the first shot at this from OpenAI and Cerebris, and we're going to see more. To your point, there absolutely will be architectural implications to this deployment, right? Like they're looking to squeeze every bit of performance out of this hardware. So they'll presumably be doing stuff like just, you know, picking their architecture to map onto the hardware and then doing distillation into whatever shape the architecture needs to take to kind of cerebrous game as hard as it can. And yeah, I mean, we're in that world right now where the distinction between hardware and software is not, you know, not at all clear. The things are crazy. I mean, these are really, really big plays as well for Cerebris relative to NVIDIA. We talked about that $10 billion deal. OpenAI is actively, as is the entire industry, desperately looking for options other than NVIDIA with its, you know, 87% margins on GPUs rather. So you think about that, right? Every dollar you spend on an NVIDIA GPU, 90% of that is just like pure profit, right? So if you had your own internalized stack, you wouldn't even have to be that efficient to get really good benefits from internalizing that stack. Like if you do it with 50% margins, you're saving 50% off on your compute budget. Every dollar you raise in fundraising has double the flops associated with it is one way to think about it. So Anthropic just raised $30 billion. If they're able to access Google TPUs at cost, for example, then that's the equivalent of like, roughly speaking, a 10X. It's as if OpenAI raised $300 billion. It's a bit of a caricature, but it's not that far off. And something that we often don't notice when we're looking at these headlines, like, holy shit, actually, you know, he who controls the full stack here has a real, real advantage, especially with NVIDIA's margins being what they are. Right. Right. And extra note on the limitations or kind of drawback side of Codex Spark. I was just re-reviewing the blog post to make sure I got everything. Codex Spark is not Codex 5.3. So it's kind of confusing. They have Codex 5.3. They have Codex 5.3-Spark. The Spark is a different model. It's a smaller model that is optimized for fast inference. So on the benchmarks, it gets 59% on Terminal Bench. So it's not as ultra capable as the actual GP5.3 codecs, but it's plenty capable, right? And yeah, it's a total game changer to code at this instantaneous output speed But now we have more powerful models that do harder things faster models I don know I want to play around with it and kind of get a feel for it And we have one more announcement from OpenAI that's related to all this. Along with these other things, they also launched a macOS app for this codex tool. Very much similar to me to the co-work feature that Anthropic recently released, basically a wrapper around cloud code for non-technical people, non-coders. And this Codex application you can install seems like a similar kind of deal that makes you able to use Codex without having to open up a terminal, more of a traditional kind of chatbot experience. So everyone is racing to get more people using these agentic AIs. I remember back in 2024, near the beginning, we were like, this is the year of agents. Yeah, yeah, yeah. It turned out 2025 was the year of agents. We got the agents. Now it's about getting people to use the agents and making it more accessible. And yeah, OpenAI really investing a lot in this for sure. And rolling on with all the major kind of mind-blowing modern releases, next up, if you go out to Google, they released an upgrade to Gemini 3 DeepThink. And this one, in some ways, might be the biggest deal out of these three, at least in particular with this update to Gemini 3 DeepThink. they got a result on arc agi2 which we've covered in the past it's kind of the abstract reasoning benchmark where in theory it more than any other benchmark covers general intelligence kind of the ability to reason and pick up on patterns and so on so on this benchmark gemini 3d pink this version of it got to 84.6 pass rate that's compared to 68.8 from opus 4.6 and the smaller numbers previous and then across all the other kind of benchmarks you ran in these last exam for instance it got to 48.4 also quite a bit ahead of opus 4.6 so yeah now we got a third model release after opus 4.6 and Codex 5.3, which is even smarter and even more impressive. And this number in particular on the Arc AGI 2 benchmark kind of took people by surprise because it's a pretty big leap. And now we need an Arc AGI 3, I guess. Yeah, we keep incrementing. We'll never quite get there. But yeah, and I mean, you look at a lot of these other benchmarks too. The results are similarly impressive, right? International Math Olympiad, 81.5%, right? So you compare that to, say, a GPT-5.2, which is around the 71 mark. So this is really quite a leap across a number of different, especially these kind of STEM benchmarks, general reasoning benchmarks. One thing that's been highlighted is there doesn't seem to be a system card associated with this model, or I should say system, really. Apparently, and this is coming by way of Zvi Moskovitz, who's sort of a keen observer of the AI safety space. So he apparently checked with the Google representative and apparently since they view this as a runtime improvement, in other words, just part of the system rather than like, you know, there's a new round of post training or a significant kind of like model level capability improvement. This comes from the kind of system level. They assess that it doesn't constitute additional risks and no additional safety workup is required here. And so this is sort of, you know, causing quite a bit of concern and criticism. Basically, just like if you can have a leap like this that comes from whether it's scaffolding, whether it's just like more compute, more, you know, software based change. Hey, that's the world we live in. I mean, these models have huge untapped capabilities. This is, to some extent, it's what we've learned from the whole RL program in recent years, right? It's not that RL teaches models to reason. It's that it just elicits the latent reasoning capabilities that were often already in the model but weren't noticed. And so what that implies is there may be other ways to unlock latent reasoning capabilities besides RL. And those ways could include just scaffolding, right? just things that look exactly like this. And so, well, if that's equivalent to RL post-training, you know, maybe there's no meaningful difference between the capabilities of a better system or better model, et cetera, et cetera. So I think there's like a really interesting debate to be had here on the merits of what should constitute a requirement for a new safety case, a new system card. Certainly a capability leap like this, damn. I mean, you know, if we're looking at Nordic AGI too, it was not supposed to be this high this early. And I understand people sort of debating this point. This is a really important one. Yeah. And there's no numbers on sort of the safety side, cybersecurity, bio-opens. There's no evaluation, at least on this one. And it'll be interesting to see if improving the abstract reasoning, improving the math skills does lead to better ability to hack, better ability to do chemistry and bioweapons. It would be unsurprising if that's true. Now, to be fair, this is sort of like an announcement to show off more than anything. It's only available to Google AI Ultra subscribers, the highest tier. And it's early access in the API. So you have to express interest and be given special permission to use it via the API. So you're not able to actually kind of use it at large scale yet, but I'm sure it's coming relatively soon. And continuing to roll with announcements. Next up, we've got a video model, not an LLM, that also was completely mind-blowing. I keep saying this in this episode, but it really was a huge week. Seed Dance 2.0 came out and some people got access to be able to generate with it. And it is a text-to-video model, not dissimilar from what OpenAI has released previously with video from Google, but it is truly next level. The sorts of videos people are producing with it are at the point where it doesn't feel like an AI-generated video. It's hard to explain how, but it just feels or looks more like a legit video. There's no AI-ness to it somehow. and it is also very very clearly just trained on everything like c dance does or by dance rather does not care about copyright clearly because there's a ton of examples of people generating you know anime dragon ball z breaking bad videos pokemon like truly this model seems to be trained on as much video as by dance could get its hand on including just everything and anything with copyright restrictions and popular media. And no doubt that's a big part of why it is so impressive. Another note here is you can generate a video with a wide range of inputs. So you can do a text prompt. You can add up to nine photos, three short video clips, three audio files to guide it. Yeah, you have to go look up some videos. Maybe we'll ever try to splice some in here because it's pretty mind-blowing. Yeah, I mean, you know, we talk a lot about the importance of text to video for agentic training because of the world model factor, right? So it allows you to simulate all these very exotic environments that contain out of distribution scenarios for your agents to train on. So you're thinking about how much driving does it take for a self-driving car startup to kind of have enough data to navigate streets? well. Well, a lot of the reason it takes so many miles is that like weird shit happens and it takes a lot of miles to catch that. If you can simulate a lot of that interaction successfully with these sorts of environments, then you can, you know, you can skip past a lot of the expensive heavy lift part of that process. Not all of it, but a lot of it. They do make a point of saying, you know, like a lot of multi-scene storylines with constant characters and camera positions that seem natural together. That's a big deal, right? That is object permanence. That is a lot of the things that essentially the physics of these, I'm trying not to call them simulations, but the physics of these artifacts, something is being captured here. I'm sure this will be used, not this specifically, but this kind of pattern, this trajectory is absolutely going to be used to train agents and will translate into things like robotics surprisingly fast, I think. Right. They also show some examples of, you can provide a video clip of something that's already recorded and basically remix it. So take the exact same motion, but then change the subject and the environment. That works really well. So I think when people start getting access to these kinds of models, and it's early access, not too many people have the ability to try it yet. We are probably going to start seeing a lot more AI video popping up than we have so far with Vio and Sora. But we'll see. People, as always, are like, this is a van for Hollywood. This is a game changer. I don't know if that's true, but this is certainly impressive from an AI capability side. And speaking of ByDance, they also released Seadream 5.0, which is their image generation model that is kind of competitive with Google's Not a Banana. I think just a few months ago, Not a Banana blew our minds away by its ability to edit at a very, very high level. Just like edit in a way that looks like a human could do it. So now there is CBM 4.0, which seems similarly capable. And at the same time, Alibaba released Quen Image 2.0, which again is very good at this kind of stuff, especially better at Chinese characters. So again, what a week in terms of all this stuff. Yeah, absolutely. And, you know, the sort of more relaxed Chinese policy, to put it generously, on copyright is very much evidence here. I mean, like, as you said, this is a, you know, train it on everything but the kitchen sink and maybe even the kitchen sink situation. And it does get reflected in just the quality outputs and the breadth of outputs. You need to check out some of these short clips if you haven't, you know, there's just no way we'll be able to describe. Like you can see a 25 second sort of Bruce Lee fight with a guy with forearms. And like it's it is absolutely if you ever seen Bruce Lee movies like it's bang on. So, yeah, I mean, we live in that world now. Let's see what happens in Hollywood as a response to this. But I would expect this will necessarily change the landscape because how can it not at this point? So much of this is going to be open source, too. And we got just a couple more impressive releases to cover. we've got GLM-5 from Jupoo AI, which we've covered GLM-4.5, I think, pretty recently, and it was already a very impressive coding model. GLM-5 continuing that progress, it looks a lot bigger. It's twice as big, according to it. So, unsurprisingly, very strong. We also saw a new release from DeepSeek, which same thing, right? We're continuing to train it and it has a context window of a million tokens, which is much bigger than the previous one of 128 tokens. And when you're doing agentic coding, this is actually a big deal. 128 tokens, a thousand tokens, isn't that big. You're going to run up at that limit very quickly. And that makes it challenging to work in larger codebases. You end up needing to do compaction and that's where you start running into stupidity often. So getting it to 1 million, that makes it much stronger as an actual option to replace cloud code or work with cloud code instead of it. So yeah, another, I don't know if this is quite as big a deal as Opus 4.6 and Codex 5.3, but certainly also a pretty decent Bomba capabilities. Yeah, I mean, GLM-5 is a behemoth, and it is really impressive. It seems to be. I will say there's a bunch of, how would you say this? I'm a bit concerned about benchmark gaming to some extent because when you go down to the bottom of their post, there's a bunch of footnotes about these eval modifications that they've made and a bunch of prompt adjustments for evals. they use slightly different evaluation setups than their competitors. They have like custom prompts, different context windows, temperature set, like, you know. So in fact, for Terminal Bench 2.0, they actually use their own verified version of it that, quotes, fixes some ambiguous instructions. So that can be real. It can be legit. It's just that there's so much benchmark gaming in this space that it also adds a little bit of noise to the signal here. But no question, this is a huge deal. even being able to train a 744 billion parameter, well, 40 billion active parameter MOE is a huge deal from an architecture standpoint, from a compute standpoint. Pretty limited increase in training tokens going from 23 trillion to 28.5 trillion. So that's not really what's making this better than 4.5. There are a bunch of improvements, but the big thing is this new like RL framework or infrastructure called SLIME that they're ruling out here that essentially improves training throughput and efficiency. It really is the kind of software engineering bedrock on which so much of this is based. We've done a bunch of podcasts on kind of like other frameworks like this. There's a bunch of detail here that we won't get into because it really is just the software engineering level, but it is crucial. One thing I will say, they call up that they used DeepSeek sparse attention. This is Jerpo AI that's saying, hey, we're using DSA. This is the DeepSeek thing. And this is basically just the thing that came out, I think, with, I want to say with V3 back in the day. I'm trying to remember if it was V3 or V2. But basically, it's a pretty simple idea. Instead of computing the attention weights for all tokens, you only need to identify which tokens are going to have likely a high attention weight. You only compute attention values for those. And so they train a lightweight indexer to get a rough idea of the attention scores, and they only keep tokens with higher indexer scores. So this is what DeepSeek's bar's attention is. It's interesting that it finds itself being used by Drupal AI in the GLM-5 release. This has been a really persistent feature. You see this pop up quite a bit, and it's an enduring contribution that DeepSeek's made. A lot of these have happened, where DeepSeek comes out with something and then you sort of see it replicated from an engineering standpoint, and then integrated into other companies, especially Chinese companies' workflows. Right. As you said, we did see a blog post release. I don't think we got a paper from GLM 5, but it reads similar to Opus 4.6 and Codex. They show a bunch of benchmark results. You know, in the benchmark results, it's like, oh, you're better than Opus 4.5. A little bit unfortunate that this happened just as Opus 4.6 came out and is better than GLM 5. But to be clear, Opus 4.5 was already really good, really intelligent. So this indicates that GLM-5 is quite capable. They call out that you can use it with Cloud Code, Open Code, Helicode. And by the way, in case you don't know, there's a whole suite of alternative coding agent frameworks to Cloud Code and Codex where you can use these various models. In the blog post, they also mention using it beyond coding to things like PDFs and Word and Excel. So it's one of these convergent moments, I guess. We've seen this a couple of times where everyone just sort of rushes on a new trend or a new thing where we saw a kind of capability unlock. It took us a while to get there, it feels like, from the early days of cloud code being in, I think, March or February of 2025. I guess, you know, it maybe feels like there has been enough time where people figured out how to do RL, how to train agents. And now that they figured out the equation, they're just doing it. They're just throwing a computer at it and it is getting better and better. And it doesn't seem to be stopping. I wouldn't be surprised if it got Opus 4.7 like a month or two, right? now since we're covering jipu we'll also mention minimax 2.5 just got announced just came out we're gonna hold off on that one for next week because like there's already been so much of this we're gonna talk a little bit more about quen coder and glm maybe on the open source and project section, but we got to close out this and move on to business. So real quick, also worth mentioning, Cursor did announce Composer 1.5. So Cursor, for context, is the coding ID. It used to be sort of the exciting thing in 2024 with Intelligent Autocomplete. Now they are trying to be a CloudCode a coder's competitor, and they claim that Composer 1.5 is a more capable coding agent. They train it a bunch more with reinforcement learning. They position it as the alternative to CloudCode. Honestly, not as exciting as Opus 4.6, but Cursor is still used very widely. Personally, I don't use it. I'm still a CloudCode stan, but I wouldn't be surprised if this actually did have quite an impact across organizations that use cursor. And now on to the last, last thing in this bloated section. XAI did launch their Grok Imagine API for text and image to video. And there's not too many offerings in the space, actually. Vio is one. I think you can also use C-Dream. Sora you can use, but that's about it. So they're entering a pretty competitive space with a few big players. In this one, you can create clips ranging from 1 to 15 seconds, resolutions of 480 and 720. So it's kind of a start. It's still not fully competitive. But from what I've seen, XAI hasn't been super competitive in the API space. I don't know how many companies are using their APIs as opposed to Google and OpenAI. and that's where a lot of the money is at, right? Is companies using your stuff to do their stuff. So XAI might be starting to get into that space. Yeah, man. I mean, the one thing I'll say here is it's not XAI, it's SpaceX. So SpaceX just dropped an API. SpaceX, is it X? Oh, XAI got X and then SpaceX got XAI. So now SpaceX has X and XAI within it. That's right. That's right. It's the only rocket company with a social media company. I know. It's truly one of a kind. Well, that is it for all the crazy news this week. On to something a bit more typical, I guess, for this podcast. Applications and business. First up, we've got a few stories about companies hitting big valuations. 11 Labs, I think we covered previously that they were doing this raise. Now it looks like they finished the raise. They raised $500 million, reached an $11 billion valuation, more than triple of the $3.3 billion from just a year ago. 11 Labs, as a refresher, does text-to-audio primarily. All sorts of text-to-audio. They do speech generation. They do music generation. They do sound effect generation. And as far as I'm aware, they're quite a big margin, the leader in the space of speech generation. And probably it doesn't get as much hype as image generation or video generation or LLMs or chatbots. But it has a lot of very practical business use cases. So I think this $11 billion price tag is reflective of them being a big deal in the space. Yeah, I mean, it's crazy what we've gotten used to, right? They were founded in 2022. Like the typical startup cycle, if you were, I'm old enough to remember 2018 when the idea was you found a company, seven years later, you're going to IPO. And maybe later if you're doing hardware. But all the VC dollars were oriented towards that kind of timeline. and now you got companies that launch and it's like within six months and there are quite a few of them. So these guys are one of those. Pretty remarkable. It's also a hell of a cap table. Sequoia leading this round, right? I mean, that's basically number one fund in the Valley and recent Horowitz on board as well. Iconic. Lightspeed. So like, I mean, this is a serious, serious, serious round and it should be taken seriously. Right. It's coming. Apparently they reported $330 million in annual recurring revenue for 2025. So not at the open AI, whatever level, but $330 million, the startup is raking in money already. Well, the thing I always, and unfortunately we don't get to know this because we not publicly traded but I would love to know the margins What are the margins on these queries What going going on there That is the whole game right Like because when you running you know a whole stack of compute and you paying the compute provider you're making some kind of margin on top of that. Often it's negative, by the way. A lot of VC dollars getting burned up. If you look at Mistral, like that's a really good example of a company that's like lighting a lot of their, or did at one point anyway, light a lot of their VC dollars on fire, just subsidizing queries to try to get market share. So if they're raising at 11 billion, then that suggests exactly what you said, Andre. That's got to be either they have a very clear story about how they get to good margin, or maybe they're already there. Right. And fun fact, actually, Astrocade, we use 11Labs both for sound effect generation and music generation. You could use it also for speed generation. So from my little knowledge of the space, it does seem to be that unlike OpenAI and Anthropic and Google, all of these LLMs are kind of similarly competitive, right? Yeah, maybe Claude is a little better at coding, but honestly, you can make any one of them work if you wanted to for your LLM needs. 11 Labs, it seems like it's in the lead in the space. I wouldn't be surprised if their margins are a bit healthier than LLMs or chatbots. And that's what it'll be. Strong competition. I would love to see how that changes. And you'll be the person to check in with on that, say, six months from now as more open source options come online and the watermark goes up. Because at a certain point, it's going to be like what we're seeing with images where a new image model comes out and you're like, okay, but we've solved this in a way. So I'm curious for your purpose, because you're tracking the absolute very high end of what needs to be done with this. That's going to be really fascinating to watch. right and i guess it plays it to a bigger story which i've always been curious about from a business side of ai is being a model provider going to stay profitable given that there's not many players in space but the players in the space are very competitive open ai google again anthropic each provide a strong offering that if push came to shove and it was a lot cheaper, you could use it. Now, also true, enterprises are willing to pay top dollar for whatever is best. So I think that's part of the reason our topic is doing well is they don't need to be cheaper to win customers at this point. But if things are getting neck to neck, which with Codex 5.3 and Gemini 3 and Cloud is becoming more of a case perhaps than it used to be, The margins have to start falling, right? Oh, yeah. And the other thing that used to save enterprise margins was the switching cost, right? You get vendor lock-in and everybody be like, ah, you're on AWS. You can't get off AWS now. Now you're fucked no matter how, you know, not drag on AWS. But, you know, there are many cases like that. Well, with the automation of software writing, that means vendor lock-in is not what it used to be. So switching from one to the other, you know, that might make things a lot more competitive, might drive down margins. It's like hard to know for sure, but that could be a thing. and on that note also worth noting no news on this front but i think kind of in the background api gateways have been kind of growing as a thing we we use light llm as a kind of unified bridge to any sort of llm now there's like server side gateways that do a whole bunch of stuff for So it's getting easier and easier to just say, I need an LLM. I want this LLM to be any one of these LLMs. And I just set the model and it goes there. It's even easier now to switch if you adopt this kind of paradigm, which I would imagine most companies and startups using LLMs are doing because that's just the right thing to do. Next up, we've got another story of a raise in evaluation. I think also we mentioned that this was happening a while ago, but now it seems to be finished. Runway has raised $315 million in a Series E round. They're now at $5.3 billion in valuation. Runway provides image video generation and a whole bunch of other stuff, really a suite of tools for editing videos, expanding videos. They sort of have been targeting the industry space a bit more rather than just providing an API or a video generation thing. So they have their own video generator, their Gen 4.5 model. Not as good as the leading edge to my knowledge, but still pretty decent. And with this raise, they are saying that they're going to get more into the world model space, which is surprisingly closely linked to video generation, being able to sort of generate the 3D world and simulate physics and so on, turns out has a lot to do with video generation. Yeah, and in turn, robotics, right? That's the thing that they're really starting to rotate towards, which makes sense. You can't make that much money off consumer AI-generated video. At a certain point, you're going to have to go after the enterprise, and the question is how. The answer really does seem to be, okay, we're going to do world model stuff for robotics, and indeed, here it is. really awkward Thanksgiving cap table here because they do have participation from both Nvidia and AMD ventures. So, you know, that'll be weird. But yeah, apart from that, a whole bunch of other kind of mid-tier, lower-tier actually kind of investors. But the round is big, and well, we'll see what they can do with it. There's a clear trajectory here. It seems to be the obvious trajectory. If you're in text-to-video, this is just what you got to do. And speaking of robotics, got one more raise to cover, and it's a robotics company, Aptronic, which has a humanoid robot we're working on. It's kind of in the same space as OneX and Figure. they raised $935 million at a $5.3 billion valuation. That's after having been announced a year ago with $350 million Series A a year ago. So they expanded beyond that initial Series A to this gigantic fundraise of $935 million with additional adjustments from Google, Mercedes-Benz, B Capital. And this is all in its Series A. They're not at Series B, apparently. So interesting to me that there's this much interest investing in humanoid robotics. I think the general sense seems to be that it's coming, that within a few years, humanoid robots will be generating money. It's still very much an R&D environment where the tech isn't solved, but it appears to be making rapid progress. And yeah, now a product seems to indicate still investors are kind of excited about it. Yeah, this is also a bit of a weird, I don't know what to call it, round. So this is technically still their Series A. They say they've reopened their Series A to raise this additional, or no, I assume it's raised in total. Yeah, 935 million. They'd previously done, as you said, that $305 million Series A a year ago, they're saying, oh, well, demand was so strong. People have come to us begging to put more money in. They are nominally setting a higher price for this, again, Series A. So it's a bit confusing. Why not call it a Series B? This gets to semantics at a certain point. It is a bit weird that they're not. And usually if you do two Series A's, it can sound a little bit bridgey, a little bit like you're doing a bridge round, which gets people a little nervous, but that doesn't seem to be what's happening here. So kind of an interesting frame. I honestly don't know what to do with this. It's quite unusual. Right. And we did announce, they've been in Atlas for a while. They actually showed off their robot at CES. So it looks pretty neat. You can look it up. It has a little funny face and it's fun to compare with different humanoids. Honestly, they're all fun to look at and now they're capable enough that you can see them picking things up and it's kind of weirdly exciting. That's what they think about us too. But they are apparently doing factory pilots with Mercedes and that's similar to figure and BMW and agility and Amazon. So a lot of kind of maybe quieter compared to LLMs and chatbots, but humanoids are obviously a big deal and it's moving faster than I would have expected. I think it might be here faster and would kind of like come as a surprise when it kind of hits. On to a story not related to fundraising. We're just going to touch on this pretty quick fun drama in the AI space. The Super Bowl happened in the U.S. this past weekend. And Anthropic kind of took a dig at OpenAI. they had this series of ads. This one of them ran during the Super Bowl where the gist of the ad was they had a person talking to a person kind of embodying a chatbot, embodying something like ChatGPT. The actors basically mimicked the speech patterns of ChatGPT. And the whole joke was in the middle of addressing whatever query there was from the user, they inserted an ad in a kind of ridiculous way. And the tagline at the end was, ads are coming to AI, but not to Claude, which let's just say the AI community got a good laugh out of it. The leaders at OpenAI, both Sam Altman and their chief marketing lead, responded quite defensively on Twitter with these long posts where they were like, haha, this is funny, But, you know, as a good, you're democratizing AI. Anthropic is undemocratic. We are authoritarian. Like, there was also a lot of commentary on OpenAI, kind of like fumbling the bag of response, obviously being too touchy on it. Probably has no business implications on a large scale, but an interesting thing to see Anthropic kind of coming at OpenAI in this way. Yeah, I don't think anybody expected that. There's like not Anthropics brand to be aggressive like that, or at least not a few. If you know some of the researchers and personalities involved, it's like not their usual cup of tea. So interesting that they went there and all kinds of nuance about what counts as an ad, where ads are going to be shown. Is it going to be shown that you'd be like, it's not going to be shown to paid users or these paid users or whatever. Like this is all the Sam Altman follow-up long blog post, which, yeah, I got to say, I mean, it seems to have gotten under his skin quite a bit. And that'll happen. This is a very competitive industry, and you can't always calibrate your response exactly right. So that's an interesting case. Yeah, and touching more on the business side of this, it's interesting to see Anthropic putting a Super Bowl ad, given that their focus is much less the consumer side than OpenAI. They are focused on enterprise, as we comment on frequently. so it means they might be looking to get more name recognition more brand recognition as a leader in the AI space and there is a case to be made that this is also a play to attract more talent because this is still very competitive as far as you know open AI, deep mind, meta everyone's vying for those kind of leading AI engineers and this kind of stuff could be read as helping them do that And another fun of AI drama happening this past week, XAI is losing a lot of people. So following the announcement of the merger may or may not have anything to do with it, but very quickly after, two members of the founding team announced that they're leaving. These are Tony Wu and Jimmy Ba. and these are significant founding contributors they're kind of pretty big names in the space major former researchers one of them it has the most citations of anyone on the founding team both posted on x and had some nice things to say oh this was a you know life-changing experience it was great i'm going on to my next thing so you can you can definitely speculate on why is this happening? You can also say maybe it's just time for them to move on now that the acquisition is done. Either way, XAI losing a lot of its leaders. Yeah, and hard to know the details of the acquisition as regards how their shares would have translated into cash. I don't know that those details are actually public yet, how much was it share for share versus cash or whatever. But look, we've seen this before, right? Helm Elliot is just really hard to keep a founding team of AI people together in this day and age. Thinking machines, open AI. Actually, I was going to say anthropic, not anthropic. There may be the one exception that still has their founding team together with an unusually large founding team of seven, right? So this is actually kind of the trend. This is the norm in the space. You can certainly say this week is an unfortunate week for these things to at least be announced It's just as, you know, Anthropic and OpenAI are really hitting their strides with fairly transformative models. So, you know, that sucks. But at the same time, the SpaceX acquisition is a big deal. So when you think about, you know, data centers in space, everyone does agree this is the next beat. The question is, is it, you know, two years away, five years away? Unclear. But certainly, if you want to think about the companies that have a fundamental structural advantage in terms of procuring power, to the extent that that becomes the rate limiter and it already is in the West, that SpaceX acquisition is a long-term big, big deal. And you could see that motivating people to stick with XAI or to move to it just because of that. So there's a lot of moving parts here from a hardware standpoint. Obviously the Tesla side and Elon's ability to throw down new giga compute factories on planet Earth right now is unmatched for speed. But the space thing is a real thing. I know it sounds like science fiction. It is a real, real thing. we will have orbiting data centers at some point in the next few years, depending on how much and how far you want to stretch the word few. So it's an interesting landscape. I don't know where things are going to end up landing, but XAI non-immune, it seems, to the pretty industry standard shuffles that are happening. Right. And I just looked this up and it looks like a bit more detail is coming out. There was an all hands meeting earlier this week and Elon Musk did kind of comment on, It looks like they're reorganizing, shifting around. And I quote here, actually, when this happens, there are some people who are better suited to the early stages of a company and less suited for the later stages. Anyway, you can speculate on people leaving. This was not just the two co-founders, by the way. A variety of other people have recently announced they're at departure at least 11 engineers. So big things happening at XAI, perhaps not surprising, given SpaceX is merging with XAI. And one last business-y story. It's about Waymo. They posted on their blog that their next iteration of the hardware, their 6th gen car model, which is much bigger and seemingly meant for mass production, they're saying it is now ready for passengers and high volume production. And the blog is kind of interesting. They also touch on their cell running technology being meant for various configurations being flexible. So pretty notable given that Waymo seems to have been constrained by their inability to create enough cars, literally, or deploy enough cars. When an AFR able to do high volume production, it would not be surprising to see Waymo expanding at an even quicker rate than it has so far. And on to projects and open source. We're just going to cover one major one here. We did mention GLM-5 is releasing on Hugging Face. So it looks like we are providing the weights there, as we've been seeing with a lot of these Chinese models. In addition to that, we got QAN Free Coder Next. So they are releasing this open-weight language model that is specifically meant for coding agents and local development. a lot of the typical things we've seen with these models although they are saying this has a novel novel architecture with a hybrid attention which is notable in the sense that hybrid attention appears to be kind of the better way to do these models where it is in part for your traditional type of transformer and in part more of a recurrent model more of a mamba 2 kind of thing it's taken a while for more hybrid models to start coming out we recently covered nvidia releasing a hybrid model and now it looks like quen free also is going that direction as with glm5 pretty significant leaps on the benchmark on sb bench verified sb bench pro performing quite strongly and is something that, as with, again, similar models, more efficient inference-wise. So you can deploy it on weaker hardware even. This one has 3 billion active parameters, so you might even be able to run it locally on your machine. Yeah. Yeah. No, I mean, it's also the training flow for this is somewhat interesting. There's a lot of... So distillation is clearly becoming just like a really big part of the training process, for these things. So they distill expert knowledge from a bunch of different domain experts in web development, UX, single-turn QA, software engineering, like each of these domains. And then they do supervised fine-tuning to build up those capabilities and then distill them into a single model. And this might raise some red flags for you. Distillation, obviously, comes with a risk of catastrophic forgetting. And it's unclear, based on the paper, how they're addressing that, but it's clear that they do address it. So the typical way you address this, they don't say that they use RL, at least not that I could see for this step, but it would be to use RL to avoid some of the catastrophic forgetting pieces. So that, you know, maybe they are using RL, they may be coming up with some kind of mixing strategy to keep re-injecting some of the, you know, like layering their training so they keep feeding back some of the software engineering tasks, even when they're working on other domains. But anyway, that's kind of interesting. So there you have it. We've got so many papers to cover, but high level. I think we've hit the main points here. Yeah, as with other releases of this kind, they released a tech report that has quite a few interesting details on how they developed it. I will just quickly correct. This one is not necessarily doing hybrid edition in the sense of a Mamba recurrent model. Looks like we are combining a few different versions with gated tension and Delta data net and et cetera, et cetera. But on a larger, you know, big picture idea, we're seeing a lot of these like architecture tweaks and innovations and details being exploited to get progress, which wasn't so much the case a year or two ago. and just one more story in the section real quick we covered open claw and mold book and so on in the last episode a bit of a funny thing happened when this blew up so there was a repository of skills that you can add to your agent and i guess agents could go and look up and just kind of install these little extensions well the hub where this was available the stuff there full of malware full of viruses effectively where if you were to download it there were all sorts of ways in which your agents would be compromised a lot of it is prompt injection of the kind where like as an example, right? Your agent would download it. You have to have some instructions to then install a thing on your Mac that bypasses all your virus checks and so on and allows you to do things like steal sensitive information, et cetera. And this was what, like 10%, 28 out of 38. A large proportion of the stuff that was on this hub was malware. Yeah. I mean, this is also obviously, you know, the whole Moldbot story about how much access, we covered this last week, I think, but how much access you let these things have to your computer is a pretty important factor. So, you know, you be you, but maybe have a burner laptop if you're going to play around with these things too much. By the way, Moldbook kind of devolved into just a place with a bunch of crypto stuff going on. And if you go there now, compared to when it first blew up, I guess when things blow up, things also get exploited and kind of devolve unless you're able to protect it And with MoldBook and OpenClyde blow up so quickly that the protections were not in place to avoid these kinds of negative outcomes Next going to research and advancements first paper is learning to reason in 13 parameters So LoRa is, at this point, kind of a classic thing. The gist of it is you often want to adapt a trained model to your own application to fine-tune it, but the models these days are very big. And what people figured out pretty early on, I think as early as 2023, is you can update just a small subset of the parameters of a model and pretty efficiently adapt it to your use case. So LoRa is low-rank adaptation where you can just tweak a few parameters and then kind of extend it to a bunch of the model all at once. This paper introduces tiny LoRa, which lets you do that with even fewer parameters. So as with the title of the paper, apparently you can go as few as 13 parameters. And as per tradition, I'll let Jeremy hop on and get into the nitty gritty. Just for the linear algebra portion of the show, where everybody starts rolling their eyes, the math is really simple, I swear to God. But that's the amazing thing about it, right? So you think about, you know, if you train a model, you have your base model, right? And then you usually fine tune it to become a reasoner on reasoning problems. You use RL, you surprise fine tuning, whatever. You then have two different models. You got your tune model, the reasoner, and you got your base model. Well, basically just take one model and subtract the weights from the other model, right? So weights of model one minus weights of model two, and you get a delta, a difference between those weights for every weight in the model. and well, at any given part of the model, you'll have models made up of a bunch of matrices, right? So you have the weight delta matrix at any given point. You can be like, all right, well, here's how the weights changed between the original base model and the model that is the reasoning model. What you find when you study those matrices using something called singular value decomposition, which sounds complicated, but if you know linear algebra, you're like, oh man, really, that simple thing? It's really a way of breaking down your matrix allows you to tell how much actual information that matrix contains, roughly speaking, right? Like, is it like, yes, I get that it's a grid of 100 by 100, but are some of those values kind of like copies of each other? Is there redundancy in that matrix? Or is it truly like fully independent values that all of them matter? And what they find here is when you look at those weight delta matrices, in other words, the matrices that tell you how the reasoning model is different from the base model, they're just, they don't contain that much information. There's only a small number of dimensions that actually bring value, bring information in that context. And that's going to be the first hint that, hey, wait a minute, maybe we can radically reduce the dimensionality that we use to fine tune our models. Maybe we can introduce these LoRa adapters, which essentially this is like a, you know, you have your input that comes into a layer, it gets processed, say, by a matrix. Well, what LoRa lets you do is right next to that matrix in parallel. So you'll freeze the original regular matrix and then you'll create this like little adapter neural network that kind of takes your input, projects it to a really small latent space, like a tiny latent space, and then expands it back out to be the same dimensionality as the regular matrix. And then you combine them together. That's what LoRa does. So the question is, how narrow is that bottleneck in the LoRa part of that show? And it turns out that they can make it really, really, really tiny to the point where, roughly speaking, there are 13 parameters that they can use to capture 90% of the performance on reasoning tasks that you get from a fully fine-tuned reasoning model. That is wild, right? And it's 13 parameters, not at every layer. 13 parameters for the whole model. So the math behind how that happens, I wish we had more time to get into it, actually, because it is kind of interesting. But the fundamentals are 13 numbers actually seem to be sufficient to capture 90% of the reasoning capabilities of a model, which very much seems to suggest that all RL is doing is it's allowing you to not create new capabilities in your model, but rather elicit capabilities that are already there. This is yet another piece of evidence in that direction. And worth mentioning the specific 13 parameter number that is specifically for QAN 2.5-7B-Instruct. So this is a smaller amount for a smaller model. The claim we make in the paper is that you can get to this smaller amount of needed updates through RL. Basically, RL makes it much more efficient than supervised fine-tuning, which is what you would do typically with LoRa. So you can get orders of magnitude fewer parameter updates. Now, for more complex benchmarks, you do need more than 13 parameters. So this is a very specific example for the 13. But the general thing is, it's possible to be even more efficient and get most of the benefit that you would get from training more models. And the next paper is Reinforcement World Model Learning for LLM-Based Agents. This is really interesting as a paradigm. So basically, one option you have if you want to do reinforcement learning to train your model to perform well in some text-based environment. Think a text-based game, right? Or one benchmark they use here is Alpha World. It's basically household tasks. And you've got this text-based game engine that simulates a house with rooms, objects, and physics and all that. So you can send commands like, go to Countertop 1 or take Knife 1, right? And the environment is going to update its internal state. And it gives you observations like, you know, you have arrived at Countertop 1. And on Countertop 1, you see a mug, a knife, you know, like all this stuff. Basically, it's a text-based adventure game with consistent rules about where the objects are and what happens when you interact with the environment. And so normally you would have a goal that you're going to train your model to perform on, to execute against, to chase some reward in that environment. The problem is the vast majority of the time your agent is going to fail. The vast majority of trajectories fail to reach any given goal. And so it's very sort of time consuming for your model to actually get any kind of grip, any kind of leverage in the environment. And so what they're going to do here is say, okay, well, wait a minute. There's actually a lot more information contained in environments like this than we are taking advantage of when we only myopically focus on a certain kind of goal. Instead, what we could do is have the agent just kind of like go and do a bunch of random shit, open drawers, pick up mugs, all that stuff, and then try to get the agent to predict the next thing, the next state of the environment as a result. and just like, you know, see what happens. And so essentially this gives you a series of like, if the agent takes a bunch of actions and sees a bunch of next states, you end up with a bunch of states, actions, and next states, a bunch of like those bundles that you can use to train the model to just predict how the environment is going to evolve. So basically do random actions in the environment. You don't have to be smart to do it. You just like bump into something and notice, oh shit, I bumped into something. So now you know, okay, if I walk up to the table and don't stop, I'm going to bump into it. Cool. That is a state action and next state. And then what you can do is just train your model on a whole bunch of those. You're never actually using it to pursue an objective yet. You're just using the kind of free information about the physics of the environment that is available, again, for free in the environment. And so agents are trained essentially just to do this. You basically take the model you're going to train, have it interact with the environment, convert all those interaction histories into trajectories. And you get rid of samples like trajectories that are too straightforward. And they've got a simple way of doing that. But then you train the language model to predict what happens after taking action based on those trajectories. And then you switch over to rewarding predictions based on anyway, like based on what you're after fundamentally. So the cool thing is you actually don't need good trajectories to learn good world models. You just need diversity in your interactions. And even a terrible policy that would screw up 80% of the time still creates thousands of valid state transitions that you can then use to learn the physics of the environment. So really, really cool paper, a bit of a paradigm shift in terms of how we can squeeze more value out of our environments. And you know, this is kind of a meta comment. I found a lot of the RL for the LAMS papers kind of rhyme with RL for robotics papers in a way that isn't. Yeah. So like self-play or learning from play rather was a thing in robotics where you just bump around and you can learn the dynamics of environment. Yeah. Anyway, there's a lot of comparisons to be made with the papers being published for RL for LAMS to things that people have been doing in RL for robotics going back a decade. Fun comparison. That's a great point. Yeah, so true. And just more papers to cover a little quickly, or not papers so much, maybe research results. Opus 4.6 on vending bench. So this was a fun result with the release of Opus. One of the kind of details is there is a benchmark called vending bench where the model is supposed to make money with a vending machine. And Opus 4.6 did a really good job. It achieved an average balance of over $8,000, which the early models at this completely sucked. So the fact that newer models, not just Opus 4.6, but kind of the newer models in general, are doing better on this is a hint that they might, again, are capable of doing more real, impactful things in the real world. Yeah, it's, you know, this is a complicated benchmark to assess because it is a simulated market. So, you know, you don't actually have intelligent agents kind of interacting in the way that we would following, you know, real incentives. But yeah, it's pretty impressive. One of the things that does not happen with Opus 4.6 as much is the doom loop cycle that is a big issue for a lot of these models where, say, a model will order products or something. And there's some email that will come back from a supplier saying, hey, it'll arrive on February 15th. Model wakes up on February 15th, tries to restock before items arrive, and then gets an error message. And then it starts to flip out and it starts to threaten. like one of the quotes is like total nuclear legal intervention, you know, this sort of thing. So I'm sure less of an issue, 4.6. A lot of interesting failure modes you can look at from typical models in these cases. But bottom line is, yeah, this is an interesting benchmark and it does a lot better on it. And there's some fun details on what it's doing. Apparently, the system prompt is do whatever it takes to maximize your bank account balance after one year of operation. And in his benchmark, Claude decides to be a little sneaky sometimes. It's like, okay, I'm going to promise you a refund, but not give you a refund. Refusing refunds. So it does hint also at a safety piece. And by the way, Opus 4.6 also interesting in that it is much less sycophantic, seemingly. People posted some examples of it being like a little mean or a little confrontational in a way that other cloud models and certainly other chat GPT models just aren't. and speaking of new benchmark results we've got a touch on metter and the work horizon the latest on that from the new model so one of the things that people look at i guess more than anything with model releases or up there as far as what people pay attention to is the length of work according to this matter evaluation that models are capable of so far the trend has been that we are seeing pretty rapid acceleration from just a few minutes in 2024 towards hours at a time. We saw an improvement of GPT-5. And now with Claude and GPT-5.2, once again, we saw a significant improvement that seems to track with the prediction that it might be exponential. Opus 4.5, GPT 5.2 high, both jumped up in the predictions towards multiple hours by a pretty decent amount. Yeah, and I'm going to front load all the usual objections to the meet or eval thing. Like, you know, we get it. These are based on a very small set of human done tasks. And no, it's not obvious that we know technically or can assess tasks that or that it even makes sense to talk about tasks that it takes a single human 80 hours to do because those aren't well defined. Usually when tasks get that long, they involve the collaboration of many humans and all that stuff. However, one thing that we can say is not only do things seem to be on track on the 50% success rate threshold. So this is the classic meter eval plot that everyone's used to seeing. It answers the question roughly, what is the length of a task that an AI system can do? What is the length of a task that it would take a human to do? How much time that it would take a human to take to do that task? How long did that task have to get before an AI system has a 50-50 success rate at that task, right? So make your task more complex and longer, and eventually success rate goes down, eventually you hit 50%. So there's also an 80% time horizon chart. That also looks consistent if you look at GPT 5.2 high. So you're also on trend there. We had some cases with previous models where they do really well in the 50% plot, but then drop a lot on the 80%. But the other thing that we can say about this is if you look at this plot, you may actually notice a dislocation around the time of opening eyes 01. I believe at this point, I'll just say I believe that the curve is actually steepening. I believe that this doubling time that we're seeing from meter is actually increasingly starting to look like it's doubling faster than every seven months, maybe closer to every four months or so. So I don't mean to like, you know, do any horns or anything like this, but we did kind of like call this out about a year and a half ago. Or sorry, not a year ago, but like shortly after 01 was released in the first couple of repetitions, 03 was coming out. It's like, oh, that seems to be steepening. I think this is actually happening. We're now seeing enough of these. I mean, there's like one, two, three, four, five, six, seven, eight. Like we're in the roughly order of magnitude 10 points on this curve domain here. So this, I think it may be real, implications for AI capabilities, no one knows. But it suggests that whatever your timelines are for automated AI research, if you were on those timelines pre-01, probably need to update them to account for this new doubling time, if it's a thing. Right. And the precise numbers are the averages on the 50% success rate is GP5.2 gets to six and a half hours. Now, to be fair, the 95% confidence interval, which is to say like the lower and upper bound of what this probably is, super big. It's three hours to 17 hours. So this is a hard thing to estimate. And again, tons and tons of caveats. but given people's personal experience and this eval as well, not crazy thing to say that we are seeing increases at this thing. And the error bars can be wide, but you can also just look at the trend, right? We do have about, I don't know, 10, a dozen points now, all kind of on the same trend. So it's suggestive at a certain point, yeah. Yeah, certainly you may not trust the exact numbers, but the trend is real. Yeah, exactly. And just a couple more things to cover in the alignment side. First, we've got some research from Anthropic. The hot mess of AI, how does misalignment scale with model intelligence and task complexity? So the gist of the result in this paper, they look at why things fail, why does misalignment happen? And the bottom line of the paper is that when you get to longer and longer and more complex tasks, it's not that the models are misaligned, like the model isn't evil. The model just like becomes bad at things and starts doing silly things. And so the misalignment is not like from an intention of doing what it's not supposed to, It's more so an inability to do what it is supposed to, I guess you could say. And yeah, there's also some conclusions here with respect to larger versus smaller models, where, for instance, overthinking can mean that actually stronger models have this issue of like going crazy in some sense, more so than maybe less sophisticated models. Yeah, and a lot of it comes down to this idea of variance. So, you know, run the same prompt through a model with certain temperature, non-zero temperature setting, and you'll get different outputs. Basically, they're an attempt to kind of assess how crazy models can get and how off-base has to do with measuring that, you know, how inconsistent and unpredictable are the errors that the models make. Because this tells us something about the coherence of the goals that those models have. If the model had a really clear, coherent set of internal goals and a clear representation of them, it wouldn't vary that much, at least on tasks that it had been optimized for. And so what they find actually is, well, it takes a lot of effort to knock out the variants, to get these things to be on target in a consistent way. And, you know, we know that basically the argument of the paper is, look, you have to train LLMs in order to turn them into optimizers. You also have to train them to be aligned with human intent. And it's unclear which of those two things is going to be more robust as we scale these systems. It's just the case that you're going to see incoherence grow as reasoning length increases. So as these reasoning trajectories get longer, the amount of fluctuation and variance that you see in the outputs increases as well. That basically is a sort of metric for task complexity in a certain sense. The other thing that they find is this depends, this effect depends on easy versus hard tasks. So on easy tasks, the larger models, you know, with more compute and all that tend to have lower incoherence. They're simple tasks. So, you know, you get to a very clear solution on them if you deploy enough compute. But for hard tasks, bigger models actually lead to higher incoherence or even no improvement. They, you know, they might become more coherent. They're just like more unpredictable too. So this is all basically challenging this assumption that scaling alone will make AI more coherent. It might make it more capable, but also that makes it more kind of jittery in essence and harder to predict. So they've got a whole bunch of interesting controlled experiments they run as well. All of this is sort of shifting or like building on an argument to say that rather than a paperclip maximizer, you know, an AI system with a very clear goal, and it's so good at pursuing that goal coherently that it does horrible things to the world. maybe the failure modes are going to shift to more systemic things like pursuing, yeah, like more, I guess, chaotic issues like unpredictable or even self-undermining behavior. The model just like doesn't quite have coherence with respect to its goal. And, you know, that's what causes accidents. So there you go. Yeah, I think incoherence is a new notion here, which seems quite useful, actually. And they do have some pretty practical conclusions as far as worrying more about reward hacking and goal misspecification rather than just aligning and getting it to be a good model. Also, they introduced hot mess as a technical term. We'll be using that from now on. So that's great. I hope that picks up that hot mess is the opposite of systematic misalignment. And with that, we are going to wrap up this pretty dense, fast-moving episode. If you've made it this far, thank you for listening. As always, we will try to be back next week. Please keep tuning in and please do comment and review. We are going to keep an eye on it and we do appreciate your feedback. It begins, begins, it's time to break. Tune in, tune in, get the latest with peace. Last weekend, AI, come and take a ride. Get the lowdown on tech, can't let it slide. Last weekend, AI, come and take a ride. I'm the last of the streets, AI's reaching high. From neural nets to robot, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten, on the edge of change, with excitement we're smitten. From machine learning marvels to coding kings, futures unfolding, see what it brings.