#231 - Claude Cowork, Anthropic $10B, Deep Delta Learning

103 min

•Jan 21, 20264 months ago

Summary

This episode covers major AI developments including Anthropic's $10B funding round at $350B valuation, Claude's new Cowork desktop agent tool, DeepSeek's technical innovations in model training, and geopolitical tensions around AI chip export controls to China. Key themes include the shift toward agentic AI systems, advances in reasoning and memory scaling, and the strategic importance of compute access in the US-China AI competition.

Insights

AI companies are transitioning from selling intelligence to selling labor—pricing reflects this shift with Cowork at $100-200/month positioning it as intern replacement rather than autocomplete
Reinforcement learning prevents catastrophic forgetting better than supervised fine-tuning when training domain-specific models sequentially, enabling more robust multi-domain reasoning
Packaging constraints, not chip fabrication, are the real bottleneck in NVIDIA's supply chain—lifting China export controls creates internal competition for limited packaging capacity
Chinese AI labs are fundamentally compute-constrained for training/R&D due to inference demands from 1.4B population, creating a structural advantage for US-based companies
Open-source model releases from NVIDIA, DeepSeek, and TII are rapidly commoditizing training recipes and architectures, shifting competitive moats toward alignment, safety, and B2B relationships

Trends

Desktop AI agents with file system access becoming standard—security via sandboxing rather than capability restrictionHybrid Transformer-Mamba architectures emerging as practical solution for long-context reasoning without quadratic attention costsSequential domain-specific RL training replacing mixed-domain supervised fine-tuning to avoid catastrophic forgettingTest-time compute scaling and recursive self-calling enabling arbitrarily long context processing without fitting entire prompt in memoryGeopolitical bifurcation of AI development—US focusing on inference infrastructure while China struggles with training compute allocationConstitutional AI and multi-stage classifier cascades becoming production-grade defenses against jailbreaks with 40x computational efficiency gainsSovereign wealth funds and nation-states becoming primary capital sources for frontier AI labs—traditional VC insufficient at $10B+ scaleMemory-augmented architectures (n-gram lookup tables, external knowledge stores) decoupling knowledge storage from reasoning computationPositional embedding removal post-training enabling context window extension from 2K to 32K tokens without retrainingOpen-source model quality reaching parity with frontier models in specialized domains (coding, math, reasoning) within 6-12 month lag

Topics

AI Agent Safety and SandboxingConstitutional AI and Jailbreak DefenseReinforcement Learning for Model TrainingLong-Context Processing and Memory ScalingTransformer Architecture OptimizationHybrid Transformer-Mamba ModelsAI Chip Export Controls and GeopoliticsCompute Supply Chain ConstraintsOpen Source Model DevelopmentB2B AI Product Pricing ModelsCatastrophic Forgetting PreventionTest-Time Compute ScalingConstitutional ClassifiersPositional Embeddings in TransformersChinese AI Competitiveness Gap

Companies

Anthropic

Raised $10B at $350B valuation; launched Cowork desktop agent; expecting profitability by 2028; planning IPO late 2026

OpenAI

Signed $10B multi-year compute deal with Cerebras for inference; diversifying compute sources; mentioned as competiti...

Google

Expanding Gemini with personal intelligence features; rolling out AI overviews in search; expanding Gmail AI capabili...

NVIDIA

H200 supply constrained by packaging bottleneck; in talks to acquire AI21 Labs; CEO Jensen Huang discussing China H20...

DeepSeek

Released MHC hyperconnection paper and conditional memory lookup paper; R1 model narrowed US-China gap; founders warn...

xAI

Raised $20B at $230B valuation from NVIDIA, Cisco, and sovereign wealth funds; working with US Department of Defense

Cerebras

Signed $10B deal with OpenAI for inference compute delivery 2025-2028; AI-specific chip platform; raising $1B at $22B...

CoreWeave

Amended credit agreements to bridge GPU delivery delays; lenders softened covenants; demonstrates confidence in data ...

Salesforce

Launched agentic Slack bot capable of finding information, drafting emails, scheduling meetings across enterprise pro...

LM Arena

Raised $150M Series A at $1.7B valuation; commercial AI evaluation service reached $30M annualized revenue by December

Alibaba

CEO Justin Lin warned of <20% probability Chinese labs leapfrog OpenAI/Anthropic; cited resource gap and compute cons...

Technology Innovation Institute (TII)

Released Falcon H1R-7B reasoning model with hybrid Transformer-Mamba architecture; outperforms larger models on math/...

TSMC

Packaging capacity is rate-limiting factor for NVIDIA H200/Blackwell production; fabrication not the constraint

Astrocade

Co-host Andrei Krennikov's startup; mentioned as his current employer

Gladstone AI

Co-host Jeremy Harris's organization; focuses on AI national security policy

Microsoft

Mentioned as major player alongside Slack in enterprise messaging; Teams integration with Slack bot

AI21 Labs

Israeli LLM lab; NVIDIA in talks to acquire; early GPT-3 replicator with Jurassic models

UC Berkeley

Founded LM Arena (originally Chatbot Arena) as community evaluation platform; researchers involved in open-source dev...

People

Jensen Huang

NVIDIA CEO; stated China H200 orders will flow through purchase orders not declarations; discussed supply chain strategy

Elon Musk

xAI founder; denied initial $15B raise report; confirmed $20B funding round; known for aggressive fundraising

Jake Sullivan

Former Biden national security advisor; criticized Trump's removal of AI chip export controls; defended Biden-era pol...

Justin Lin

Alibaba CEO; warned of <20% probability Chinese labs catch up to US; cited compute resource gap as primary constraint

Dario Amodei

Anthropic CEO; mentioned in context of previous essays on inference-time compute scaling and policy implications

Quotes

"We're not expecting any press releases or any large declarations. There's just going to be purchase orders."

Jensen Huang•China H200 orders discussion

"If you've used Cloud Code, you're almost used to now at one-shotting some pretty complex things like you know things that in the past might have taken me you know three hours to do to be honest it just knocks them out of the park"

Jeremy Harris•Cowork discussion

"We're going to have to let these things rip in some sense and do big functions on your Mac and on increasingly sophisticated systems. How do we retain guardrails in that context?"

Jeremy Harris•Cowork safety discussion

"The only thing preventing us from beating the Americans or whatever is access to chips. That's really the only thing."

DeepSeek founder•China AI competitiveness discussion

"If you had $50 billion, you probably could train a pretty good model at this point."

Andrei Krennikov•Open-source commoditization discussion

Full Transcript

Hello, and welcome to the Last Week in AI podcast, where you can hear us chat about what's going on with AI. As usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. And you can also head over to lastweekin.ai for our text newsletter with even more news. I am one of your regular co-hosts, Andrei Krennikov. I studied AI in grad school and now work at the startup Astrocade. And I'm your other co-host, Jeremy Harris. I don't know. I do stuff. I'm a Gladstone AI. That's the one. I do a lot of AI national security things. Yeah. And excited to be back on the podcast because this is like, I don't know, we're half a dozen episodes into the return in the sense. And we missed last week. That was on me, travel, but we will not be missing weeks like that in general. We are going to be recording these. There was fortunately not that much that was going on last week. There was a deep seek paper that is worth paying attention to and that we'll talk about. There's a couple little things. Certainly, co-work is a big deal, the anthropic release, but yeah, not a huge week. So kind of forgiving. It doesn't always happen when we miss a week. Usually we get flooded. In this case, we can cover kind of both weeks in one episode, I think, without going into crazy overtime as we might otherwise. That's right. And yeah, this episode got kind of like a real mix of stuff, some significant and minor updates to tools. Gemini also has some interesting updates business-wise, some new 10 billion, 20 billion kind of dollar deals, which are pretty interesting. Got a decent amount of open source compared to most weeks. And then, yeah, quite interesting papers in research and advancements dealing partially with sort of this question of how do you scale up memory? How do you go the next step beyond what we've done in terms of learning? So pretty fun episode to come. And we'll go on and start with tools and apps with Anthropik's new cowork tool. so this is probably a big deal as you said and it's a big deal because cloud code is a big deal so at this point it's almost like a joke in silicon valley that like people are going crazy about cloud code and that's because it's it's quite powerful and just does do a lot of work for you and what people have observed is cloud code can do a lot more than just code like it can edit videos, it can compile spreadsheets, it can do all sorts of stuff where it just goes into your computer and does things that you ask it to do. And that's effectively what this is. This is Anthropic integrating cloud code, but without sort of the coder programmer interface of a terminal. You don't need to install it as a package or anything. It comes bundled in the cloud desktop app and just is its own little tab that you can switch to and then ask it to do stuff. And it goes on and interacts with a file system very much like Cloud Code. So given that Cloud Code has found many uses and many proponents, including myself, Cowork could similarly have a lot of fans to come. Yeah, absolutely. I mean, if you've used Cloud Code, you're almost used to now at one-shotting some pretty pretty complex things like you know things that in the past might have taken me you know three hours to do to be honest it just knocks them out of the park so it is pretty wild and now we are seeing that translate into this sort of desktop agent model which is what this is like co-work is kind of just an all-purpose aid in your like in a new way to interact with your computer so you can do things like point at some messy downloads folder i mean i can speak for myself and say my downloads are just a kind of a crap pile of like random things I've downloaded. And you can say, hey, sort of everything by file type and date or by theme or whatever. And it actually can look into a folder full of, you know, say screenshots or whatever and automatically build Excel spreadsheets and basically just like dive in, do a bunch of work that you might have an intern do or something and then give you an output. So that is like, it is a pretty broad set of capabilities. There has been some conversation about the sort of security side of things, I think it's important to note. So yes, it is a combination of sort of local access to your computer, web access, and autonomy, which are sort of like, this is almost the like lethal trifecta of three things that people talk about when they think about loss of control scenarios. One important thing though is, and this is often lost in the noise. So Anthropic is actually using like sandboxed virtual machines to contain these systems. They're not letting Claude run wild on your actual Mac. They're running it in a digital container. So this is kind of a new gold standard in 2026 in terms of what security looks like or for AI agent safety. People are saying, okay, well, we're going to have to, for competitive reasons, we're going to have to let these things rip in some sense and do big functions on your Mac and on increasingly sophisticated systems. How do we retain guardrails in that context? Anthropic has always been in this interesting position where they have historically oriented towards, well, we don't want to release a true frontier capability because we don't want to make the racing dynamics worse. It's clearly like that's no longer the philosophy. But instead, there's this view that, OK, let's at least shape the trajectory of the technology. And so you see them doing that with their responsible scaling policies. Other labs do that too. But they are explicitly trying to come up with frameworks and precedent like this whole idea of having the model run in a digital container that puts pressure on other labs that are going to fast follow to do the same thing. So this is kind of at the margins, how Anthropic is like spending some of its safety budget, let's say, which is an interesting play. And by the way, this is like a pretty interesting price point too. So they're looking at $100 to $200 a month when this comes out for the Claude Max tier anyway. So this is, you know, when we're talking about OpenAI potentially releasing, I guess when we were talking about GVD5 back in the It's like, oh, this could be $20,000 a month or something like that. We're now hitting genuinely hundreds of dollars a month, thousands of dollars a year. So this is pretty interesting. It's quite a price point. Right. That's the same pricing that they have had for cloud code. And at those kind of very higher price tiers, you kind of unlock a lot of tokens, a very high amount of usage, which when you do these sort of agentic, almost like assistants, you do wind up using. And I know personally, I'm on the max plan. So I would imagine a lot of people are actually paying those $100, $200 price tags. One more thing about the safety, I wonder if this also kind of implies a vote of confidence by Anthropic on the alignment side, where this kind of tool even forgetting potential future amplifications. Like in the present, you could ask you to go hack someone or go write spam or all sorts of very boring but real misuses. And I think we are at a point in alignment and safety compared to a few years ago where the Frontier Labs may be more comfortable believing that their agents are not going to go off and do things they're not supposed to, such as in this case? Yeah, it's an interesting question. I mean, certainly in the short or immediate term, like with this model, they're comfortable having it run in this context, right? With the reputational risk that comes with that and everything else. It's also worth noting too, I mean, Anthropic is selling mostly to corporations, right? Their B2B work is the most significant product line and they actually do dominate in that vertical right now. So that's a high risk, right? If you start having failures in a B2B context, it can affect a lot of people and high stakes. But yeah, when you talk to people about the short-term alignment side, getting agents to do what they're meant to do, you do get quite a bit of confidence on this. The long-term picture on the super alignment side remains actually quite pessimistic. And what I've heard from folks at Anthropic, talking to people at all the labs, one of the interesting differences is Anthropic seems to believe in short AI timelines more, I would say on average than most and therefore to be more concerned about super alignment because there hasn't really been much concrete progress in that direction. So it's this interesting thing where in the short term, we kind of go, oh yeah, these agents, we can keep them contained, we can release this, but then there remains that question mark. And I'm curious if that distinction ends up getting blurred in the future as we start to get more confused about what counts as what. But yeah, for sure. And by the way, so on the price point, one thing to know too is this is a kind of a shift for Anthropic, sort of like what they did with Cloud Code, where they're not just selling intelligence. What they're really doing here is selling labor. That's where these price points are coming from. We're starting to get into the thousands of dollars a year, low thousands, no question. But this is starting to look more like, like I said, hey, intern, go do this thing than, hey, autocomplete my code and give me the next few functions or whatever. This is like a really big sort of conceptual shift in the landscape. And I think that is reflected in those price points. It's hard to justify them otherwise. Next up, we have some news about Gemini. They're introducing a feature called personal intelligence, which would connect to Gmail, Google Photos, search, and YouTube histories for users of those and be able to reason about that information when chatting with you. So very common sense application or extension of Gemini by Google. Apparently, Google does acknowledge the potential for inaccurate responses or over-personalization and is going to be addressing those problems. It's also an opt-in feature, so you can connect, disconnect different apps. And Google is implementing some guardrails as well for sensitive topics. So yeah, definitely the kind of thing where you could see some funny, like unintended AI knowledge access. And I think Google might've learned from their embarrassing episode in 2023 or 2024 to avoid kind of blunders that are avoidable. Yeah, this is now out in beta and it is usable by Google AI Pro and ultra subscribers. Yeah, I like the phrase over-personalization, as a soft way of expressing something. I'm actually quite curious what specifically is going to be met by that. If I'm Googling for lobsters and dresses, then Gemini should just not worry about why. That's right. That's right. Yeah. I mean, how many times have we Googled for lobsters and dresses? I mean, it's just constant. Absolutely. And one of the interesting things too is there's been this shift. Like we used to talk about this. I remember years ago when we were talking about, I guess, GBD-4 and all that stuff. We were talking about the advantage that OpenAI enjoyed relative to Google because that was the main axis at that time. in terms of OpenAI being perceived to be a new player. So it's like they release a shitty thing and it helps people make bombs or it does whatever. And everybody kind of goes, eh, yeah, whatever. It's OpenAI. They're just starting up here. Whereas Google, if they release something, everyone goes, whoa, Google, like what the fuck, guys? And it's kind of like changed now where OpenAI is actually large enough that they're expected to put like, they can't quite get away with the same stuff that they might've been able to pull off, say two years ago or three, or even one for that matter. And so one of the ways that's expressed too is on the ad side, like their opportunity to experiment with ads that would probably be crappy at the beginning has kind of passed. And so here Google has kind of been going the other direction saying, hey, you know what? We are actually playing catch up and that comes with a license to throw some kind of wilder punches. And I feel like we're starting to see that a little bit, the sort of willingness to experiment and just try things out with all this kind of couchy language. Like, hey, you know, there may be some over personalization. There may be this or maybe that. But certainly Google is now shipping, which is a big shift. We'll see if that continues. But institutionally, they feel like a different company in this space now. And speaking of Google, we've got another story related to them. This time about their overviews in Google searches. They're removing some AI-related health summaries after an investigation has found dangerous flaws in those responses. So Google has disabled specific queries like what is the normal range for liver blood tests after experts flagged them as dangerous. Didn't kind of do that across the board. There are still some responses where it would do that. So yeah, one of these cases where I guess it could have been predicted that having a chatbot summarize some information inaccurately could be problematic. and good that this was caught. Yeah. I mean, at a certain point, this is easy to say, by the way, because most people just don't have the time or the capacity to do this or the knowledge base to do this. But the way to use these tools is obviously you do a search, you get whatever result. And then if it's high stakes, you look it up. You actually make sure you find the ground truth. The report talked about this kind of critical error on pancreatic cancer searches. The suggestion here was that patients should avoid high fat foods. Apparently that contradicts standard medical guidance where you want to maintain your weight and it could be a serious issue. And I don't want to get in the business of saying, ah, who cares? So what here? But at a certain point, we do face this question of we're either going to surface these recommendations or we're not. And the question there is where does the burden lie in terms of validating the factuality of some of these statements? I don't know what the right answer is here, but it does seem like the option to just say, okay, well, therefore let's put pressure on companies like Google to just never surface this, we're not seeing the other side of the coin here, how many lives are saved by the good recommendations that actually help. I don't know that number. And until we do, it kind of feels like it's like the self-driving car thing all over again. We can look at the awful crashes, but if we're not looking at the lives saved, it's just really tough to tell. And this reminds me recently, I was chatting with some people about the topic of AI overviews and how it sort of feels like almost under the radar AI overviews just became Google. Like I cannot count the number of times that I'm just asking Google a question, which used to be like ChatGPT, people were saying back in 2023, you know, Google is in peril. Google might die out because ChatGPT would replace it as the go-to place for search. And it took Google a bit of time. And when AI overviews initially rolled out, people were finding all these jokes about how many rocks to put in or like glue in a pizza, I think, or how many rocks do you eat per day. But I know in my case, some people I've seen now just have this learned behavior where you, if you have a question, you Google it and you look at the AI overview and that's just standard. So I've sort of been reflecting and noticing that. Yeah. People were always talking about how Google has 90% market share on search, what they tended not to focus on was that OpenAI had like 100% market share on the chat market. And now if Google is stepping in on that and Anthropic stepping in, it's like, it's actually not now. These are massively growing spaces, right? So the pie is growing fast enough that there's enough, there's plenty there for everybody, but it is interesting. You're right. It's not as simple as that just search market story. Yeah. It's no longer the case that sort of, I guess some uses of chat GPT have been overtaken by AI overviews, not even Gemini, just AI built into Google search, which in itself is pretty interesting. We've got one more story about Google. Gemini is expanding within Gmail. So beyond the basic features that have existed, like summarizing emails, it can now do some more useful stuff. You can ask questions about emails. If you've got some of the subscription tiers, there's also proofread that would offer grammar and style suggestions. We've got AI inbox, which would filter emails to highlight important messages and tasks, help me write, suggested replies, all these features. So they're still kind of not integrating it full on as an agent or anything like that, but kind of adding it here and there in various ways, which to me, yeah, seems pretty intuitive. Yeah. Some of these look like actually really interesting lifestyle improvements. They give this example of, instead of typing in your inbox search or whatever, if you're looking for a plumber who gave you some quote, right? They're like, you could just type in, who is the plumber that gave me a quote for the bathroom renovation last year? And that actually would solve an awful lot of my inbox search problems personally. So it seems like a good quality of life thing. Yeah, this whole idea of the AI inbox where they're going to filter the clutter so you can focus on what's most important. That seems a bit riskier to me, right? Because at that point, knowing what's most important is very, very context laden as a thing. So that seems like something that we should keep tabs on to see what the actual vibe check is and what the failure modes are. But yeah, it is interesting. Again, they're leaning out. They're taking this big swing. And just one last story in the section. We've got Slack bot is an AI agent now. So Salesforce has launched this new AI powered version of Slack bot, of course, built into Slack, which I think still is one of the dominant messaging platforms. And as you can imagine- Dude, tell me Slack's in trouble without telling Jimmy Slack's in trouble. I don't know much about the market. I think it's still Microsoft and Slack as far as the big players. That's true, yeah. Certainly kind of a big deal. And this new agentic Slack bot would be capable of finding information, drafting emails, and scheduling meetings within Slack. Also interact with other enterprise products like Microsoft Teams and Google Drive. This was announced a little while ago, but is now being rolled out. and the Salesforce CTO has described it as a super agent. Yeah, I mean, I think this is another aspect. We've seen AI get integrated as these little kind of question answer summarizers. I think the next step probably is all of the business apps, Notion, Slack, you name it, will have agents built in. It's interesting to kind of see if anyone who is around in the kind of 2020 era where people were really getting riled up about what the future of post-human labor would look like or post-human markets. They often would talk about it'll all be AI agents chatting with each other. And it was difficult to imagine how that would actually start happening. But now when you see it actually happening, whether it's Salesforce or Slack or wherever else, it's like, I'm going to, for a larger and larger fraction of my work, I'll be outsourcing it to these agents and eventually it's going to go all the way. Everyone is going to become a manager. That's right. Manager or a paperclip. Those are your two choices. And onto applications and business. First up, Anthropic raising money. They are getting 10 billion at a 350 billion valuation. This is still not fully signed, but it sounds like based reporting that it's more or less being finalized. And that's being a shit. Sounds like GAC, Syncopers, Sovereign Wealth Fund, and Cartoon Management plan to lead the new financing. I guess we're not done with these mega deals yet. Yeah, no, absolutely. It's notable too. $350 billion. I am old enough to remember the old, old times of September 2025. What is that? Four months ago when Anthropic was only worth $183 billion. So they doubled their valuation just in that time. Now, when you get into this territory of raising $10 billion plus, we've talked about this a lot. You're in the territory of sovereign wealth fundraisers. There's just no other place to get that kind of capital. Maybe SoftBank, but they're pretty tied up right now with OpenAI. So this is really, this is the end of the road. After this, you IPO and you access the deep capital markets of the United States, but that's basically it. So that is the plan, by the way. So Anthropic is expecting to break even by 2028, which is pretty soon. This suggests they could reach profitability actually faster than OpenAI, which again, we've talked about this before, but not a surprise given that Anthropic is dominating in that B2B segment. There's much more profit per token associated with that work. So you might actually expect that break even to happen. But the revenue growth has also been really crazy. I mean, they went from about a billion dollars at the start of 2025 to less than a year later, 5 billion. They 5X their revenue in that time. Pretty wild. They are looking at an IPO, preparing for that as early as late 2026. So they've got Wilson Censini, which is a big, big famous law firm for tech company IPOs. And they're starting to work on that corporate restructuring that they need to hit to do that. So this will be one of the big stories of 2026 if we get there. The Anthropic IPO, possibly the opening IIPO, there's a lot coming down the pipe. And speaking of mega financing rounds, XAI has raised $20 billion from NVIDIA, Cisco and Vestors. The funding would value XAI at approximately $230 billion. So pretty impressive raise. Sounded like actually there was a lot of demand to get into the round. XAI hasn't had something like cloud code, but they are working with the Department of Defense in the US. So I guess that could be helping with the Optimus upside. Yeah, I think I'm trying to remember the Department of War now. I guess the deal that they had was for like $100 million for a couple, I think it was a couple of different labs. I'm not sure now I'm trying to remember. but certainly there is that partnership. I think it'll be a relatively small fraction of their revenues, of course, but strategically interesting and important. Elon famously said that the story of them raising an initial $15 billion investment was false. He's like, this is not true. And presumably now we're learning that it's because it's going to be a $20 billion investment. So technically true that it wasn't false. What? Technically true that it wasn't true? Yeah. Anyway, directionally accurate. The other investors here, by the way, do include a bunch of, so the Qatar Investment Authority, Abu Dhabi's MGX, right? So you are, again, back into that whole sort of sovereign, well-funded, adjacent territory. It's just a lot of money. I mean, the big play is going to be on, obviously, the data set that X has, right? XAI now kind of has access to all the X data and the Tesla data. There's all these interesting integrations, especially now that we're talking about this world model stuff, self-driving car data starts look really interesting. So anyway, not surprising, of course, with Elon at the helm that they're pulling off these wild fundraisers. And just a fun anecdote about Grok real quick. Just a couple days ago, we needed to test our moderation system. And can you guess how we generated the inputs to test moderation? Grok was very capable of offering some very spicy things that I think is it's true. Other chatbots would probably have not been capable of and in fact are a little bit sensitive when it comes to moderation relative to maybe newer models. And of course, we can't go through this section without at least one story about NVIDIA. This time it's about NVIDIA needing, quote, a supply chain miracle from TSMC as China's H200 AI chip orders OOM supply. So apparently there's as much as 2 million orders for H200s coming from China, while the current inventory is only 700,000. So that's a big gap. And that's despite the average selling price of one of these chips being estimated at $27,000. So this is a lot of potential money that NVIDIA would be leaving on a table if they're unable to actually just create and sell these chips. Yeah and that means they spinning up as Jensen said their H200 supply chains right Like they bringing them back to life They had been all focused on Blackwell because well that just like the better chip But here they rotating back and they have the H200 because this China sale thing is going through One of the most important things to keep in mind on the supply chain is they try to do this. So when you go to make the H200, so you're using TSMC's four nanometer node, and that can be produced both in Taiwan and in the US. So you've got tons of production capacity there. The issue to meet this demand is not the ability to fabricate that logic. The issue is actually the packaging. Basically, this process where you take the logic and the memory and you put it all on one kind of coherent chip, that's CoOS. And CoOS packaging is basically being used across the board for Hopper, for Blackwell, and for Blackwell Ultra. So now that they're saying, okay, well, we want a ton of Hopper, it's like, sure, you're using a different node to FABIT. So that's great. You can do that in parallel, but you're relying on the same finite pool of packaging. And so that's really what's becoming the kind of rate limiter here. It's something that has been clear. It was clear that packaging was going to be the rate limiter for some time. It has been an issue, but now even more so because you're pulling down a bunch of H20s or H200s rather to ship them basically to China. So this is a sense in which, by the way, this idea of exporting advanced chips to China actually hurts American companies directly, because you're hitting the packaging part of the supply chain. So sure, TSMC can fab the chips or the logic dies, but there's going to be less blackball chips if the same packaging process is used for both. So very complex supply chain, a lot of interactions that are not necessarily obvious when you first think about, hey, let's lift the ban on exporting these things. Also, the H200 is about six times more powerful compared to the H20 for training workloads. And that's one of the reasons that China's AI industry is rushing to place these orders. So there you have it. I mean, this is a very complex story and it is an interesting consequence of the China ban lift. Right. It's coming at a time when it's also been a complex kind of history of export controls where last year the Trump administration sort of flip-flopped a bit, but then eventually basically let NVIDIA sell to China. Now China, the government there seems inclined to maybe start disencouraging buying these chips from what I've seen, but it's still isn't that true. So I would imagine at least part of the story for why there's a rush to buy these up is because it's very uncertain if it's going to keep being possible. Absolutely. And from one administration in the U.S. to the next as well. It's also unclear, you know, if Congress flips to Democrat in 2026, what new laws could come in that make export controls harder, or sorry, that make exporting harder. But yeah, and then in terms of the shipping to China, you're right. The Chinese have come out and said, hey, you know, we're not so sure we want these chips now. It's kind of ambiguous. This is, so Jensen had a press conference. I think it was a press conference, something. He made some statement where he was like, look, here's the deal. There's not going to be a splashy announcement from China saying, yes, we're open for business. We've decided we want the chips. Instead, it's going to come down to the purchase orders. There's going to be purchase orders that suddenly come from Alibaba and from Huawei and from everybody else, and it'll all be done discreetly, but the GPUs will flow. That's how we'll know that China is actually open for business. And I mean, frankly, I fully expect them to, even though I know there's been a bunch of kind of questioning down that line. I would be shocked, and we can revisit this in a future episode, but my money's on if the Trump administration allows those to ship, the H200s will ship. And we've got a couple more stories on chips and compute. The next one is about OpenAI sending a deal worth $10 billion for compute from Cerebras. So this is a multi-year agreement where Cerebras is going to deliver 750 megawatts of compute to power starting this year through 2028. So this is kind of an extended deal worth $10 billion being accrued over time. I think an interesting development where we've seen OpenAI continually trying to get more compute, diversifying the sources of compute. Cerebrus is an interesting player in a space where they have this AI-specific chip system that can have a high throughput, specifically for inference. So different from NVIDIA GPUs. They have been around for quite a while, and it seems like Cerebrus now is getting to a point where there's a lot of demand for these chips in data centers. Yeah, and this is really about inference, right? So Cerebrus is an inference platform. That's what this is going to be used for. OpenAI came out and said that this is just basically going to be about decreasing latency for certain customers. And OpenAI has a strategy that, well, the way they described it here is to build a resilient portfolio that matches the right systems to the right workloads. In other words, there are workloads that Cerebris specializes in. They're going to be workloads that, well, all kinds of other players in this space. You could think here of like FluidStack or anyway, any other entities that have different specializations and different kinds of workloads, and they're going to try to ship them to the right providers. And that's really a all ships rise situation. There's inference, there's training, there's weird mixes of the two. You know, that's what it's all about. So also Cerebris, by the way, has been pushing back their IPO a lot. They first filed for it in 2024, but there've been a bunch of controversies and challenges. So they've been raising on the private market since then quite a bit. Apparently, they're in talks to raise another billion dollars at a $22 billion valuation. So they're sort of, I don't want to say limping towards that IPO because these are big strides, but it's been a bit of a stutter step to that goal. Yeah, I think with NVIDIA, all about acquiring Grok, it would have been a very fun story if AMD all but acquired Cerebris, but it looks like that's probably not happening. And on to more of a cloud story, CoreWeave is amending its credit agreements. So this is a part of cloud computing and it seems to be modifying these agreements to have more liquidity. So I'm actually going to let you take over, Jeremy, because it's a bit technical. Yeah, well, so this is one of the classic challenges that happens with a lot of these big builds. Basically, CoreWeave ordered a bunch of hardware, like GPUs, like billions and billions of dollars. Probably it's Blackwells, just given the stage that we're at. And the problem is that they have been delayed. So their arrival has been delayed. CoreWeave paid the money for those things. So they're out of pocket, but they need those GPUs to be able to pay back the loans and all that stuff. So what's happened here is they're basically saying, look, we need this liquidity bridge because of this delay. And this means that they need to turn back to their existing investor, sorry, lenders, I should say, and say, hey, we got to rework our terms here. Because when lenders give money, in this case, they gave like $2.6 billion to CoreWeave to finance all this stuff. They don't just give $2.6 billion and walk way, they set these things called covenants. And these are rules that the company has to follow to prove that it's healthy. And so the amendment that they've just made to those covenants has a couple of different components. One is there's a minimum liquidity, a minimum amount of cash on hand that CoreWeave had to keep that was lowered to $100 million. And that gives them more breathing room to spend cash on data center builds instead of just like letting cash sit idle in a bank account to satisfy some requirement. But another key one here is that they actually are postponing the testing of their debt service coverage ratio. This is basically just like how much their operating profits can cover their interest payments, like that ratio. There's a moment where you trigger a test of that, basically checking your profit versus your interest. And they've pushed it back to late 2027. So you can see how this is all about like softening the pressure on CoreWeave because fundamentally what the lenders are saying is, look, we believe you'll be able to make this money back, no problem. We want to let you fight another day. So we're just going to soften these things. We have confidence in the commercials. It's the underlying supply chain that's the issue here. And this we expect to be resolved. So there's a whole bunch of other stuff in here, but it's basically that theme. They're being allowed to just decrease the proof points that they have to show, decrease the amount of cash on hand that they have and all that. So really important for CoreWeave. And it does show that at least as far as these lenders are concerned, that market seems pretty healthy, or at least CoreWeave's position seems pretty healthy. And onto the last story, LM Arena is now valued at $1.7 billion after raising $150 million in a Series A funding round. That's pretty soon after previous fundraising. They've got a 100 million seed run in May. That was at 600 million in valuation. They've launched their commercial service AI evaluations back in September and apparently already reached an annualized revenue rate of $30 million by December. LM Arena started out as just kind of community-run evaluation platform was originally called Chatbot Arena, founded by UC Berkeley researchers, and initially was funded through grants and donations. So it's kind of an interesting story of how it developed and how it became apparently a very valuable product for companies. Yeah. I did not see this coming. No, absolutely. And then you look at the investors who are participating in this round, this is pretty wild. So it is a series A. It's And recent Horowitz, Kleiner Perkins, which like, you know, old school, but still very upper, upper brow or high brow. Lightspeed Venture Partners. There's a bunch of others, but like these are, there's a lot of the who's who's who in the Valley. So damn solid fundraise. And onto projects and open source, where we've got a few interesting projects and open source releases, starting with Nemetron Cascade, Scaling Cascaded Reinforcement Learning for General purpose reasoning models. So this is a framework that has scaled this notion of cascaded reinforcement learning across multiple domains to develop these general purpose reasoning models. This is a really interesting paper from the standpoint of a key problem that's been persistent in the space for a long time, which is catastrophic forgetting, right? So traditionally, when you train an LLM, you train it, you know, pre-train it, let's say, you're going to train it with data from a whole bunch of topics, if you continue your training and you cause it to specialize in a domain like math code or general chat or whatever, it will start to forget the things that it's learned about other topics. So this causes people to try to choose between, should I make a specialist model? Should I train sort of topic by topic so that it masters those topics? Or should I go broad? And at what stage should I do what? And what they're doing here is they're saying, well, look, A, you should be doing this with reinforcement learning and you should be going, instead of blending a bunch of different data from multiple domains like math and code and general chat at the same time in reinforcement learning, what you should do is do sequential domain-wise training. Just code, just math, just general chat, just whatever, but with reinforcement learning. And that's really important. The reinforcement learning process is one that prevents catastrophic forgetting. It turns out, it seems like using reinforcement learning from human feedback, in particular as a pre-step, can set up the model's foundational reasoning abilities, and it gives you this base, this robust base that can help you go domain-specific down the line without collapsing. Reinforcement learning is quite interesting, like where you should use it versus where you should use supervised fine-tuning or, you know, anyway, the standard sort of autoregressive training. The autoregressive training seems to be really where you get into the catastrophic forgetting thing if you do this approach, whereas what happens here is they do have this RLHF kind of base alignment to get models to learn general reasoning. And then when you go domain specific, one of the big advantages is if you focus just on math, like math has often a binary reward, like correct or incorrect. And it's usually really fast to compute. So it's got a specific profile of like how the data flows through the system, the sources of training instability, all of that is pretty unique to math. And if you move to code, the reward might change. It might require a sandbox execution environment and it might have higher latency. So if you're trying to mash together math, code, and software engineering, which is extra noisy because your code might be partly correct, but it failed one test, but not others. And so trying to mash all these together when you do your reinforcement learning step can cause all kinds of training instability because the way data is flowing through your system just has to be a little different for each of those. And so what they're doing is by going through one domain after another, after another, using reinforcement learning, the hyperparameters like learning rate or batch size, all that can be tuned specifically for that domain's response lengths and the sparsity of rewards and all kinds of stuff. You can even do reward shaping to make sure that you're tailoring rewards more closely to what should be done in that space. And so this is really how they do it, right? They start with this like base alignment step where they're doing general conversational domain. And again, through RLHF, reinforcement learning from human feedback, the goal here is just to make sure the model is helpful and can follow general instructions. And then they might move on to a bunch of training on math using verifiable rewards. And then a bunch of RL training on coding, then a bunch of RL training on software engineering and so on, which is just like it's this interesting discovery that prevents catastrophic forgetting again because you're using rl and because you're starting off with a model that you've trained to do general purpose reasoning so there's a bunch of extra stuff here but yeah i think this is actually quite an interesting process paper as we think about where rl can add value and where it doesn't yeah we have a fun one diagram essentially of this whole process where it's people are now starting to call it mid-training or post-training and it seems to be starting to get a little kind of figured out with this set of stages you do supervised training then LHF and a few different variants of RL that are domain specific still primarily in verifiable domain land this one actually released the paper and the model kind of a month ago, so it's not super fresh, but I don't think we covered it. And it is fairly notable for this size range of 8 billion, 14 billion. These models are very performant and fully open sourced by NVIDIA, as well as the training recipes, data, and the report itself, the paper is very detailed. It's like dozens of pages. So nice to see here from the US, It's sort of similar to DeepSeek type technical report where you see a lot of integrity, a lot of the stuff that would otherwise be figured out within Frontier Labs, but never shared publicly. I think this is giving us a hint of what the Frontier Labs are probably figuring out. Yeah. And it's also, you know, we talked about this, I think with the last NVIDIA release, but they're very interested in promoting the open source market because everybody doing open source is using NVIDIA hardware. Whereas increasingly you're seeing with closed source, I mean, we've talked about, you know, everything from Grok to, which now is NVIDIA, of course, but Grok used to be, to certainly TPUs and, you know, Tranium 2 and Tranium 3 and like all these other platforms, everybody's getting their own chip. But the open source landscape is dominated by NVIDIA. So they have a very strong vested interest in pushing that. The other thing too is like kind of from an intuition standpoint, like why RL over supervised fine tuning? This is something that like has been for a long time known in this space, but it's worth saying explicitly in this context, because this is really a clear test of it. When you look at supervised fine tuning, you are training a model to imitate, right? It's trying to minimize the difference between its output and some kind of like target answer. And so if you move then from a math data set to a coding data set, the model is forced to adopt new token patterns, right? They can overwrite the patterns that it learned from math. And that's where catastrophic forgetting happens. Whereas when you look at Cascade RL or like reinforcement learning from verifiable rewards specifically, basically the model is not just mimicking tokens. That's not what the loss function is doing. it's exploring its own reasoning paths basically to reach a verifiable goal like a correct math answer or some kind of executable code and so what that does it reinforces the underlying reasoning capabilities instead of just these kind of like the surface level patterns like what the model sounds like all of that is ditched and so you get more durable skills out of it there's a place for each of these of course but this is one of the really important reasons that they're flagging here for the fact that you don't get catastrophic for getting from RL in the same way. On to more of a paper coming from DeepSeek titled MHC, Manifold Constraint Hyperconnections. So this one actually got a decent amount of play on Twitter, despite being quite mathematical and deep. So the gist of it, the high level is there's this notion of residual connections, which is very standard in neural networks. Basically, you don't just go layer by layer. You pass forward some information from an earlier layer to a later layer without processing it through these intermediate layers or in addition to that. And that turns out to help a lot with training. Now, there was this notion of hyperconnections, which are essentially fancier residual streams. They do some computation within that connection. to improve its benefit. But that turned out to then make training a bit trickier. It introduced some training instability. So the manifold constraining here is DeepSeq suggesting a way to do this hyperconnections trick, which is pretty new also from 2024, but while preserving the kind of ease of training. And specifically, I'm quoting from a paper, AMHC utilizes the Sinkhorn-Knopf algorithm to entropically project H-Res onto the Birkhoff polytope, which I have no idea what that means, but I assume it means- You don't know what the Birkhoff polytope is? Come on, Andre. I thought you were a Stanford guy. No, I'm one of these people who just put together new onlets, wrote the code. I never learned the math at Red Manslam or Vises, but it seems to yield some pretty significant improvements in terms of large-scale training. Yeah. Yeah. It's actually conceptually pretty fascinating. The whole idea with residual layers, we talk about these a lot in Transformers. You basically take the output of the previous layer, and as you pass it forward to the next layer through the residual, what you're going to do is you have some input to the current layer, right? Call it x input. What you're going to do is you're going to chew on that input, right? You're using your layer to produce some output. And then the thing you're actually going to pass on to the next layer in the residual is not just the output that was the chewed up version of the initial input, but the output plus the initial input. In other words, you're going to try to kind of give the initial input a little bit more influence. Like It's not just that you're going to take an input, spit out an output. You're going to take an input, spit out an output, but then add it to the input again so that the input gets to be represented again in what you pass on down the line. And in a way, this kind of creates a sort of momentum in favor of preserving the information from previous layers. It makes it kind of over time, eventually, yes, you do tend to like, if you make the transformer too deep, you will kind of forget and lose that information. but this is meant to kind of give you a way of preserving that information flow so that, yeah, so the information content of previous layers is propagated for it. Helps with stability, helps with all kinds of things. Now, the challenge is exactly what I just said. If you make this model too deep or, yeah, I mean, in many cases, you can find that if the output of a particular layer is just too, let's say, loud, it will just kind of take over and overwrite and functionally kind of erase, wash out the information from the previous layers. And so the solution that a lot of people have been using and that they kind of came up with is, okay, well, why don't we create, instead of having one, think of it as like a notebook or something, where every layer just kind of adds findings to the notebook, to a page. It doesn't erase the old text. We're actually going to keep it there. We're going to just add a few more notes to the margins and pass it along. Well, what's going to happen here is we're going to say, okay, this seems to cause us to forget that initial text too often. So what we're going to do is we're going to use many different notebooks. Basically, we're going to have a bunch of different residual lanes and use each one to store a little bit of different information. So maybe like lane one of the residual stream is reserved primarily for like the raw embedding that you initially got. You're preserving that information all the way down, keeping it clean to make sure it's available for any future layer to look at. So this is the pure raw initial input embedding. Whereas maybe the other lanes are more like scratch pads where different layers can dump in their outputs without muddying the original pristine signal. Now, the problem this creates is at some point, you are going to have to mix all of those lanes back together to get one thing, one output that you can pass on to the next layer. That mixing has to be done with a matrix because, I mean, that's what you need to mix vectors like this. And the problem with doing that is that if the numbers in that matrix are even slightly larger than one on average, then the signal is going to get amplified at every layer. Basically, because you're reusing this matrix to multiply, multiply, multiply, and you get this explosion. And likewise, if the numbers are slightly smaller than one, the signal kind of fade away. And so this is what the paper deals with. It's this MHC solution. And what they're doing is they're taking this matrix that mixes these many different lanes in the residual stream, and they're forcing it to be doubly stochastic. All this means is every row has to sum to one and every column has to sum to one. This guarantees that the total, mathematically, it's called the total energy of the signal across all the lanes is constant. But basically, it just means that you're not going to gradually compound and blow up the information that's being sent down the line or make it disappear. And this whole sinkhorn NOP algorithm is just the way that they efficiently make this matrix doubly stochastic. The details really don't matter. They are interesting, but they don't really matter. The bottom line is this is, again, DeepSea going deep into a very narrow, technical, mathematical thing that really matters for implementation, but that most people don't care about. And then it's kind of solving this very fundamental and interesting problem in a fundamental and interesting way. And so this is what they've got. They even wrote a custom kernel, by the way, to optimize the crap out of the Sinkhorn NOP algorithm in this context, which is, again, classic DeepSeq doing the hardware aware thing. Back to a model, we've got a technical port for iQuest Coder V1, which as it sounds like is a model that is specialized for coding. Actually, a family of models at different scales, 7 billion, 14 billion, 40 billion, and 40 billion dash loop. and the gist of what they did to get a very nice coding model that is competitive or roughly similar to a lot of the other good coding models, Sonnet 4.5, Kimi K2, etc. They have a fairly complex training pipeline. So they have pre-training, of course, with focusing just on training on code with the standard transformer model. Then they have mid-training where they start to get a little bit more agentic and addressing different tasks. Then they've got post-training for thinking and for instruction following. And they essentially in the support, just detail how they set this whole training regime up, what the data mix is, et cetera, and wide up being able to get a model that is fairly competitive, despite being smaller, presumably to Sonnet 4.5 and GP5.1. Yeah, seems quite good. Yeah. And there's this funny kind of weird thing that they're doing with this 40B loop variant that I actually think is really interesting. They have kind of this multi-pass workflow. So they feed the same input through the same weights twice but they trying to do something pretty different each time So you know the first time they feed the input in the tokens get processed through all the layers And as that happens the model is populating what called a global key value cache. It's not actually going to try to decode an answer at this stage. All it's doing is kind of building a latent representation of the big picture of what is meant in this input. And in this sense, it kind of reminds me a little bit of like an encoder decoder architecture. So you have this sort of encoder phase where the first pass is, let's just get the big picture of what's going on here. The output of that is a set of hidden states, as you'd expect. And then also the KV cache, this kind of global KV cache, which is, again, at every layer, there's a global KV cache that represents that layer's understanding of the global meaning of that piece of the picture. And then in phase two, the tokens get passed through the same weights, the same model for a second time. And then what happens is the model looks back at the KV values, the sort of global context for each layer that was stored from the first pass. And this happens layer by layer. And at the same time, during the second pass, the model is using local attention to just do causal attention on the tokens that are being generated in the second pass. So you sort of have this first pass that's focused on populating this global cache, and then the second pass that's more kind of local correlations, and then they have a learned gating mechanism that does kind of a fusion that fuses those two streams, and it decides for every token basically how much to rely on the global understanding that it got from the first pass versus the local, more logical-oriented refinement from the second pass. It's kind of interesting. It reminds me a lot of these attempts to, again, solve for memory for, you know, the sort of, yeah, really it's an attempt to fix the memory problem. There are a lot of people who are trying different things for this, but this is something that I hadn't seen before. So kind of cool. And the results are pretty, pretty good. So they have, you know, live code bench, the 40B loop model thinking version got a score of 81.1, which is the highest recorded for that benchmark, at least in the report. Sweetbench verified 47.2%, decent for a 40 billion parameter model. So anyway, it's kind of interesting. Again, just another stab at this whole memory thing. Let's have the model kind of process the thing twice. I haven't seen this before again. I wonder if this will pop back later. Yeah, fun fact. They note that this approach is pretty heavily based on this other paper from 2025, pretty new paper by the ByDance Seed team and Hyperconnections also from the ByDance Seed team. So a lot of these research insights are now making it into more models. And it seems like we are really at the optimization phase where maybe the data stuff is a bit sorted out and maybe the training regimes and stability and so on have pretty worked out. And now we can get it to a lot more of a nitty-gritty. And this loop transformer goes back to 2019 with universal transformer, like back when research was research. So yeah, as we know, it's back to the era of research. All right, and one more open source model. We've got the TII Abu Dhabi team releasing Falcon H, H1R-7B, new reasoning model that apparently outperforms others at math and coding with only 7 billion parameters and a 256K context window. So this fancy naming is to indicate that this is a hybrid transformer and Mamba 2 architecture. Having this Mamba 2, which just to quickly reiterate, is an alternative to transformers that has recurrence within it kind of looping over the input, which allows it to handle much longer sequences without increasing computation. So here they're using that to handle large context windows. And via another training process, they have a two-stage process, SFD on long reasoning traces and reinforcement learning, similar to what we've seen before, is able to achieve fairly good benchmark scores. And yeah, we got another Falcon model, which used to be a big deal in open source early on, but has fallen off. And now we're seeing, I think, more and more of these indications of what you can achieve with hybrid models. And it seems to be like a promising direction in general. Yeah, like one story or one take home from this too is TII, the Technology Innovation Institute, I think that's, if I remember, that's what it stands for, is back, right? I mean, famously the Falcon 180 or no, that was the, I think that was the disappointing one, but there was one before that was really impressive, put the UAE, among other things, on the map in a big way in the space. So this is a genuinely interesting and impressive model. It does beat Coin 332B and Microsoft 414B in a bunch of reasoning tasks, and it only has 7 billion parameters. So this is a legitimate achievement. Also, the Mamba 2 hybrid thing, I feel like we're seeing this more and more. The Transformer Mamba hybrid, usually the way these hybrid architectures work is, or historically the way they'd work, was you would kind of stagger, you do, you know, Transformer layer with standard attention and all that stuff. And then you would pass it on to a Mamba layer and then a Transformer layer kind of alternate that way. What's happening here is not that. So they're actually in parallel using some Mamba heads and some attention heads to produce the output at each layer, which is an interesting change. It's definitely not something that traditionally had been done. And they call this the parallel hybrid layer. Some of these heads side by side. And so when a token hits one of these layers, split, part of the information goes through the Mamba2 head, and that will handle the longer term sequence history, right? We talked about Mamba and Mamba 2 before, but this idea that you kind of have this vector that's going to store context over time as you read the text, much like a recurrent architecture, like an RNN. Basically, this vector that you're going to keep dumping more context into, almost like a kind of pseudo scratch pad as it reads. And then you've got the other part that's the attention head that handles these more complex and precise relationships. Because the Mamba vector, this memory vector, is only so big, you're going to kind of have more fuzzy recall, more lossy memory. So when you go to handle complex, precise relationships, you really do want the attention head. And there's an MLP, like just a multilayer preceptor on standard feedforward network that then merges their outputs together. There's a bunch of differences between Mamba 2 and Mamba that hopefully we have a podcast episode explaining. There's like an efficiency kind of optimization thing happening there that's very important where just super quickly, like when you do matrix multiplication, it can help to take really big matrices and break them down into smaller matrices to do what's called block matrix multiplication. The original Mamba did not allow you to do that. The new one does, which is great because those smaller matrices can then fit on more constrained hardware. And so you can use your SRAM, your cache, your VRAM more efficiently across the board by doing that. So yeah, this is basically the big picture, but the results speak for themselves. And yet again, we're seeing this Mamba transformer merger that I feel like this is the second or third time this year that we've covered. Right. Yeah. We saw previously the same team released Falcon-H1 back in mid 2025. That's where they introduced the first version of this hybrid and really went into this way of doing the hybrid. So this one is that plus R, right? Pushing the reasoning frontier, going into essentially the DeepSeq R1 style of thing where you do reinforcement learning and your data mixes and so on. Similarly, also, there's a pretty detailed technical report, 20 pages with going as deep as giving you the specific hyperparameters of training, data mixes, et cetera. So it's very interesting to be in the AI space where there really aren't many secrets. If you had $50 billion, you probably could train a pretty good model at this point. That's an interesting question, actually. What are the secrets that matter? What is Anthropix's moat made of, for example? What is OpenAI's moat made of? I think it's actually a lot of taste. A lot of it is at the level of, yes, the alignment strategies and all that. That's a big difference. There is some stuff going on with hardware optimizations and software optimizations that is below the public line. But you're right. I mean, if you wanted to stand up like a reasonably competitive model, let's say, it's not going to be cloud code. It's not going to be something that makes people jump out of their seats, but it's going to be like, you know, within 12 months of the frontier or so, you could like, yeah, you're right. Yeah. I think there's in the frontier there's people who know things that are like hard to even write down. You just have this weird combination of knowledge that is just, you're not going to get it any other way. but the algorithms, the general types of data, the general model architectures with Quen, with the Chinese open source models and also releases like this from NVIDIA and TII, we're getting a lot more visibility into what works and what doesn't, which actually is different from, I guess it started with LAMA back in the day. It was the LAMA. Now it's a different world. On to research and advancements, we begin with deep Delta learning from just a couple of researchers at two different universities. is the gist, once again, of what they're doing. They're introducing this delta operator, which is a fancy math thing that you can do to the residual stream, sort of similar to hyperconnections, which we discussed earlier, that is enabling you to essentially do a similar thing to hyperconnections, of getting more powerful residual connections that you can introduce into your network architecture and then get stronger models effectively. Yeah, this is one of those papers. So it's a theory paper, first of all. So everybody don't get too excited, but it's a rare paper that doesn't have really that many experimental results. The claim here is we have a more efficient way of designing the flow of data through these residual layers. And there's a world where this actually is a paper that we turn back to years from now and go like, oh, damn, that was really important. So it's worth kind of sketching out how this works. In a standard residual network, you've got this whole thing that we talked about where the input to layer one gets added to the output of layer one in what gets passed on to the next layer, right? So instead of just feed the input to layer one, then you get an output, you feed that output down the line, you instead generate your output, and then you add the input to it again, and you pass it down. Again, the goal here is to mitigate these vanishing gradients is the key problem here. If you didn't do that, you would find that the information from earlier layers just doesn't make it through to later ones. And this caps the depth of the networks and introduces all kinds of problems in terms of performance. So the problem with this is there's actually a pretty strong bias that you're creating when you do that. You can only add new information mathematically to the information in the previous layer. So again, the input to the next layer is the input to the previous layer plus the output of the previous layer. All you're doing is adding. And that creates this very fixed mathematical relationship where it makes the network's flexibility limited because it can't easily flip information in the path. All it can do is keep adding and adding and adding. It can't discard stuff. And so what this approach, deep delta learning, allows you to do is it allows you to more generally either sort of flip the direction of certain features or geometrically reflection flip across a certain axis. The details get pretty involved. I'm just going to flag two different mathematical entities that are relevant here. The first is the model will learn a matrix that lets you calculate a direction along which it can basically fuck with the residual to deviate from just the identity transformation. So instead of just like taking the pure residual, the input, and just handing it down the line, instead what we're saying here is, well, let's see if we can just pick a direction that we can learn where we're going to start to fuck with this residual along this vector, along this direction. in a way that improves performance on the loss function. At the same time, they're also going to have a learnable variable, a scalar basically, a certain number, not a vector, but just one number that's going to determine the nature of the transformation that's done along that direction that we just talked about. So this is a parameter called beta. If beta is zero, you get back the identity transformation. So it's the same old thing. You will have a direction that you want to fuck with that input on, but the magnitude of that change is going to be zero. If beta is equal to one, then you erase information along that direction. If beta is two, then you actually flip information across that hyperplane, but you add negative information. And so all of this is to say you now have a more nuanced way of chewing on your data than just continually adding the residual each time. And there's a theoretical case that says this actually has been a limiting factor for the way transformers chew on their data. And so if you believe that neural networks need to be able to delete bad features or flip their internal logic to reach higher levels of intelligence, then this deep delta learning matters because it provides the first mathematically clean way to do that while keeping the training stable because that's the other challenge that people have had. This actually works in that respect. And so it is a theory, it is a paper, it is a white paper, but it does unify these kind of three components that were previously thought separate, gating attention and residual learning in a way that's coherent, that seems to work at least pretty well. And they found an excuse to write the term delta operator in a context that isn't special forces. So that's kind of cool too. That's probably most of what they were going after in that paper. That's right. And I think we've covered quite a bit. We've covered some of these big training runs and open source models. These are things that are from universities. And just to call this out, of course, research is built through a lot of past work. This is building out a lot of analysis of this residual stream topic. So there's, previously to this, the Delta rule was incorporated into residual streams and has been shown to be effective. So there's a bit of empirical results prior to this paper that this is building on. The idea of residual stream stuff in general goes back to 2016 with ResNets, and that kind of was like a big deal when that was introduced and allowed for very, very deep networks, which was not possible prior to that. So all this stuff, suffice it to say, none of these papers kind of lives on its own. It's all going back to a rich history of prior investigations. Next up, we've got recursive language models coming from MIT. The gist of the paper is that to be able to scale to very big tasks and basically process arbitrarily long prompts, they treat the prompt as a thing you can read kind of bit by bit or whenever you need to. you can look up information from the prompt. The way they put it is they treat the prompt as part of an external environment that the LLM is able to then programmatically examine, decompose and recursively call itself. So the recursive part here is it's able to say, okay, here's one bit of stuff I need to do. I need to look at like chapter one of his book and see what's going on there or like extract out the main characters. let me have language model look at this bit of a prompt and do this task and then give me some information. And this is then possible to do kind of arbitrarily deep. So basically a prompt, it's kind of a, in my opinion, a bit of a weird terminology question where you could say this is a prompt. You could say that the book is like a text file that is in the environment. But in a sense, it's very similar to what Cloud Code, for instance, would do if you have a text file or documentation, guidelines, et cetera, and you give it a task, it is able to, at inference time, when it makes sense or when it is needed to then look into this kind of reference guide or book or whatever, for whatever bit of it is relevant. And they show that by doing this kind of approach, you're able to do these very long prompts with book length inputs and so on. And it is quite capable. It took me a little while. I don't know if it's like the way the paper was written, but to kind of fully get it. So there's an example I was sort of iterating on with Gemini. And like, so basically imagine you have a big book and you say, list every character who ever held a silver object, right? So the idea here is you're like, okay, this is like way too much. You know, it's a giant book, maybe a million tokens. It's not going to read the whole thing or fit it all into REM and do a good job. So instead it's going to, so it will call itself and say, hey, the book is too long to read it once. I'm going to write a script now to split the text and to call myself in that script. I'm going to call myself with a certain query for every chapter. That's the recursive step, right? So it's literally saying like, number one, book too big. Number two, I need to write therefore a script to chunk up the text. And because of what my top level prompt was, I'm going to have to give myself instructions to kind of pursue certain lines of work deeper. And so now there's a child instance that's called, and maybe that opens up chapter one. Maybe it sees that the chapter is still very long, like too long, and it's going to chunk it up even more or whatever. But ultimately, it does its own analysis, returns this response like, okay, in chapter one, Bob held the silver flask, something like that. And so anyway, so then the root model gets all those responses and kind of does its analysis. Now, this is a lot like, and I think this was before, Andre, we started recording this podcast together. But back in the day, I remember when GPT-3 was first introduced, a few months after, they started using GPT-3 to summarize entire books. And the way OpenAI did this was they basically had GPT-3 generate a summary of chapter one, summary of chapter two, blah, blah, blah. And then they had GPT-3 write a summary of the summaries. As you can imagine, this had all kinds of problems, not least of which was the fact that it completely misses long context interactions between pieces of information that are maybe too niche to be worth surfacing when you read chapter one the first time, but that actually do connect. So if you're thinking, for example, like in a detective model where an offhand mention of which person might prefer to use their right hand or their left hand could be totally ignored by the model that's generating a chapter summary, when in fact that niche fact ends up being pivotal to the overall plot by the end, you completely miss it. And so what this is fixing, among other things, is that because you essentially have this dynamic peaking function, this programmable search where the model, in a detective case, the base model starts by getting tasked with, hey, find the murderer. Instead of just tasking sub instances to summarize chapter one, it could write a script to find every mention of this or every mention of that, get a summary, and then based on that summary, iterate and refine its prompts to keep going. And so this is partly use of test time scaling. That's one part of the story. But it's also just like fully like offloading the responsibility to design the search architecture itself to the model, which is really interesting. And they have all kinds of test time scaling results that are really positive here and curves that don't bend. So anyway, do with that what you will. But I thought it was quite an interesting conceptual experiment. Yeah, I think it effectively formalizes the notion of sub-agents in some sense. Yes, yes. This is what already happens with Cloud Code and other agentic agents. they can tell itself in some sense to go off and do this thing. It also is kind of funny that there was all this debate about newer symbolic versus connectionist back in the day. And now we are just in newer symbolic land with tool calls and all this technically symbolic stuff, but nobody cares. It's just writing some code. A couple more papers to get through. We've got conditional memory via scalable lookup, also from DeepSeq. And the general topic is, you know, when you're training a language model, there's in a sense kind of two distinct things. One of them is memory, just knowledge. And another is reasoning, thinking, whatever you want to call it, kind of acting on that knowledge. And so there's, for quite a while, this notion of like, well, if we have information, if you have some sort of memory, maybe it's better to just have external memory in some way that the neural net itself can just do lookups on. Instead of having it be embedded in the actual neural net weights, where you, in some sense, want to encode the intelligence, you have distinct computation memory split up, which these days we don't have in transformers, don't have in large language models in general. And so in this paper, they introduced one way to do this. They call that Ngram. it's building on the notion of ngram embeddings from back in 2017 as a way to instantiate this so you create these trained modules that allow you to do this kind of sort of lookup operation and then you insert them into your overall architecture so you wind up having kind of the input, then you have some transformers block, when you have an engram block within which you're basically interacting with this memory, which can be various sizes, have more or less knowledge, and then you move on to attention and mixture of experts and so on. And this would be expected to make it possible to scale to more knowledge, more information without scaling necessarily kind of the intelligence bit of the model. So go into a lot of detail into how much information you can store here with as big as 27 billion parameters for this n-gram thing. Yeah. And it is also worth noting what n-grams classically did. So n here is an integer. It's a number, right? So unigram. Just real quick. It's kind of annoying. So there's n-gram, which is n-gram. And in the paper, they introduce en-gram as their fake. So let's say en-gram en-gram, I guess, to be clear. Exactly. And this is what they're calling back to. So this idea of the, God, I don't even know how to say it. The letter n-gram, en-gram, the old one, the original one, you would have unigrams, bigrams, trigrams. And the idea here was a unigram was a single word, bigram was two word sequence, right? So, you know, a unigram might be machine, a bigram might be machine learning, trigram is a three word sequence, like deep neural network or something, right? So basically, these are chunks of words that together mean something, right? So Alexander on its own, it means something, but Alexander the great means something very different, very specific. And so the idea here is going to be, well, it's challenging for these models sometimes to learn trigrams. They start by learning unigrams, then bigrams, then trigrams, typically, if you look at the way the training unfolds But instead of the model having to recalculate what Alexander the Great means every single time in its layers which is typically what in practice happens they trying to make it possible for the model to see the words, look them up in this massive table, like you said, this reference table, and just pull out the answer. And they do this. There's a bunch of context, I guess, since this is a lightning round story, I'm not going to say the context, but they have an interesting way of doing this kind of lookup. They're not looking at traditional engrams because these things are learnable. They actually do evolve over time. They get a sort of random initialization at first. They do suck at first and then they get better very quickly. Maybe I'll just pause there. There's a bunch of interesting stuff though. If you like this, multi-head hashing is a good keyword to look up to kind of get a sense for what's special about this paper. Yeah, DeepSeq, once again, 20-page paper, a bunch of follow-up information. goes pretty deep. But the general theme here is we're introducing a lot of new, interesting ideas that augment the basic transformer and seem like they might actually be a future component of the way we do neural nets, which for a while, we've been doing transformers the same way, more or less. Then there was a mixture of experts and now there are training regimes. And it looks like potentially some of these hybrid ideas, memory ideas could play a part in the future. And one last paper, extending the context of pre-trained LLMs by dropping their positional embeddings. So the thing with inputs to transformers is transformers don't have a notion of position in an input by default because of what you do with the attention mechanism. So we attach these positional embeddings that tell you, you know, this is the first token, this is the second token, et cetera. And this goes back all the way to the original transformer. And this paper looks into and shows you that it turns out that part of this thing of positional embedding, you can kind of keep it early on and that helps you train. so you need this positional embedding part of what in this case was rope the standard way to do embeddings positional embeddings they show that after you train for a while you can actually drop the positional information and it turns out to actually work just fine and then you wind up being able to extend the context of pre-trained llms so kind of an interesting paper honestly i've haven't had the time to read into that much, but sounds counterintuitive. Yeah. Well, it's counterintuitive. And then it's not once you know how the magic trick is done. And that's always like, I find that's always how these things are, or at least except for the papers that go, this works and we have no idea why, which does happen. Yeah. So like you said, the standard approach to allow the model to understand the relative positions of tokens in input is to do this little trick where basically like, so imagine that your word embedding, so it's basically just a vector, a list of numbers that represents that word of that token. Imagine that it's a two-dimensional vector, the vector in a 2D plane. So the original way of doing this using rope was to take that vector and just rotate it by an angle, call that angle theta. And so that's if the word is at position one, you're going to rotate it by theta. If the word's at position two, you're going to rotate, excuse me, the corresponding vector by two theta. And if it's a position three by three theta and so on. And so what you're doing is you're basically inducing this kind of repeatable pattern of basically messing with that vector in a predictable way as a function of its location. We do this weird rotation stuff because just putting in a number like one, two, three, four doesn't work well, the numbers get big and small and that mess is when you're on that. That's actually a really important clarification. People have tried that. Like, well, can I exactly, can I just like glue a couple of numbers to the end of the vector, its location? And this exactly causes all these problems with stability. Now, in practice too, an LLM embedding is not just going to be two-dimensional. The vectors that represent these words or tokens can have thousands of dimensions. And so Rope actually breaks all those dimensions down into pairs. And then each pair basically has a separate two-dimensional plane with its own kind of like clock hand, you can think of it, rotating at a different speed. Some of these pairs of dimensions in the embedding vector get rotated really fast as you increment the position of that token in the input. And then some of them rotate much more slowly. And typically, so the fast clocks, these sort of fast rotating pairs help the model tell between words that are right next to each other. So these are like short range relationships. And then you've got the slower ones that rotate more slowly. They help the model keep track of the broader structure over hundreds of tokens, these long range relationships. And anyway, there's a whole bunch of, and by the way, this is only done with the keys and the queries, which help you tell, basically, that's how a token says, hey, here's the information that I contain and the things I can help with, basically. That's the keys. And then the queries are, hey, here's the information I'm looking for. And the product of the keys and queries is how the attention mechanism works. Anyway, this approach basically says, you know what? We thought that these clocks, these like pairs of dimensions in the input were really important for the model to function. But it turns out that the model's internal layers, they can actually learn to tell time on their own. Basically, they can learn to tell the location of this input on their own. Once they learn kind of the vibe of the sequence, you can take the clocks away, and the model actually becomes even better at handling massive amounts of data. So essentially, like you said, you strip out all of this rotational mumbo-jumbo after you finished pre-training. You give it a little bit of fine tuning for sure. You give it some extra fine tuning, but it very quickly learns how to handle, how to understand positioning just based on context very fast. And the really important thing here is that the rope requires you to train a model in a way that is very sensitive to the size of the context window you're trained on. So if you train the model with these rotational embeddings on 4,000 token contexts, and then you suddenly give it 8,000 tokens of context. It's never learned these clocks for the last half, basically the last 4,000 tokens. It's never learned to use those clocks. And so now all of a sudden, you're going to cause the model to basically start misfiring. And so the advantage of removing those positional embeddings is actually that the model can learn to recalibrate and then it's good for much, much larger sequences. One of the main reasons that models struggle with long sequences is taken away and the model can suddenly just generalize really well and move from, in one instance, they show from a 2,000 token context to 32,000 without ever being trained on those longer context lengths. All it's ever done is you give it a little extra training to work without the embeddings and it's a way to the races. So really, I think quite an interesting paper. this idea of expanding context window lengths has been such a challenge. I think you obviously still need memory and that remains a problem. None of this solves that, but this does solve at least the positional embedding aspect of this, which is kind of interesting. We're learning an awful lot about what goes into these high token or high context failure modes. It's not just the memory, it's all these little things, including how you train models to do or to do without positional embeddings. And by the way, apparently this idea that transformers can just learn positional information, that's not new. It goes back to December of 2022. What this paper shows is this previous approach where you just don't bother with positional information doesn't train as well. And it turns out that this other approach of starting with it to help out with LLM and then dropping it later works better. And yeah, that was kind of interesting trick. On to policy and safety, and we've got just one more paper to discuss. This one is from Anthropic. Constitutional classifiers plus efficient production grade defenses against universal jailbreaks. So constitutional classifier is the thing that Anthropic kind of introduced for dealing with alignment issues, if you ask an LAM to do something harmful, you have something in the system that then is able to classify it, whether it's like within your constitution of what you're supposed to or not supposed to do. We know there are some jailbreaks that get around these systems. So they introduce two in particular reconstruction attacks that sort of split up the input so that it doesn't look harmful or bad, but then if you combine those bits, it winds up actually being bad. And then obfuscation attacks, which I guess we talked about a little while ago, if you ask it to output something as a poem, it will do it because it doesn't seem like the recipe for meth if it's a poem. And so they discuss how to get around these jailbreaks. and the gist is to make it a little more robust. So instead of just the input and output only classifiers with just a single exchange of input-output, they do evaluation in the entire context of the input. That leads to the system being more expensive and a bunch of this paper is just talking about how you can get it to be affordable while still doing this more robust check via some tricks. And presumably Anthropic has this in production now. Yeah, that's right. And actually, to your point, the tricks there are sort of, as ever, the key thing. They use this two-stage classifier cascade. And so the idea here is first you use a lightweight classifier to screen all traffic, and you only escalate suspicious exchanges. And at that point, you're escalating it to a more powerful, like an expensive second stage classifier. And what they find here in practice is that when combined with other optimizations that they use, they reduce the computational overhead by a factor of more than five and maintain the same performance of the more expensive model. So their first pass is going to default heavily to false positives. They're going to try to catch as much as they can. It allows them to avoid escalating the vast, vast majority of these prompts in cases that don't matter. And so overall, the way this works is it's kind of like a funnel. Stage one is where you have this external classifier that screens all the traffic, escalates suspicious cases. They then have a bunch of linear probes that they run on every input alongside that stage one classifier. And these inputs are looking at activations of the model rather than just the input or the output, they're also looking at how are the activations inside the model getting triggered? And is there anything we can learn from that that suggests that there's something suspicious happening here? So that's another way in which they're going one layer deeper. And then finally, at stage two, for the escalated cases, the system is actually going to combine the linear probe ensemble kind of predictions with the higher accuracy second stage external classifier that they run, and that's going to generate the final decision. So a lot of impressive results, 40x reduction in computational cost compared to the baseline exchange classifier. And also significant reduction, though somewhat quite a bit lower reduction compared with the original implementation that they had that was 23% more processing power that was needed. Very, very low refusal rate, by the way, 0.05% on production traffic. So that's actually representative, at least as of today, of what they see, the original implementation had like a 0.38% refusal rate. So that's, you know, like almost 10 times higher. So you're going to see a lot less refusal, presumably a lot less inappropriate refusal from this model. So yeah, pretty impressive. And another sort of pseudo alignment paper from Anthropic. And then they did test with quite a bunch, including with red teaming. And here's just a fun detail. Red teamers got API access to this defendant model. They could submit their jailbreak attempts along with how long it took. And then they offered bounties that scaled by the number of successfully jailbroken queries of maximum payouts ranging from 25 to $35,000 depending on the campaign. So they gave humans quite a bit of motivation to try and get these models to slip out. Next up, just a few more stories and we're done with all the research. First, we've got NVIDIA CEO says purchase orders, not formal declarations will signal Chinese approval of H200. So Jeremy, you mentioned this earlier. This was in Las Vegas and that this title is the story. The quote is, my expectation is that we're not expecting any press releases or any large declarations. There's just going to be purchase orders. Yeah. And as he says, the customer demand is quite high, quite high. We fired up our supply chain and H200s are flowing through the line. So we talked about this earlier in terms of the co-ops, the packaging kind of being the rate limiter there. There's also this just little drop. I completely missed this. Local media reported last month that NVIDIA is in talks to buy Israeli firm AI21 Labs. They are the OG post-GPT3 LLM replication lab. Like they were one of the very first, if not the first, certainly Western lab, but I think they may have been the first lab anywhere to replicate anything that looked like GPT-3. It was the Jurassic Jumbo, I forget Jurassic One Jumbo. I forget that, yeah. Yeah, one of them things. Anyway, NVIDIA is buying them. Or at least in talks to buy them, which is very interesting. And again, completely, completely missed that. So there you have it. One more story on China. China AI leaders warn of widening gap with US. So this was a comment particularly by Justin Lin, head of Alibaba. He basically said this, that over the next five years, he would give a less than 20% probability to Chinese companies leapfrogging OpenAI and Anthropic with fundamental breakthroughs. And this was, I guess, reiterated or somehow also mentioned by others at companies like Jipu AI, potentially being like, oh, we need a bit more compute. That's my meaning of this is, boy, do we need compute and maybe don't bat us from getting less NVIDIA chips. Yeah. I mean, this was apparently like one of the biggest themes here was this idea of the resource gap. At least that's what they're calling it, presumably in Chinese. Yeah. And so they're saying, hey, look, you've got US firms with huge amounts of compute. We're stretched too thin. And the interesting thing here is they're basically stretched too thin on inference. They're just trying to service all of these requirements. Customers coming and try to use these services. You've got a billion, 1.4 billion people in China. That's a lot of inference that you have to service. And especially given that use of Western tools is more limited there too. So it's all getting funneled there. That leaves you very little left for training and R&D. And so this is a real fundamental challenge that they're facing. This rhymes almost exactly with what one of the DeepSeek founders sort of said early on before DeepSeek was on the radar of the Chinese Communist Party. He was going out and just saying, yeah, like the only thing preventing us from beating the Americans or whatever is access to chips. That's really the only thing. And then he basically, you know, DeepSeek dropped their big R1 model. Everybody got excited. The Chinese Communist Party had him come forward and testify in front of the, you know, I forget who exactly it was. But anyway, the vice chair or whatever. And basically, he was told to shut the fuck up about the fact that Western export controls are absolutely working and crippling Chinese AI efforts. This is, yet again, I mean, they can't keep it in the pants with this stuff. They just keep telling us what our policy should be. Hey, guys, like our freaking chips are the thing that we're missing. Like, don't send us chips unless you want us to be able to compete with you. In this context, we have all this H200 chip stuff being sent over. So there are obviously a lot of considerations here, but at least in terms of what the Chinese labs themselves are telling us our policy should be, at least from them, that corner of the universe, it seems pretty clear that the chips are a big thing, right? One important thing, and we did talk about this way, way back when the R1 model launched, they were talking about how DeepSeq's R1 model helped narrow the gap temporarily. That's why it was a big deal. but maintaining that pace has been really tough under current hardware constraints. And we talked about that at the time. Specifically, we said, and even before, I will say, I'm sorry, I'm beating the last week in the night I've grown here. But this is even before Dario came out with his essay. We said, what you're going to see is a massive reconfiguration of the ecosystem in both the United States and in China around the idea of inference time compute, trying to scale up inference time infrastructure. And as that happens, what you'll find is China will suddenly be capped at a very small inference time budget for training purposes and other things, whereas we have a lot more chips that can support that in the West. And so although DeepSeek R1 made it seem as though they could keep up, that wasn't going to be a lasting advantage. And this whole narrative at the time that the Chinese were really keen to push was, look, DeepSeek R1 shows that all these chip export controls just don't work. You might as well get rid of them, when in reality, the real story was yet to play out. It has played out. We know the export controls were working. There's a question as to, you know, there's a whole bunch of policy questions that we need to ask ourselves about outcomes in different forms. But certainly with respect to this, it seems pretty clear that that's how it's played out. So yeah, I think this is a really interesting export control question, a question for NVIDIA, how they prioritize their supply chains. I think the administration has not had its last word on this. I think we'll keep seeing iteration on this stuff as the consequences of different moves play out. But this is a really, really information-dense announcement, I would say. There's a lot to be learned from this. I stated this previously, but relevant to this notion, we do kind of by default give the viewpoint of the United States with regards to these topics, right? I'm sure there's a different reading you could do if you're in China, pro-China, and QAN models are great, and there's a lot of researchers and China are great. We're not saying that China is bad, but this is from a position of the United States geopolitically, et cetera, something worth noting. And the next story actually is, on that note, Jake Sullivan is furious that Trump removed Biden's AI chip export controls. Jake Sullivan is a former national security advisor under Biden who helped put in place the export controls that were there until 2025. And in this interview with The Verge, there's quite a bit of detail on his view of the developments. He basically is quite critical of the Trump administration and goes into why removing those export controls and some other key bits of the Biden policies is overall detrimental to the US, both in terms of competitiveness, and essentially helping China catch up in the AI race, if you want to think of it as an AI race. Everybody's got a take on these things, but one of the most predictable, unfortunate self-owns of the Biden administration on this was bundling a bunch of hyper-partisan language on ESG, economic and social good, and AI ethics and diversity, equity, and inclusion in a bunch of their bills that covered AI national security stuff. And so when you think about the repeal, like a lot of what's been repealed is stuff that politically Trump basically had no choice but to repeal because of what he ran on. And so the easy, I would say obvious play that the Democrats should have gone with in the previous administration, and we talked about this at the time, was don't put that language in the same bill that is doing things you consider to be important. If you think they're important, there's literally no reason to make this a mudslinging fight. Some of these moves are obvious policy wins, at least from where I stand. And the moment it's couched in partisan language by either party, it's liable to get repealed the next time the next administration turns over, if it's a law, you know, the next time Congress or whatever, this creates a real issue. And so turning down the temperature on the partisanship, I think is a really, really important aspect of this. It goes to everyone. Like there's no reason that being a Republican or Democrat should affect whether or not you think frigging chip sales should happen to China. That's a crazy thing. Same with energy infrastructure, same with all these things. So I'm getting on my non-partisan soapbox here. Hot take is like everybody should just chill out a bit more. But yeah, I mean, I think it's kind of weird that these things have become politicized in the way they have it. These are technical issues. Living in the US, I think, honestly, it's money talks with the Trump administration. None of this really matters as much as NVIDIA just having Jensen out there talking to Trump and promising to deliver some money, as is true of OpenAI, as is true of all these other... where the leaders go and make nice comments and promise big numbers and all this other stuff, power distance, et cetera, that's a secondary, but also a factor. Yeah. Everyone's got a take on where these decisions are coming from. It's just sort of funny that there are so many cases where you just see math and then you see stuff overlaid on top of It's like, let's put the incendiary language for whatever the other party is in here and just see what happens. It's a tough way to make coherent policy. And that is it for this episode of Last Week in AI. Once again, you could go to lastweekin.ai for even more AI news beyond this. As always, we appreciate you subscribing, sharing the podcast, reviewing. one of these days I'll get around to actually bringing back the reply to comments segment of the podcast I promise but more than anything please do keep tuning in have we not had comments in the last little bit not in a while yeah just been guys guys busy some some love man feel free to email yeah in the episode description you can find our emails We'll be right back. And let it slide. Last week in AI, come and take a ride. From the labs to the streets, AI's reaching high. New tech emerging, watching surgeons fly. From the labs to the streets, AI's reaching high. Algorithms shaping, but the future sees. Tune in, tune in, get the latest with ease. Last week in AI, come and take a ride. Get the lowdown on tech, and let it slide. Last week in AI, come and take a ride. I'm a lad to the streets, AI's reaching high. From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten, on the edge of change, with excitement we're smitten. From machine learning marvels to coding kings, futures unfolding, see what it brings.