AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More with Sebastian Raschka - #762
Sebastian Raschka discusses the evolution of LLMs in 2025-2026, focusing on three key areas: reasoning capabilities through post-training techniques, inference scaling methods, and agentic applications. The conversation covers recent model releases, practical implementation strategies, and predictions for continued innovation in the reasoning and tool-use paradigms.
- Most LLM R&D focus has shifted from pre-training to post-training optimization, particularly reasoning capabilities, as pre-training is already sophisticated but post-training still has low-hanging fruit
- Verifiable rewards in math and coding enable infinite answer generation for training, providing more reliable feedback than human evaluation and driving major reasoning improvements
- The biggest practical LLM benefits come from creating custom workflow automation tools rather than using sophisticated agentic wrappers or interfaces
- Current reasoning models excel at automatically determining appropriate effort levels, reducing the need for users to manually specify high-effort modes for most tasks
- Multi-agent systems face compounding failure rates as more models are chained together, making single-model optimization more impactful than complex agent architectures
"The R and D like the research and development of the focus of the research team. I think it's more focused nowadays on the post training, like getting more performance out of that because it's more like the newer paradigm and there are still low hanging fruits to be picked where in pre training it's already pretty sophisticated."
"My hypothesis is if you would take the best open weight LLM and put it into, let's say a ChatGPT or Gemini or Claude interface, you would almost get the same type of quality performance and everything. Where I think a lot of use cases evolve around the tool wrapper, around the LLM nowadays."
"It's almost like wasteful to also even ask an LLM what is one plus one or something like that. You can use a calculator. So it's like I think it's still important kind of to recognize what is the nature of this problem and what is the best tool for that problem."
"The more models you add, the higher the risk that one of them fails if they depend on each other. And I think improving the model itself here will also help improving the whole system basically as the main way to improve the performance."
The R and D like the research and development of the focus of the research team. I think it's more focused nowadays on the post training, like getting more performance out of that because it's more like the newer paradigm and there are still low hanging fruits to be picked where in pre training it's already pretty sophisticated. You will still get better results if you use more data, optimize the data mix, maybe multi token prediction and these types of things. But most of the interesting things are happening now on the post training front in the reasoning realm. So I think we will see more there.
0:00
All right everyone, welcome to another episode of the Twimble AI podcast. I am your host, Sam Charrington. Today I'm joined by Sebastian Raschka. Sebastian is an independent LLM researcher. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Sebastian, welcome back to the podcast. It's been a little bit.
0:45
Yeah, thank you for inviting me back, Sam. I'm happy to be back and to chat about LLMs, AI and whatever you have in mind. I had a lot of fun last time so I hope we can make it fun and interesting again.
1:06
You know my joke around this time it's getting a bit old, but it's like the last time we spoke was three years ago. Not much has changed, right?
1:19
Well, good things come in threes. I think there's a saying, right.
1:28
And in fact a ton has changed and we're going to be focusing on the most recent and most important of those changes. In particular, what's new with LLMs and what to expect with LLMs in 2026. This is an area that you spend a lot of time focusing on with your research and education work. You know, maybe we can start with just, you know, kind of top of mind. Like if you think about, you know, very big picture where we are now compared to where we were a year ago, you know, what, what is your broad reflection about the evolution of the space?
1:33
Look at today compared to one year ago. It's almost like the anniversary of Deep Seq, the big Deep Seq version 3 model accompanied by the R1 model, the reasoning. I would say reasoning revolution in quotation marks. It's still LLMs, it's still the same base model, but we have now more techniques on top of that to make the models smarter in terms of solving more complex problems. And the. So I would say architecture wise, LLM architectures are looking still relatively similar, but the reasoning training is one of the new things if we compare today to last year and then also I think there's a more heavy focus on tool use. So back then when ChatGPT was launched, or also the first iteration of LLMs, the focus was mainly on general purpose tasks, but then also having the LLM answer all the things we are curious about, like from memory, like if we ask it a math question or knowledge question, the LLM would basically draw from its memory and then write the answer. But that's not always, let's say, the most effective or accurate thing to do. Similar for, for us humans, I mean LLMs are different from how humans think. But we as humans, if you asked me a complicated math question like, or just like multiplying two large numbers, I would pull out my calculator and calculate that on a calculator. I wouldn't do that in my head. Maybe could, but it would take a long time, it's more error prone and so forth and there's no need to do that. And the same with LLMs. Now with more modern tooling, it becomes more and more popular to use or to have the LLM use tools too. It requires training the LLM to use those tools. But with that I think we can reduce like hallucination rates, not completely getting rid of those, but reducing those and then also making answers more accurate. So and then with reasoning capabilities, it's essentially giving the LLM more time in quotation marks to think through a problem. So these are I think the two main, I would say, yeah, knobs that we can tune and to make progress on. And in the last year, if we look particularly like last year and now, the difference, yeah, we'll dig into the
2:11
technical aspects of like how we've evolved in reasoning and how we've evolved in tool use among, among other things. But before we do that, I was thinking it might be interesting to talk a little bit about from a practical perspective. How do you think where we are today is different and has shifted and it's super interesting. We're talking in kind of second week of February and already this year, in 2026, there's been a ton of new news, new models. Opus 4.6 OpenAI 5.3. You know, there's been the whole open Claude Multbot, you know, talk a little bit about, you know, the, the what we've seen already this year. But in the context of like where you see LLMs are from a practical perspective.
4:35
Yeah, I would say, yeah, that's a good point. We are just in the second week of February and that Means the Chinese New Year hasn't even like occurred where I think there will be also another, a batch of releases. But I think like on the open weight front, but I think that that is like a separate thing where you have now companies developing the tooling around LLMs that is becoming more and more mature and then you have better LMS yourself. And I think I would also almost like separate those two. So my hypothesis is if you would take the best open weight LLM and put it into, let's say a ChatGPT or Gemini or Claude interface, you would almost get the same type of quality performance and everything. Where I think a lot of use cases evolve around the tool wrapper, around the LLM nowadays.
5:29
That's this idea that was popularized kind of towards the end of last year on harness engineering.
6:22
So I think that is also something how we changed using LLMs. Because before it was just simply, yeah, like a very simple chat interface. Yeah, yeah, yeah, yeah. And then it became, you know, more sophisticated. You could upload files and PDFs and so for my personal use case, I use LLMs mostly for like, actually it sounds weird, but like proofreading, checking things and these types of things. So just before recording here, I was finishing writing a chapter and I wanted to update the table of contents and then I just uploaded the PDF to the ChatGPT interface and say, hey, can you give me the headers so I don't have to pull that out myself? And then you can just double check also that it is correct. But like little convenience tasks like making work a bit simpler, like these tedious things. But then like you said, there was also the new OPOS Model and then ChatGPT released codecs 5.3 and macOS app with that. And I think that is also like yet another leap in terms of what these models are capable of. I mean, before there were also coding LLMs and it became more popular to use LLMs for coding, but it's always, you know, more and more and getting better and better. And so before I used Visual Studio code, I mean, because I just used Visual Studio code, the code editor, for like years, like maybe five years now, 10 years. And before that I was using VIM and other things. But I'm very familiar with the, the UI basically. So I have my Git tree, I know where I have a terminal inside and that stuff. And so I actually liked having the LLM as a plug in there where you sometimes can say, okay, can you, I have a bug. Can you just double check? You know, it's Just like another layer of tools you add to your workflow. So the LLM doesn't have to be front and center. It can be also this little helper, you know, like before you. I mean I still debug things myself, but often it's actually quite nice and fast to ask the LLM to double check things. And what I like about it is when it's like a second pair of eyes, but it's also like it's not completely taking over and doing everything. It's. But it's making your work better in a sense. Like you have additional checks and you can ask, hey, can you suggest improvements to make my code, let's say more performant still? I mean you as the person, you still have to kind of ask the right questions and you still have to run actually the experiments to see whether it actually makes the code faster. So it's, it doesn't mean like the LLM does everything for you, but it suggests useful things. I know a lot of people also use it for coding things. So that also works with the new, let's say the Codex plugin, but also the Codex app. What's new is, I mean A year or two ago people were uploading code files to ChatGPT or Gemini or Claude and then getting some feedback and then you had to manually incorporate that and now it's more in line like you
6:28
can, I think it's been a while since folks have been doing that.
9:37
Yeah, right. So that is now more native where you can see the file div. You don't have to leave your coding environment. But then on top of that also now when you run these tools locally, you can give it access to your, to your whole folder, let's say your whole git folder. And then it can see the context of all the files. You don't have to manually upload anything. And then on top of that, so you, it can also nowadays use tools itself so it, you can give permission for, to the LLM to run certain commands to run a unit test by itself and these types of things and that together. I wouldn't say there is a single thing that is like groundbreaking or like a game changer. But all these little things they add up to make the more capable because it's more and more getting more and more sophisticated. And I think that's what we have been seeing in recent months and maybe, yeah the last few quarters months where people like, they develop these, these types of capabilities instead of just making the model better. So there's a lot of performance we can get from the LLM by making the interface better, basically.
9:40
Have you found that either of these new models, you just said that there was no breakthrough changes there, but did you find yourself surprised with some new capability in either of these two models? Or is it, you know, very much incremental to what you're already doing?
10:51
For me, it's personally more incremental. It's just more like the convenience. They're just getting more robust and better where I wouldn't say there is anything where it's like, there's no wow effect to me. Like where it's like, oh, my previous model was not able to do xyz, it's just a bit better, you know, it's getting more robust, better. And then I also develop a bit more trust in the results. It's more like a gradual improvement. I think the one thing is still, we still have the distinction also between the different reasoning efforts in terms of. It's like a slider in terms of how, like how much time the LLM should spend on getting you the results. And there are different settings from low or no reasoning effort to high reasoning effort. And that changes the time it takes for the LLM to generate results. And I remember like half a year ago, a year ago, if you wanted to have good results, you almost always had to use the highest settings, the high reasoning modes, which took forever. Nowadays even the lower modes I feel like are pretty good where for most tasks it's sufficient to use these medium high reasoning efforts instead of the extra high ones and then you get results faster. And I think that's also like a quality of life improvement for these models where, um, before you ran them, maybe occasionally because you don't want to wait five minutes, but now it becomes more routine that they are part of your workflow, basically.
11:15
Yeah, yeah, yeah. I would expand on that and say that the LLMs have gotten really good at themselves knowing how much effort is required to provide a good answer to a query. And so I find myself, you know, in the vast majority of times just, you know, typing my prompt into, you know, ChatGPT, for example, and not specifying a model or level of thinking and letting it figure it out. And if I want more, I'll tell it I want more. But it does a fairly good job of determining when to just give me a quick answer or when to use a search tool, when to, you know, you know, do more thinking, that kind of thing.
12:44
I agree. I have my setting on ChatGPT, the auto mode where it automatically or by itself decides Whether it should use more or less thinking efforts, the same thing. The only context where I still use the Pro mode is when coming back to the chapter I mentioned. When I have a chapter written like a 40 page PDF, I would upload it there and say hey, can you check for any inconsistencies, incorrect numbering and, and all that type of stuff. And then I set it to the Pro mode like the one that takes 20 minutes. I go have lunch or have dinner, come back and look at the results where there it's like, it's like a rare thing that I do that. I mean once a month I finish like a chapter or something and, or like if I write something important where I want like the maximum, let's say quality check on that. But like you said, for most tasks it's sufficient to use the, the light effort. Yeah. Or the automatic one where it decides by itself essentially.
13:28
Yeah, right, right. And I mentioned Multbot and the release of that tool. Have you spent much time digging into that?
14:23
Well, yeah, Multbot I think now, now called openclaw. Openclaw, yeah, it changed a bit, quite a bit. It's interesting. It's like this local agent that people can now run on their own computers where I think that what I find interesting about it is it gets people excited about things. It's almost like back then when DeepMind had AlphaGo like the go playing, it's like a board game, like the Go playing model where it got really exciting because not many people, let's say in the grand scheme of things that at least in my, my circles who played Go before but it got people like my family and everyone really excited to see this type of progress when it was playing against the world champion. I think with Multipot it's kind of like similar where it gets people interested in checking these things out and excited. I think there's also a lot of genuine use cases around it where you can run it, I mean to organize your calendar and emails. For me personally that's something I have not done. Maybe I have a little bit of a trust issue where I'm like, I mean personally it's like yeah, well I don't know if I trusted enough to do my finances or my calendar a bit. I'm a bit hesitant still to adopt something like that but I think it's a cool demonstration of what to kind of like show someone who is let's say also not developing LLMs but these LLMs can do and what the purpose of those is. Also in a sense I think that's actually quite cool.
14:31
Yeah. Any other tools or services that are largely kind of wrappers around LLMs that you have come to depend on or do you find yourself mostly turning to the models themselves or the dev environments?
16:12
Yeah, it's mostly still for me for my workflows, I don't have anything super automated where I'm like, you know, I need to run something incrementally or in an agentic type of setting. What I've been doing a lot though is developing my own apps, like productivity apps. I think back in the day I grew up as like a coder, like using Bash, the Terminal and Python and that stuff where I was writing myself for myself, like scripts for all kinds of things to automate things. And now with LLMs, it kind of changed that a bit towards developing native Mac OS apps. Like, I always wanted to learn Swift coding coding in Swift. I've never had the time because yeah, I mean there are so many other more important things to do for me where like that was an opportunity to say, hey, I want this. But I have as a script as a native macOS app because it's just more convenient. For example, just the other day, my wife also has a podcast. It's like a book, book club podcast. And I help her with the episodes basically like uploading everything and that and editing and like just the workflow in general because she's not like a tech person. And then I had like a script to add these chapter marks to the podcast and now I made a, just the other day, a native Mac OS app where you can just add the timestamps and click a button and it adds the, the chapter marks to the audio file, like simple things like that. And then I can share it with her and she can use it now. And it's just like these little quality of life things in, in your everyday life where instead of just doing things manually, you can just automate them. Now, I mean, not necessarily. I mean this is not running the LLM, but it's using the LLM to develop something that behaves deterministically in a sense. So I, I'm more like of that, more like person like that who does that. For example, I also have, when I read social media feeds, I mostly as a researcher interested in papers. So I often end up bookmarking a lot of archive links, links to PD archive PDFs or the abstracts. And then I have my markdown sheet where I have a lot of these links. And now I wrote Myself a native macOS app where I just put in these links and it pulls out the, the title, the date, the author names and the link in like a nice format and just making my, my life easier so I don't have to click on them individually. I get a nice list and see the titles and. Yeah, and I think little things like that I feel like elements are super cool for like to, to develop these tools that I would not have time to develop otherwise. Basically.
16:30
Yeah, that parallels my experience quite a bit. I think some of the, the most benefit I've gotten out of LLMs in the past year or so has been writing kind of custom workflow tools. So primarily around the podcast. Like one of the things that we would do when we work with sponsors is like pull these analytics reports. And it was repetitive and time consuming. And so I created a web based tool that will hit the API where we get the analytics and pull information about episodes and you can choose an episode and then we've got, you know, pull a bunch of data into Pandas and like do some analysis and then generate a spreadsheet like a Google Doc. And like it's not, it doesn't. That isn't using an LLM, but an LLM was used to create it. And that's one example of probably like half a dozen, you know, fairly significant tools that you know, have a big impact in our workflow.
19:10
Yeah, and that's a good point. Also like that you said again also that in these cases the LLM is not doing the, let's say regular work, the task. Task. It's more developing the tool to do the task. And I think that's also an important point in my opinion that well, the LLM is very useful and very capable. But there are tasks where it's almost like wasteful to use an LLM for that. Like it's like, it's like if all you have is a hammer, everything becomes like a nail type of situation. Where I do think if you have like a deterministic task, it still makes sense to develop a deterministic tool. You can use an LLM for that. But it is almost like wasteful to also even ask an LLM what is one plus one or something like that. You can use a calculator. So it's like I think it's still important kind of to recognize what is the nature of this problem and what is the best tool for that problem.
20:23
Basically I've also done some tools where I'll use LLMs and like, you know, almost like a classifier, like a very simple use case. I have one where you know, it's like there's the name of the guest so you know your name, you know. And then I pull a bunch of recent directories from the Google Docs API and say find the directory that corresponds to the project for this particular guest. And it's like a regex or a text pattern match. Doesn't always work because they can be different sometimes, but an LLM can do it pretty easily and with a very high level of repeatability and, and low error rate.
21:16
Well, it's like where you need almost like a human or like some less structured approach then LLMs are great for that. I had actually a similar project as a college student. I was doing a sports prediction as a side project for fun, just for fun, like daily fantasy sports, like predicting outcomes of like how, who, which player scores a goal in the Premier League Soccer on the weekend, like daily fantasy sports. And for that I was also developing this very sophisticated thing which was pulling information about the players from different websites and looking at how, who is injured, who is in good form and these types of things. And for that, and now I kind of revived that project just for fun, using an LLM. It's kind of like the same problem you mentioned with the names because some players have like the spelling of the name is slightly different. There are these accents over certain letters and some people have sometimes a middle name, sometimes not in certain databases and then just getting them lined up in the database. It's really hard with regex or just deterministic things. So that is actually a great use case for an LLM to kind of these unstructured, almost like vague data set parsing things that depend also a bit on the context basically.
22:10
Yeah, yeah. So maybe kind of pulling back into like where we are with, you know, LLMs. From a practical perspective, I think both Sam and SEB are using none and very kind of. Well, I think two main things came out of this one, you know, if you. I was going to actually caveat this by saying if you are a development mindset. But you know, I think we've seen with like vibe coding that even less technical people or non technical people can, you know, get a lot of value by creating custom tools to automate, you know, specific parts of their workflow. So that's, you know, a huge thing that I think has been very impactful for both of us over the past year or so and otherwise, you know, just kind of, you know, taking advantage of the, the improvements and models by for me it's just like I can't really Articulate like a rule set. But like I, you know, for if I'm confronted with a particular thing, you know, I have kind of a soft mental model for. Yeah, I think I'll start with ChatGPT here or I'll start with Claude for this or that,
23:25
you know.
24:40
So I think the takeaway is that neither of us are, you know, using OpenClaw or you know, any particularly, you know, slick LLM wrapper agentic tools with any regularity. You know, maybe the caveat for me would be something like a circleback or granola to do like meeting summaries. But beyond that, it's mostly, you know, like you described kind of, you know, use cases through the native chat interfaces and the development oriented use cases would
24:41
add maybe also one more thing you mentioned. So it's also most, mostly a slider. Like you can use LLMs. Not at all. You can do still everything manually. Then you can only use LLMs. Like I know some people who develop, let's say even a company just based on LLM code. People call it vibe coding. But like I think vibe coding doesn't even do it justice anymore. But like not doing any manual coding anymore, just using LLMs, like having the LLMs built, the website, the product and everything. So like these two extremes and I think we are more like in the middle where we kind of adopt LLMs but we are not like let's say going full LLM. And I think for me also, I think there is still like, I mean, I would say also for people who are nowadays learning how to program and like, as a, like is it worthwhile? And I think it is actually still worthwhile to learn math and coding even though there are LLMs that can do that because it makes, makes your life also still more efficient and it makes you better at using these elements. Because like an example I had also I was using an LLM for my website to add a dark mode. So I, that's something I always wanted to do. I wrote the website myself like 12 years ago, but then, well, I knew HTML and CSS and JavaScript much better back then than I do now. And I always procrastinated on adding a dark mode button because it's. I knew it would take me like a month maybe to do it well or something like that. And it's not my main, let's say job. So I was like, okay, I, well, I can't spend that much time on it. But then I was like, hey, let me try using an LLM for that. And it did a really good job adding it, but it was not perfect. So the button was misaligned and everything. And then I was like, hey, make it a bit higher, make it a bit lower, Edit, move it to the left. And okay, this is actually a thought very inefficient. Why don't I just go into the HTML or CSS file in that case and adjust the settings there. And because I still knew a bit about CSS files, I was more effective to make these adjustments myself. Instead of having the LLM do everything and just brute force telling the LLM, oh, move it that way, move it this way, and I could just, just change them on myself and refresh the page and see. And I think in that sense it does make sense still to have like an understanding of how these things work. Because then there are cases where it is just more efficient to do things yourself still, then prompt the LLM to, you know, redo everything. And so I think, like, I think the. What I wanted to say is that there's a, there's a middle ground basically where I do think there's still value in learning how things work.
25:21
I wonder what your experience is. I often see around these new model releases on social media. Oh, I one shot at this. I one shot at that. Like I'm trying to remember the last one that, that I had this experience and then I'll go and try and one shot, the same thing. And the results that I get are horrible. Like nothing like what is reported in social media and you know, like, hey, is it me or is it just people like reporting these successes for engagement and they're not really there or they're fake? Like, what's your sense for. Do you experience similar things?
27:55
Yeah, I would say so. I think, I mean I mentioned my native Mac apps. Even like something I have a Mac app where I just put in a PDF and it exports the PNG, WebP and PDF versions in a certain resolution. And it took multiple tries. Even with there was back then codecs 5.2 to get really everything, all the buttons working correctly. Like you said, it was not one shot at all. It was multiple iterations to get it to work and even something simple like that. And then I sometimes wonder, are my instructions maybe bad or maybe I wasn't clear? Maybe you have to say please test everything thoroughly and make sure everything works and blah blah, blah. Maybe you have to be super explicit about that. And we are not that explicit because we kind of assume it would make sure everything works. Or maybe these cases we see are just lucky you know like sometimes on certain things it just happens to work very well. So I don't know for sure but I agree with you that it's not all what it seems when someone shows you oh I one shot at this. I don't think that's reflective of how things work today.
28:35
So let's switch gears a little bit and talk through some of the key areas that you expect to see continued innovation around with LLMs in the upcoming year. And then you know, for each of them we'll dig in and talk a little bit about the recent history and where you expect to see things going. What are the big themes for the year?
29:42
I would say it's still going to be the reasoning. We can maybe go into more detail there because it's a very broad topic. So like push like pushing more on the reasoning front the post training the second one I would say is also inference scaling like more sophisticated techniques that they are partly related to training but mostly how to use the LLM after training and then I also think we will see more of this yeah this agentic type of use like how because right now mostly LLMs are focused on like a turn by turn and how to like people will double or companies will double down on this loop basically running LLM as a loop like Multbod and optimizing for that. And I think these three things will be mainly the biggest I guess focus areas for companies.
30:06
So let's dig into reasoning to set the stage for where you think we'll be heading in 2026. What do you think were the big advancements in 2025 around reasoning?
31:00
So yeah the biggest advancement was I mean first OpenAI01 which got everyone excited about it and then OpenAI01 was using both inference scaling and I mean no one knows for sure because there's no paper but likely also training techniques. But then R1 deep seq R1 they published their pre training sorry the reasoning pipeline and I think that that was like really something that took off where a lot of other companies also doubled down on that but it's still very new in the grand scheme of things. It's just like a year old and there have been I was recently working on a chapter on reasoning. There were so many improvements to the algorithm. I mean I just the other day compiled a list of 15 different tweaks and improvements from basic things changing sequence level lock props to token level but then also GDPO by Nvidia lots of progress there and I think we will see more of that first also one Reason is that with pre training we have seen basically that, I mean it's still works and I think it's still the biggest part of the pre training, of the whole training pipeline because it's just so much data and very expensive. But the R and D, like the research and development of the focus of the research team, I, I think it's more focused nowadays on the post training like getting more performance out of that because it's more like the newer paradigm and there are still low hanging fruits to be picked where in pre training it's already pretty sophisticated where yeah, you, you, you still need a lot of data, you still need a lot of compute, but there is no, let's say like there's nothing you can really do much there compared to post training in terms of changing up the algorithms to get more performance. Of course you can still do that and you will still get better results if you use more data, optimize the data mix, maybe multi token prediction and these types of things. But most of the interesting things are happening now on the post training front in the reasoning realm basically. So I think we will see more
31:12
there on the reasoning front. The one topic that I heard about, I heard come up quite a bit last year is the idea of verifiable rewards. And I think that led to a lot of the advancements or contributed to a lot of the advancements that we saw in terms of coding models. Can you talk about that as a paradigm and some of the big milestones that we've seen there over the past year?
33:12
Yeah, thank you for the question. That's actually I. It's a really, really important point like so the reasoning training is essentially mainly based on verifiable rewards, which means there are tasks where you can verify the answer. So for example, in deep seq r1 the verifiable rewards were coding and math. So with math, for example, you ask the model to output the final answer in a boxed format in the latex. It's like a latex command like boxed and then you can have like a deterministic, like a regex or deterministic code to extract the answer. And then you can use something like wolframalpha or sympy to compare this answer symbolically to a reference answer. Like if 2 over 3 matches to over 3 or 4, 4 over 6 matches 2 over 3, it's essentially the same answer but you can symbolically double check the answer and get a reward signal whether it's correct or not. And this is actually great because you can kind of like infinitely. You can evaluate infinite numbers of answers because before with, I mean, it's still an important point. But the reinforcement learning with human feedback. Yeah, you need human feedback. Essentially you can train a reward model to approximate that. And it's part of the training where you get, get a score for each answer. But it's not quite as, let's say, accurate as a really correct answer. Like where you can verify the answer. Like there's an absolute. It's math, it's either correct or not. And you can have, if you have something like that, where you can verify the answer deterministically cheaply, you can have the LLM generate infinite answers. You can say, okay, generate 60,000 answers for this problem and then you can calculate the reward on all of them in a fraction like a very short time. It's still expensive to generate these answers, but you don't have vagueness and you don't have to have let's say human evaluating these answers. And so I think that helps scaling these things. And the same with code where, with code in the deep seq R1 paper, the original approach was to have the code and make sure that the code compiles, basically like if it compiles correctly. And you can use also a code interpreter for that. I think, I mean both are great, but I think this is just the beginning. I mean we will probably see this being extended to have more than just the correctness reward. I mean there are already other types of rewards that are being added. For example, a formatting reward where you want the model to use. I mean not every, it's not required by you. Some companies prefer to have the thinking in the think tags so they have like a token think and then a closing token think closed like in HTML. Like the opening and closing tag. It's not required, but it can be helpful to have it because then you can parse out the intermediate stuff and do something with it where you can train the model to output this structure and that is like, it's called a format reward. So you can have multiple types of rewards added to this thing in addition to the correctness reward. And I think we will maybe also see interesting things there where people will come up with formatting rewards or like auxiliary rewards that help the overall model to, to learn. And one thing is also they tried that in the deep seq R1 paper to evaluate the, the answer explanation instead of just looking at the final score or the if it's correct or not making sure or like evaluating if the, the reasoning like the explanation is correct or not. And process reward. Exactly. Yeah. They use this is called process reward model that it's basically another model that you train to give a score for this explanation. But I remember, I mean it's been a while since Deepseek R1 came out. In the paper they had a section that they listed that as a failed attempt or unsuccessful attempt. So they tried it but they thought, okay, this increases the chance of reward hacking. And then it was just not worth it. It's more expensive. It resulted in reward hacking the model exploiting thing because it's easier that way for the model to kind of cheat, to kind of mislead the model that evaluates the model basically. And so it is still tricky to do that. But I think also in the recent months there were some more interesting success stories like Deep seq math version 3.2. They use something like that where they evaluate also like the whole answer with a rubric, have another model for that and then they have another model that evaluates that model that has the rubric and so forth. It's like multiple levels and. But that seems to work. And they had like ablation studies that showed that this is actually helping. And I think we will see also more of that. It's just like a very new paradigm like making the reasoning training more sophisticated essentially. Yeah.
33:44
Right now the verifiers are focused on like math and coding. And that's because, you know, they're, you know, for a given response there's a concrete ability to verify. Do you see this verification paradigm expanding beyond math and code? And I think, you know, in part, you know, the focus on math and code is successful because you know, even though not all LLM responses are about math and code, those things kind of have an inherent, you know, logic or reasoning capability in them. And so the ability for the model to reason generalizes to, you know, non math encoding problems. But do you, do you see a focus on expanding this idea of verification beyond math encode types of problems?
38:47
Yes. So it's actually a very interesting and important point. How can we. I mean you mentioned that if you train the model on math problems, reasoning on math, it will also become better at reasoning in general. But then it would be even better if you have a target domain to train the model specifically on that target domain, on reasoning in that target domain. I think you, you're right. There will be more of that for me right now I just lack the creativity right now to come up with examples of problems that can be verified. But I would say maybe something even like biology related where for like a drug design, like a pharmaceutical drug design or protein structure modeling where you have like physical constraints so there are like the angles between atoms, they can only have a certain angle and so forth. Where you could probably have like a physics type, type of equation that double checks whether the generated molecule adheres to these certain types of formats and then have the have that as a form of reward when you're training the model. I mean this is maybe not a typical case of reasoning because well, what is the reasoning explanation when you're generating a model? Right. I mean but in general like something like that for other fields. And in the worst case you can always, I mean this is more like a rough approximation but you can always train another model that provides the correctness reward. This I think this is more challenging though because it's susceptible to reward hacking. Even going back to back in the day generative adversarial networks where it's easy for the generator to collapse. You have the discriminator which says is this image real or generated? And then was like the setup where you train a generator to fool the discriminator and the discriminator gets better at distinguishing. And you have almost like a similar setup. You can use it to say give a reward or not. But then the model may or may not exploit it at some point learns how to learns a trick. If I only generate this one word or something like that, then I fool that evaluator. But I think maybe we'll see also more of that like developing like AI based reward models essentially that can be used then in other fields to train better reasoning models.
39:43
Beyond increased focus and tweaks to the verification models, there are other areas that you see as contributing to stronger reasoning going forward.
42:10
Yeah, I do think it's also. I mean the training is one part, but the other one is the inference scaling that you can get much better performance if you use simple. Let's say not simple, I mean simple in quotation marks. But if you after training, yeah, nothing is simple if you spend. Let's say if you have the model, if you spend more compute essentially like inference scaling is all about the definition is essentially spending more compute after training during inference when someone uses the model to generate the answer and you can do it in multiple ways. I mean reasoning models themselves, they are already kind of like a form of inference scaling because they generate more tokens than regular models. The explanation is longer than a regular model provides and but it helps often the LLM to reach the correct answer. But this is more like A sequential inference scaling. You can also have parallel forms of inference scaling where you just generate multiple answers and that's called self consistency. So for example, if you have a math problem, you can have the LLM with different temperature settings, answer the question multiple times and then you take a majority vote or something like that. I mean there are different ways you can do it. There's also, there are different scoring methods or other LLMs that look at all the answers and give you the most likely correct answer. And with that you can also boost the performance of the model. It's more expensive though. So it's always like this. Yeah, it's not like a one size fits all, you don't want to use it all the time, you use it when you need it. But I think what will be interesting is to improve the way to tell when it's leaned. I think when ChatGPT was it 5.1 or 5 launched they had like that automatic setting that we talked about in the beginning. It was very bad at the beginning, but I think it got much better over the months and years. And I think I, I'm not quite sure we have anything like that in the open source open weight ecosystem. Maybe listeners may correct me here, but I think something like that I can see also being more important because yeah, on the one hand we are developing these very expensive models that can solve very hard problems like in this Math Olympiad, but we don't want to use them all the time because they are slower and more expensive. And there's also going to be more like a focus at the same time on cheaper models. So for example, just the other week, Gwen 3 quarter next or next quarter, sorry Gwen 3 next quarter came out which is Gwen 3. Gwen 3 is one of the most widely used open weight models because they have like a lot of really high quality models in all different types of sizes. But they also, the next model, it is essentially like a hybrid, it's not like a pure transformer anymore. It has like, it's inspired by state space models to make things cheaper. But then it's like always this trade off. People are developing higher accuracy models, people develop cheaper models. And I think, yeah, I mean one way would be changing the architecture to control the quality and price. The other one is inference scaling. But I think right now it's in the open weight ecosystem. It's not quite as popular yet. So I think we will also see more of that in local tools and so forth.
42:25
I don't know that I know of any like a open source project or a model that incorporates this. But from conversations I do get the sense that a lot of companies that are building around, you know, the Quinn models, for example, and these open weight models commonly have like a router component in their architecture that tries to assess the complexity or category of a prompt and routes it to the right, you know, model and prompt that is, you know, either most economical or, you know, maybe post trained for better responses, that kind of thing. My sense is that that's the common approach to addressing this challenge that you're describing.
45:51
Now that you mentioned that, another example came to mind. It's the GPT OSS model, the open source model by OpenAI, which came out last summer. And in that model, even if you use a very simple inference or like a simple tool like Ollama or any comparable tool, you can set the reasoning effort in the system prompt so you can say no mild medium high reasoning effort and then scaling inference based on the reasoning effort. But I, I don't think there's any other technique really automatically incorporated like self consistency or self refinement. It's, it's mainly you have to, as the researcher, do it yourself most of the time.
46:39
Can you talk a little bit more about the self refinement and self consistency and how folks use those techniques?
47:19
Yeah, so self consistency and self refinement are two examples of inference scaling. I would say the biggest difference between the two is one is a parallel technique where self consistency as a parallel technique generates multiple answers and you choose, let's say, the correct answer based on majority vote. Or you can have a scorer who assesses the answers. But then people call that technique Best of N. Like best of N. Okay,
47:27
Best of N or quorum or that kind of thing.
47:54
Yeah, it's essentially an ensemble technique, like classic ensembling almost. And the other one is self refinement where you have the LLM generate the answer and then you feed the answer to another LLM or to itself and say, here's the answer, this is the question. Write a summary if the answer is likely answered correctly and what are weaknesses? And like, like a rubric almost. You have a, you provide a rubric with certain things that the LLM should check and then it gives you back a report and says, well, this is, could be better. This is likely incorrect. The explanation doesn't match the final answer. And then you feed that output back to the original LLM and say, hey, look at that report and refine your original answer based on the report here. And often this can lead to the LLM improving its own Answer. It's almost like this phenomenon. Sometimes you ask chatgpt something and it gives you something and like wait, that can't be right. I know. You ask something about like when was a certain model released and you know, okay, this can't be right, the year is totally wrong. And you tell Church, hey, you are incorrect, you made a mistake. And oh yeah, you are right, I made a mistake. And then it tries again and it's better next time. And it's almost like that same mechanism where yeah itself refines its answers. It can, I mean based on my experiments, it can also sometimes make answers worse. Like it will overthink or it was originally correct but then it, I don't know, the feedback is weird or bad and then it makes the answer incorrect. So it's not like a, like, well, not a foolproof technique. It's also with caveats, but in the deep seq math version 3.2 paper where they had a self refinement in a more sophisticated way, where they had a third model evaluating the evaluator, they really showed I get a nice graphic or a plot where they showed how much the accuracy can improve. And basically from a. I don't know the numbers off top of my head. But the. If they cranked up the self refinement and self consistency, they were able to have like gold level performance in certain math competitions, which was very impressive given it was still the same model as they used before but they just cranked up the inference scaling basically.
47:56
One thing that's interesting, kind of reflecting on these themes that we're discussing is just how they're all very interrelated, you know, so reasoning is a key theme. Reasoning is enabled by inference scaling. Inference scaling. A lot of what we're hearing as we talk about this is like loops and recursion and those kinds of ideas. And those are key ideas. And the third key theme that you mentioned, which is kind of agentic uses of LLMs. You know, with that as a segue, talk a little bit about, you know, what you've seen thus far around agentic and what you think is exciting in that space.
50:22
I would say yeah, the agentic use cases, it's even like simple, I mean not getting in quotation marks, simple things like codecs or cloud code where it does just multiple iterations to solve a problem. It's not just like one shot, it's more like doing a task rather than just providing an answer. And I think, I mean multipot would be another example of agentic systems that have, I mean agentic is also like a, I would say almost like not well defined term because people use the term differently. But for this podcast maybe we can think of Agentic as something that runs in a loop and I think yeah, that is something we will see more of recently. Claude code and GPT 5.3 codecs, the app, they added a lot of these tasks where you can even schedule something and it does something on a reoccurring basis for example. And I think we will see more of that. It's just like the beginning, it will be more like plugins and I mean it's still the same LLM. It's just like how we use the LLM and how to get the most out of it, out of the context, feeding back the context context. And I think there has not been that much focus on this in the open weight open source community. The focus there is more on developing the LLM itself where companies though that like you know, OpenAI, Claude, they are more like okay, let's build these tools so we can actually do more and more impressive bigger things with these LLMs. And I think maybe by the end of the year we will have systems that can reliably book a trip to, you know, some holiday vacation destination where this becomes more, more common. I mean there were already tools that promised to do that. I forgot the names but I think it was called Devon, something like that. It might still exist but I think this is just the beginning. Yeah. Oh yeah, minus, right? Yeah, yeah. But I think it's just like the beginning. And also most people, I don't think they need like a full blown thing that can do everything. They just maybe need like a plugin for Excel to have certain intervals where it updates certain things and then Excel spreadsheets goes into the, let's say goes to the Internet and pulls the recent stock price or something like that. But like in a type of loop type of setting essentially.
51:08
Yeah. One of the things we heard a lot about in the context of gentic uses of LLMs is I think over the past year, maybe two years is the idea of like multi agent systems and like decomposing a problem into independent agents with kind of their own Personas and that kind of thing. And I think that you know, the whole open clawed or open claw idea, like even like today I'm seeing you know, a lot of hey, I created my, you know, AI team and there's this employee, my AI employees, right? And there's this employee, that employee, that employee and they talk to each other using you know, Slack or Moat book or whatever. And like what have you seen with regards to, you know, kind of from a concrete, you know, builder or technical perspective, this like multi agent
53:35
use case,
54:42
are you finding, you know, folks getting a lot of value out of that?
54:43
To be honest, I wish I had a really good answer or interesting answer, but this is something I've not explored personally. Where most of my experiences are like single use case, where it's one LLM that provides solutions or tackles a specific task, but it's mostly not interacting with other agents. I think here though, I mean I also see it's more like a context engineering problem where how do you, I mean it. So the LLMs themselves, I don't think they are the bottleneck. It's more about how you let's say provide the result, get the results and provide them to another LLM. I mean in that sense it's almost like a form of when you do image or video generation where you have one LLM parsing the text or improving the input and then passing that to the LL or to the part of the model that generates the output, the diffusion model part or the transformer based diffusion part, where I think it's more like a sophisticated form of that how do we provide the right context to the different agents? And it could be from basic databases to using Slack where one own model outputs something there and with the via the API the other model ingests it. And I think well there would be. That is I think also something that is just getting started also with Multipod and openclaw and I think we'll be seeing a lot more of that. But yeah, so that's all I can say to it because I personally don't have any concrete experience. I haven't worked on this myself yet.
54:46
Do you have a sense for where we'll see focus and innovation around these kind of agentic uses in the upcoming year? Or you know, maybe what the gaps are, what really needs to be worked on in order for them to kind of come into their own.
56:24
I do think it's still like each LLM has its own kind of like failure rate at some point where so the progress is usually measured how long the LLMs can work autonomously, like how, how long can they work until they fail. And the more models you add, the higher the risk that one of them fails if they depend on each other. And I think improving the model itself here will also help improving the whole system basically as, as the main way to improve the performance. But then I can also see right now most as far as I know, based on what is publicly available, these are still the vanilla LLMs that are in Claude or in other APIs. They're not specifically trained to interact in a multi agent setting. And I think in that sense, if you prepare data for training these agents in a multi agent setting, like a fine tuning type of situation, I think you can also get more performance out of them. We have seen that, I mean, even for simple things like codecs, so GPT5.2 or 5.3 codecs is not the same as GPT5.2 and 5.3. These are models that they forked off and then specifically trained to work with the Codex app, basically. And I think something like that we will also see for these agent models. It's just harder for, let's say the consumer to do that because you don't or we don't have access to these models. So we kind of like dependent on the person who owns the LLMs, who hosts the LLMs to, to do this type of training basically. So yeah, I think I can see companies also developing something like this. I mean, if I had to bet Claude and OpenAI, they really paid attention to what Multipod or Openclaw is doing and maybe coming up with their own version of that that is maybe even more capable because they control the model and they can fine tune it for a certain interactive multi agent types of environments.
56:45
Yeah. One of the things that's interesting looking back is that a lot of the things that we might look back and see as big advancements over the past year or two years from an architecture perspective, they're relatively incremental, like the fundamental core architecture. There have been a handful of proposals of where we might go beyond LLMs, but the core has been fairly stable. You know, do you, do you agree with that? Where do you see, you know, how do you think about the future of LLM architecture?
58:49
Yeah, that's a interesting question. So I would say everything I'm saying here with an asterisk because deep seq version 4 is not out, it might change everything completely in terms of what I'm saying. But if we just look at 2025 up to the second week of February, I don't think there were any fundamental changes in terms of the state of the art architecture. So I think one thing we have to distinguish between is like there are architectures that are more geared towards doing the same thing more efficiently and there are architecture changes that are geared towards, let's get more modeling, performance accuracy out of the model first. If we look at those Models that push the state of the art, the modeling performance. There haven't been that many changes really recently. I mean looking at 2025 mixture of experts models have been making a comeback. I mean there were other models like Mixtral and Deep Seq MOE before, but they really became popular after Deep Seq version 3 came out. And Deep Seq version 3 became popular because of Deep Seq R1 which is basically a fine tuned version or trained post trained version of deep seq version 3. But then a lot of companies adopted this architecture. I think Kimi straight up used that architecture and they scaled it from 670 billion to 1 trillion parameters. Or even like the European company Mistral AI, they used a deep seq version 3 architecture. So a lot of people are, I would say they are not gambling in that sense in our let's try something different. They will. Or they take something that works and try to make progress or differences in terms of changing the data and the algorithms. But that doesn't mean there are no new ideas. So I mean deep seq version 3, besides MOE, the mixture of experts, they did have the multi head latent attention. I think it was also in one of the previous papers. But multi head latent attention is essentially like a tweak of the attention mechanism where you have like a intermediate smaller compressed state of the keys and value queries and keys. No, sorry, keys and values. Keys and values. I mean all of them actually are. But the keys and values are important to compress because then your KV cache becomes smaller. So you don't store the full keys and values in the KV cache but a compressed form. But then you reconstruct the keys and values from the compressed form in inference. So you are basically trading off compute with memory. But. But also maybe to explain this a bit better, you can think of it as a LORA like the low rank adaptation. So basically you project it down into a compressed space and then you project it up again. So that's basically a multi latent attention. That's. So that's like an interesting tweak I think in 2025, 26 that people adopted. And then it was again deep seq version 3.2 that had another, I would say tweak, sparse attention. I mean sparse attention is also not new. But there's always been this research on how we make attention cheaper because it scales quadratically with a sequence length. And there have been hundreds if not thousands of papers. But with papers I'm always a bit. I mean the ideas are interesting, but I'm Always a bit careful and I'm waiting always to see that in. In production in quotation marks. What I mean is to see it in a flagship model, like, because the idea might work well if you are only focused on a small model, but things may fall apart once you scale the model to 500 billion, 600 billion 1 trillion parameters. So. And deep SEQ here is a nice case study because they do have this flagship model and if they use something in that flagship model, you basically know it works at scale and they have their own version of sparse attention. I think they call it literally deep Seq, sparse attention. And it's. Yeah, it's instead of. So they have like a lightning indexer, like a small cheap model in a sense, instead of for one token paying attention to all the previous tokens, it's more selective, it selects which tokens it pays attention to. So it's kind of like a mask. So you're calculating a mask over all the tokens to select a subset to make it cheaper, to make it scale sub quadratically, basically. And there have been like these types of tweaks, but it's not fundamentally changing how attention works. It's still the same attention mechanism. But how do we. Yeah, how do we make it cheaper, basically? So I think that is something where people hone in on what works at the moment. But we will see maybe in 2026, maybe one of the flagship models will have a fundamentally different approach. Little companies, I mean not little, sorry companies, but little changes have also been made in terms of alternative architectures. So we mentioned Q03 earlier. So Q03 is one of the flagship models right now it's maybe not at the top anymore because it's a bit older. It came out in summer, but usually when Gwen 3 or when Gwen models came out, they are usually top of the leaderboards. They had also a parallel version of their model. They called it Gwen 3 Next. And that one tried something different. They had like a hybrid attention mechanism with a gated delta net basically to have more like a state space model approach where it's more like linear, where people are trying, but it's not necessarily like their flagship model, like they are to trying in parallel trying other things. And I think this makes sense because, yeah, you don't want to put all your eggs in one basket, basically. You want to have a good model and then maybe on the side try something and then maybe scale it up later if it works. Well, yeah.
59:27
What about continual learning? That comes up frequently as an opportunity, you know, particularly before we got really good at incorporating in tools and the ability to do searches because models would get stale, you know, very quickly. But you know, there's still this interest in having a model that, you know, we can, you know, we can keep its training data updated, we can delete, we can delete things, we can incorporate new knowledge. Like, do you foresee significant innovation in that area?
1:05:32
Yeah, I think this is like maybe the biggest dream in the sense of like, hey, how can we make the model improve itself? Like the biggest, I guess, achievement right now that could be made if that gets, if someone finds out a way that this works. But I think right now there is no, even no pathway to this. Like there's no, yeah, there's nothing really that is like where you would say, oh, that's the thing that will give us a reliable continual learning. But that being said, I think, I mean there are already forms of continual learning, I would say. I mean even something like, well, I mean it's more like controlled, like instead of the model automatically updating itself, people would collect data from like the recent Internet or recent tasks and then carefully update the model essentially. I think it's more like that where it's not that we don't update models, but we also don't do it fully automatically. It's like a semi automatic almost type of thing. I think also it's like that not only because that's more reliable, because yeah, it's risky to just update a model on new data, but it's also because of resource constraints because for example, I don't know how many copies of the model OpenAI has, but I mean it's not, you can't definitely not have a single copy per user. That would be way expensive. I mean everyone would have to have a little supercomputer at home or like a hundred thousand dollar computer to have like a big flagship model. And so companies can't like just update everything on the fly for each user because that would be just infeasible basically. And so unless we have ways that the models only run on the personal device, I don't think we have or can have really good continual learning essentially, because yeah, and then the other thing is, yeah, you have to be really careful how you update it. You don't want to make the model worse. So because it's such an important expensive product, if you have random, if you just even think about feeding the Data back to OpenAI and then OpenAI automatically updates the model and maybe there's a better update and it disrupts everything for everyone. And so I think it's more of infrastructure, security, that type of issue. But otherwise, I mean if you look at the reasoning training, the what we talked about, the reinforcement learning with verifiable rewards, if you run this based on correct answers in that type of setting and you just keep it running, it is kind of a form of continual learning in a sense where you can technically just keep running this and it's just like you don't want to, you want to be more selective basically.
1:06:11
And do you think that long context or longer contexts kind of alleviate some of the pain or need to do continual learning like, you know, in your case of like the personal personalized models? You know, one approach is to take new information and kind of, you know, continually learn against it. You know, another that I think folks have played around with is to create like personal Lora adapters for a model. But then a third is to just put that new information into the context and use it, you know, as at inference time.
1:09:05
I would say yes and no. So I do think long context LLMs, they have enabled so much recently. Like where before people were building rag systems like the retrieval, augmented generation generation systems and now it's almost, I wouldn't say they are obsolete, they still are very useful if you have a fixed big database or document set. But. And if you use it repeatedly. But if you're a regular user and you have like, even if you have a thousand page PDF, you can't technically often most of the time, I mean thousands may be stretching it a bit much. But like a 200 page PDF, you can have it in context. You don't need to train the LLM in terms of fine tune it on that data. You don't need to have a rag, you can do a lot of stuff in context. And like you said, the same is maybe true for information where you could technically just provide all the relevant new information in context. But I think it only gets you so far because you also as a user have to know what to provide as information. But then if you couple that with tool use, for example, if you ask about a historical event and let's say the Data cutoff is 2025 and you ask about a 2026 historical event, the LLM can still use a web search, it can still use a tool and look it up on the web. So you don't necessarily need to update the LLM for that particular historical event. But then if the historical event has a lot of ramification and affects a lot of around it, that might be Missed then you only get certain facts from a tool call but not that whole interaction with other data points and I think so it's not fully replacing the updating but it is making it maybe less necessary to do it or it's not necessary to do it quite as often. I think maybe so yeah, you're kind
1:09:46
of big picture thoughts on like where the field will be focused over the next year is you know again reasoning, inference, time scaling agents. Any other thoughts or predictions that you know come to mind for you?
1:11:48
Yeah, I will be curious to see. I mean it's a little thing but you know like we talked about there is no big alternative to the transformer architecture but there is for example there are things like text diffusion models and Google for example they had like a. It's like a waiting page. They are planning to launch text diffusion model like a small, I mean not small but like a alternative where I'm really curious. It's more like something I want to see. Maybe that's going to be replacing like the free. Free tier of LLMs maybe that's a really interesting thing. I'm mainly interested in that is because there's been a lot of research on text diffusion models. So it's like a different take instead of generating the text sequentially it's more like a BERT model, a B E R T model where you have masks and then you gradually denoise or replace the masks by text. I just want to see how it performs at scale because right now most of the models are research models and just things like that. It's nothing I think people should get excited about in terms of cutting edge performance but it will be maybe cheaper and faster and maybe that will be making the like it's like an everyday maybe improvement for even like the Google summary and Google search which is also LLM based but it's not the best and like these little quality of life improvements I think well why we're recording this is it's before the Chinese New Year and historically around the Chinese New Year there have been always a lot of model releases, open weight model releases. So maybe there's like a little surprise in there. Maybe you will see deep seq version 4 and maybe there is a bigger change. So I'm kind of like interested in following that and seeing that. But yeah, right now, top of my head, I think we covered pretty much everything.
1:12:08
Let's maybe switch gears a little bit and update us on what you've been working on. Personally you've kind of referenced chapters of the the book Talk A little bit about your current book and you know, where folks can learn more about it.
1:13:58
Yeah. So I think last time I was on your podcast we talked about my Build a Large Language model from Scratch book. So it's basically the whole journey from building the architecture to pre training a model and then also doing instruction, fine tuning. And the goal of that was not to, let's say build your personal assistant that does all the things at home for you. So because that would cost like fifty thousand, a hundred thousand dollars and be a lot of work. And it's, it's, I mean even though it's simpler nowadays to train your own LLM, it's not something you can do routinely on a weekend. But the goal of that book was to teach people still how that workflow works to understand how LLMs work. Because that help you, that helps you to, let's say, use LLMs better to understand what is the context, what's the limitation of the context, how does, you know, attention work and why is it more expensive if my input gets longer? And it's just like if you build the LLM yourself, you kind of like get a real clear understanding compared to just explaining it in a more, I would say free form based approach. And so yeah, a lot of people like that. And it's like a very popular textbook also for teaching now. And I was then really excited to kind of, because it's only one book, it could only cover so much to work on the sequel. So right now I'm working on Build a Reasoning Model from Scratch, which is kind of like the SQL. There's no overlap between the books. It's basically, I mean it can be read as a standalone book, but it's mainly focused on the reasoning techniques. We talked about the reinforcement learning with verifiable rewards, the GRPO algorithm, inference, scaling, like all these techniques that once you have a pre trained LLM. So the book starts by starts given. So there's a given pre trained LLM. We use Quant 3, the smallest Quant 3 model and then adding inference, scaling and the reinforcement learning. So the first 360 pages are already in the early access and I'm hoping to finish. I mean there's only one more chapter left by April. I mean the chapter is a lot of work because you have to run all the experiments. So I've been running a lot of experiments especially for the GRPO algorithms because there have been so many different papers and improvements and trying them out in practice. It's been a lot of fun, but it's Also a lot of work. So yeah, so I've been mostly running experiments in the last couple of weeks and months and yeah, it's quite exciting actually.
1:14:14
And so can folks pick up the second book and run with that or do you expect folks to have read the entire first book before they start with the second?
1:16:41
I would say either way works. You don't have to read the first book if you. So the second book, it uses a pre trained LLM so you don't have to pre train your own LLM or you don't need the first LLM to, sorry, the first book to train the LLM for the second book. So it's kind of independent like that. But the second book doesn't explain in detail the pre training or the architecture. I mean I have an appendix on explaining the architecture but it's not quite as detailed as the first book. So I think if people want to understand the whole, let's say the whole life cycle of an LLM from pre training to post training, I think it would make sense to read them sequentially. But you could also start with a second book, learn about inference, scaling and reasoning and then if you're interested in the pre training you can fill in the gaps later on. I think either way works basically well.
1:16:51
Very cool. Sebastian is being great, great catching up with you and we need to do it more often than every three years. But thanks so much for jumping on and sharing a bit of your perspective on kind of where things are and where things are going.
1:17:40
Yeah, thank you so much for the invitation, Sam. I had a great time. I love talking about LLMs and AI so well that was a treat and thanks for having me on.
1:17:55
Thank you. It.
1:18:04