LLMs in 2026: What’s Real, What’s Hype, and What’s Coming Next
75 min
•Feb 23, 20262 months agoSummary
Sebastian Rochke discusses the state of LLMs in 2026, emphasizing that progress will come from reasoning capabilities and inference scaling rather than architectural breakthroughs. He debunks hype around AI replacing developers, explains practical use cases like coding and RAG systems, and clarifies why building LLMs from scratch is valuable for understanding fundamentals but impractical for most organizations at scale.
Insights
- LLM progress in 2026 will be incremental refinement of 2025's reasoning/thinking models and inference scaling techniques, not revolutionary architectural changes
- Building LLMs from scratch is educationally valuable for understanding fundamentals but economically impractical for most; fine-tuning existing models is more feasible
- Developer productivity will improve significantly with AI assistance, but coding won't become obsolete—iteration, testing, and domain expertise remain essential human work
- Performance benchmarks (MMLU, leaderboards, LLM judges) each have limitations; real-world evaluation through actual use is more reliable than any single metric
- Tool use and inference-time scaling (longer context, multiple reasoning steps, external tools) are bigger differentiators than base model quality in production systems
Trends
Shift from training-time scaling to inference-time scaling as primary performance leverReasoning/thinking models becoming standard across all major LLM providers (OpenAI, Google, DeepSeek, Grok)Context window expansion enabling RAG-free document processing for many enterprise use casesSpecialized fine-tuned models outperforming general-purpose models for specific domains (law, finance, coding)Tool-calling and agentic workflows becoming critical differentiator beyond base model capabilitiesOpen-weight models (DeepSeek, Llama, Qwen) closing performance gap with proprietary modelsReinforcement learning with verifiable rewards as dominant technique for improving reasoning capabilitiesData privacy concerns driving adoption of local LLMs and on-premise fine-tuning over API-based solutionsMixture of Experts and parameter efficiency becoming standard rather than novel techniquesMisconceptions about AI capabilities driving unrealistic expectations in non-technical leadership
Topics
Large Language Model Architecture and TrainingReasoning Models and Chain-of-Thought InferenceInference Scaling vs Training Scaling Trade-offsFine-tuning and Domain-Specific Model CustomizationRetrieval Augmented Generation (RAG) SystemsLLM Evaluation and Benchmarking MethodologiesTool Use and Function Calling in LLMsAI Impact on Software Development and CodingOpen-Weight vs Proprietary LLM ModelsData Privacy and Local LLM DeploymentReinforcement Learning with Verifiable RewardsContext Window Expansion and Long-Form ProcessingLLM Tokenization and Computational LimitationsMixture of Experts ArchitectureAI Hype vs Practical Capabilities
Companies
OpenAI
Discussed as major player developing reasoning models and ChatGPT; uses inference scaling and custom models for bench...
Google
Mentioned as developing reasoning/thinking model variants and participating in math Olympiad competitions
DeepSeek
Highlighted for 2025 breakthrough with reinforcement learning and verifiable rewards; released DeepSeek v3 and reason...
Lightning AI
Sebastian Rochke's employer; AI development platform mentioned as his professional affiliation
Anthropic
Mentioned as developer of Claude model with reasoning capabilities and tool use features
Nvidia
Referenced for Nemotron model supporting 1 million token context windows
Meta
Discussed for Llama open-weight models and training infrastructure details shared in papers
Xai
Mentioned as developing Grok model with reasoning/thinking variants
University of Wisconsin-Madison
Sebastian Rochke's academic affiliation where he bridges research with practical LLM development
Bloomberg
Example cited of company pre-training specialized LLM from scratch for financial news domain
People
Sebastian Rochke
LLM research engineer and author of 'Build a Large Language Model from Scratch' book series; primary guest discussing...
Geoff Nielson
Host of Digital Disruption podcast conducting interview with Sebastian Rochke about LLM capabilities and trends
Quotes
"2025 was particularly interesting because there was at the beginning of 2025 deep seek and then this new paradigm we can maybe get into this in more detail later but the reinforcement learning with verifiable rewards which is like a technique to develop reasoning capabilities in LLMs"
Sebastian Rochke
"I still write most of the code I care about by myself without AI. But I use LLMs for coding in different ways—for things I don't know how to do, like building a macOS app, it's kind of like magic."
Sebastian Rochke
"It's not making people developing code or designing apps or building apps obsolete because it's still work. You can't just say build xyz and it will build a final version. Usually the first version is not the final version so there are iterations you have to test it you have to use it you have to tweak it and that is going to be still work."
Sebastian Rochke
"If someone knew a new architecture that is much better than the current status quo that's like a trillion dollar idea basically. If someone has something like that, it wouldn't be something someone had shared already."
Sebastian Rochke
"By building the foundation it helps you really demystify these misconceptions. Mixture of Experts is not training different LLMs and combining them—it's a module in the LLM that has to be trained end to end."
Sebastian Rochke
Full Transcript
Hey, everyone. I'm super excited to be sitting down with Sebastian Rochke. He's the author of the Build a Large Language Model from Scratch and Build a Large Reasoning Model from Scratch book and video series. As an LLM research engineer, he's bridged academic teaching at the University of Wisconsin-Madison with the Hyperpractical, working for the AI development platform Lightning AI. But what I love about Sebastian is that he has zero appetite for AI hype and can dive right into what you can actually do with these tools. I want to ask him what impact AI is having on coders. Is it really going to make them obsolete? What real capabilities can we expect AI to develop from here? And who should be actually building an LLM from scratch? Let's find out. Sebastian, thanks so much for joining today. For those who don't know, Sebastian is the author of Building an LLM from Scratch book as well as video series on YouTube. Sebastian really gets deep into the technical aspects of LLMs and how we can actually create these and build our own and get beyond the hype sometimes we hear about of AI. But just before we get into who should be doing that, what that looks like, I wanted to zoom out a little bit. And Sebastian, maybe you can tell me a little bit about, in your view, the state of LLMs in 2026. How are the capabilities advancing? What do you see on the horizon for this technology? yeah thanks for first thanks for inviting me on the podcast to talk about llms it's yeah one of my favorite topics so i think we will have a lot of fun in this uh episode so uh but yeah you you began with a very broad question here on like the state of llms in 2026 i would say um 2025 was particularly um interesting because there was at the beginning of 2025 deep seek and then this new paradigm we can maybe get into this in more detail later but the reinforcement learning with verifiable rewards which is like a technique to develop reasoning capabilities in LLMs reasoning is also in quotation marks it's a broad topic it's like the reasoning in LLMs I would say we shouldn't take it too literal like how humans reason but it is like a set of techniques that make LLMs better at solving complex tasks. And 2025 was pretty much dominated by this idea of developing these so-called reasoning, sometimes called thinking models. So everyone from OpenAI to Google, Cloud, Grok, all the Openweight LLMs, they all have now different variants of LLMs, like the regular instruct or the regular variant and then the thinking variant. And we can maybe talk also later a bit about the trade-offs here. But so I think this was like 2025. We will see this kind of like continue in 2026 because these techniques, they are still relatively new. And people are currently, I would say, in the first iteration or version of these techniques where, okay, this works. And now it's like, let's hone in on that and let's make that even better, add some tips and tricks to that and really exploit that type of mechanism. So we will see more of that. At the same time, I think also I would say a lot of progress came from the inference scaling side. So there are basically two paradigms for LLMs. One is like the training and then is the usage, the inference. and there's also like a trade-off you can spend a lot of money in training which is very expensive and then you get that one model and it gets maybe used for a few months and then it gets replaced by the next model or you may also even update this but yeah training is very expensive and it gives you you know longer training like the scaling laws training on more data gives you better models, essentially better performance. And it is expensive though. So you can also spend money, extra compute and I mean extra compute usually costs extra money because you need more resources during inference. That means after the training when you use that LLM. So you can say, okay, instead of let's say I have a user and the user has a query, instead of just giving the first answer you can have maybe three answers. Like if it's a math question, you have the LLM with different settings running three times. And then you take the highest scoring one or the majority vote. But that will be three times as expensive. So that's one version of inference scaling. There are other techniques, for example, generating longer outputs. And that sometimes helps the LLM also to think through a problem, like in quotation marks, think, coming back to the reasoning. And so I would say to answer your question before we go into technical details, 2026, I would say it is still on that trajectory of we can still make a lot of progress in training, especially maybe not so much the pre-training because that's not where the low-hanging fruit is anymore, but more like the reinforcement learning for the reasoning capabilities. and at the same time more clever inference scaling techniques like more more of that so they're like these two things and i think this is going to continue i don't see anything right now surprising on the horizon but this is always like that in ai if someone knew a new architecture that is much better than the current status quo that's like a trillion dollar idea basically we would So if someone has something like that, it wouldn't be something someone had shared already. So it's always going to be a surprise if something like that lands. But I don't see any indicators or anything that gives me the confidence that there will be something really, really different. It will be more like honing in on these things, basically. I think that's really interesting. And so much of, you know, your outlook, Sebastian, comes back to this notion of reasoning or inference and, you know, use the word thinking in quotation marks. But the reason this is so interesting to me is it sounds like, you know, there's not going to be, in your view, a huge, you know, step up in 2026 from what we see right now. We're going to continue to see performance improvements. But the reason I find the reasoning piece so interesting is so much of what you hear in the media and so much of the noise coming out of Silicon Valley and from CEOs in any industry is these sort of grandiose statements about what they will use AI for, what AI can do and how transformative it is. You know, I'm curious in your mind if you can, you know, share a little bit more about what are some of the better use cases for this and maybe, you know, debunk some of the things that this technology is just still not going to be able to do at the end of 2026. And I think specifically, you know, when you mentioned reasoning about, you know, I call it the strawberry problem. And you may have heard of this example. But, you know, the issue right now that if you ask a lot of the leading, you know, LLMs, how many R's are there in strawberry? It can't accurately answer that question. It says, oh, there's one and there's two R's. You know, there's one in straw and one in berry, which is obviously not right and sort of lays bare some of the reasoning limitations here. So what will this be good for at the end of the year? What will it still not be good for at the end of the year? Yeah, so I should also, well, yeah, it's also a broad question, like not a super broad question, but like I wanted to preface this with another point to add on on my previous answer with the reasoning capabilities. It's also a spectrum. Like, well, what I'm trying to say is the same LLM with, let's say, different inference scaling can have different capabilities. I mean, or so basically you can have an LLM and you can, so coming also back to the strawberry problem, you can use it in the high power reasoning mode, but it might make mistakes on simple tasks like that. It's like almost, it's called like overthinking, but it's also something, I mean, again, I don't want to say LLMs think like humans, but it is also something humans suffer from. Like, you know, I know a lot of things. I, you know, I'm a researcher. I can do a lot of things. But, I don't know, ask me at 11 p.m. what, like a simple math question, like 21 times, I don't know, 11 or something like that. And I will give you a wrong number, maybe because I, I don't know, I'm tired. My brain doesn't work anymore. Or in other ways, like, I sometimes make really dumb, stupid mistakes. But it doesn't mean I can't do these other things. And I think for LLMs, that's also true. Where the counting the R in strawberries, it's almost like well it's not like uh you're not evaluating it in the real use case you care about i would say like this is something where you would not use the lm and um you can actually now i think that's also one of the big drivers of progress in 2025 and going to be in 2026 is a tool use like the lms can use tools and so they don't have to do everything from memory like counting the R's in strawberry instead of the LLM trying to do that. And I think the limitation here is usually around the tokenization and basically how LLMs work. But instead of having the LLM tell you based on its own internal compute, the LLM can do a tool call. It could use a Python interpreter, like read this as a string and then just use string finding, like the letter finding in the string. And then it gives you the accurate answer. And so I think we also have to think about it like this, that there are different modes how we use LLMs and, you know, your mileage may vary. And problems, you know, that, so coming back also again to 2025 again, there was OpenAI, Google and some others that participated in these math Olympiad type of competitions. and they got really, really, really good results. I think, well, what they call gold level performance. But at least for ChatGPT OpenAI, they didn't use a model that is publicly available. So they used like some custom version. And this usually also involves inference scaling. So for example, the same is true for DeepSync Math version 2. They had like a paper and they had that same LLM and they cranked up the self-refinement steps where the LLM evaluates its own outputs and refines those and has multiple tries at each problem. And it boosted the performance significantly. So what I'm also here trying to say is like, it depends a bit on how you use the LLM. It's like a small LLM can be really good at being, you know, efficient and cheap at a certain problem, but it's not going to solve all your tasks. you can specialize the LLM more for complicated things, but then it might fail at another task. And right now, I think, well, when you go to chatgpd.com, for example, this is like a general purpose model. And it has some modes where it has this auto mode, like thinking, non-thinking, deciding what is the right one. But it is still trying to be, you know, like a jack-of-all-trades in that sense that some people use it for summarizing emails. Some people, I mean, ask medical questions. Some people use it for coding. and so well it's just like at that it's pretty good it does a little bit of everything but it's not super specialized and I think with LLMs right now the most the biggest use case and the most promising or utility wise the biggest use case right now is coding for example it's really good at coding and we'll see it how well it performs at other things and but yeah I think I'm diverging here. There was like a question at the end of the 2026 year, what is like some of the tasks? And I would say coding for sure. It's maybe the boring answer, but it's a text problem. It's pretty easy. Not pretty easy, but it's pretty approachable for LNMs. If you work in IT, Infotech Research Group is a name you need to know. No matter what your needs are, Infotech has you covered. AI strategy? Covered. Disaster recovery? Covered. Vendor negotiation? Covered. Infotech supports you with the best practice research and a team of analysts standing by ready to help you tackle your toughest challenges. Check it out at the link below and don't forget to like and subscribe. I completely agree with you from everything I've seen. This seems to be one of the low hanging fruit areas where it seems like, you know, we can, we can, you know, dramatically improve, you know, productivity with developers here. You had a quote I came across in your blog where you said that I still write most of the code I care about by myself without, you know, AI. Is that still true? And, you know, what do you think the implications are in terms of how developers and, you know, development teams use LLMs or don't use LLMs for what they're trying to accomplish? This is to a large extent still true. So, but, So I also use LLMs for coding in different, I would say, in different ways. The other week I wrote a program. I can do Python coding. I'm a scientific coder. I can use PyTorch, Python, some other languages for scientific computing. But I am not really a web designer. And I can't really build apps. If you ask me how to build an iPhone app or a macOS app, I have no idea. I've never done that before. um but for myself i automated things all my life like i don't know 15 20 years ago i usually write scripts that do something to help me like rename files or these types of things for example for my blog post i have a workflow where i have all these images in one pdf and then i usually export it and i have a script that converts it to different crops and automatically and converts a different file formats but it's still like a bit tedious i have to go to the location of my script type the commands and everything. So I thought, okay, I can just make my life easier here and develop a macOS app, like a native app, where I can just drag and drop the file in and it performs the cropping and the conversion and everything for me. And that is something I used in LLM for that. It took a few hours, maybe one or two hours to just get it right the way I wanted it. But it's something I would have not been able to do otherwise. It's kind of like magic. I have a native macOS app that does that for me. And I can do that for a lot of things, like everyday life things on my computer. For that, I would just use the LLM. I don't really care about how it does it. I mean, I can see, okay, it works. If it doesn't work, I can give it some more prompts. I have no idea how SwiftUI works. I could probably figure it out, but I always wanted to learn it, but that's something where I don't have time to learn it. I have so many other things that are more important each day than just learning how to do that because it's not my main job, essentially. But then for things I care about, for my, let's say, research experiments, there I usually write most of the code myself just to figure out, like just to think also through the problem and through the shortcomings, just to get an idea like, okay, I have a very good idea what I want to do. But I don't often, let's say, well, there are cases where I make mistakes. And I usually use an NLM to like just get a second opinion, like hey does it look okay like it's something where in back in the day i had a colleague or something or you do prs on github open source projects where there are other people chiming in and it is like another layer between that before you share it with other people you do like a sanity check with llms to to make your work better in a sense like to you know just have a like kind of like a proofreader um second pair of eyes in that sense it's really good at that um sometimes also it has suggestions to make things more stable. Sometimes I also I have multiple experiments. They have let's say different settings and I've written some code here and then I fixed it here. I now want to apply it to all the other scripts, all the other experiments. And then I would use an LM and say hey look at this what I've done here. Now basically copy that over to all the other ones. I could do that manually but sometimes it's tedious and this is something that I can easily review because I know the code and I can see what changes it made and oh that looks okay I just you know check okay next next next and in that sense um I do use LLMs for my coding workflow but depending on the context well I don't want it to like do everything for me in a sense because that is then I have no idea what's going on anymore I want to do it more like in a controlled type of way for for the work I care about that was basically what the quote was also about yeah It makes complete sense to me. And the reason I bring it up is, you know, in the context of, you know, there's certainly some alarmist narratives out there that you may have heard, you know, the most extreme one basically saying that, well, you know, computer sciences or development as a whole discipline is going to be obsolete because we won't need to hire these people anymore. Everybody can vibe code and the machines can do it by themselves. And there's there's, you know, a more balanced version of that as well. That just as developer productivity is going to be so radically changed that, you know, one developer can have the same, you know, throughput as, you know, maybe 10 developers could a year ago. Do you see that as holding any water and holding any water in the next, you know, two years? Or is that just sort of fanciful thinking? um i think there is a kernel of truth in that in a sense that uh it is true like i noticed that myself like the use cases i just described that it just goes faster if i tell let's say the lm like apply my patch that i have here to the other files or something like that so in that sense 100 also for example uh for my own website i added a dark mode button that i otherwise would it would have taken me like weeks or months i mean i had it on my to-do list for like years and i never got because I knew it's going to take a lot of work. And LM did that in one day, basically. So in that sense, yeah, it is true. It kind of makes things faster but it still works You know even with this mac os app i described it still took me a few hours to do that and it a very basic app and so what i trying to say it's not making people developing code or designing apps or building apps obsolete because it's still work you can't just say i mean maybe one day but i don't even think that's true where you can say okay build xyz it will build a version of that but usually the first version is not the final version so there are iterations you have to test it you have to use it you have to tweak it and that is going to be still work i think what will change is um that the well what i'm hoping is with lms apps like that or websites they get better than they used to be i'm hoping it's like people use LLMs to improve what they would build otherwise, but not to just have more low quality work. I think that that's what I'm hoping for the future, but everything is still work. I also noticed that for my own experiments, it's not just the code, it's also running the experiments, doing the comparisons, thinking of additional things to compare, and that's all still work. I think the same is true. There are some people on the internet saying like software is basically free now or like yeah free in a sense that um an LLM can do it there's no value in let's say open source projects anymore and I don't think that's true because well I think I would always take something that has been developed over many years and tested over something that the LLM gives me as a one-shot solution basically because um yeah people spend I mean I think the best of both worlds is if people use LLMs to improve things that are already there and build new things, but then iterate over it and just make it better than they would be otherwise, like adding more tests, making it more robust, patching bugs and that type of thing. And it's going to be still a lot of work to do all these things, even if you have LLMs. And it feels, that makes complete sense to me. And it feels even more reasonable, given the conversation we were having earlier about the fact that if we really want to push any of these LLMs to their limits, that probably won't be done through the generalizable, you know, just chat GPT standard model. It's looking at, you know, some of these, you know, unique use models for unique use cases, really. And as soon as you're starting to build out some of those, you need someone who actually understands what they're doing to be able to set those up appropriately. And so, I mean, first of all, I want to feed that back to you because that was something I took away from our earlier conversation, that it sounds like just trusting a singular LLM to be able to push the frontier in every given area is going to be less effective than being able to, you know, have more specific ones for specific tasks. Is that fair? Yeah, that is a good characterization. And it comes back to older problem in deep learning, like deep learning is the field of training neural networks. because an LLM is essentially, at the end of the day, a deep neural network. And the one problem is basically if you train it on one thing, it will forget other things. It's like, but it's the same also, again, LLMs work different from humans, but it's the same like for us, right? I mean, if I just solve math problems every day, I will get really good at math. If I don't do math then for a few years, like do something else, you forget things, you know, like it's because you get new information, you don't, let's say, hone in on your skills. And then it's kind of like that. And the same is true for LLMs. So people, when they develop or pre-trained LLMs, they're very careful what goes into the pre-training mix and also in which order. And then once you have that base model, you fine tune it usually. And then in addition, you have also often domain-specific fine tuning. It's like an older paper, but I think it was CodeLama, which had like a nice graphic on that, how they develop the model in these different stages. And usually at the end, the last stage is more like what you really care about. I mean, if you want to develop, for example, a coding LLM, you always have to have the coding data already in the pre-training. You have to carry it through. But at the end, you will have like a specific phase where you just fine-tune it on coding problems. but then it will probably get worse at math or Spanish or something like that. And that's the trade-off. So you start off still with a generalist element, but then you kind of specialize it and that trades off other skills. So it becomes worse at other things, basically. So yeah, that's basically how it works. Right. And I can wrap my head around that. And it makes the case really nicely for building some of these more specialized models. So I'm curious, Sebastian, we've got the proprietary generalized models sort of on one side of the spectrum. We've got completely building your own LLM on the other side of the spectrum. And then in the middle, and feel free to reject my characterization here, we've got sort of the custom GPTs or finding ways to customize some of their proprietary models. When in your mind does it make sense to be customizing a proprietary model versus building your own? And, you know, who are the types of people and what are the use cases that make the most sense when we start to talking about building your own? Yeah, so there are different levels of that. You can essentially start from scratch, like just pre-training, fine-tuning your own model. Like that's like the most work. And that I would say I would not recommend anyone doing, except you are a company whose goal it is to build LLMs, essentially. or you're a big company who has a lot of money and really want to do something specialized. I remember it's a few years ago now, I think Bloomberg had a pre-trained model from scratch and just focusing on their news headlines and writing news articles or something like that. Like at that scale, maybe it makes sense, but it's going to cost millions of dollars. So it's not cheap. and some i think well i think what we're going to see is big um like fields like you know finance law there i think it does make sense if you want to develop it because like lawyers can't just use chat gpd for data privacy reasons and other reasons and so i think like if like there is some like people get together in that field there's a lot of money also in that field they could spend a few dozens or millions or hundreds of millions of dollars to develop an LLM, like a base model for law type of things. But then it's like this general law model, and then you still have to maybe fine tune it on your internal data at your company or something like that. But so yeah, one thing would be completely from scratch. And like I said, I would not recommend it because it's very expensive unless, you know, a select few players might want to do that. The second use case would be, or the second variant would be, you take an existing pre-trained model and then you specialize it. And I think that is more feasible because there are a lot of LLMs out there in all different sizes, like all the open weight models, for example, DeepSeek, a really big model. QAN 3 is a very popular model, or even from OpenAI, the GPT-OSS model, an open weight model. It's free to use and you can then fine-tune it. But it's, again, not trivial, so you will still end up spending tens to hundreds of thousands of dollars if you want to have a really good model there depending on the size so it might vary but it's also something as a hobbyist i think that totally out of reach like unless you really really are passionate about something but the problem is like if you want to really build something competitive like that well you have to have a big user customer or use case for that because it's going to cost and it will also at some point become obsolete for example if i i don't know when quen 4 comes out But like I'd say I today use a QAN3 base model that's from summer 2025. And I spent a lot of money and time to make it really good. I don't know, a few months there's QAN4 or some other model that is much, much better. And then my model is completely obsolete and I have to start over again. But yeah, it does make sense still for certain things. And then so taking a base model, fine-tuning it. And the third use case would be taking an existing model and essentially just customizing it with a prompt. And I think that's a lot of what a lot of people do. They use an API where they don't even host the model. And then with a prompt, you can steer it in a certain way. It's not going to be perfect. But for a lot of use cases, it gets you a lot of like bang for the buck because you don't have to train your own LLM basically. But again, there are limitations. for example if you use an API you have restrictions you can't use your private data or you shouldn't at least use your private data because data is public there were a few instances on the news in 2026 where prominent people did that data leaked so I think yeah it really really depends on what your goal is but if I for example today need something to translate I don't articles from one language into the other, I wouldn't go out there and train my own LLM. I would just use, I don't know, like pick one of the popular ones like Chachipiti, Gemini, use a prompt for that and see how far it gets me basically. Right. And I'm, you know, chuckling a little bit and I have to ask, I want to come back to something you said, which is, you know, you're sort of, you're sort of steering people away from using, from building an LLM from scratch, which I'm chuckling at because, you know, it's something that you've obviously invested a lot of time teaching. And so for the people that, you know, you're teaching this to, is it mostly people who are interested in learning this either as hobbyists or they're interested in learning the fundamentals of how this works so that then when they use LLMs, you know, in a more, you know, commercial and setting, maybe it's more proprietary, they understand, you know, the basic building blocks. Who's typically interested in this yeah uh you bring up a good point because it sounds paradoxical on the one hand i'm building these things from scratch and then i tell people not to use them so i don't uh i would say i'm not trying to steer people away from it it's more i want to set the expectations right you know like like just what you said like um who are these people who should be doing it and i think um so i also know from my experience how much work it is build something from scratch it is actually better than something out there so uh so if there is someone like there were some readers for example they stumbled upon my book and they are not coders they are like from different fields and the expectations oh i read the book and i will be able to let's say there was actually a case with language translation uh an lm that translates uh documents for me better than chat gpt and that's not gonna happen basically unless you spend And I mean, LLMs also usually they're not trained by a single person. They are usually trained by a large team. And so to answer your question, we can use an analogy, for example, like something else. For example, let's say you are interested in cars and you want to learn how, you know, you're passionate about cars. You want to understand how the cars work, the motor works, everything works, the steering and everything. But you would not build a Ferrari. Like, you know, it would be very expensive as a single person. i mean that's even it's not even in reach you would need a team you need the factory you need the design documents you need a lot of time and money and millions of hundreds of millions of dollars to develop a ferrari instead you would maybe develop a you know like a simpler i don't car that would resemble a car from the 1980s or something like that something you can build in your garage but building that car you will understand how the ferrari works a ferrari is essentially just a fancier version of that and so the book is in a similar way the the goal is kind of like to understand how things work it's for people well i mean people who maybe want to build these large gpd types of models education wise because right now i mean how would you learn that you know like how would you do get hired at the company you have to show usually that you have some skills already and so you have to start somewhere and that that could be an entry point to learn how to build these lms but it's also for people who don't even want to build lms they just want to understand what are the limitations why why does the lm struggle with um strawberry the number of letters there like what goes into the lm how does the whole workflow work and one way would be you could explain everything conceptually you can you know explain everything in words and like yeah the lm has text it tokenizes it it converts it into numbers and they go in and then there's some computation but that's all very vague and could be misunderstood because a lot of like things details you glance over and i think the best way to really see how it works is yeah by actually doing it by going through the actual steps and these steps also don't lie you at the end you have the working llm and so you know okay this is this is actually working this is not made up it's not like fantasy explanations it's really concrete it works and that is essentially also part of the goal at the end you can develop your own llm but then the disclaimer is it is a lot of work to get something that is really competitive, basically. Right. Because setting up the LLM in some ways is not even the hard part. There's the training, there's the pre-training, there's the data set you have. All of that is what separates your car you made in your garage versus the Ferrari, right? Yeah. And it's also a good point in the data set you mentioned. In the book, I'm only using public domain data from Project Gutenberg, like a simple example book that is a public domain hundreds of years old where because of copyright uh concerns like to not just you know do that but if you want to build a real big lm you need you know trillions of tokens like you need terabytes of data basically and uh that would be impossible for a human who buys the book to do because you would have to buy all the hard drives You have to rent hundreds of GPUs and everything, and that would be really not feasible. And in that case, the book is also data-wise focusing only on a very small data set, and the model will then learn how to write text that is similar to this book, basically. But it's not going to be your next JetGPT because that's impossible. It's like, I mean, not impossible, but you would for a single person, it's kind of not feasible. You would spend a lot of money, a lot of time. And, you know, and so the goal is really helping explaining things. Right. Let me take this in maybe a slightly different direction, but the data set conversation got my wheels turning a little bit. One of the challenges for a lot of organizations trying to do this themselves is basically marrying capabilities of an LLM with the quality of their current data that they may have sitting in their enterprise applications. And, you know, I think it's easy when we talk about, you know, a book, for example, you ingest a book and you, you know, kind of tokenize the words and, you know, all the characters and all of that. When we think about, you know, whether it's data that's structured in a database, whether we think about it as, you know, the unstructured data that maybe people have around that or metadata. How capable are LLMs right now at making sense of that? And is that something that you see changing in the next year or two, or is that an inherent limitation of the structure of the LLM? And basically the reason I'm asking that is for organizations that are worried about the quality of their data, is that a problem that's being misunderstood? Is it going to go away soon, or is that just an inherent limit of this model that they're going to have to come to terms with sooner or later? um so let me just try to rephrase to see if i understood your question correctly like the limitation is like working with your personal data like like data you have that's right with organizational proprietary data call it customer data or employee data or something like that yeah and so there are different ways you can work with that data so usually the limitation of lms uh i mean it's still a limitation but there are so many tricks now that this becomes more feasible and the limitation was usually that you can only fit so much into the context of LMs. And see, this is where a book from scratch book would come in handy because you would see or understand what is the context, how does the LM process the data, and there's a limitation to the size. If I crank that up, it's going to be really expensive, or it's sometimes at some point exceeded, so I can't just put everything into the context. I have to be smart about it. And so traditionally, in an ideal world, you would put everything into the context, but that doesn't work because it's too expensive. So people developed something called a RAG, R-A-G, which stands, I think, for Retrieval Augmented Generation. And so that's like an application layer around the LLM where you take the documents you have that you care about and you chunk them up, put them into a database, and then you query the LLM. It produces a vector embedding, like let's say a compressed version of your thing like the query and it looks for more like simple like let's say dot product math you look for similarity to other chunks in your database and then you retrieve that chunk and you hope that this is gonna be relevant so it's essentially like a smart lookup but you could also in simpler ways think about it like that you have a query and then you chunk up your document into smaller parts and then you go through it iteratively and try to find what is the most similar one and then can the LLM use that to answer the question So for example let say you are in a law firm and you say what was the case in 1983 where XYZ happened and you try to pull that out and then the LLM can use that as a part of the answer basically. It's not perfect because you are chunking the document and there's not the full context, you have always like little chunks so so but one of the i wouldn't say breakthroughs but one of the con it's because it's more like a continuous development but one of the progress parts of the progress we've seen in 2025 was that contexts are no longer the supported context sizes are longer so we are now um it really depends but there are like even like open weight lms like uh nvidia nemotron they can do up to one million tokens of course it's going to be more expensive you need more gpu power for that but um i think even like you know chat gpd online the version i think it can do 100 000 200 000 tokens and i think that's about the size i mean i might be wrong but i think it's about the size of like uh one of the harry potter books the first one or something like it's a long context and and so for many people this is actually sufficient so you don't need any specific fancy application around the llm to process that you just put it in there and there is the problem of it's called like the needle in the haystack problem where what people found though i mean there are multiple problems but there's also something related to attention sinks where the llm kind of focuses more on the beginning of the text that you put in but then also the needle in the haystack problem is when people develop these long context supporting llms um they have let's say you have a question like um i don't know like some some factual data you want to retrieve it's kind of buried in these hundred thousand words the lm should find it and the longer the context the harder it is for the lm to answer correctly because there's a lot of noise it sometimes gets distracted but i mean it's similar for us humans too like the more stuff throw it at the more complicated it is to figure it out and um so that's kind of like like where people i would say where companies made a lot of progress last year to make that better so it kind of most of the time works and i would recommend for most people just try that first instead of building something uh it's also like how i approach problems you do the simplest thing first you write down what performance you get and then you try to iterate and tweak it and try other things and see if it's better but before using the most complicated thing always try to you know do the obvious simple thing and then maybe that gets you already most of the way there and you can iterate later and see if it's worth the effort to iterate there if it's worth spending three months to build something around it that can do it maybe one percent better um one limitation though is that the case you described you may not want to do that with chat gpt because well your data will be online i think beginning of 2026 there was like a case where um i mean based on the news i read that a government employee uploaded some sensitive documents that got leaked i think they found out because it appeared in answers from other people it was like a confidential high security type of data and so yeah i think as far as i know i mean i don't want to say anything uh to like anything that's wrong but I think ChatGPT, for example, does use your data for training their models. They don't really, like, you know, I don't think they specifically single out specific data and, like, publicize it or something, but it's implicitly used for the training. They try to anonymize everything, but, well, if you upload it to ChatGPT, you have to be aware it might be part of the training data. And so you can't do it for everything. There are laws for certain fields where you can't just, you know, can't just share patient data, like sensitive data. but then again you can use a local LLM that runs locally for example I mean most LLMs that run locally support up to 160,000 tokens which is like again the Harry Potter book it's a lot of data and those I mean there's special ones like Mnemotron 3 from Nvidia has 1 million tokens so there's always like something you can do locally that gets you most of the way there and at the beginning one more thing that you interject There's like a paper that I found really interesting at the beginning of 2026. Let me see. I think it was called Recursive Language Models. So the title of the paper. So what they do is, it's kind of like a clever trick. They have like the query and they want to answer, let's say a question like similar to a RAC setup where you have a lot of data to process, a whole document base, a whole folder of, let's say, a lot of data basically that can't fit into the context. And so what they do is they parse. So instead of letting the LLM do everything, they parse the input into a string in Python in a coding environment, and then let the LLM come up with ways to chunk it up into sub-problems. So for example, if your problem is, let's say summarize all the chapters in this gigantic book or something, you have 12 chapters instead of feeding the whole book into the llm and trying to have all the summaries for the 12 chapters what you say what the lm decides oh maybe i can just have one chapter each and process it in 12 different parallel let's say execution loops and then i just pull together each summary from all the 12 chapters and write an overall summary or something like that so like just chunking it it's not really like rocket science is really just using a coding environment where the LLM can use a tool to chunk up everything itself, but that gets you already most of the way there. And they did that also with ChatGPT. It doesn't have to be a local LLM. You can do it with APIs, with tool calls. And so there is a lot of, I would say in the field, there are a lot of workarounds where you have limitations in the LLM itself, but you solve the limitations by doing clever tricks in the surrounding API layer, or like the application, basically. And does that fall under the bucket in your mind of, you know, reasoning enhancements, or is that something else? I would say it's something else. It can be reasoning related. So if it's like a problem that requires good reasoning capabilities, but I would almost, if I had to group it into something, it's more like general inference scaling, for example, where, I mean, it's not even inference scaling in that sense that it makes it more expensive. So if you're just chunking it up basically into separate subcalls. But like you said, I mean, it could be related to reasoning if your query is a reasoning query. But here, I think it's also a bit tricky because if you have like, let's say your task is to solve a math proof, like something really complicated where you have a lot of sequential steps where you derive all the individual intermediate steps, I think that would not be a good case for the method because the method runs things in parallel, like subcalls and like parallel in kind of like independent of each other. And, um, reasoning models, they usually benefit from this so-called chain of thought where they think through a problem, which is more like sequential. Right. And, and the reason I'm asking, and it probably gets a little bit outside of your field of expertise, Sebastian, but I'll ask anyway, is when we think probably. in other, you know, quote, AI or automation applications that start to get outside of LLMs and the transformer model, but maybe have some overlap with them. You know, I'm thinking about, you know, agentic AI or some of these AI use cases where, you know, these different kind of, you know, AI systems work with each other to understand what a task, you know, what the outcome of a task should be and actually, you know, orchestrate to get it done. it seems like there's some overlap in terms of the processing here and being able to understand what's more likely and chunk it up. Does that come into play or is that completely off base? I do think it's a very good point. I do think it's kind of related in a sense like to understand the input, process it, and present it in a way that can then be processed. In this particular case, instead of interacting with other agents, it's kind of like interacting with itself but it could be other agents too i mean it's basically how can i you know divide and conquer my problem here like in a sense but then it it calls itself on on the chunk problem but it could be as well like delegating it to other types of models and i would say that is one of the biggest progress drivers in the recent months that um yeah the tool calling like using different tools for different things and self trying to do everything itself and um i think it's also where a bit of let's say the magic uh comes in when you use something like gemini or chat gpt or claude it's not just the llm i mean it's just a hypothesis but i do think if you take something like deep seek or some other open weight model that is really good and you would put that into whatever like framework they have behind the scenes and let's say gemini or Chachupity, I think it would be almost like identically or similarly good. So what I'm trying to say is I don't think the LLM itself is necessarily the differentiating factor anymore. They're all kind of like similarly good. But what is really important is how you like format things, like how to deal with context and how to the history and then how like the previous, if you're in a conversation, the previous, yeah, like the previous back and forth, how you process that, how you use tools and that stuff, where there's a lot of work that goes into making it really robust. I mean, I don't know exactly because it's proprietary how they process input at, let's say, ChatGPT, but if I sometimes type something, I make a typo. And it's sometimes even like a relatively technical term. I have a typo in my prompt, but often I don't even care about fixing it because I know, okay, it's kind of dealing with that already. So it's just instead of all delete, delete and fix that word, I just have it with a typo in there. And I can see based on the response, because often the response involves repeating part of the answer, I can see it fix the spelling of my word. So I don't know if it's necessarily like the LLM itself, because it's like has all the different spellings or tokens or subtokens. Or there could even be like, you know, like processing layer, like fix obvious typos because that enhances the performance of the LLM. So because then instead of having to have a huge vocabulary or subtokens for all the different ways someone can misspell a certain word, you can just have like a dictionary fix, like a simple fix and make it then less work for the LLM itself. So I think there's a lot of magic like that happening behind the scenes to improve the performance. And I think that's why you see that the performance of something like ChatGPT or Gemini is better than something you would run locally, where most of the tools that run LLMs locally, they kind of like run it bare bones without much stuff around it, basically. And, you know, I'm glad you brought that up because it still feels like there's so much tied to like the quality and the clarity of the prompt. And there's some, you know, cosmetic fixes it can do for you. But, you know, I'm curious on your on your thoughts on this, Sebastian, around performance benchmarking, because if we're talking about, you know, variable output based on the clarity of what you're asking for, a lot of performance benchmarks are like there's just they seem like they're pretty clear. They're trying to do something quite clear. And, you know, to bring back a point you made earlier, it also feels like you said, like some of these, you know, organizations are not using their publicly available models or there's some inference scaling that's happening. It's that's, you know, not really there behind the scenes. You know, how much weight do you put into performance benchmarks at all right now? And how much should people be looking at those as a measure of the capability of these tools going forward? Yeah, that's a good question. I think that's one of the biggest problems in the field, how to evaluate models, let's say, fairly. And so, well, there are different ways you, like benchmarks wise, there are different types of benchmarks. Top of my head, I would say there are three or four ones. So let's see if I can come up with what I have in mind. So that one is basically more like the classic MMLU. So that is like a multiple choice benchmark. And so that one is basically, you know, like a trivia question almost, or like a, you know, who wants to be a millionaire type of question. It's like a question, and then there are A, B, C, and D. And the model has to select one of these answers. And people use that usually to test the knowledge. So does the LLM know about world knowledge and math and like basic things? But it's ultimately not how you use an LLM. It's basically, you would never give it all the solutions and say, give me A, B, and C, and D. You would, you know, ask free form, for example. But then the free form is really hard to evaluate programmatically because there are different ways to spell it. You can have it as a word, as a sentence. And so that's why they do A, B, and C, and D. And so like multiple choice, but then it has a lot of limitations. And so, yeah, I mean, I think that comes also like, it's like a minimum threshold. I think an ARM should have a minimum score on these benchmarks to be okay, but then at some point, if it's passing a minimal threshold, I don't think it matters if it's 90% or 95% or that, basically. Also, we have to keep in mind also some people run these benchmarks with and without tool use, and the GPT-OS S model, Openweight model had a nice chart about that. When you have a model and you allow it to use tools, then it gets much better performance than not. Like, for example, if you ask a model who won, let's say, the Soccer World Cup in 1998, it can try to remember it, but it can also use the tool and look it up on the internet, let's say on the official website. And then you increase the accuracy on these types of things. So that is one type of benchmark. And I think, well, it is like a minimum threshold. It should be able to do these things and answer these correctly. but it doesn't really tell you how the LLM performs when I actually use it and query it and prompt it in different ways because they can be also sensitive to the prompt format. So another way is like these so-called leaderboards where they have like a website and then you can use different LLMs for the same prompt and then you can compare the answers. And you say, oh, I prefer this answer over this answer. So that sounds like actually more related to what we would care about. like let's say F, Gemini and JGBD side by side, it answers a question, oh, I actually prefer this answer. And if you do that a lot with a lot of people, with a lot of pairwise comparisons, you can use a statistical model like a Bradley Terry model to convert it into a ranking, like into numbers like one, two, three, four, five, six, you know, like so you can say, oh, this LLM is on top one. But this also has limitations because people really prefer also a certain style. Like people are sensitive to the answer style and not necessarily the correctness. Because if I ask a question to an LLM, I usually don't know the answer. Like if I have a challenging math problem, otherwise I wouldn't ask really. And so then you get different answers that say, and you say, oh, I prefer this one because it's maybe nicer explained or I like the language better, but it doesn't mean it's more correct or something. So with leaderboards, it's kind of like sensitive to a certain style. And there was like an incidence also, or not incidence, but like a thing with LLAMA4 last summer where i mean i don't know the full story because i i'm not affiliated with the companies i don't know the behind the scenes but what was reported was that they used a different model on the leaderboards and it got really high leaderboard scores but in reality it was not a good model because it's all like you know what is it called like it's like there's no substance that's always behind it it's more like the the glamour like it looks better than it really is behind the scenes than when you actually use it on hard tasks. And so it's always a bit challenging. Another way to evaluate models would be verification. For example, you can have math or code or something like that, like something that you can verify, let's say math, and you have the correct answer, like it's a numeric answer. And there are tools where you can compare two numeric answers. You can, usually how it works is you tell the LLM, hey i have this problem you know and then explain and then write the intermediate uh sorry write out whatever you need and then put the final answer in a box like answer box it's usually a latex box format and then you can programmatically retrieve this answer and compare it to the reference answer and then you can use a calculator like you can use a calculator to say oh um these numbers are the same or different like there are different tools like you know wolfram alpha i use Sympy, it's like an open source program or a library for Python, where you can symbolically compare solutions. And that is really, I mean, this is accurate, right? I mean, if the model follows the instructions and puts the answer in the prompt, I can really, with a lot of certainty, there may be some parsing errors, but like 99% of the time you have a fair comparison. But the problem is it's kind of limited to get to math or code with the code compiles. It's harder to evaluate the whole answer at its whole you can only um yeah evaluate the last uh answer point and so i think i listed three so one more that comes to mind one way to evaluate is uh using uh judges llm judges and so it's basically using an other llm and then you provide a rubric like evaluate i don't know if the answer is correct evaluate the intermediate steps if they make sense And you give a different criteria and you say, given these criteria, evaluate this answer given that other reference answer. And you can say give me the answer like the quality of the answer on a level between 0 and 10 where 10 is best And then you can do that for different LLMs and average the numbers and say oh this model gets 8 on average this model gets 9 on average That is a numeric way you can kind of evaluate freeform answers. But then again, there's always the catch. The catch is, while you're using another different LLM for that, and the LLM might not correctly always evaluate things or fairly it might have a bias towards a certain answer style and so a long story short each of these different methods to evaluate llms has its shortcomings and the best way is to look at all of them together in context not like a single one and try to see well what are the weaknesses and strengths of the different uh lms on these benchmarks and that's really hard and in the end they all look kind of similar when you see like releases and really you have to use it basically and see oh that works for me that doesn't work for me you know it's like it's really hard it's one of the biggest problems right now to have a fair comparison basically it makes sense and it's you know it's interesting because there's at the frontier there's you know these small differences and okay do i do i prefer this style slightly more than this other style versus is it getting the answer fundamentally wrong? And, you know, do you use, you know, an LLM judge or like, how do you how do you best answer that question? I am curious, Sebastian, I have to imagine, you know, you've been doing this for a while and I have to imagine the interest in this from, you know, non-technical people has increased in the last couple of years. What do you see is like, what do you personally see as some of the biggest misconceptions people have about LLMs and how they function and how they can, you know, get value and use out of them? It's a good question, actually. I think, top of my head, well, I don't think there's a huge misconception, but it comes more like down to expectations, again, like what we mentioned earlier, that, well, it's really hard, it's really expensive to train LLMs. I think that's like, the misconception is oh I just have to x, y, z and I can do that on a weekend basically. I think it's also a person can understand everything on a basic level. That's not a problem but I think the challenge compared to previous let's say problems in the field is that you need a whole team to do it. You need an expert in GPU infrastructure. You need the person, the researcher who implements the core architecture. You need to run experiments. It's like, it's not usually something someone can do by themselves or on a weekend. I mean, understanding, yes, but doing this whole thing is a lot of work. Looking, for example, at, I mean, most people don't share these types of details. So, but there's a, I think it was a LAMA two or three paper. They had like a very nice section on what it took to train that model back then. and I forgot all the, let's say, nitty-gritty numbers, but they reported, okay, we trained it on so many thousands of GPUs, but we had so and so many failures each day because it's also like hardware fails, like crashes, GPU crashes, and then you might lose your whole model. And so you have to checkpoint it or you have to build in robustness to when one GPU fails, it doesn't crash your whole million-dollar run, right? So it's like there's a lot of that. And this usually requires a whole team, basically. It's not like a single person because you have to monitor everything all the time. And yeah, you can have notifications coming up and everything, but it's like a full-time job for a lot of people to develop an LLM. Yeah. So I think that's a thing that has changed compared to machine learning or deep neural network training before. And that's also why you see not that many academic LLMs anymore. For example, there are a few, but for example, back in the day, I mean, with convolutional networks, image classifiers, everyone was able to do that by themselves in a small lab because university labs are usually, you know, like a handful of people. But now, yeah, the budget is, you have to have a huge budget, a whole team, a lot of time, a lot of expertise, it's a lot of things. And that's why this is now mostly restricted to companies, basically, that have resources, basically. So I'm going to play that back to you and let me know if I'm getting this right. But it sounds like just based on how much this field has evolved, how many resources, you know, the biggest players have the, you know, amount of staff required, the amount of compute required, the amount of, you know, just cost required and just raw data required to do this, that, you know, if you're looking to do this as a small shopper, as a hobbyist, it is valuable to learn how to build an LLM from scratch. Um, that's something that will help you personally or professionally, but it's probably not something in most cases that you're going to then implement, you know, and try and do at scale because it's just not as practical as it may have been five or 10 years ago. Is that fair? Yes, that is fair. Also with a caveat. So I agree with you. Um, but if you fine tune, uh, so there's like the pre-training, uh, and the fine tuning, if you use an open-weight LLM that is already out there and you build on top of that then it becomes easier so also just to put in some numbers um if so taking the deep seek model again because it's just such a popular model the version 3 and r1 that came out uh december 2024 and january 2025 because they had some nice numbers in there if you would rent the gpu they used they put in the price for like i think two dollars per hour it would have cost them five million dollars to train the deep seek version three model and this is not including any um cost for the like staff like the salary and everything or um like the building and that's not like the physical building like the rent for the building to have people there and it also doesn't include the failed runs because when you do something like that you will fail a lot of times until you find the right configuration a little of trial and error also but if you only take the solution that worked and you run it the pure gpu cost would be around five million dollars which is a lot of money um but then if you look at the fine-tuning the reasoning training they had it was more like on the order i forgot about like hundred two hundred thousand dollars something like that uh much much much lower and that is much more approachable now this is for um 671 billion parameter model it's a very very big model Now, if you go down in size from 600 to, let's say, 20 billion, you could probably do something really good with a few thousand of dollars. And then it again would make sense. But again, it's not going to be something on a weekend. It usually requires a few weeks or a few months to get really good results. But once you are confident and can do it, you can then later on swap out the LLM and can repeat the procedure. So once you learn the workflow, it's actually easy to, yeah, once you get going, it's kind of easy to apply your skills, in a sense, on other LLMs. And with that, I wanted to say is, yeah, you can do interesting things. And there are also APIs now. I mean, again, not affiliated with any of the companies, but there is, let me think, I think it's called Thinking Labs, Thinking Machines. there's like a company from one of the open ai co-founders that has an api where you can fine tune and customize llm so you don't have to have the gpus yourself you can use um like a chat gpt api you can use an api they have but for fine tuning you can give it the data and the settings and then run it on cloud machines without having to worry about managing let's say gpu failures and that type of thing but again i think here it really really helps to learn how to build it from scratch first to understand what you're doing with these different settings for example in my book in the reason build a reasoning model from scratch book i have a chapter on reinforcement learning with verifiable rewards which is essentially the deep seek on reinforcement learning and i'm only running it on a data set it's called math 500 where there are so my sorry Math 500 is a test set that is popular for benchmarking, but I have like the training set. It is 12,000 math problems that are not overlapping with this, and I'm training on that. And just to give you some numbers, it takes about, so there are 12,000, about 12,000 problems. It takes about a day to train for 500 steps. So on one GPU, the only, but if you would do 12,000, it would be 10, about 20 days or something to just train it. And it's a small model, but doing that on a few steps, you understand what is the refiable rewards, what is happening there, what am I comparing against, what are the reference answers, what is it checking, what are the different settings. There's something like number of rollouts, batch size, there are different settings in the GRPO, like clip ratios and everything, and like the epsilon clip ratio. There are a lot of little tweaks and knobs, and by building it from scratch, you know what all these types of things mean. and then once you understood, oh, that's what I'm doing, that's what's happening here, then you can, for example, go to an API and say, oh, I'm actually confident I know what this setting is. I used it before I understand. It's just a knob I have to tune. I will try this setting and this setting. And, you know, like where it helps a lot to build from scratch to just get that intuition before you have a production system. That's great. So just, you know, as we start to wind down the conversation, I did want to ask you, Sebastian, You know, what's your kind of takeaway advice for technology leaders who, you know, may be interested in learning more about LLMs or ensuring their teams better understand LLMs? Like, what's kind of the main takeaways you would give them in terms of how they should move forward into this space? Yeah, well, shameless plug here. I would say doing like coding and LLM from scratch. I mean, not from scratch without any template, but for example, my book or something where you do have a guide that guides you through it. And if you are comfortable with Python, PyTorch, this is something you could technically do on a weekend or maybe four or five days. And then that gives you the foundation. And then there's a lot of jargon out there, mixture of experts and different attention mechanisms, group query attention, multi-latent attention. But it's all derived from the original GPT model. So once you understand the core building block, it kind of like demystifies all the other things. They are all basically built on top of that or like flavors of that. And I think it is important. I mean, what I also like to do is to build a foundation of something. And then it's always easier to look something up. And you can see how it evolves from there compared to starting. Okay, I have no idea how anything works. What is this mixture of experts? And I think, so that is a good point because I think I had actually, there were some prominent people asking me, for example, about advice on mixture of experts. Would that be something worth investing in, like more like a big picture investment as a person who, let's say, is not building LLMs. And for example, the misconception would be, oh, mixture of experts, I can train different LLMs and then I can combine them together. I can have a math LLM and I have a Spanish LLM and then I don't have to train all everything together and I think that sounds very plausible but then if you it's not how Mixture of Experts works though it's a very different thing and I think by building the foundation it helps you really demystify these misconceptions basically like by so Mixture of Experts is essentially a module in the LLM that is like a feed it's called a feedforward layer. It's just like basically a weight matrix. It's like a classic multi-layer perceptron. And if you have that connection, you know you can't just swap anything in there. It has to be trained end to end. And the experts are also more implicit. You can't tell, okay, this is just doing math and this is doing Spanish. It's more like, yeah, it's maybe this one is stronger at math because it gets more activated when there's a math problem. But it's not like this discrete distinction it's more fuzzy and i think these are things that are really hard to understand by not looking at the fundamental uh architecture building blocks it's really like even right now i'm trying to explain it um in a sense where it's more like big picture but big picture doesn't capture it i think that's maybe my my message is like i think there are a lot of um visualizations and big pictures that are very they're not incorrect but they're very fuzzy they're very vague and then they can lead to misunderstandings because they don't show the full picture they try to do a big picture i think really doing that you you don't have to learn about every nitty-gritty detail like gpu optimizations that's like a i guess it's like a detail if you don't really train on lms you don't have to necessarily understand nvp4 like floating point for precision and how that's implemented but having a big picture you you would understand okay four bit precision is less than 16 bits. So in that sense, it's cheaper, but it also approximates because you can't store as much information and like understanding and appreciating these type of nuances, I would say. Right. And there's just no substitute to your point for like getting your hands a little bit dirty and seeing how it actually operates in an environment. Yeah, because I think that's like the, it doesn't lie. It's the truth. It's like, it's not hand wavy. it's really concrete and this way you yeah you don't you don't have then these types of i wouldn't say knowledge gaps because well even if you build something from scratch you don't build um everything in all different directions from scratch you focus on the core but the core itself it is the the true core it's not um you know like a vague concept anymore and so it does it's like almost like self-verifying you know it's yeah it's like when you have math equations and math sorry but like when you uh derive things from first principles you like there are certain formulas you can just use them and memorize them or you can derive them from from first principles and then you can see if you do that from first principles oh this formula i mean you don't have to do it all the time it's just you do it once but then you know okay this is actually rooted in something and these are the parts and this is why it is because i derived it this way it's not just um a fantasy formula that someone came up with, there is like a reasoning or like a process behind it, basically. And then I think that answers a lot of questions that people would have. Well, and it feels like this is a space where there's so much misinformation or so much hype, and there's so much opportunity for people to misunderstand how it works that there's a lot of benefit to being able to just actually go to the source and see it for yourself. Yeah. Yeah. One example just came to mind when you were saying things about the fundamental misunderstanding of certain things. For example, there was this model, I think it's a really cool research, it's called the Hierarchical Reasoning Model. And there was also something called the Tiny Reasoning Models. And that came out last year in 2025. And I think they even won the ARC benchmark arc is like a logic puzzle type of benchmarks and it got a lot of hype it was huge in every like in the media and um everywhere basically but um i think it would have helped if people like build something like that or like follow the paper because it is a really really cool model but it's not an llm they compared it to chat gpt it's a tiny uh like transformer model and it is only working on this particular task you can have you can train it to do let's say sudoku you can train it on these arc benchmark puzzles but you can't say translate my sentence from spanish to english because it can only do that one thing and i think here it will help if people yeah they understand a bit like what is the architecture how is it trained and then you see these limitations basically it doesn't mean it's it's a bad model it's actually a very impressive for its size but then it wouldn't be fair to compare it to an llm because llm can do a lot of things and it is a general model where this is very very very specific and i think these are things where if you understand the fundamentals you can kind of like escape this type of hype um you can kind of oh this is yeah this is clearly just a news headline they're trying to get attention with something and and i think there's a lot of that like you said there's a lot of this type of hype where it sounds good, it gets hype and clicks and everything, but often if it's too good to be true, it's often too good to be true. There's usually some catch and it's easy to find that catch if you know the fundamentals essentially. Well said. Sebastian, I wanted to say a big thank you today for coming on to the show. It's been a really interesting and insightful conversation. And yeah, I was excited how deep you could take us on some of these topics. Yeah, thanks so much for inviting me. I had a lot of fun talking about all these technical things. You know, that's what I do for work, that I do what I do for a hobby. And I had a lot of fun. And so, yeah, thanks for inviting me here. And yeah, it was fun. If you work in IT, Infotech Research Group is a name you need to know. No matter what your needs are, Infotech has you covered. AI strategy, covered. Disaster recovery? Covered. Vendor negotiation? Covered. Infotech supports you with the best practice research and a team of analysts standing by, ready to help you tackle your toughest challenges. Check it out at the link below and don't forget to like and subscribe.