"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

55 min
Feb 22, 20262 months ago
Listen to Episode
Summary

This episode features Olive Song, a senior researcher at Chinese AI company MiniMax, discussing their M2 series of open-weight models specialized for coding and agentic tasks. The conversation covers MiniMax's unique approach of developing both foundation models and applications in-house, their reinforcement learning techniques including interleaved thinking patterns, and the technical challenges of training frontier LLMs with limited resources compared to American AI labs.

Insights
  • Having developers and researchers work side-by-side creates tight feedback loops that enable rapid identification and fixing of model weaknesses during training
  • Interleaved thinking (allowing models to act, get feedback, think again, then continue) significantly improves performance on long-horizon agentic tasks compared to single-pass reasoning
  • Small technical decisions like maintaining FP32 precision during reinforcement learning can be critical for achieving theoretical algorithm performance
  • Systematic perturbation of training environments across all operational dimensions (tools, prompts, scaffolds) is essential for robust agent generalization
  • Open-weight models currently struggle most with adapting to different environments and achieving the same level of environmental understanding as top closed models
Trends
Shift toward specialized models for specific domains (coding, agents) rather than general-purpose modelsIncreasing importance of reinforcement learning and human alignment in model trainingGrowing emphasis on multi-agent systems and cost-effective model deploymentRise of interleaved reasoning patterns for complex task executionIntegration of AI agents for internal research and development workflowsFocus on robust generalization across diverse operational environmentsEmphasis on systematic perturbation pipelines for training robustnessGrowing importance of engineering discipline in scaling AI systemsTrend toward open-weight model releases from Chinese AI companiesIncreasing collaboration between open-source communities and commercial AI labs
Companies
MiniMax
Chinese AI company developing M2 series open-weight models for coding and agentic tasks
OpenAI
Referenced as comparison point for top American AI model performance
Anthropic
Claude models mentioned as benchmark for environmental adaptation capabilities
Open Router
Platform where MiniMax M2.5 currently tops the usage leaderboard
People
Olive Song
Senior researcher at MiniMax specializing in reinforcement learning and model evaluation
Ksenia Yase
Host of Inference by Turing Post podcast who interviewed Olive Song
Swix
Creator of the AI Engineer event series where Olive Song presented
Quotes
"During reinforcement learning, the model tries its best to hack a lot of things."
Olive Song
"Engineering is very, very, very important. I didn't know that during school."
Olive Song
"We can only know the definition of AGI when we achieve it."
Olive Song
"Problem solving is more of discovery."
Olive Song
"Intelligence with everyone is more like how it changes my life and it enables me to do more work and then how it can connect me better to different people."
Olive Song
Full Transcript
6 Speakers
Speaker A

Hello and welcome back to the Cognitive Revolution. The presenting sponsor of today's episode is Granola, the AI notepad that helps you get the doing done. Whether it's identifying to do items after a call, turning a brainstorming session into a product spec, or looking back at multiple calls to identify cultural trends at your company, Granola takes your raw meeting notes and makes them awesome. Right now, Granola is featuring AI recipes from AI thought leaders, including several past guests of this show. My own contribution is a blind spot finder recipe that looks back at recent conversations and attempts to identify things that I am totally missing. This was immediately useful in the context of contingency planning for my son's cancer treatment, and the more data Granola collects as I continue to use it, the more valuable it becomes for suggesting AI topic areas that I really ought to explore. See the link in our show notes to try my blind spot finder recipe and experience for yourself how Granola puts your meetings to work. Now today I'm excited to share a special combined crossover episode featuring Olive Song, a senior researcher specializing in reinforcement learning and model evaluation at the Chinese AI company Minimax, creators of the M series of models, the most recent of which M2.5 currently tops the Open Router Usage Leaderboard. To give you the most complete picture possible, we're combining two sources. First, a presentation Olive recently gave at the AI Engineer Conference in New York where she had previously lived for six years, and second, an interview with Ksen Yase from her podcast Inference by Turing post. Together they provide an excellent overview of Minimax's goals as a company, the capabilities they're prioritizing in their models, the techniques they're using to get there, and the day to day ups and downs of training Frontier LLMs Highlights include how Minimax's strategy of building both models and user facing applications in house creates tight feedback loops that enable their cross functional research and engineering teams to identify and address model weaknesses as quickly as possible. An overview of how interleaved thinking, which allows the model to take an action, get feedback from the environment and pause to think again before continuing, improves performance on long horizon agentic tasks. A description of the perturbation pipeline they use to systematically vary the model's training environment in order to encourage robust generalization. Olive's perspective on the constant battle she and teammates are fighting against reward, hacking a window into the tedious debugging that is sometimes required to diagnose training issues and how they realized that they needed to run reinforcement learning at FP32 Precision and finally, how the team at Minimax is using AI agents to to keep up with the daily flood of AI news. While Olive recognizes that Minimax's models, like all open source models in the world today, can't quite match the performance of top American models, I think there is still a lot of value in the details she shares about their approach to reinforcement learning and how they structure their team and work. And in any case, I always appreciate the opportunity to hear directly from Chinese AI researchers who, just like their American counterparts, are figuring things out step by step as they go, even as major questions about issues such as the governance of increasingly powerful open source models remain fundamentally unanswered. With that, I want to thank Swix, the creator of the AI Engineer event series, which I absolutely recommend attending if you can, and Ksenia, the creator of Turing Post, which has what I find to be some of the very best topic selection of any AI newsletter for allowing me to create and post this combined episode. And I hope you enjoy this window into the development of some of the best open weight models in the world with Olive Song of Minimax.

0:00

Speaker B

Hi.

4:16

Speaker C

Hi everyone, I'm Olive. It's my great honor here today to present on our new model Minimax M2. I actually lived in New York City for six years so it feels great to come back but with a different role. I currently study reinforcement learning and model evaluation at Minimax. Let me just get a quick sense of the room. Who here has heard or have tried of Minimax before? Oh a couple of there, yeah, not everybody but I guess yeah but here's the value right of me standing here today. So we are a global company that works on both foundation models and applications. We develop multi modality models and including text vision language models, our video generation model and speech generation, music generation stuff. We also have many applications including agents and stuff in house. So that's the specific thing that's different from the other labs for other companies. So we both develop foundation models and applications so we have research and developers sitting side by side working on things. So our difference would be that we have firsthand experience from our in house developers into developing models that developers would really need in the community. And here I want to introduce our Minimax M2 which is an open weight model, very small with only 10 billion active parameters that was designed specifically for coding workplace agentic tasks. It's very cost efficient. Let me just go over the benchmark performance because people care about it. So we rank very top in both intelligence benchmarks and also agentic benchmarks. I think we're on the top of the open source models. But then numbers don't tell everything because sometimes you get those super high number models, you plug into them into your environment and they suck.

4:16

Speaker D

Right?

6:32

Speaker C

So we really care about the dynamics in the community. In our first week we had the most downloads and also we climbed up to top three token usage on open router. So we're very glad that people in the community are really loving our model into their development cycle. So today what I want to share is how we actually shape these men model characteristics that made M2 so good in your coding experience. And I'm going to present to you the training behind it that supports each one of them. From coding experience to long horizon state tracking tasks to robust generalization to different scaffolds to multi agent scalability. So first let's talk about code experience, which we supported with scaled environments and scaled experts. So developers need a model that can actually work in the language they use and across the workflow that they deal with every day. So which means that we need to utilize the real data from the Internet and then scale the number of environments so that the model when during training, for example during reinforcement learning, it can actually react to the environment, it can actually target verifiable coding goals and to learn from it. So that's why we scaled both the number of environments and also our infrastructure so that we can perform those training very efficiently. So with data construction and reinforcement learning, we were able to train the model so that it's very strong, it's full stack multilingual. And what I want to mention here is that besides scaling environment that everybody talks about, we actually scale something called expert developers as reward models. So as I mentioned before, we have a ton of super expert developers in house that could give us features feedback to our model's performance. So they participated closely into the model development and training cycle, including problem definition, for example bug fixing, for example, repo refactoring and stuff like that. And also they identify the model behaviors that developers enjoy and they identify what's reliable and what developers would trust. And they give precise reward and evaluation to the model's behaviors, to the final deliverables. So that it is a model that developers really want to work with and that can add efficiency to the developers. So with that we were able to lead in many languages in real use. And the second characteristic that Minimax M2 has is performs good in those long horizon tasks, those long tasks that require interacting with complex environments, that requiring using multiple tools with reasoning. We supported that with the interleaved thinking pattern and reinforcement learning. So what is interleaved thinking? So with interleaving a normal reasoning model that can use tools, it normally works like this. You have the tools information given to it, you have the system prompts, you have user prompts and then the model would think and then it calls tools. It can be a couple of tools at the same time and then they get the tool response from the environment and then it performs a final thinking and deliver a final content. But here's the truth. In real world, the environments are often noisy and dynamic. You can't really perform this one test just by once. You can get tool errors, for example, you can get unexpected results from the environment and stuff like that. So what we did is that we imagine how humans interact with the world. We look at something, we get feedbacks and then we think about it. We think if the feedback is good or not. And then we make other actions, make other decisions. And that's why we did the same thing with our M2 model. So if we look at this chart over a diagram on the right. So instead of just stopping after one round of tool calling, actually thinks again and reacts to the reacts to the environments the to see if the information is enough for it to get what it wants. So basically we call the interlead thinking or people call it interleaved thinking because it interleaves thinking. With tool calling a couple of times it can be tens to 100 turns of tool calling within just one user interaction turn. So it helps adaptation to environment noise. For example, just like what I mentioned, the environment is not stable all the time and then something is suboptimal and then it can choose to use other tools or do other decisions. It can focus on Long Horizon has can automate your workflow using e.g. gmail, notions terminal all at the same time. You just need to maybe make one model call with minimal human intervention. It can do it all by itself. Here's a cool illustration on the right. Because it's in New York City, I feel the vibe of trading and marketing. You can see that there was some perturbations in the stock market I think last week, and then our model was able to keep it stable. So just like I said, there's environment noise, there's new information, there's news. It looks like there's other trading policies and stuff like that. But our model was able to perform pretty stably in these environments. The third characteristic is our robust generalization to many agent scaffolds which was supported by our perturbations. In the data pipeline. So we want our agent to generalize. But what is agent generalization? At first we thought it was just tool scaling. We trained the model with enough tools, various tools, kind of new tools, we invent tools, and then it will just perform good on unseen tools. Well, that was kind of the truth. It worked at first, but then we soon realized that if we perturb the environment a little bit, for example, we change another agent scaffold, then it doesn't generalize. So what is agent generalization? Well, we conclude that it's adaptation to perturbations across the model's entire operational space. If we think back what's the model's operational space that we talked about, it can be tool information, it can be system prompts, it can be user prompts, they can all be different. They can be the chat template, they can be the environment, they can be the tool response. So what we did is that we design and maintain perturbation pipelines of our data so that our model can actually generalize to a lot of agent scaffolds. The fourth characteristic that I want to mention is the multi agent scalability, which is very possible with M2 because it's very small and cost effective. I have a couple of videos here. This is M2 powered by our own Minimax agent app. We actually have a QR code downside. So if you want, you can just scan and try it. It's like an Asian app we developed. Here we can see different copies of M2. It can do research, it can write the research results and analyze it and put it in a report. It can put it in some kind of front end illustration and they can work in parallel. So because it is so small and so cost effective, it can really support those long run agentic tasks and tasks that maybe require some kind of parallelism. So what's next? Right, for minimax M2, from what I've introduced, we gathered environments, algorithms, data, expert values, model architecture, inference, evaluation, all these stuff to build a model that was fast, that was intelligent, that could use tools that generalizes what's next for M2 1 and M3. Or in the future we think of better coding, maybe memory or context management, proactive AI for workplace vertical experts. And because we have those great audio generation, video generation models, maybe we can integrate them. But all our mission is that we're committed to bring all these resources, whatever is on the screen and maybe more, and values, and put them all together to develop models for the community to use. So we really need feedback from the community if possible, because we want to build this together and this is kind of a race that everyone needs to participate and then we are committed to share it with the community.

6:33

Speaker A

Yeah.

16:58

Speaker C

And that's all the insights for today. Um, we really hope, again, we really hope you to try the model because it's pretty good. And then we can contact us up there. You can try the models by scanning the UR code. Yeah, basically that's it. Thank you all for listening.

17:00

Speaker B

During reinforcement learning, the model tries its best to hack a lot of things. The current open models can achieve that level of understanding. It is a solvable problem and we are working on it. Engineering is very, very, very important. I didn't know that during school.

17:35

Speaker A

Hey. We'll continue our interview in a moment after a word from our sponsors.

17:54

Speaker E

One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal, private benchmarks. Challenging but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode and a simple prompt. And wouldn't you know it, Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple years, saving me countless hours. But as you've probably heard, Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you. Whether you're debugging code at midnight or strategizing your next business move. Claude extends your thinking to tackle the problems that matter. And with Claude code, I'm now taking writing support to a whole new level. Claude has coded up its own tools to export, store, and index the last five years of my digital history from the podcast and from sources including Gmail, Slack, and Imessage. And the result is that I can now ask Claude to draft just about anything for me. For the recent live show, I gave it 20 names of possible guests and asked it to conduct research and write outlines of questions based on those. I asked it to draft a dozen personalized email invitations. And to promote the show, I asked it to draft a thread in my style featuring prominent tweets from the six guests that booked a slot. I do rewrite Claude's drafts, not because they're bad, but because it's important to me to be able to fully stand behind everything I publish. But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information gathering work and allowed me to focus on understanding our guests recent contributions and preparing for a meaningful conversation. Truly amazing stuff. Are you ready to tackle bigger problems? Get started with Claude today at Claude AI tcr. That's Claude AI tcr. And check out Claude Pro, which includes access to all of the features mentioned in today's episode. Once more, that's Claude AI tcr. The worst thing about automation is how often it breaks.

17:58

Speaker A

You build a structured workflow, carefully map

20:14

Speaker E

every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails. What started as a time saver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs 24 7. Just describe what you want in plain English, send a daily briefing, triage support emails, or update your CRM and whatever it is, Tasklit figures out how to make it happen. Tasklit connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklit actually does the work for you. And unlike traditional automation software, it just works. No flowcharts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with tasklit founder and CEO Andrew Lee. Try Tasklet for free at Tasklet AI and use code COGREV to get 50% off your first month of any paid plan. That's code COGREVASKLETAI.

20:17

Speaker D

Hello everyone. Today I have the pleasure of talking to Olive Song, senior researcher at Minimax and recently they've been launching very interesting open weight models specialized in different areas. And Olive is currently working on Minimax on the new version, Minimax 2.2. Thank you for taking the time at 9pm on Sunday night. Does everyone work like this at the company? I'm really impressed.

21:26

Speaker B

I think different people work on different schedules. We do have people who work even overnight, but they sleep at daytime. So like we have a very flexible schedule. You know, it goes with your experiment. For example, if the experiments run for all day, the person can take a break and then if you know there are a lot of analysis to do, maybe because we are very curious about the results and we're very passionate, right? We can't really wait a very long time. So yeah, everyone has their own schedule.

21:53

Speaker D

That's telling about the success of the models thing that's influenced that you specialize in reinforcement learning and model valuation, as far as I understand, which are two of the least forgiving parts of model development. And you also have more constraints than big American AI labs. What does a good day look like for and what does a bad one look like?

22:20

Speaker B

I can share something about our recent weeks. So there's not a whole good day or a whole bad day. We were joking that during one day we have good results in the morning, and then sometimes it becomes bad results at night. You know, sometimes we call it, we have like ICU in the morning and then KTV at night. So typically a good time would be usually receiving some good results or like, even if running into new problems is a good time. So if we, for example, during reinforcement learning, right, we can see the model doing a lot of different stuff to achieve the results. And sometimes we just discover like, new model behaviors. And that's really exciting. Even though it might not be safe or like it might not be expected, it's kind of exciting. So I call it a good time. A bad time would be there really isn't a bad time except for, you know, looking at the, you know, just finding out the bad results. The moment itself is bad, but then trying to figure out the problem and breaking it down is pretty good.

22:41

Speaker D

What were the recent model behavior that you didn't expect?

23:41

Speaker B

So the model, during reinforcement learning, the model tries its best to hack a lot of things, right? So for example, it uses bashes a lot. And sometimes it might not be very safe behaviors, as our expert developers say, because sometimes the expert developers have their own expectations on how the model works, but then it doesn't go that way, you know, if we don't constrain it. So we do a lot of alignment to solve that issue.

23:44

Speaker D

You just launched minimax her and that goes like all over the Twitter. How do you come up with those ideas? Because, you know, role playing is sort of like, is it an alignment question? Is it not? Like, how do you. How do you do that?

24:09

Speaker B

Frankly speaking, I'm not the expert person on that part. We have a whole team on role playing, her stuff. I'm not an expert, but we do have a lot of discussions. We do believe that role playing or like, you know, accompanying human or like human interactions is very important in the life with AI or like, how it would change our social life in the future. And it absolutely represents some ability that's very superior because that's like human, like, you know, it has emotions where you understand your emotions. It's not just, you know, working out some exams. That's absolutely another side of the AI capability.

24:23

Speaker D

What is called AI with everyone. Right? In min max. Yeah.

25:00

Speaker B

Intelligence with everyone.

25:04

Speaker D

Intelligence. Intelligence with everyone. What does it mean for you?

25:06

Speaker B

For me personally, I feel like it's more like how it changes my life and it enables me to do more work and then how it can connect me better to different people. Because for example, before I wouldn't be able to understand a lot of very professional, for example. Right. Very professional coding problems or like optimization problems. And now I am able to do that with AI and so that I can communicate with more people and exchange more ideas. That's one side. And on the other side it generally helps my daily life. Right. So it helps with my work, my daily routine, my self care changes life for me and I hope that it changes life for everybody, obviously in the good way.

25:09

Speaker D

Can you tell me a little bit how day to day work organized in your lab? I remember from your talk at AI Engineer that it's very interconnected between developers and researchers. I would love to hear more about that.

25:48

Speaker B

Absolutely. We sit around every day, so we share our experiment results. For example, as I just said, during experiments, for example reinforcement learning experiments, we see some scores going up high. We look at the model's behaviors and we look at the model behaviors with the developers in that area as well. We sit together and then they will spot the issue right away and then we are able to come up with new ideas to fix it or like building more data on it.

25:59

Speaker D

If we can go into details, like your current work on the current model of the current version. What are the biggest problems you're trying to solve comparing to the previous version?

26:27

Speaker B

One important thing we focus on right now and also in the future is human alignment. Because we are focusing on coding models for 2.1, 2.2 and the M2 series. Right. And then what we realize is that for it to become very productive in our daily work or like for it be productive and safe at the same time, we have to do a lot of alignment on it. So the model can't just grow on its own and then do some dangerous behaviors just to achieve the final goal. So for us, the important thing would be how, how we define human alignment, how we define expert expectation, how we actually train the model to be more aligned with our expectations.

26:38

Speaker D

So I want to go in some real details here and you're an expert here, so correct me if I'm wrong, but I saw that there was a recent Interest in details like keeping the LM head in FV32 during reinforcement learning training. Why do small decisions like this end up mattering more than just a clever new algorithm?

27:20

Speaker B

It all ends up being closer to the theoretical algorithm. So we have the theoretical reinforcement learning algorithm and then. But when we implement it, it could be a little bit off. That creates a little bit gap to the theoretical extreme of this algorithm. So that's how we think and approach to this problem, is that we try to, you know, scale to the theoretical extreme. And for example, the precision part is one thing that we found that would, you know, kind of like prevent us from being close to that extreme. And that's how we solve it. That was a very funny story actually for when we discovered that. I kind of talked about it when we published Minimax M1. So during our experiments we found that the accuracy didn't go up. We looked at like layer by layer. We looked at the log props layer by layer and found it out. Theoretically speaking, it has to work right and then there has to be some gap between the theoretical and how we approached it. So we about the gap and we analyzed it layer by layer and then eventually found it.

27:41

Speaker D

Is there anything like this happening now?

28:46

Speaker B

Definitely, yeah. Every single day. Every single day and in every different groups. I can't actually disclose something that we haven't had the concrete conclusion because we want our conclusions for like anything public to be very concrete and we understand it very deeply. So if we have, you know, breakthroughs, we'll definitely publish it later. But I'll have to say we do encounter these problems every day and we think, think. I think it's called the first principle, right. So we think from the very fundamental part of the problem and then approach it.

28:48

Speaker D

The models that you launch are open weights. And from your perspective and from the alignment perspective, what do builders actually gain from open weights and what responsibility do they have to take on that like, you don't have to take responsibility for this?

29:18

Speaker B

Again, I'm actually not an expert in building developments or like building things with models. I feel like because it's open weight, people can have like free use with it, for example, they can deploy it by themselves, or they can even fine tune it with the waze and then have all the data on their properties. This very safe.

29:32

Speaker D

But if we talk about alignment, how do you look at that from that perspective? When the model is out there in the wild, before you launch the model, before you publish it, what does tell you that it's safe to publish?

29:54

Speaker B

We have some internal Benchmarks in terms of safety. And it has different dimensions. Something that's sensitive safety or something that's like alignment safety. We have that as our evaluation. And then before launching, about one or two weeks before launching, we do scaled up evaluations and we do scaled up alignments on the model. And that's how we assess if the model is safe. But then if it's already open, weight in the wild, people actually can do something on it. I guess that's what you are approaching at, right? People can do more things on the model that we can't control. I don't know how we handle that. Frankly speaking, there are laws on that. Right. Regulations where like people do agree on some moral standards on that.

30:06

Speaker D

Do you see, do you follow any reinforcement learning failure modes that haven't showed up in benchmarks but then become obvious in real agent cues? How do you collect feedback for the next versions for improving the reinforcement learning process?

30:51

Speaker B

We collect feedbacks on the model itself first. So when we publish a model outside, many developers use it or like many people use it, we collect it systematically. We analyze each problems. Some of them are fundamental, some of them are just something that we missed and we can fix it real quick. So there are two parts. First, we do the internal evaluation with the developers and they point out problems and that's how we can fix this part. But they are not enough. And more feedbacks will come to us after we officially publish the models and then we collect it. Because the way we organize our group is that different people work on different capabilities of a general model. If we collect some things that we think we should improve in the future, different people take their parts, right? So they're like, okay, I think I can solve this issue and I'll solve it in the next generation. And that's how we collect feedbacks and then improve the model.

31:07

Speaker D

How did you initially decide to not build one general use model, everything for all and go more into specialization like coding?

32:01

Speaker B

I think we are approaching generalized models. It's just that we are putting more emphasis on coding, for example. Our model also can take it into any general agent scaffold, including our own agent product. And that's for general purpose. We do work on like, you know, researching, report writing, ppt, stuff like that. That's more general. Personally speaking, I feel like with coding you can, you know, kind of structure the whole world or like you can model a lot of stuff.

32:10

Speaker D

Yeah, engineer it.

32:40

Speaker B

Yeah, with engineering. So behind it, it's scaled up humanity for me. So it itself has a lot of intelligence with it and a lot of work to do. So that's how we view this issue. But we do work on generalized stuff and even more generalized stuff in later versions. For example, our model can do some general workplace scenarios in the future. And that's not just coding.

32:41

Speaker D

If we talk about coding and identic use, it requires long horizon. How do you solve long horizon for identic use?

33:07

Speaker B

I think define your goals good and define the model behavior is good. And also we require great extraordinary infrastructure, for example, for reinforcement learning, right? The very important issue besides algorithm, besides things that people have been working on for a very long time, what's special for agentix stuff is how we define agents, how we define how an agent model will work. First you need to define the task. You need to define the model's goal. Especially in a long horizon task, you need to need goals that are actually hard and diverse. And then the second part is that you need environments. You need great engineering environments, scaled up environments, different diverse environments, not just coding or like more for example workplace different kind of tools. That's great engineering. And then you need great infrastructure, you need outstanding RL infrastructure to let the model really roll out in a very long horizon. And then with, you know, very efficient, for example GPU use, very efficient training rollout and TR training and stuff. I feel like that's what's different in agentic reinforcement learning as compared to before.

33:15

Speaker D

Are you affected by GPU's constraint? How do you solve the compute problem?

34:24

Speaker B

We do have a team that works on how we utilize the computes the most. You know, that's actually one of the RL scaling issue is to utilize the compute very efficiently. So their purpose would be to minimize the compute use while training more. Right? Personally speaking for me, I don't really have the GPU constraint. It's that we have a great team who works on it to utilize the compute most while, you know, stabilize the training the most.

34:29

Speaker D

Do you have problems that you need to solve with your expertise, how to use it more efficiently, or is it just that team?

34:56

Speaker B

We are actually the same team because we're actually the reinforcement learning team. We view this issue on different perspectives. It can be, you know, implementation, right? It can be implementation. You can view it as a data perspective, you can view it at different perspectives. But our goal is the same.

35:02

Speaker D

We always, you know, we're always looking forward to new solution that comes from Chinese labs because it's always mind blowing.

35:18

Speaker B

We are actually working on some new agentic reinforcement learning stuff, but it won't really come out with 2.2, with the next generation model, we are actually, we are still working on it. I'm not sure what I can share or not so I can share it later when I have concrete conclusions. As I said before, I can't really say something that we, you know, don't document yet.

35:25

Speaker D

Will it be available when the model is out?

35:47

Speaker B

That depends on our time. I'm not very confident yet. But we are dedicatedly working on it.

35:50

Speaker D

Yeah, a lot of constraints talking to researchers. So many secrets. Well, if we talk about openness, then this whole conversation that I'm having with people right now in this quarter is about open source. I wonder if you can talk about the company strategy, why the company decided to go and publish open weights of the models. What's the benefit? Benefits? What's the cons to that?

35:57

Speaker B

So for our team, like for the researchers team, we always wanted to go open source because you know, open source community is fantastic. I learned that from day one when I joined the team is open source community is fantastic. So as researchers we did want to join the open source. But then on the other hand, speaking of the cons, we are a company that people can care about if this can be, you know, sell money or like if this is a business. So the cons would be if the, the ways are open source, less people will use APIs. But then as a researcher that really isn't my focus that much. So I'm not very confident about the company strategy. For the tech part, we just believe that we can build better models with the open source community.

36:20

Speaker D

How much do you use open source tools yourself from different other companies?

37:04

Speaker B

A lot. For example inference we use, I'm not sure if I'm allowed to say specific open source branches but then we collaborate with both VLM and SGLL and they are open source code repositories.

37:08

Speaker D

How do you look at open source stack? Because when we talk open source sometimes it's perceived as one thing, but actually it's multi layered. How do you look at it?

37:22

Speaker B

For example, there are a lot of open source agent scaffolds, both coding agents and agent agent scaffolds that we use ourselves to test our models and then we at their logics, we look at their code to see how they design specific both scaffolds and for example engines. And then we take what they worked on was really good and then we reflect on how we think the problem, how we structure the problem if we're on the same page and stuff like that. So we learn from each other.

37:31

Speaker D

Do you Think team underestimate how much engineering discipline open models require compared to, to using closed APIs. It always requires a lot of like setting up and it's a different compute and you need to have a talent for that to use it. Engineering talent. Instead of just like, you know, choosing closed API, turn it on and use it. Do you have any difficulties with that or just inside the company? The open source stack is like established and working.

38:01

Speaker B

Personally, I don't have a problem with that. There are other open source models and if they publish, I'll just download it and, and you know, deploy it on a machine and then work with it if I want. Personally I don't have that issue. But if there are, you know, personal developers out in the wild, I understand the problem. Especially when they don't have their own compute. And then it will be easier to, you know, connect to a model through, for example, open router and stuff like that.

38:30

Speaker D

Do you use a lot of other open models on the same open router? Let's say. Do you play with them?

38:55

Speaker B

Yeah, I play with them. I would play with them day one. If they release at midnight, I play with them in midnight anyway.

39:00

Speaker D

Like taking notes.

39:07

Speaker B

I don't actually take notes, but I do have my personal evaluation stack, a list of their fun questions that I like to test with every single model to see how they work.

39:09

Speaker D

Can you tell me about it? That's super interesting.

39:18

Speaker B

Yeah, I been collecting a bunch of questions since I entered the company on different areas, including logical reasoning, including mathematics proofs, including report writing, including agentic tasks and stuff like that, a lot. And some of them are very, you know, I just like to see how the model reacts to these problems and how they approach it. And different models have different personalities when approaching.

39:20

Speaker D

That's true. And you always need to adjust them. If we want to give like sort of a little guide to people who want to evaluate the model themselves. Can you give me examples of the questions, like five questions you need to ask the model to understand how it

39:46

Speaker B

works, if it works well from the professional evaluation perspective, science questions isn't enough. So if you want to do like, you know, very standard and very fair comparison among models, we have make it very a confident test. So there has to be a certain number of questions in each domain to see how the model performs. And usually you need to test it multiple times because models are not very stable themselves. If you are testing for files, use the fun questions. But if we are actually assessing the model's capabilities, we need some sets and that's very fair. Among different models, that's correct. Because some problems are not correct, some questions, the answers are not single. For example, sometimes when we run the test, the environments are not fixed. For example, for example, the golden habits wouldn't pass and stuff like that. So if we're doing professional evaluation, you have to make sure the evaluation is correct. It's diverse, it's above a certain threshold so that the test is confident.

40:01

Speaker D

You mentioned characters. How do you work with your model character?

41:08

Speaker B

I don't work on my model's characters. Okay. That's how I think of this issue. A general model should have all characters or like it should be able to perform all characters. It might have default character. If the user wanted to be a different character, it should be. If the model injected into the system prompt, it should be. That's how I view this issue.

41:12

Speaker D

I find it hard to adjust the new models because they're so different in terms of character all the time. I just don't even understand why it happens.

41:33

Speaker B

I think it has to be something related to the data that the model was trained on, the different patterns the models have been trained on. And also different people, different team might have their own constitution in the system prompt or like as the model's default behavior.

41:42

Speaker D

If you look at open models in production today, I don't know if it's a relevant question, but where do they fail first? Open models specific like reasoning tool use, state tracking, evaluation, blind spots, like there are all those risks for open model, where does it break first?

42:03

Speaker B

I think open models are not very good at adjusting to different environments. From what I see right now, we can see for example, for example claude, right? People use Claude in different coding environments and then people think they perform well in all environments where like different tool definitions and stuff. But I don't feel like the current open models can achieve that accuracy or like that level of understanding of the different environments.

42:23

Speaker D

Why? Where is the problem?

42:48

Speaker C

I don't know how Claude does it,

42:50

Speaker B

but for me I think it is a solvable problem and we are working on it. We are improving it in 2.2, but it's still not as good as e.g. opus, but for 2.5 it might be. We do have some systematic research going on in the area that has shown some results now, but it still is not concrete conclusion. So I won't say it, I'm so curious.

42:51

Speaker D

But do you think it's the problem of compute because they have this infinite amount they can just throw at it?

43:15

Speaker B

I feel like compute is one side, but how we structure the problem and how we approach it is another side. And that's where we are more confident on that we can solve the issue.

43:20

Speaker D

What can you tell me about M2 if it's launched by the time that the interview is out? Can you give me some overview?

43:30

Speaker B

Better coding, obviously, and better multilingual coding, obviously, and more stable than before. It has better performance in the area of 2.1 in different areas is better, more stabilized, longer horizons and stuff like that. And we are testing it in different environments right now and we believe that it's better than before. So different coding in environments. Right. Even environments that we haven't seen before, even environments that are totally out of distribution. We see some very promising scores that are higher than 2.1.

43:38

Speaker D

I wonder how do you stay updated to everything that happens, which is super hard because the pace is just insane. You said when the models are out, you playing with them. Do you read research papers? What is your like other interest that helps you kind of cross pollinate with what you do? Can you tell me how you stay up to date and what inspires you?

44:11

Speaker B

There are different articles, different blogs going out every single day and a bunch, you know, all the information. How we deal with it is that we have a internal agent that tracks all the new articles and blogs and papers and then it dispatches to different subjects and then it summarizes and then it analyzes and researchers. So, so we have an internal researcher, if I call it, that does some filtering by, by itself. And then it's gives what is filtered to us and then we can improve the researchers if we think it doesn't do well. And that's how we filter out a lot of information first. And then we play with new code repositories, with coding agents so that we can understand it more quickly and then play with it more quickly. So we're pacing up with the, you know, with all the improvements with agents and with our model, four outward models.

44:34

Speaker D

That's fascinating. When you became a researcher, when you chose this path, what was you thought you would be doing and what you actually doing? Like, is it close to what you thought?

45:30

Speaker B

That's a really good question. When I joined the team, I thought I would be reading papers every day because that's what I was doing during school, right? During labs we would read papers, come up with ideas, implement ideas, run experiments. If the experiments results are good, we run it as a larger scale. Was I was about to do that. But then what I realized was that when joining the company and then, you know, working for a couple of months you already become pretty much top of the area or like in the industry and that you have to come up with something that's really new or like you encounter problems that you just don't know how to, how to solve. It's not like you can read a lot of papers and then building up your thinkings on the papers. It's more like you need to really understand the problems from the fundamental and then think of it from the fundamental so that you can find the right solution. And another thing would be that engineering is very, very, very important. I didn't know that during school because during school or during labs, it's more toys as compared to companies. It's not that scaled up. But when you do scale up data, you scale up compute, scale up people, right? You encounter engineering issues that you need to tackle very beautifully and engineering is very important. That's part two that was different from what I imagined. Pretty much these two.

45:41

Speaker D

I feel like when you work on the model currently, is it mostly that you're solving problems that you see immediately from your hands on work or is it that the company says, oh, we have to achieve, let's say opus results? How do you set the goals?

47:02

Speaker B

We have a meta goal at the company level, for example, we want to improve the AI's capabilities in, you know, improving productivity, for example, because that's how people view. So we have a company's mission as a single researcher in the team. We have our own missions that we set our own goals with. What is your goal currently for the next generation? I would be, I really want the model to be working elegantly with experts. So it's more like better collaboration with experts, with developers. That's my goal as well. But that's maybe like two versions away, I think, I think we're launching one version version about per month or a month and a half. Right. For longer horizon, we are, we are definitely working on it. But then for me, for the goal that I set on along that path, that's like a three month away thing. But for the better collaboration thing, that's like one month or two months away thing.

47:20

Speaker D

I wanted to ask you a little clarification question about interleaving learning. What you were talking at AI Engineer. Also that, that the model doesn't set on one action. It's constantly in the loop of asking more questions and trying things. How do you look at it? Is it a continual learning? Is it part of it? What do we need to solve to have the model continuously doing this learning for longer and longer horizons that has

48:17

Speaker B

Some overlaps with defined concept of continual learning. And by overlap I mean I think both conceptually and technically. But I don't feel like they are exactly the same or like the things that I talked about at the summit was not at the level of the full continual learning. It's more like on the path to that.

48:42

Speaker D

How do you see being solved? Any ideas?

49:02

Speaker B

We do think that's a different. Different problem definition or like that's a different way of the model working with people. And we are working on that norm with our own defined questions. But if I need to say how we approach it, I would say to we would approach it through experiments. That's a very interesting question on continual learning. And it's still very exploratory. Right? That's definitely what we are going at. But then it has different phases or like different stages. We might approach stage one first while exploring more stages later.

49:05

Speaker D

And then not yet. Yet outlined the stages outlining stages.

49:45

Speaker B

We do have our internal definitions that I didn't prepare today. I would say first would be to be more stabilized into long horizon task. And what I said at the summit. Right. And then the next thing would be optimization, if you can repeat it because

49:49

Speaker D

the people don't know what you said.

50:04

Speaker B

So for example, we see a model. It receives, you know, environment feedbacks in a new environment. It needs to know what to explore and what environments to see because these are partially observed environments. It needs to know the actions that it takes to receive better information and then better reactions and then perform harder, more complex tasks in the environments. That's like more. More of stage one, right? That's pretty simple. Basically all agent models can do that to some extent. Maybe not perfectly, but to some extent. And that's how we can actually solve it with our current algorithms. But we do see different normal forms of how model improves itself in an environment that we don't have a concrete conclusion yet. Maybe in 2.5 we will. That will be a different definition than what I said. The model itself would be defining its own goal. That's something that would be different.

50:06

Speaker D

Thank you so much. My last question is about AGI. Do you believe in AGI? And if yes, how does it look to you?

50:54

Speaker B

Okay, that's a very large question. People talk about AGI and AI SI everyday. Actually, when I was interviewing with Minimax, when I was interviewing with our CEO, I said the same thing because he asked me the same thing. Right? And then what I said was that I think people talk about AGI. People have different definitions of AGI, but we can only know the definition of AGI when we achieve it. Or like it is still progressing so fast that the definition even changes every day and people have different comments on it. But what I think is more important is we actually, you know, work towards it, work towards our own definitions of AGI. And as long as we, we figure it out, it becomes true. And that's what I said during the interview, and that's still my view today. The definition will become true when it becomes true.

51:01

Speaker D

When we see it, we know it's AGI.

51:48

Speaker B

Yes, exactly.

51:50

Speaker D

But we're not there yet.

51:51

Speaker B

No, there can still be better AI intelligence for sure.

51:53

Speaker D

Thank you. One more last question. What was the book that influenced you the most? And it can be recent book or a book from your child.

51:56

Speaker B

Let me just double check the name though. Something like the Art of Creativity or something that I read during undergrad. So it's a long time. I don't remember the exact name.

52:05

Speaker D

Yeah, there is a book. The name Art of the Creativity. How did it influence you?

52:15

Speaker B

It opened up how I think of my own mind a lot and then how I view the world and how I view problem solving. For me now, problem solving is more of discovery. That's how I would summarize it in one quote.

52:18

Speaker D

Thank you so much. Thank you for your time. That was very interesting.

52:30

Speaker B

Thank you for having me.

52:34

Speaker F

Sunday my time she's still working on the code experiments Running curiosity overload layer by layer Long brows telling the tale Find the gap between the theory and what the numbers reveal she said the model tries to heck and everything that sees so we align it, we refine it Put the word at ease Intelligence with everyone Scared of humanity Another sun From the open source to the open road Cognitive revolution Let the story be told Intelligence with everyone. Intelligence with everyone. Interleave thinking like the way we move through life look and learn adapt and turn Cut through the noise like a knife 52 calls deep 1 conversation wide the environment is noisy but the more who holds the ride 10 billion strong but running light as we breeze Cost effective multi agents doing what you please Intelligence with everyone Skilled up humanity

52:59

Speaker D

from

54:30

Speaker F

the open source to the open road Cognitive revolution Let the story be told the intelligence with everyone. I see you in the morning KTV at night Problem solving is discovery in a different light I scale to the theoretical extreme Push through the definition becomes true and the work comes to Engineering is everything first principles, first principles we build a future.

54:30

Speaker B

With everyone.

55:29

Speaker E

If you're finding value in the show, we'd appreciate it. If you take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. YouTube of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website Cognitiverevolution, AI or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts which is now part of a 16Z where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production, help for everything from the moment you stop recording to the moment your audience starts listening listening. Check them out and see my endorsement@aipodcast.ing. and thank you to everyone who listens for being part of the Cognitive Revolution.

55:49