The Evolution of Reasoning in Small Language Models with Yejin Choi - #761
Yejin Choi discusses her research on democratizing AI through small language models, focusing on improving reasoning capabilities through better data curation and synthetic data generation. The conversation covers mode collapse in LLMs, pluralistic alignment approaches, and the broader implications of AI homogenization on human creativity and diversity.
- Small language models can achieve competitive performance through better data curation and synthetic data generation rather than just scaling up parameters
- Current LLMs suffer from significant mode collapse, producing homogeneous outputs even for open-ended questions across different models
- The future of AI democratization depends on nonprofit and academic participation, not just profit-driven tech companies
- Synthetic data generation requires sophisticated filtering techniques using gradient vectors and clustering to maintain diversity
- Pluralistic alignment should accommodate different cultural values rather than enforcing universal neutrality
"The mission really is democratizing generative AI so that it's not just companies who can purchase a lot of GPUs are able to create LLMs and adapt LLMs and serve LLMs."
"Whatever is out of distribution, just make in distribution. Make sure that you make all the out of distribution in distribution. This is how generative AI works."
"I personally think that there's a lot of benefit we could get from AI as well as concerns. So the thorny thing about this situation, current situation, is that both the benefit and potential harms coexist."
"Our brain apparently use less energy than one light bulb."
Even for open ended questions, the models are not as diverse as we would have expected, to the point that even when you ask multiple times with higher temperature it may not be able to vary as much. So there's intra modal homogeneity in the modal output as well as we find intermodal homogeneity. Meaning, you know, llama chatgpt and dip C sick R1. They all have strikingly similar behavior.
0:00
All right everyone, welcome to another episode of the TWIML AI Podcast. I am your host, Sam Charrington. Today I'm joined by Ye Jin Choi. Ye Jin is Professor and Senior Fellow at Stanford University in the Computer Science Department and Institute, Institute for Human Centered AI, or hai. Before we get going, be sure to take a moment to hit the subscribe button wherever you're listening to today's show. Yejin, welcome back to the podcast. It's been a while.
0:45
Oh yeah, thanks for having me back.
1:11
Absolutely, absolutely. I think we last spoke in the fall of 2021, which seems like ages ago in AI years. I would love to kind of jump in and have you bring us up to date on what you've been working on since then. And actually for folks who didn't catch that one, maybe start with a little bit about your background.
1:14
The time when I was on your podcast, I was still maybe best known for working on common sense knowledge and reasoning. And back then I was also working on natural language generation quite a bit. Of course, since then a lot has happened. Uh, so more recently I've been excited about reasoning, especially making small language models to reason better. So I'm broadly interested in large language models, small language models, large reasoning models, small reasoning models, and then how we could make models align better for pluralistic norms and values.
1:36
Nice, nice. What drives your interest in SLMs? Seems like a lot of the action is in large language models, and we're working hard to get the smaller ones up to the same level of performance. What's your particular interest? Driven by?
2:21
Yeah, so the mission really is democratizing generative AI so that it's not just companies who can purchase a lot of GPUs are able to create LLMs and adapt LLMs and serve LLMs. But also, you know, people like myself and colleagues who are academics, for example, cannot buy as many GPUs, and, you know, is there something really meaningful and fun that we could do even with a smaller counterpart? And at the end of the day I believe that fundamentally it should be feasible. It's only that the world has invested so much more into exploring what happens when you scale things up so much. Whereas if we invested even like smaller, I mean, like even a fraction of that investment, but, you know, just a little bit more, I do think that we can unlock a lot more exciting capabilities out of small language models. Part of my research is also driven by the desire to find really better ways of teaching intelligence to machines. Currently, it's just so data centric. We can talk about that in more detail later in this podcast. But it's just so data dependent and, you know, that's pretty much the only way we know how to teach AI about human knowledge and intelligence. But in the future, I don't know whether we will find those solution or not. But as an academics, I feel like we have to give it a try to find an entirely better solution to this that is so much more data efficient and then able to learn so much more with much less data.
2:39
When you think about how the space, the industry evolves and your comment about where all the investment has gone, why do you think that is? Do you feel like the investment has just kind of followed quickly what works and without us taking a time to step back and identify all of the opportunities to optimize? Or do you think that there are particular impediments to smaller models that make it inherently more challenging?
4:43
There's definitely snowball effect and then ship herding effect. You see other ship going where and then, you know, you want to follow that because it's a safe choice, especially when raising funding is relatively. Yeah, raising funding is not as hard as it used to be for AI. Therefore, that's guaranteed and proven ways of increasing intelligence. So why not? And in fact, I'm not against such effort. It's really interesting to see and watch how much of an intelligence scale can unlock. I appreciate that some people went crazy and find out the, you know, the frontier of what happens with the scale. Having said that, I do worry about everybody trying the same thing. I think it's very important that we try different ideas, especially historically. Whatever innovation happened with, you know, computers or phones, they're always very large at the beginning, and then over the course of the time, people figure out how to make it smaller yet more powerful. So the same thing will definitely happen with generative AI as well. In fact, already there's a lot of research effort that makes models smaller but more powerful. And I think we can do so much more, so much better if we put more mind and effort into it.
5:22
And how do you think about the, you know, the different attack vectors or approaches to Tackling this problem and like, what do you feel like is already being explored and where do you think there are opportunities that really haven't been explored very effectively thus far?
6:57
Yeah, so there are multiple routes. I think at the beginning people were trying to think about compressing larger models into smaller models by either quantizing it or pruning some neurons and stuff like this. So that's more like in some sense optimization based or a little bit more mechanical optimization based approach to making larger models to smaller models. And it does require larger models in order to make smaller models. So there's that. Nothing wrong with that. It's nice to have that option, but it's not the only way. So in the short term, I think having new architectures like, you know, some hybrid between state motions and conventional transformers, like Mamba Hybrid from Nvidia for example, could be an alternative ways of making small models more powerful. But there can be other ways, such as making data better, especially providing much more powerful data. And this data usually has to be at the outer, outer skirt of the Internet data, meaning the kind of data that intranet couldn't quite provide in order to teach the model to do certain kinds of reasoning better. So if we have a much higher quality data, usually small models learn so much faster. So that's another way.
7:20
When you say data kind of on the outer reaches of the Internet, what are some examples of these sources like? I think it's commonly thrown around that, you know, we found all the data that is available to the public and we train all of the large models based on this data. And you know, it's often proposed that the future is going to come from unlocking new types of data. For example, video I think is the obvious one that people talk about. Is that the kind of thing you're referring to or do you have other ideas about what will be effective for small models?
8:58
Roughly speaking, yes. But let me clarify what I meant by better data. Because Internet data is not so bad, at least in terms of quantity. But when we look at the LLM pipeline, pre trained model is never good enough. Despite the scale of it, pre trained model is never good enough. And you have to do poster training on a fairly large amount of data that usually is different from the Internet data. The, you know, like supervised fine tuning as well as RL requires the kind of data points that are either curated by humans just for the purpose of teaching AI. These are not like things that you just downloaded from the Internet. But you maybe pay someone to write those data points and you know, the the more common practice these days is they're not even crowdsourcing your data, but rather hire experts, like lawyers, you know, former international Math Olympiad winners. Now these are like real experts and then have them to write data for you. So a lot of expert data being collected as well. And then even that is not enough for AI, because AI is so data dependent. So then more recently, people also do a lot of automatic synthetic data generation. Now, synthetic data, if you do it in a more like in the vanilla way, you know, just ask LLMs to write some problems for you or solutions for you, then oftentimes it's not good enough, or it could be just a repetition of the same thing. So it does require a lot more effort in the way that you design prompt and then have even like a pipeline of different models, making the prompt better or making the solution better, revise the solution, taking lots of iterations. So it's not as simple as just asking ChatGPT to write data for you. But if we do it quite right, then it can lead to a new data point that didn't exist on the Internet. It could be like really high quality data that is qualitatively different from what was on the Internet. And a prime example of this is hard math solutions. So Internet data does have a lot of math, but it doesn't necessarily have solutions to a lot of hard math problems. So you have to come up with those solutions either by asking experts to write solutions for you, or using LLMs in some ways in order to generate good solutions, even though they're not quite capable of doing it yet. But you could use, for example, reinforcement learning with verifiers that can explore a lot of explorations and see which solutions that AI generated happens to be correct based on the verifier. Then you collected that data, which is implicitly used as good data during rl, to amplify that model behavior. But sometimes people then use that data points in order to do imitation learning on top of iteratively.
9:34
That sounds like the application of a fairly broad variety of approaches. You talk about synthetic data generation, you talk about imitation learning, you talk about rl. These are all things that historically have been like their own field of research and then put into practice independently. And now you're talking about integrating them together. Is that a significant element of how you think this problem gets solved integrating a lot of these ideas?
13:08
Yes. And in fact, in some sense the art of artificial reasoning is fairly artificial in the way that we have to orchestrate all these complex, almost like system style research in order to make sure that things are done at the right time, in the right way, in the right sequence, and then iterate over and over.
13:39
Elaborate a little bit on the use of imitation learning and how you see that playing into the pipeline.
14:07
Yeah, I am blanking on which company, which model. It may have been llama, actually. Whose white paper describing how they did the post training may have been llama 3. They were repeating, of course, you know, of course they were doing something like, okay, first the pre training and then sequential fine tuning, instruction tuning, as well as like supervised training on some of the exam style data. And then finally there's reinforcement learning phase. It's not uncommon to take the iteration between RL and SFT such that after doing RL you find some good behaviors and then you want to kind of like drill on that. Here's another example. So deep seq R1 deep seq R1 does some amount of this imitation learning after the reinforcement learning in that they do provide distilled version of their deep seq R1 in, you know, the smaller models. What's interesting is that they don't do straightforward RL at those small scale models, but they rather do distillation off from the stronger model that's already trained through reinforcement learning. So they say that if you just do reinforcement learning, by the way, sometimes this model starts code switching in the middle of solving math problems. It's just suddenly speaking in Chinese and English and back and forth or some other foreign languages that, you know, may not make sense to human readers. So reinforcement learning only cares about whether you got the final solution right or not. It doesn't care about how you got there. So strange behaviors can be emergent and then it can be even reinforced. So then if you don't want that, if you want interpretability of the way that chain of thoughts can be interpreted by humans and, you know, verified, then you want only those chain of a thought that leads to the correct solution that you like. So then you can filter out a lot of these output examples, solutions by this stronger model that went through reinforcement learning and then collect only those better examples in order to teach smaller models. And in general, this leads to a model like purely based on imitation learning that's very powerful, but oftentimes have a better behavior that you wanted.
14:15
I guess maybe the what I'm trying to get clearer on is maybe more fundamental in the sense of when I think about imitation learning, I always think of this canonical example as, you know, showing a model, a YouTube video of how to do something, and then the model learns how to do it based on imitating what it sees in a video. When we try to apply that to more traditional data, textual data, it's not super clear to me how that is distinguished from just supervised fine tuning or RLHF or something like that. Like, can you explain like how they're, you know, what makes, what makes it imitation in the application or pipeline you described?
17:11
Oh, imitation learning just means supervised fine tuning in this context. But sometimes people say imitation learning when things were performed either right before or after rl because then there's some like example trajectories that you want to imitate from, but it's really just a supervised to fine tuning.
18:00
Okay. Yeah, okay, got it. We were talking a little bit about, you know, one of the things that you might want to do is to use the model to generate synthetic data for you and you talked about how that, you know, tends to not work. I don't think you mentioned it, but I'm assuming we're talking about mode collapse here. And that brought to mind the, the artificial hive mind paper which was highlighted as one of the award winning papers at this past neurips which really talked about one of the implications of this kind of mode collapse. Can you talk, it's not necessarily in this small models, you know, thread, but can you talk a little bit about that paper and what you found interesting about it?
18:24
Sure, yeah. So mood collapse is a real concern with LLM generation because. So what we find in our paper is that even when you ask open ended questions like, you know, tell me a joke about time or tell me something wise about time, you know, like when you ask or you know, tell me a story about something like when you ask these open ended questions, you expect that, well, there's no one good answer. Therefore language models should be able to generate diverse set of solutions. Even when you ask, by the way, hey, give me random number between 0 and 10. It's not random.
19:17
I was thinking about that. Right?
20:07
Yeah, yeah, it's usually like seven or you know, 13, you know, if you ask the bigger range. So it's not random because data is not random where data is skewed in the first place. So when you do random of the distribution of the data, what you get is skewed distribution. That's part one of the problem. Even after pre training. The bigger problem is after post training, like sequential fine tuning and not L, the probability output probability of the model becomes even more skewed, like zoning in to the stereotypical answers that people tend to like and so we find that even for those questions, I mean, of course there are questions for which you shouldn't vary the answer at all. Like what's the factual question? Yeah, factual answer. Yeah, yeah, you don't vary that. But even for open ended questions, the models are not as diverse as we would have expected to the point that even when you ask multiple times with you know, temperature, higher temperature, it may not be able to vary as much. So there's intramodal homogeneity, homogeneity in the model output as well as we find intermodal homogeneity, meaning, you know, Llama, ChatGPT and Deep Seq R1, they all have similar behavior, strikingly similar behavior. Sometimes they generate output that's almost verbal team identical, which is very strange. So that's sort of like the gist of this paper that we presented at Neurips Artificial Hypermind. It's a bit of a concern to me because more and more people utilize LLMs in order to post, you know, things on the Internet. I wonder, you know, what happens to our Internet. You know, the Internet used to be the artifact of human intelligence. It really encapsulates the, you know, vastly different ways people write and think and you know, it's a historical artifact of human intelligence. Now it's really becoming the artifact of LLMs mixed with some amount of human intelligence. But what if it becomes less, more homogeneous and less reflecting the diverse spectrum of human thoughts? Now we're going to do something valuable. There is my concern.
20:08
And did you study that kind of second part, the implications of this as well or were you primarily focused on demonstrating the effect and how mode collapse works in these open ended scenarios.
22:52
Our study stops at just like studying how homogeneous these models are, especially after post training. Pre trained models are better in this regard. Although this conversation reminds me of a study that I saw that looked at the language use of Reddit forums before and after ChatGPT and they find how even Reddit posts are not as diverse as before.
23:14
There's a lot of delving now that wasn't happening before.
23:50
Yep, probably, yeah. You know actually whenever I see the word delve in anybody's writing I'm like what did you do?
23:54
Yeah, yeah. It's interesting that I think, you know, we've folks that use a variety of LLMs see this behavior a lot. Like you ask a fairly open ended question across multiple LLMs and you get very, very similar responses.
24:04
And.
24:26
I think, and kind of in the context of training data the question is often asked like, okay, so how do we improve models? If models are quote, unquote, smoking their own exhaust, for example, or, you know, training on this synthetic data? You know, there's the implication being that we accelerate further mode collapse. But what I found interesting about this paper is that yes, there's that and, but not only that, like what is the impact on, you know, the reader, the ecosystem, the humans that are in this environment where this, you know, synthetic data is being posted as, you know, articles? Is it, you know, hive mind? Is it changing the way we're thinking?
24:29
And.
25:19
You know what I think the broader thought was you're affiliated with the hai. That seems like the place to kind of cross interdisciplinarily study this. And I'd be super interested in hearing more about that. You know, let me know if you, if you hear any work in that regard.
25:21
Oh yeah. So yeah, half of my Stanford affiliation is with hai, Human Centered AI Institute. Therefore, you know, easily half of my research has to do with AI's impact on humanity. And I personally think that there's a lot of benefit we could get from AI as well as concerns. So the thorny thing about this situation, current situation, is that both the benefit and potential harms coexist. And then in some ways, depending on how we pursue AI research from here on, the future can be drastically different, is how I feel. You know, on one hand it could be that LLMs influence human intelligence such that we lose individuality and, you know, we lose the diversity. But on the other hand, it could also lead to a future in which the opposite is true. Or at least for some humans, I think they might be able to become even more specialized and even more creative with the help of AI. And then for many others, they might choose to be overly dependent on AI and then lose their own thinking. And then they might just be super dependent on whatever AI says is what they say. So the gap, the best case scenario, the worst case scenario, the gap might actually increase is one possibility. In any case, I believe that it's important to be aware of potential harms and, you know, worst case scenarios in order to do something about it. So even for the this problem that we find that LLMs are more homogeneous after fine tuning as a follow up research, or which was pursued concurrently to the artificial hive mind, but we haven't talked about yet, is spectrum tuning. So my former student, Taylor Sorensen, he worked on this idea called spectrum tuning. It's a kind of post training method that teaches the model to retain the spectrum of different ways of generating the output instead of Just like honing in to the correct answers that were presented by the post training data. So I do think that when we are aware of the problems, we can then try to seek solutions to that problem either by designing new post training algorithms and or ensuring that the post training data is diverse in the first place because then these data dependent models can become less skewed than how much skewed they could have been. So I think there are a lot more future research to be done in this space in order to mitigate the potential concerns about generative AI.
25:44
When you started to describe this idea of there being two possible outcomes in that gap widening, you talked about the way we pursue research as being kind of central to which of those outcomes we tend towards. Can you elaborate on that and the role you see research having in determining direction there?
28:53
Yeah, so I do think that LLMs are doing really well on method data. Method problems will necessarily be the best, more beneficial one for humanity. That's one crude way of saying it. That kind of encapsulates what I believe about the future of AI on humanity, which is that we really have to work on specific problems more explicitly. Like if we care about democracy, we then need to work on designing AI that can make humans more democratic and then be able to understand each other through democratic process and then be able to work with different opinions through democratic process as opposed to building AI that really optimize for people's attention and engagement with feeds that could then increase the tension among us because these profit incentives are not necessarily aligned with what humanity at large should aspire to achieve. So in order to get what we want, we cannot just leave it to a few tech companies doing their wonderful jobs. I mean, I'm sure that they're all, you know, they have a lot of members in their companies who are, who have good intentions about making. Yeah, well intended people. But it could also be that easily, you know, especially when there's like profit competition and engagement competitions, things could unroll in a way that is not as beneficial for humanity. So when we think about how to make AI serve humans. So this is where, by the way, when I think about AI democratization, I really think about the way that I like to think about is AI of humans, for humans and by humans. So if it's AI for humans, it should really be AI for all humans, not just some humans working for some companies in some countries. Not only that, if we are not very careful with how we design AI, I think it could be coming to a point where it's not even AI for humans, but it's more like AI for AI and even worse, humans for AI. You know, humans working for AI, essentially. And for that I think it's very important for nonprofit sectors participate in designing the future of AI, not just the profit sectors.
29:19
Part of that echoes the idea of kind of the brightest minds of our generation focused on making people click ads as opposed to, you know, advancing humanity, science, et cetera. You're suggesting that there's kind of an AI oriented aspect to that as well, that we need to be on the lookout for and be proactive in defining our future.
32:22
Yeah, and more investment needed to support such research that really think about AI's impact on humanity, not just increasing the benchmark scores on math problems.
32:52
Let's come back to that. We were in, I think, in the middle of our small models conversation, and I think we were talking broadly about making small models perform better. But I don't recall us getting to the specifics of reasoning in small models and kind of unique characteristics of reasonings as of reasoning as they pertain to small models or small reasoners. Is that something you're looking at as well?
33:07
Yeah. So I'm quite excited about making small language models a better reasoner, especially because reasoning is such important intelligence capability that any model in the future has to be good at. So. And also this is interesting challenge because Internet data doesn't really equip the LLMs to reason well right away. They can reason to some degree, but they do require a lot more poster training in order to infuse better reasoning capabilities. So my research hasn't focused on recently how to make small models reason better. That requires both feeding in better data through sequential fine tuning phase, and that data has to be really high quality and diverse, as well as other kinds of algorithmic approaches that can squeeze out better intelligence from even from small language models that seemed kind of hopeless.
33:48
Are there any specific techniques coming to the fore with regards to the data curation side of things? Or is it largely kind of manual labor in humans curating these large data sets to increase quality? What are you seeing there?
34:58
I think there's a huge future in synthetic data. So I can give you one example of our recent work called the Prismatic Synthesis. It's a synthetic data generation algorithm which is Prismatic because it acts like a little bit like a prism that can scatter the light to make it more diversified. So what we do is basically method problem synthesis, where actually it's more like method problem and solution synthesis. And we are doing this using Dipsig R1 32 billion parameter model as the teacher model. Now 32B parameters, medium size, a little bit bigger than medium size these days, but it's much worse than Dipseq R1, the full model, the biggest model which is 671B parameter model. So that's like 20 times bigger than the model that we choose to use as teacher model. In this work we primarily focus on making sequential fine tuning data for hard math problems using this medium scale teacher model. And then our goal is to compete against the alternative which is to use much Stronger Teacher that's 20 times larger. Now in general that's like really difficult game to play. It's really hard to beat against a teacher that's 20 times larger because the performance gap is significant. So how on the earth do we close the gap is algorithmic ways of filtering the data to ensure the diversity of the generated data. Because no matter how good your teacher is, as we demonstrated in our artificial hive mind paper, they're all repetitive, they're all homogeneous. So you have to do put a lot of efforts to diversify them. The way that we diversify data is we look at the gradient vector of output given input using a proximal del small scale proxy model. It is so small, it's only 1.5 billion parameter model. We just use Quan 1.5B just downloaded from the net and we use it as a proxy model to compute the gradient of output given input. And these output input pairs are the synthetic data that we just synthesized using the dip seq R1.32 billion parameter model. So we look at the gradient representation of each data point that this teacher model generated and then we look at how they differ from each other by looking at K means clustering. It's old fashioned clustering mechanism that you know, still works in this modern day. So we do tensorized K means clustering to see which clusters are over represented and then which data points are underrepresented. We filter out overrepresented data points really aggressively. Like we throw out the vast majority of all the data that we just synthesized and then only maintain those that are unique and different from each other and then use those data points in order to prompt the teacher model in the next round. So we iterate through this, you know, over generate and then filter aggressively using gradient vectors and then over generate and filter aggressively until we gather 1 million data points. So it's a lot of over generation and filtration and then we find that that 1 million data points is actually better than the 1 million data points that you generate from the stronger teacher model, the best teacher model.
35:20
Can you talk a little bit about the like the kinds of prompts and responses that you're seeing in this example?
39:22
It's just some hard math problem. Yeah, it's all hard math that requires a very long solution. And so by the way, when we auto generate all the solutions, we have no idea whether the answer is correct or not. Right. So we play a bit of a trick. The trick is simple. After we generate the problem, our case is like a fully synthetic, even for the problems. A lot of other synthetic data usually relies on problems that exist on the Internet so that you only solve legit problems. But in our case we generate sol the problems as well because we really wanted to explore diverse scope of the math reasoning domain. But we, you know, these solutions generated for fake problems or made up problems may or may not be correct to. So what we do is we ask the model to solve the same problem multiple times and then check whether the final answer is identical to each other. If not, then we worry that the quality might be bad. It's a very crude way of filtering data, but it worked well enough for our case. And I think this is a powerful method to use for controlling the quality of a synthetic data without human validation.
39:31
Got it. And so are you. What's the. I don't know if this is the right way to ask this question or, or if there's a better way to ask the question, but I'm trying to get a sense for you. I think you mentioned generating a data set of a thousand points and it's happening over multiple rounds, I guess.
40:59
And.
41:23
I'm trying to understand like the degree to which the prompt is varied across rounds or are you starting from like a prompt and developing variants based, you know, kind of some logarithmic number of variants with each round without any kind of variance of the prompt.
41:25
Excellent question. We. So for prompting we show some examples hey generate math problems of this kind. And so what we change through these multiple iterations of overgeneration and filtration is the examples that we show in the prompt. So the examples come from the previous iterations so that in the hope that the models may feel more inspired when provided with newer, more different examples in.
41:47
The context and you're at each stage, are you validating the answer to the question or you're not?
42:21
We, we do validate whatever we decided to keep, we do validate so that the final 1 million examples that we've synthesized hopefully have by and large actually correct answers. Not Guaranteed.
42:33
The core idea here is as opposed to starting with some prompt on saying generate a million examples, do some, you know, have end stages and at each stage generate some section and kind of disperse from there. And you've found that to work well for maintaining a degree of diversity in the dataset.
42:55
So this is, you know, depending on how you view it. It's a simple idea and a simple method, which is the big benefit of this method because the principle here is that you really want to make data that's different, qualitatively different from the Internet data that's easy to generate. You really need to go to this relatively less explored regions. And then of course, you know, you need to make sure that the quality is reasonably good there as well. But really it's a diversity that we are trying to quantitatively enhance in the data set production. And I think that alone can really go quite far. Like if you really cover diverse ground, then your model can perform so much better. Because at the end of the day, current today LLMs, no matter how amazing they are, they are only as good as the data that it was trained on. So you have to show similar examples. The more similar examples, the better for whatever the model may have to deal with during test time. You know, the way that I like to put it is that whatever is out of distribution, just make in distribution. Make sure that you make all the out of distribution in distribution. This is how generative AI works. This is how even self driving car works. Make sure that you cover all the rows, corner cases, everything, and that should go into your training data. This is really different from how humans learn to drive. We do not need to see a lot of these corner case examples, we just deal with it. Which is the real mystery of intelligence that I wish one day we will have some answers for. We're so data efficient. But right now the generative AI, the only way under this current framework and current paradigm is making sure that all the out of distribution becomes in distribution. However you do it, just make sure that that's going to happen. And so that's why poster training requires curating a lot of data or even synthesizing and curating and taking some combination of the two. And then even that's not enough. Therefore you do reinforcement learning at scale because reinforcement learning is, is another ways of making out of distribution in distribution by having the model to explore itself, all these other unexplored areas, make sure that it's all explored before you get into the testing phase.
43:21
You know, whenever I'm in these conversations that we're talking about the, the, the role of data and the idea of using, you know, varying techniques to improve the data for small or large models. It kind of brings me back to a few years ago when your colleague there at Stanford, Andrew Ang, started kind of planting this banner around data centric AI. And as you noted earlier, like all AI is data centric. So what does that really mean? But the idea that we're going to improve models by focusing on how the data that's used to train them as opposed to iterating on it, on the algorithms, continues to resonate.
46:02
Yeah, in that sense, we didn't go very far. Yeah. But we're just doing a lot more of it in a more empirically powerful way. And still more can be done. I think it's a bit disappointing when we look at the seemingly magical, you know, generative AI frontier models from this lens. But it is what it is and I think we can do a lot better even following the current paradigm. But I, you know, like this is recurring theme of our conversation, which is that there must be better way of, fundamentally better way of doing this. And can we find it? In some ways the nature found a solution, which is human brain. The nature found a solution and human brain requires so little energy. Our brain apparently use less energy than one light bulb.
46:44
And so, you know, thus far you're proposing that one way to do this is to focus on the data, or thus far you're proposing that one way to do this is to create synthetic data that follows, you know, you know, diverse and distributed pattern while still kind of constrained in quality. You are also looking at ways to incorporate reinforcement as part of the pre training objective. Can you talk a little bit about that work?
47:50
Yeah. So that's a reinforcement learning as pre training objective. A new paper that we recently put out, and roughly speaking, the idea is that during pre training the model is forced to be completely passive in the way that it learns to predict which token comes next. But what if we encourage the model to think for itself before predicting the next token? What if we encourage the model to think for itself by generating, you know, something like chain of a thought and then predict next token and then in that context the reward. Because it's reinforcement learning, we now need to think about reward. The reward is, could be, I mean, there could be different ways of designing this, but the key idea of our approach is to make the reward information gain of predicting next token with thought compared to without thought, so that now you have to be able to predict the next token even better. Than yourself predicting the next token without a thought. So you know, you have to learn to think better so that your next token prediction probability becomes better than your own prediction probability without a thought. So we, that way we encourage the model to think for itself before answering the next token.
48:27
And when, when you said information gain and then you went on to say that you, you want the model to predict the next token better. Like when I hear information gain, I think of like maximizing surprise in the sense of you want the, you want to give reward to predictions that.
50:00
I.
50:30
Don'T even know how to, to describe it. Not necessarily better in the sense of, you know, more accurate, but better in the sense of maybe more diverse is even the way I'm thinking about it. Can you elaborate on that part?
50:30
Yeah, actually we kind of do the opposite of diversification in this context. So in this context what we are trying to do is because we frame this reinforcement learning approach under the pre training framework in which it's all about next token prediction. So we go with that overarching framework. So it's all about next token prediction. But we give some ADL style flavor by incorporating a reward during the last phase of a pre training by defining reward as information gain of being able to predict next token with good thought, good intermediate thought. And then so what we look at is the conditional probability of the next token given all the previous tokens concatenated with your own thought. Compare that with a conditional probability of predicting next token and all the previous tokens without your thought. So you compare these two quantities and then the challenge here is that you got to generate, you get reward only if your intermediate thought actually increase the conditional probability of predicting next token when you concatenate your intermediate thought in addition to all the previous tokens. So it's not an easy reward to get.
50:45
So not information gain with regard to the token, but information gain with regard to the thought relative to generating the tokens.
52:15
Yeah, yeah, yeah, yeah.
52:26
You just mentioned it's not an easy reward to get. Like, to what degree do you expect, like I'm hearing in there, you know, aspects of inference. Like to what degree do you expect the. The complexity of training to increase based on techniques like this.
52:27
Makes computation of pre training much higher than before. So in our work we do a lot of experiments, one experimental setting is to control the amount of tokens during pre training with or without rlp. So if you use the same amount of token, then we use a lot more flops, as you noticed. So another empirical setting is we control for the flop so that we use way less tokens. Now of the last phase of pre training, we use way less tokens, but we use more, we use identical number of flops. And what we found, to my big surprise is that when you finish your pre training in this way, even in the flop controlled setting, the final pre trained model does do much better after post training. So the not only by the at that point in time, not only this model does better with reasoning benchmarks, of course it will do better with the reasoning benchmarks because it was incentivized to be able to think for predicting nuxtokens. But also not only that, if you applied the same post training recipe like sequential fine tuning followed by rl, the performance gain survives such that your model now performs even better with reasoning heavy post training recipe. So what does this entail is it's a bit analogous to how humans there's this critical period during which you have to acquire language, for example, and probably it's a good idea to learn math and logical thinking reasonably early in your life as opposed to much later in life. So something like that is happening even with the pre training that we empirically found.
52:53
Interesting, interesting. You know, I mentioned that we would come back to this broader idea of democratizing AI before we close out and one of the topics there that you mentioned is pluralistic alignment. Can you elaborate on what that means and why you're excited about that idea?
55:00
Yeah, so that goes to this earlier statement I made about AI of humans, for humans and by humans. So AI of humans is about the origin of AI being really humans in terms of the values, the knowledge, the different kinds of norms that AI learns from should really reflect the entirety of humanity. So that's the idea. And of course Internet is not evenly reflecting all humanity, therefore the resulting AI is also biased in some ways. And the question is, is there anything we can do?
55:24
So it's really hard de biasing the Internet for, you know, one uses training data. That sounds hard.
56:14
It's I, I, it's almost impossible. Right, because you cannot go back and change the history to make even number of kings and queens. Whatever happened in the humanity already happened. So you're not going to change that.
56:23
Although like there, you know, on the other hand, like you know, we have debiasing statistical techniques. And so if you think of your training set as just kind of a data, you know, a data set, we have ways to debias that. The question that jumps out to me is like, you know, what do you lose in doing so? Like, and do you lose Some fundamental aspect of, you know, the Internet that, you know, made these LLMs that we don't understand how they work. Work.
56:38
Yeah. You know, I don't think debiasing is the right answer in the sense that debiasing is also impossible, but also that, you know, sometimes you do want to maintain your own bias. For example, if you're a religious person or if you're from a certain country where you have a particular norms that you like to go by, then maybe, maybe, you know, we want to respect that. You know, by the way, as a human, when we interact with other human beings who we know have different values, we have way of navigate around this person such that, you know, we maintain politeness, we maintain respect, we agree to disagree. Right. So to some degree, I think this is very important that AI is aware of diverse values and then be able to navigate around it as opposed to just being completely neutral everywhere, which may not only be not attainable, but also may not be the desirable solution if we are trying to really serve different cultural norms in a respectful manner. So in my work we think about pluralistic alignment from three different angles. There's something called overtone pluralism, distributional pluralism and steerable pluralism. So these concepts require explanations. Maybe let's start with overtone. So overtone pluralism means when you ask.
57:12
In the sense of the Overton window.
58:47
Yeah, yeah. So it's like when you ask a question that's politically thorny, for example, that could have different answers. The best way might be for LLM to just present all of them, all of the reasonable opinions as hey, the answer is that people have different opinions. Here's one view, there's another view and be able to include all of them as opposed to picking the majority opinion, because that marginalize out the rest. So being able to cover all of these options, distribution of pluralism is when AI is made for more like decision making process, where maybe AI is doing job application filtering or AI is answering questions. Have to answer questions in a more like categorical manner. You have to choose an answer. You cannot give all of the answers. Then distributionally, the distribution of llamancers should mimic the distribution of humans decisions. So at each point in time, AI might be making a decision that differs from any other human decisions. However, when we look at the overall distribution, instead of going for the, you know, like the majority case all the time, which would be distributionally super skewed, the idea is that try to be more at least distributionally even compared to now, you know, of course humans have A bias. So it's not like our distribution of decisions is necessarily fair or unbiased. Right. But at least let's not get worse than that is the idea. Now the last one, steerable distribution is. Sorry, steerable pluralism is that you are able to steer the model to different values, your modal framework or value framework, to serve your day to day need within the scope that's reasonable to execute this.
58:50
Meaning in some scenarios you might want to be more or less pluralistic in the way the model operates.
1:01:05
Yeah. So the model should be able to steer to any different value system that's reasonable. Of course the question is what is reasonable? Because maybe we don't want to allow the model to be steerable to support people who want to be criminal criminals. Value system probably should be completely out, but within the reasonable, legal, socially acceptable scope, the ability of being able to steer your model to serve your value system.
1:01:16
How far along are you in identifying ways to do these things beyond identifying the three kind of dimensions of pluralistic approaches?
1:01:49
So this sort of research requires both data research as well as algorithmic research. To my delight, a lot of smart academics started developing solutions for both fronts. And so there are some new algorithms that does do more pluralistic alignment for e.g. distribution of pluralism. And other people are working on this. But one could argue that hey, the frontier models are not so bad. So frontier models in general are better at this kind of alignment pluralism compared to open source models that went through less effort in terms of data curation and you know, like safety guardrails and everything. So in the spirit of making small, smaller, especially open source models more powerful for wider accessibility. This requires more academic research to share data, make data and share data as well as coming up with algorithmic innovations in order to get around the limitations of the data. It's still a lot of work to be done, but at least there's a community effort.
1:02:03
That's awesome. So we're connecting, reconnecting in the beginning of the year, beginning of 2026. Any thoughts or predictions on what you expect to see happen this year or maybe what you're would be excited about seeing?
1:03:27
Yeah, I think the community efforts on small models will escalate even further. There's last year already there has been increasing efforts for open source community and of course now Nvidia is also really heavily invested into supporting open source efforts. And so we will see a lot more of that is one obvious prediction I can make. Another one is the use of AI for science. I'm quite excited about that personally, because the impact of positive impact of AI for scientific domains can be really phenomenal. If we know how to do it quite right, then like, you know, medicine and like different aspects of a human life could really benefit from AI for science. And that's also a really hard intellectual challenge, because now that really requires being able to reach the knowledge that's really above and beyond the human knowledge reflected on the Internet data. And the thing is, AI is only really good at learning the data that humans are able to provide. So this is a big intellectual challenge as well, and I'm very excited about pursuing further into that direction.
1:03:46
Well, Yejin, thanks so much for jumping back on with us and giving us an update as to what you're working on and in particular digging into how you're approaching reasoning for SLMs.
1:05:13
Thank you so much for having me again.
1:05:26
Thank you.
1:05:28
Bye bye. Sa.
1:05:30