Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2
Yi Tay, AI researcher at Google DeepMind Singapore, discusses his team's work on achieving IMO Gold medal performance, the transition from off-policy to on-policy reinforcement learning in AI systems, and the establishment of Google's new Reasoning and AGI team in Singapore. The conversation covers technical aspects of AI reasoning, the evolution of transformer architectures, and insights into scaling AI systems.
- On-policy reinforcement learning (where models learn from their own outputs) is becoming more effective than off-policy learning (imitating others' successful trajectories) for developing reasoning capabilities
- The decision to abandon specialized systems like AlphaProof in favor of end-to-end general models represents a significant philosophical shift toward unified AGI systems
- AI coding assistance has reached an inflection point where it can effectively debug and fix complex ML training issues without human investigation
- Data efficiency remains a critical bottleneck, with humans demonstrating orders of magnitude better learning efficiency than current AI models
- Geographic distribution of AI research teams provides 24-hour coverage and talent diversity while maintaining global collaboration
"On policy is basically this idea of model training on its own outputs and letting the model generate its own trajectories and then letting some reward verify it and then the model train its own outputs."
"If the model can't get to IMO gold, then can we get to AGI? So it's basically at some point we have to use these models to try these Olympic competitions."
"The wonderful thing about this era of LLM is that you can be an AI researcher engineer and you don't have any domain knowledge and you can still get a gold medal."
"There were so many moments this year where AI suddenly crossed that emergent thing. AI coding is one of them that we just discussed."
"I think machine learning is the most scientific way we have ever studied learning, just in general."
The thing that I find the most useful about these models in general is when I have this big spreadsheet of a lot of results and I just made plots of it. I think models can quite go screenshot, make a plot of this. I hate making this matplotlib stuff about. It's so annoying. There were so many moments this year where AI suddenly crossed that, like, that emergent thing. Why? The AI coding is one of them that we just discussed. I think nanobana also got to the point where I usually, like, if you make these images, it's just like for fun, you just troll your friend or something like that. But Nanomala actually really got so good.
0:00
Welcome back.
0:36
How are you? Yeah, I'm good. I'm good.
0:37
Great to be back. It's been one, one and a half years.
0:39
Yeah, it's been one and a half.
0:42
Feels like a long time. So last time we talked, you were at Reka. Yeah. And then you joined GDM again. Working for Cork again.
0:43
Yeah.
0:51
And more recently, you've started GDM Singapore. Yeah. Is it GDM Singapore or Gemini Singapore? I don't know if you've named the team.
0:51
I think we have a Gemini team in Singapore.
0:59
Yeah, Gemini team in Singapore. It's called Reasoning and AGI.
1:00
Yeah, Reasoning and AGI.
1:03
Is it important to have AGI in the name?
1:04
It was like a white thing that we put AGI in.
1:07
Yeah.
1:10
I think that, like, one reason why we work on these models is that we want to get to AGI. And I just with a wife thing that we added the AGI to the job posting. Yeah. There is no, like, formal name of the team yet, but it's basically the Gemini Team Singapore.
1:10
Yeah. I mean, I think people are, like, trying to figure out, triangulate. Amazon has an AGI team. You guys have an AGI team. And then let's say Meta now has a super Intelligence team. What are people signaling when they choose these names for their teams? Do they have, oh, we have a plan, or is it just vibes you.
1:24
Try to fish off hot takes on the.
1:42
No, you have officially AGI in your job title.
1:43
No, it's not a team name. It's not a team name. We are. Yeah, it's just, you know, we just want to signal the North Star of. We beat this model to AGI. Yeah.
1:46
Yeah. No, I wasn't really fishing. Politics. Okay. So you rejoined gdm.
1:54
Yeah.
1:59
And I think last time we talked about. I listened back to the whole thing. It was an amazing episode. Last time you were talking about how it's like externally you were in Brain and came out and now you're back in gdm.
1:59
Yeah.
2:10
I wonder what's your. Just your general reflections, just plugging back into the Google infrastructure.
2:10
Oh yeah. So I guess coming back is very interesting because it felt and return to Google like everything including your ldap, your username is all the same. It's like you play Pokemon, you leave it aside and then you go back and you click continue. Save game. Yeah, you save game and continue game. It's like that. Obviously the last 1.5 years while I was away, many things have changed. Brain is now part of GDM and stuff. So I think that obviously a lot of things have changed, but I think overall the coming back has been pretty seamless. Obviously I love Google infrastructure and I think debuts are great and stuff like that. Yeah. And I'm very glad to be back to. Yeah.
2:15
To Google infra.
2:51
Yeah.
2:52
And was the intention always that you were going to work on DeepThink?
2:52
No, not really. I think I missed research a lot. Like doing like research, not like super fundamental research, but like close to model research. Right. But I really miss being at the frontier and trying to go beyond that. Right. So I really missed that a lot and I think when I came back, big thing wasn't a thing and I don't think there was any plans actually. It was just like I'm just going to work on research and see what happens. Yeah.
2:56
But I'm sure, I guess there was some inclination that reasoning is the next frontier and that's like obviously the most rewarding research path, especially this year.
3:19
Yeah, I think reasoning these days reasoning and RL is like probably quite RL reasoning comes. I spent a lot of my past life, I call it the past art, working on architectures and pre training but I think now I more transition more into RL research. I'm not like old school NRL but against RL and old school RL. And to be honest, I had almost no RL background coming back. But I think like RL is the main means of modeling these days and yeah, so I think it was pretty easy to jump back in. And I think a lot of fundamental skills in research is general purpose and universal and it's quite easy to innovate even in a tool set that you're not super used to. And yeah, so I think RL is basically the main modeling toolset that we play around with these days.
3:29
Superficially I see some, you know, in your UL2 and T5 work, some overlap of like, you know, the focus on objectives and the focus on the stuff that you're trying to incentivize. So I would have maybe guessed there was more overlap than you are saying right now, which is interesting, but I understand it's very simple.
4:15
Objective is objective and they have some like overlap. Right? Yeah. I think it's just mainly like on policy and off policy of designing these things that change how like also the learning algorithm itself. Right.
4:38
Let's just introduce this kind of terminology to people if they're not that familiar with the sort of RL policy. I do think that a lot of people are trying to understand what is working about this generation of RL research. Anyway, so Jason had this interesting post which I think you are co signing, which is basically you always want to be on policy instead of mimicking other people's successful trajectories, take your own actions and learn from the reward given by the environment. Basically correct your own path instead of trying to imitate other people's path.
4:49
Yeah, yeah.
5:17
First of all, he writes really well and I wish that more people wrote like him, but I don't know what's your reflection on that or your addition on top of that.
5:18
Yeah. So I think the biggest analogy of on policy and ODD policy is basically ODD policy is basically like when you SFT something, it's ODD policy. Basically you take some of the other model larger model stuff and then it's basically like this off put at somebody else's generated outputs, trajectories and whatever. I think ODD policy is mainly like the core idea of modern LMRL where you like generate and then you reward the model based on its own generations and then the model trains on its own generations. So it's a bit like self distillation to some extent. The model generates its own output and then you reward it and then trains on its own output. So I think on policy nurse is basically this idea of model training on its own outputs and letting the model generate its own trajectories and then letting some reward verify it and then the model train its own outputs. I think this is more generalizable in general. I think there's still a lot of science out still to be done about the gap between SFT and RL itself. But I think basically on policy and on policy. Right. And I think bring this analogy back to real life. I mean this on policy ness is more like humans. We are more on policy because we go around the world, we make mistakes and then we are. Okay, this is. But like imitation learning is mostly somebody else, not first principles. It Just tells you what to do and then you just copy. So I think, yeah, this philosophy, bringing back Zifor's philosophy to life is quite like powerful. Now I have a kid and everything, like want my kid to try stuff. And then you tell them like, okay, this is like where this went wrong, where this went right? And stuff. Rather than okay, you just copy everything somebody else does.
5:25
Yeah. Montessori schooling is mostly that, right? Like very unstructured learning. Like you discover your own path and we just give you a safe environment to do it. Yeah, yeah, yeah, yeah. What is the point at which you should transition from imitation to on policy? I do bounce back and forth.
6:53
Humans, right? Not models.
7:09
Right. I would say in models it seems like there mostly has been a very concrete, like first you imitate and that's pre training and then URL at the end.
7:10
Technically SFT is still imitation, but I think for humans there's a little bit of this, right? Because if you basically like sports, right, you play sports, you start off by imitating, like hardcore imitating, but then you cannot imitate forever because you need to like imitation. I don't know whether this is a good energy, but watching a lot of tutorials and stuff is more like imitating. You try to learn certain movements and stuff like that. But then like on pausiness is like going to the game itself and try and get a reward signal from that. Right. So I think that humans do need some form of imitation learning. But I think everybody starts off by imitating. But then again the human and model kind of. It's just fun to have analogies, but we shouldn't take things super literally and stuff like that.
7:18
I actually am quite a serious taker of machine learning. Insights into human learning.
7:58
That's what we learn from models now. Yeah.
8:05
Because I think like machine learning is the most scientific way we have ever studied learning, just in general. That's true, that's true. We have to invent curriculum from like scratch.
8:07
Yeah, that's true.
8:17
And things like learning rate. If your learning rate is too high, blah, blah, learning rate is too low, blah, blah.
8:19
Like where do humans even have a learning rate?
8:22
So I do tell people to keep an idea of their own learning rate and to be wary of it being too low. So for example, if you've been wrong once, you should ask, where else have I been wrong? And typically, usually let's say learning. You know what I mean? People usually update slower than they should when they have been wrong. What is it?
8:25
Stubbornness?
8:46
It could be stubbornness. I don't Know, is that the right word for it? It could be like they're too Bayesian when actually their prior assumptions are wrong and they need to completely throw out their previous assumptions because one counter example invalidates all prior experience. Your entire world model is wrong. Throw it away. So Bayesian actually wrong. Let's say you've lived for 10 years under some assumptions and you have one example that breaks your narrative. You shouldn't be like, okay, Now I have 2% update. No, actually it should be like, oh, something's really freaking changed. Everything I've assumed for the last 10 years is probably wrong. What else am I wrong in? And update 20%. Update 50%, not 2%. You know what I mean? That's a learning rate thing for me. So my direct example is the whole getting into AI stuff. I was watching Gans for 10 years. Has it been 10 years? 2012, 2013.
8:47
Time flies.
9:43
Yeah, I was watching Gans and I was like, okay, this is cool. It's getting more detail, not that impressive. Then all of a sudden stable diffusion came out and you can run it on your laptop. And that was my learning rate. Okay, like, fuck. My mental model of generative images did not include this. And so I was like, okay, I am very wrong and I need to pivot everything. And that's how I started later space.
9:43
So will this mean that your learning rate is high?
10:05
Yes, I will nudge it up. I schedule my learning phase because a world model has been violated.
10:08
Okay. I think it's a good strategy. I think also this brings a little bit to like when new paradigms happen, like how fast people are to adopt it or like to invalidate their understanding of things. I think as scientists we definitely, a lot of times we do have to keep as the field progress, do have to keep invalidating our own world model. It could be like a certain way is the way to do something all along and suddenly something comes along and invalidates it.
10:14
Yeah, you can be very proud of your priors until it becomes your prison. That is actually very dangerous. Okay, that was a bit of a tangent. I don't know how we got there. You did highlight Denny's LLM reasoning lectures where he got traced the intellectual history of reasoning in LLMs chain of thought to. To non RLFT. And then one part that I was going to prompt you a little bit was also self consistency. Right. Which I think people roughly know. I think it's more crudely implemented with OpenAI than with you guys, where it is straight up, they have eight inferences and they Judge or whatever. But I do think that also is relevant to on policy distillation, where it's like literally you have eight different paths and they're all from the same model. So checking my intuition there, basically the stuff that you're saying about why on policy is important and using, let's say, an external verifier to improve your reasoning, you can also do that with parallel reasoning.
10:37
Oh yeah. I mean, when we train RL models, they sample multiple times. So yeah, to some extent there's some form of self consistency.
11:32
Is that directly. There's self consistency, right?
11:40
Yeah, self consistency is a little bit more. In the more nuanced version of. If you talk to Danny will tell you it's not majority voting for sure, but it's more.
11:41
I agree.
11:48
Yeah, it's more nuanced version of that. But I think parallel thinking. Thinking definitely is related to self consistency.
11:48
Yeah, I think for those people, OpenAI also actually put out some interesting papers on majority voting versus other forms of multiple output consensus. Basically at the highest level is an actual LLM judge that decides this is actually a worthwhile trajectory that is more valid based on some internal consistency or just inspecting the chain of thought, which is very cool that we can train models to do that.
11:54
Yeah, for sure. Yeah. Self authenticity is a big fundamental idea. I mean, chain of thought itself was also a big idea. And then self authenticity was also like a big fundamental idea in modern LIM literature.
12:19
Yeah, amazing. Okay, so let's bring it to. I guess one of the headlines of this podcast is going to be about diving into the IMO world. So this was around about May, March, March, July, July. You guys announced. Oh, this very nice photo here. This is. I was looking at. This is in London, I believe, where you had. Dishoom.
12:33
Yeah, dishoom. Oh, you got to be at a. Like you got to be at a photo taking to get the credit.
12:51
That's bullshit, right? What the fuck?
12:58
No, I'm just kidding.
13:00
The contributor list is bigger than this.
13:01
Yeah, yeah, but like, they were like saying like, oh, okay, you should go to the.
13:03
In order to get a literal gold medal.
13:06
No, no, no. Like to get the credit for being in IMO at first. He's created in the order. So it's just a joke. It's just an.
13:08
But anyway. Okay, could you tell the story of studying this IMO thing? Apparently it was done in one week.
13:14
So let me be a bit more clarify a lot of things. Right. So the IMO effort has been like very long standing. So Tang and basically and KOH has been really working on this. Even last year, right? Last year they got. I was not back at Google at the time. They had the alpha geometry stuff and then they were like alpha proofs and stuff. So it's a very long extending effort. But I think this year was the. We wanted to try to actually use Gemini as an end to end model. Basically.
13:19
No, no second system.
13:44
No second system. That fax in fact out model and.
13:46
Even that was a not intuitive thing. I covered the silver result from last year and I was like, okay, it's pretty close. Like it's one point off from silver. Just try a bit harder, you'll get gold. That decision to abandon it I think was pretty bold. I don't know.
13:49
I personally was believe, always believe in. If we are not like in retrospect it's easy to say this, but it's a bit like if the model can't get to IMO go, then can we get to AGI? So it's basically at some point we have to use these models to try these Olympic competitions. And I think that one of the goals this year was like, okay, we're going to do an end to end type index model. That's where my involvement came in. So basically I was not involved in the IMO effort only until the model training part. So I have to say that Tang did most of the IMO thing. I just trained the model with a bunch of what does that work involves, what are some things. So basically we just prepared the model checkpoint for the actual IMO itself. Right. So that's also something that's easily overlooked about the IMO thing was that many times you want to chase benchmarks or stuff like that is always like a thing that you can kind of keep running and running and hill climbing until you get there and then you. But like the IMO was a live completion. Like some members of the team were in Australia for the thing and there was like this happening thing was happening. Life was unfolding live.
14:03
Oh, it's a very alpha goal. You receive the thing, you like punch it into your system and then.
15:08
Yeah, yeah, yeah. So like some of the professors from Kang's team were like when they went to the IMO itself and stuff like that, the conference. I don't even know whether IMO is a conference, but it feels like there were people there like in Australia and then so it was a live thing and there were people who actually the job was to run inference on this IMO P1, the P6 that came out and they also came out on like different Days. So it's like different sets, like one day, one day, two, something like that. So the fun part is that I knew nothing about IMO at all. I'm not like a kid that took part in imo. I was too down for that. You're a piano player and yeah, I played the piano. But what I only knew was that, okay, we delivered the checkpoint and that checkpoint was used to do the IMO go. But then there was somehow a week in London where everybody gathered that, so everybody was flying to London and then this photo was taken there. And then you get to see how all the different parts, like, come together. And like also being in the other rooms, in the rooms with the other like co captains and then it felt a little bit like a hackathon thing. So, yeah, I think this was like the training process of this IMO model itself was like maybe a week or so, not the actual, like the whole like. Basically.
15:12
Yeah, yeah. I think that the question is I'm still not over the decision to throw away alpha proof.
16:26
Okay.
16:32
Yeah, basically, I think it's very major and I understand that you have this goal of AGI. Obviously at some point one model should do all of it. Right. But I think if you pointed a gun at me and said in 2024, what do you need to do IMO and IOI and CPC and all the other stuff that you guys did was you need an LLM reasoning system that knows how to operate a computer and knows how to write lean and run Lean verifier and all these. But basically you roft the lean verifier into the chain of thought. So basically, like, it is not obvious that you can do that.
16:32
Okay.
17:10
At all. Because I think.
17:10
Okay, so what you mean is that like some in some way encoded in terms of the model somehow.
17:13
Yes.
17:19
Yeah. I mean, it's just whether at the end of the day you just believe in like this connection is one, one model, lots of parameters. I mean, there's also two U's, right? There's also two U's and stuff like that. But I think to some extent the model. I think we should be able to get to a point where in the past, when the LLM first started, the model couldn't even be a calculator. Now it can somewhat be a calculator. So technically, like a tool, like a calculator is somewhat encoded in the parameters of the model. So I think we will eventually get a point where whether there's things that cannot be expressed in the parameters of the model is like an Open question. We don't know where is that limit, but I think we will keep pushing and pushing this limit. So whether like something like a lean system or some other things to solve other like physics engine or something, we still continue to push that boundary. But I actually don't know whether there were a lot of debates about symbolic system versus yes, that's the word I was trained in. I actually don't really know whether there was like to me I was just like, oh, let's train the model. And then someone told me to train the model and then I trained the model. Basically there was like overarching like people like the IMO effort that decided this. And I also think that because basically these specialized systems are very one off systems that are like, you could create like a chemistry engine, you could create a math engine, you could create the thing. Right. But at the end of the day you want one model for everything. So I think this kind of fits that direction a little bit more where you have one model for. And then this model was also launched as Gemini Deep Thing as a general purpose Gemini Deep Thing.
17:23
So it's basically unchanged, but with maybe some config toned down a bit.
19:00
Yeah. So the inference time conflict was like the one served to most people. It's different. And the full IMO inference conflict was shipped to some mathematicians just because of the inference cost. Right. But that was good enough to be a general purpose model. I think my take is that this intuition was what led to the trying to go towards one model instead of. Because these specialized systems, there's no end.
19:03
Right.
19:29
You can create many specialized systems.
19:29
Yes.
19:30
The most I can see in the future is there'll be a model then if there's something that really cannot be subsumed by a model, then you just use a tool or something like that. Right. But my prediction is that I think most things can be subsumed by the model. I think, yeah, AI researchers are quite good at hill climbing.
19:31
History would say that you have a lot of evidence backing you up. Is this the model output? This is it, right?
19:50
This is what? Yeah, I think this is the model output. Yeah.
19:55
What do you see when you look at this? You see, obviously it looks like a well written prompt. It looks like something a real human mathematician would do. People did compare yours versus the OpenAI one where OpenAI is a lot more raw or they have to clean up their versions. We don't have to talk about OpenAI. But I think what is interesting to you when you saw this kind of.
19:57
Output, I want to give A little bit. A special disclaimer is that I know nothing about app. Right. So I think the wonderful thing about this era of LLM is that you can be an AI researcher engineer and you don't have any domain knowledge and you can still get a gold medal. It's a universal tool that you don't know anything about. I can't pass this at all. I guess it's foreign to me.
20:16
But maybe a proof is a particular kind of chain of thought. But I would say that the other interesting thing that some of your collaborators were talking about, eosu is like, oh, this is the first example of reasoning in a non verifiable domain which to me isn't proofs by definition verifiable. I just want to give you things to riff on or debates that might be worth digging into.
20:39
So I think there's a lot of aside from proofs, there's a lot of domains that are non verifiable and I think not easy to verify. So it's like when people mean non verifiable it's like non trivial to verify or just not as easy as the solution of a math problem because it pulls a long form and that's why it's non trivial to do verify unless you convert it to Lean and then you do all kinds of things. Right. So I think there's a lot of work to be done in these non verifiable domains. Yeah, I'm getting into this territory where I'm not sure what I can say or what cannot say.
21:02
Okay, so yeah, sure, that's it. I think another thing that is an open topic of debate was how much domain specific work or post training was done. Because you then went on to do the IOI and CPC stuff as well. Right. The same model.
21:33
I was not directly involved in the icpc but I was related to some extent. That's all I can say. Yeah, yeah.
21:48
Any other interesting call outs? Maybe I'm just on the team. You called out Jonathan as someone who is co captain on this effort and yeah, basically how does the effort look?
21:56
So I think there were four captains for the imo, two from London. Jonathan was from Mountain View, I was from Singapore. So I think four of us basically trained this model together. And I think one, I was trying to see what Tang was saying but I think one interesting thing was that like we all in different time zones and we all. And there's something also very interesting about like passing on the job. There's no like really like fixed workflow, how to work Together between captains. So it's more like, oh, I'm going to board the plane now. I'll be AFK for 12 hours. So it's how man takes.
22:07
So you're just babysitting the run sometimes.
22:37
There are bugs and stuff. The job comes down sometimes. So basically it's very ad hoc and it's very, very busy between the captains, how we decide to like work together and. Yeah, but I think it was a kind of interesting time also because we were all flying. Like, I think the London folks were not having to fly, but I had to fly and Jonathan had to fly. And then like when you visit another country, you have another. Like if you visit an office, you have many meetings. So I was in out meetings and it was pretty like, interesting. And I also think that nobody really knew whether we would get go at that time because the IMO actually hasn't happened. Yeah, it was interesting, exciting. And then I think like the whole process of this getting verified by the IMO commit day and you know, like, you know, there was like, okay, but we're not going there. But I had to learn a lot about how the IMO works. Right. Apparently the goal score is not even. It's not even a fixed number. It's like a bell curve. Right. So it was like a time where you just look at the score. You're like, like I was even like looking at, watching the human participants and then seeing like what the scores were. Because whether Gemini will get gold depends on like how the humans do. So you like looking at, oh, if a certain percentage, you like, what's the.
22:38
To some extent you don't have any control over that. So.
23:47
Yeah, but you're just curious, right? Because I would say that it's definitely more like exciting, like there's more adrenaline than like just running on a benchmark and getting a number. It's also like a process that took some time. Yeah, yeah. But I think overall if you have specific questions you can ask also. But I think this whole thing has been a highlight for me. This IMO effort has been.
23:50
Yeah. I would say most people, if you ask them maybe two years ago whether a model could get an IMO goal, they would have said, like, impossible.
24:11
Yeah.
24:19
Then the silver helped. Right. From last year. But the fact that you can throw that system completely away and then just take existing Gemini and scale up deepthink and then just run it for IMO gold. I think it's also very non consensus compared to last year.
24:20
Yeah, definitely. To some extent. I think researchers were also surprised. I Wouldn't say surprised, but it was more like a pat on the back kind of surprise. We actually made a lot of progress. We as in collectively, all the engineers and researchers working on Gemini. There's a lot of progress being made. Let's look at how much we went in one year. And I also think that it's just five years ago, not two years, five years ago. You just imagine the outcome. You just look at the state of AI now, just generally IMO and the ICPC go. And also even things like nanobana. If you just look at the AI progress now and five years ago, I think people would think that we already reached AGI.
24:34
Some form of AGI.
25:10
Some form of AGI.
25:11
We're just moving.
25:12
If you just traveled, like you take these checkpoints and you traveled back into five years ago, someone should make a drama about this. But I think it's really quite impressive, like how the field has moved so quickly. Yeah.
25:12
The hard parts, you would say, were scaling inference.
25:23
In what aspect? Like hard inference.
25:27
Even expensive hard, as in maybe the most amount of brain power expended on the team. I saw some comments where they were like, actually the hardest part was the inference optimization or the very, very long horizon inference. That deep thing needed compared to normal Gemini, stuff like that.
25:28
I didn't work on the inference time scaling. Yeah, I wouldn't know here.
25:46
That is mostly that. Oh, and then there was this. The code name was apparently imocat, which you named after your desk.
25:50
Okay, that's not really like it was in the. So I think I tweeted about it at some point. Right. Going up. Tweet. So the IAM cat was basically like. Okay, it's not like an official codename or something. It's just like the name that the conflict of the job was like imo cat. That's the.
25:57
You just need some kind of name.
26:15
Yeah, I mean, I just like. You know, I just. I like cats and. Yeah, yeah, fair enough.
26:17
That is mostly it on imo. Unless you want to bring up anything else. We have other sort of researchy topics. But beyond, before I go into sort of researchy topics, I did want to maybe leave the floor to cover. What else should people know about the reasoning effort that's going on at gdm?
26:20
Let me think of where to start, please. What do people need to know? So it's really good. Yeah, that's what people need.
26:36
Maybe an easy one to start with would be a lot of people were focusing on maybe academic benchmarks two years ago, last year, maybe LM Arena. This Year Pokemon. Pokemon is very interesting reasoning, visual reasoning and just general long horizon agent planning. Yeah, Benchmart and I don't know, you seem to focus on it a lot and I think Gemini did it very well. So obviously I think it's something that is easy to talk about.
26:43
I think I'll probably show you this. There's actually nothing specifically done for Pokemon, of course. Yeah, of course there's nothing specifically done for Pokemon. And I think Logan had this tweet recently about the recent Gemini Tree on Pokemon Crystal and Pokemon Crystal is like.
27:08
There was so much.
27:23
I think Pokemon is like. So I used to play a lot of Pokemon and I'm a big Pokemon fan in general. And I think it's a, it's a great like you say, it's a great long horizon benchmark and stuff like that. And I think it's good to check in once in a while on these benchmarks that like almost never get contaminated or like people actually like don't spend time to like you climb benchmarks. It's like kind of silly to like people like, okay, like what are you working on? If some people are. Oh, I'm working on Amy, I'm working on Hle, I'm working on Pokemon Maxine or something like that. That's kind of like funny.
27:24
We did interview the cloud based Pokemon AI. I think his name is David and it showed serious flaws in anthropics screen understanding, vision capabilities. Literally couldn't tell. I'm trying to get past this wall but you just keep running to it because it doesn't know that the wall is there and so it doesn't have any spatial reasoning at all.
27:55
I mean some of it could be like a harness, like the harness thing. Or also whether the model has access to game state information or is it completely visual.
28:18
Yeah, Claude's implementation is very game state heavy. They dumped effectively all the memory of what's going on in the emulator.
28:27
Yeah, I see. Yeah. I think for. I don't know whether I'm jumping off tangent or something like that. I think solving Pokemon is going to be more of like how fast you solve it. And then the thing that I have not really seen so far is whether the model can complete the Pokedex.
28:35
Why is that more challenging?
28:49
No copy the bullet is so hard. You need to plan, you need to search up information. There's some things if you don't go online, basically you need to have a little bit of deep research in this. The model will just never know that it needs to trade. If he's Able to go online, post on forums and then find someone to like. Hey, cool, can I trade with you? Is Pokemon to evolve? Some Pokemon needs to be traded. To evolve.
28:51
Yes. Or mated.
29:13
Yeah. But anyway, I don't know. I have not seen the model vehicle to complete the Bucharest, but completely is really hard actually for model. So I think that's actually an interesting. An interesting one.
29:15
Yeah. I wonder what the real world analogy would be. Once if. Let's say we have a model that is capable of doing that, what can we make it do that we cannot do today?
29:28
There's a lot of planning involved. Just real deep research, like planning. There's also a lot of planning involved and it's more. I think the Pokemon game is very linear. Right. The thing, the Bogodex involves a lot of backtracking research. Yeah, a lot of research and a lot of. Yeah. So it's probably a different nature itself.
29:37
Is that as interesting to you as, for example, a lot of other people in the AI for science world are trying to discover things that you cannot look up? Right.
29:56
Novel knowledge.
30:04
Yeah, novel knowledge. Because basically what you're saying is we're not even there yet. We're at the place where models cannot consistently apply knowledge that they look up. Right. Like you give Gemini access to web search and you say, okay, try to collect all the Pokemon in a Pokedex. You don't have high confidence that it will do it. I don't know if someone's actually tried. Probably. No. Right.
30:05
No. I think the hard part is actually trying to synthesize the web knowledge and then apply it in the game itself. With all that visual state going on and stuff like that, it probably will be solved in OneNote. It's not.
30:25
It is challenging.
30:37
It's not challenging, but it's not like super interesting.
30:38
Like you're basically just. The task really is, can you look up the guide to do it and then can you apply the guide? That's it. You know what's even more intelligent than that? Creating the guide. Like being the first to figure out how to create the guide. Which is what?
30:40
Oh, yeah, yeah. But then when it comes to this, mostly does an exhaustive search thing. The model does try and try like humans and guide for that. Yeah.
30:57
Okay. So that's actually less interesting to you. Interesting.
31:03
Okay. Actually to be okay. When you think about it, it's like not super. Super interesting, but it's okay. It's have not seen a model to try to do this.
31:05
Yeah. I think efficient search of novel idea space is interesting. Obviously you can brute Force anything. But we're not talking about brute forcing, we're talking about trying to create an AI scientist.
31:12
But novel knowledge is actually an interesting thing that I think is going to be quite a big thing. Being able to generate novel.
31:23
Google has done stuff there which you're probably not that close to those teams that has done AI scientist work.
31:30
There's some things that have been, for example, if you freeze the model weights at 2015, do you freeze time at 2015? And then even with the current model, let's say you have. Okay, let's assume there's no leaking of information. Somehow if you ask the model what's the best ML, okay, not 2015, like 2012 or something. It will just tell you that SVMs are the best. Right. This is the way that machine learning works in general. Right. And the question is, can you invent the transformer? It might not be able to. Even today models, they might not even be able to invent the transformer if you freeze the time at a certain time and even you bring the text. I mean the model is a transformer. So I just say there's no assuming, there's no leak issue.
31:36
No, totally possible.
32:13
So I think there's still a lot of open questions about whether the model can really innovate and generate really novel knowledge.
32:15
Yeah, one related question on that which I think is related to the Danny paper, which is I think people have this sort of mythicism on what reasoning is and reasoning is if you really demystify it a lot, it's whatever happens inside the chain of thought tags. Right. And you're eliciting that reasoning behavior from stuff that is already latent inside of the pre trained corpus. That's one version of this interpretation.
32:23
I think these days reasoning itself is very vague and it's very open. So it's mostly different people will have different definitions of what reasoning is.
32:52
Right.
32:59
So I agree that this chain of thought is basically when people think of reasoning they associate with chain of thought and obviously it's what happens in the thinking and okay, reasoning. Right. But I think these days it's more like I said in earlier, but it's like reasoning and RL is almost like basically it's anything that is post training to elicit capabilities, basically RL and post training to elicit capabilities. So I think the actual technical definition of reasoning is making more or less better with thinking and post training. Okay, yeah. So basically like RL ing the model to think better and thinking is more like thinking traces and thought trajectories and Stuff like that. There's also this line of work for latent thinking and stuff like that. Like when the latent thinking and discrete token thinking is going to be the same thing or something like that. It's like an open question, meaning adding.
33:00
Extra tokens to your vocab that represent.
33:49
I forgot, what's the name? Like these academic papers that do this. Loopy things or track tokens or basically instead of decoding discrete tokens, you actually simulate this by doing this in latent space. Right. So when you do chain of thought, thinking and reasoning, you basically decode extra tokens, hide it in the thinking tag and then you decode stuff. But latent thinking is basically you just don't decode tokens. You just don't bother buy it.
33:52
It might start speaking the native language of thinking is numbers, not passing it through some filter of English. And sometimes you must start thinking Chinese or something else.
34:17
Yeah, generally I don't really believe that moral thoughts have to be the same with human thoughts. I'm actually like generally in ML, I'm more of the school of thought of let the model do whatever it wants.
34:27
In general, there was a discussion, there's a latent representation hypothesis paper that I think you are maybe sympathetic to if you haven't already read it to me. Sam's obvious. Basically image models will have the same idea what a laptop is versus a text model will have. They'll converge on the same latent and obviously you can align them and you can do all those stuff with them. And so it totally makes sense that their concept would be just a vector of numbers that represents laptop. That's the concept. Yeah. And okay, maybe you have some numerical differences between one model's idea what a laptop is versus another, but it mostly would be the same. Yeah, very interesting. The question I was kind of leading into was that because now we are in this age where LLM text is in the corpus of stuff that we train on, where it's a little bit of a recursive loop. Right. The reasoning tokens are out there now. And so pre trained models themselves, pre trained based models are also capable of reasoning and they're increasingly so as more and more reasoning text goes into the corpus. Isn't that interesting or is that worrying?
34:36
Do you actually see much reasoning trace on the Internet? So I've never seen those though I.
35:42
Would say that like on hug your face. Yeah. People are publishing that specifically as to whether or not people, researchers are actually including that in their training corpuses, who knows? Right. That's their choice But I would say that percentage on Common Crawl that has COT tokens in there went from 0 to 0.001%. And it will just go up over time because people are publishing it.
35:47
Yeah, but I think if the sources are quite clear, you can actually filter away those. Because usually people put it on GitHub or put it on.
36:13
Do you want to filter? Maybe you don't.
36:19
There's a choice for the.
36:21
Yeah, quite literally. The whole reason why I don't think we covered this in our previous part, but two years ago a lot of people were like, oh, you just include more coding tokens in your pre trained corpus.
36:23
But the coding tokens are different from like coding tokens.
36:34
It generalizes outside of code for reasoning.
36:37
Oh, that was like.
36:40
You don't believe that?
36:41
No, no, no. I think. I don't know if it's still true today, but I see.
36:42
Yeah, that was just our general coverage of reasoning. I would say that there's a lot of interesting work here and more to do. Maybe I'll cover one thing which I know that you have personal inputs on, which is that you have started using AI coding.
36:46
Oh yeah. So I actually don't really use much AI coding in the past, but I think we've reached a point where AI coding has started to become really useful. Okay, so before AI coding, the thing that I find the most useful about these models in general is when I have these big spreadsheets of a lot of results and I just need plots of it. I think models can quite go into. Screenshot me a plot of this. I hate making this matplotlib stuff about. It's so annoying. Okay, but that's basically one thing that I can remember about how I use AI in the past. But I think AI coding has started to become the point where I run a job. I get a bug, I almost don't look at the bug. I paste it into like anti gravity and like I told it that would fix the bug for me and then I relaunch the job at that. Like beyond like Vibe coding, it's more like vite training, Vibe ML or something like that. I would say it does pretty well most of the time. And actually there are classes of problems that it's just generally I know this is actually really good for and in fact maybe probably better. And like I would have to spend 20 minutes to find like figure out the issue and then fix the thing and then reload.
36:59
So yeah, this is very interesting because I would say like level one vibe coding is you actually know what to do. You're just too lazy. Yeah, it's just ah, just do it for me. Like I've done this a thousand times. Like just go fix it. Like I know exactly what to do Here you're saying is that like the next level where you actually don't even know it's investigating it for you. As long as like the answer looks right, you're just stripped.
38:06
At the start I was a little bit like I did check it and look at everything and then at some point I'm like, okay, maybe the model looks better than me. So I'm just going to let it do its stuff and then I relaunch the job based on the fix that the model gave me. And I think the models will just keep getting better and better. Yeah. So yeah, it's something that. I also think that recently there's anti gravity. I think also because these tools were not like that in Google infrastructure. It's not that easy to. You don't. I'm not that familiar with what is available outside and when I was at the startup I didn't really. I think the models were not like so good like one and a half years ago. So it's also like a forcing function that also people are like oh try anti gravity is a game changer and stuff. Okay. So I just started using and yeah.
38:27
Yeah, you spent some time with Varun recently.
39:09
Oh no, I did say high end grader. Okay.
39:12
I guess you were telling me you're an AI researcher that doesn't even use much AI and now you're actually like AI pilled as a user.
39:15
There were so many moments this year where AI suddenly crossed that emergent thing. So I think AI coding is one of them that we just discussed. I think Nano Banana also got to the point where I usually like if you make these images it's just like for fun you just troll your friend or something like that. But like Nano Banana actually really got so good that you can use it for the best that you can use it for like basically. Yeah. So it's getting really good and I think yeah, this year the stuff like. And even things like the past many these lms will hallucinate things a lot. But now I just trust it automatically. I think we just. People are just enjoying the utility by. By these models. So now I'm like, I was always AI fueled. AI is a good thing. I don't see how anybody can disagree with that.
39:22
Yeah, but you are actually using it for things that you are high expertise in, which is your own ML work Yeah. And just to. Do you have a special version of Gemini that you use internally that we don't have access to or it's like public Gemini?
40:05
I think it's the public gem line.
40:21
Okay.
40:22
Yeah.
40:22
I was just saying, like, it would be entirely reasonable to train Gemini Internal for only your code base and your work.
40:23
Oh. Actually, I'm not sure, though. These things are just like, straight away for me.
40:30
But obviously, if it obviously improves your productivity by, I don't know, 10% worth it. Right. So I think that's interesting. And there's the interesting thing. Levels of how much do you trust it? How much of your jobs do you automate away? No longer need. There's also the question, I guess, about how people come up and train in the field if you no longer need juniors, because the Gemini is your junior, Mr. Researcher. So I think these are all interesting questions.
40:35
I want to say one quick thing first, right? So I think that when it comes to whether a model can be like a junior suite or something like that. Right. So I think if you think of it this way, if a job from 1x suite, one time suite can be replaced by a model itself. But let's say you are a manager, right? The objective that the metric you track is like your time. And then if you can have a model that saves you the same amount of time as the work that your reports do, but you don't actually replace one person per se, but you.
41:02
Yeah, a little bit from everybody.
41:30
Yeah.
41:32
Right.
41:32
Then I definitely agree that, okay, when you count the net time save, there are times where the model can fix bugs. That would have cost me, like, one day. One day is huge. And these things are definitely like. I don't know whether anybody has done any real metric evaluation or these type of things, but if you use time as a real metric and then not as in number of. Okay, maybe three hours, like, kind of metric. Right. But these things are not, like, going to replace one person as it is, but more like a passive aura that buffs everybody in game terms. Right. Yeah.
41:33
I often think of myself as a bard because I tell stories and I plus everybody around me. That's an ideal situation for me in a DND group.
42:03
Okay. I don't play dnd, but okay. They're the kings of support.
42:11
Hero support spot.
42:17
Yes.
42:18
Yeah. Okay. AI Support, I think, is very encouraging. I think. Where is it still not working for you that you've tried and you're like, oh, man, I expected it to be better.
42:19
Oh. There are times when models try to get lazy and Try to fix something in a. Like they are still. They get lazy and then they try to gaslight me into thinking that the bug is fixed. So there are still classes of. There are classes of problems that are like very easy for the model, very hard for humans. There are some things that are very easy for humans, very hard for model reference paradox and stuff. So it's still very hard to characterize these things into these proper quadrants and stuff like that. So I would say that the capabilities on models these days are good enough to be really helpful. But it's a bit like it still has some. But yeah, but I think this will. I don't think there's anything that to be done to specifically focus fire and these things is more like general capability improvements. The model just get better over time and then these things will just like go away.
42:29
You say that. Okay, so yes, I think obviously in the grand scheme of things, just trust the process. Keep scaling in every dimension and things will just fall away. Things will emerge. But you've also said in the past, I can't remember the exact tweet where you're like, each additional data set compounds over time. They're just small additions. And I would say that when you say things like focus fire on things that you would think humans, it's easy for humans, hard for machines. Those are easy wins where you can just add a data set that would focus fire on that. And isn't hill climbing just a sequence of doing that until you reach AGI?
43:15
Okay, so I get your point. I think that it's true that sometimes a lot of progress on the whole is just a series of small incremental changes that push. I think that's accurate. That's true. It also feels that there's also a lot of small, seemingly minor for the lack of better word that push AI to the stage it where it is today. So I definitely agree. So nothing against people who focus fire, but it's just that when I mean that it might be not easy to focus fire on things that are not very easy to characterize. So it's just that when he has there's something targeted. Right. Okay. I want to improve this capability, add some data. So I think defining eval is defining the problems and stuff like that. It's like characterizing it and if it can be characterized, then okay, then fine. But I think it's like what I was trying to say with the coding is that these things are not even some of the class of problems are like, I don't work on coding, but like people maybe work on coding. They know like they have like terminology for different types of failures. But so maybe somewhere somebody is focusing, firing on this, then make the model better. That's great for everybody.
43:52
Yeah. I mean that's why it takes a thousand people to get all these things together.
44:54
AI is definitely like a big collective effort these days. It's a big machine. Yes.
44:57
It's really crazy. Okay. So I just wanted to broaden out to general things people are talking about in the community on research, which again, I know that you are very locked in. So you don't necessarily have read all the papers or anything, but we can just riff on ideas. You can obviously ask me what I think as well. Is attention all you need?
45:01
So attention and transformation is a core idea in the recent times. Pre training and scale is the thing that makes attention and Transformers actually shine. Right. Because without. I think the first Transformer paper was this like a machine translation thing. And then basically GPT and BERT were the ones that actually showed the full big potential of this idea. So in terms of is adjuster, really, really all we need probably no, but I think it's like one of the. From architectural point of view also, maybe no, but it's not all you need. But you need it, definitely.
45:22
What else are you thinking about on that same level? Are you talking about moe stuff? What do you mean when you say it's not all you need?
45:59
You definitely need the scale of pre training. You need all the tokens. You need RL I think when people.
46:05
Say asset is intentional, unique, it's mostly.
46:11
From architecture point of view.
46:13
But will transfer risks get us all the way to AGI, right? I guess it's the.
46:14
So basically when you get to AGI is the problem.
46:18
Will it still be a gnome architecture or so like meaningfully different.
46:20
You'll be a Transformer. I think really it depends on what you call it. But I think unless the paradigm shifts completely, which is. I mean as a scientist, you cannot like completely say no to that. This will never happen. But my feeling is that it's been like what, like almost 10 years since Transformer 2017, nine years since the Transformer. I think we have not replace self attention like in some form of it. Like you could rename it. You could name it something else. You could. Sometimes you can do local global side window. Yeah, it's still a Transformer in the end. And I think that's not going anywhere. Unless the whole thing with back prop, everything goes like the whole thing just changes completely. Then there's a different Story, there's a different conversation to have, but if it's still within the same scope and bounds. So I spent a lot of time thinking about architectures and whether there's alternate architectures and stuff. Okay. At the sequence processing level, like there is the ultimate.
46:25
Yeah, it's sequence to sequence transformer.
47:24
It's probably the self attention is. There was this whole big era which I was also involved in, this era where people try to like undermine the attention as much as possible. Like they try to remove it, simplify it, make it efficient. Like this whole efficient attention era. At the end of the day, the outcome was always like, oh, we remove all the attention, but we have one layer of self attention there and it still works. Like there's at the end of always.
47:26
The story, which even Gnome, a character, he published some stuff about how he has some ratio of mixing of local and global attention. Right. Basically still attention, but modifying it quite a lot.
47:48
I will consider local and global attention to be like still attention.
48:01
Just like how much you're skipping.
48:06
Yeah. The only question is that if the formulation changes too much, your QKV becomes like ABCD FG or something like that. Or some.
48:07
Maybe I'll give you some motivating constraints out there to do this. You guys are still charging 2x for over 180k token context or 240, something like that. And the max is theoretical. Max is 2 million tokens, right? What if we need 200 million? Is there some point at which where even this concept of input token context is irrelevant? Because you are doing continual learning, that kind of stuff, where you're modeling it as. Okay, the AGI will be achieved through a sequence to sequence transformation, therefore. And attention is the best sequence to sequence model or architecture, therefore attention is all you need. But I think other people are like, sequence to sequence doesn't accurately capture intelligence.
48:15
But that's not really about sequence to sequence. It's more about the whole gradient design and backpropting. Right. It's not architecture itself. That's a problem that we. That is more of like the learning paradigm itself rather than the architecture itself. I think the architecture is just basically like the interface between the learning algorithm and the tokens. I think it's more about the learning algorithm itself. And this. Continue learning this. There's many ways to think about processing many insanely large contexts like 200 million 1 billion tokens or something like that. Right. Whether it's going to be like you have a new learning algorithm that every time you run inference you. You learn on it. Right. Then you can technically have some kind of memory, like human being is learning as I'm talking to you. Right. So that's also like one way. The other way is like whether, okay, maybe somebody will say that, okay, the attention is like just too expensive for 200 million 1 billion contacts. So we need new architecture. Or some people will say that, oh, we just improve the chips like accelerators. So I think many ways to interpret it. But I think it's like, if it's about. There's a lot of fundamental things that. If it's about continual learning and stuff, there's a lot of fundamental things about the learning algorithm and stuff as well.
48:59
That will have to change.
50:09
I think the learning paradigm and architecture and stuff like that goes hand in hand. And I think as the field progress and ideas just stack on top of one another. So there's also this thing about idea that proposed has to be compatible with all the work that have been done before to shine. It's a bit of variant of this hardware lottery by Serral. It's not the hardware rotary that I wrote about the GPUs failing, but it's the original hardware lottery. But it's more like a bit of lottery of the things proposed have to play well with the things that were proposed before. So it's a bit like going down this local optic minima to some extent. So now we are in this local minima of Transformers, everything. Everything. Right. Maybe it's not easy to get totally out of this because also a lot of people's investment optimization have been done. So the things that play well need to play well with the ideas before. And the way I see it now is it's very difficult to like, come out of it.
50:10
Okay, I'm not entirely convinced I see what you're saying, but let's call it Gen AI. Fucking hate that term. It's still a very young field. And so, yes, there's been eight years of work on the transformer, but what's that in the grand scheme of things? Maybe we're in a local minima and we've got to nudge ourselves out of it. I do want to leave that open ended. I don't have an idea. I do think that people are in what Ilya Suskever has been calling the age of research.
51:02
Right?
51:28
We're like, okay, we scaled up what we can scale up. We know what the next maybe one two orders of magnitude look like in scaling on every dimension that we know about. But what is the next dimension to scale?
51:29
There's this misunderstanding a little bit about. Oh, the last five years has just been scaling things. Scale.
51:41
Okay, please tell me more. Yes, you made that joke about now we scale researcher salaries.
51:50
Okay, let's not go there. But I think that ideas matter and I think that there have been a lot of good ideas in the last five years. It's just that maybe it's just not so it's not been like blindly like if you took an mlp, right. That's like without a self detection. Okay, I'm going to throw like $100 trillion on this and scale up that thing and the thing.
51:55
It's never going to work.
52:15
It's never going to work.
52:15
Yeah, yeah.
52:16
So there's no like there's part of it that's also like. I think the bitter lesson gets used to too much, too conveniently used around. But actually there's also a little bit of. Not a bit. There's also a sweet lesson where it's like ideas matter. And I think even till today, right. People downplay ideas and stuff like that.
52:17
Do you think the rate of new ideas without being specific about what ideas, because obviously you can't share, but the rate of ideas has increased or decreased because there's kind of a law of diminishing returns, at least smaller.
52:35
I think the number of ideas is always proportional to the number of researchers working on a certain problem. So by definition it should increase. But I think the number of ideas that actually work is not decreasing compared to the last. Like if you're not in the era of diminishing returns yet. Wow. So I think the ideas are still very important and there's still very good ideas that are game changers that are being invented.
52:45
And I think I know the answer to this. But is the closed lab advantage increasing versus open source or decreasing? Like the Chinese labs, they say keep publishing open source models and some of the American labs as well publish open source models. Would you say that the ideas that I see there, Nvidia has Nemotron, OpenAI has GPT, OSS. These are all basically checkpoints on what is publicly known about training models as of this year.
53:07
Oh, okay, okay.
53:38
It's declassified information because everyone.
53:40
Yeah, okay.
53:41
Yeah, everyone does this.
53:42
I think that the gap is increasing.
53:44
I don't think it's completely predictable from stuff that you said before.
53:46
I think the gap is definitely increasing.
53:49
Yeah, I think that justifies researchers. Otherwise, what's the point of having researchers if not finding new tricks that compound over time?
53:51
But definitely I think this is increasing.
54:03
Okay, I'll do A side tangent. I don't know if you have any comments on this. So this is very related to Nvidia's recent purchase of grok, which I don't know if you have views because you're very TPU centric, but are we memory or compute bound? And this is relevant to the Transformers discussion of like in terms of what serving Exactly. I think the classic view is that we are compute bound because we just need more compute for pre train and RL and then inference. But actually the counterargument I would make against this is I actually have these charts of Moore's laws. I wish I could just pull it up easily. Moore's laws of the scaling of compute versus scaling of memory versus scaling of network and bandwidth. And compute has a much higher slope of scaling than the other two.
54:05
Memory. What? The cheap memory? Honestly, I don't think about this memory bound that much. Maybe it doesn't. Yeah. So I would disagree with it, but I don't have high confidence in.
54:52
And because you're mostly on the research side, less on the inference side.
55:04
Yeah, maybe the inference guys will be like, yeah, I don't wake up and think about serving. So yeah, maybe I don't think about the inference that much. Yeah.
55:06
My previous line of discussion here was like Nvidia is very foresighted by Milanox because it actually isn't the real bottleneck in scaling because it has the lowest Moore's law. And then the second one now is memory, which is very interesting.
55:15
Okay, okay. But honestly, I don't think about that much. Yeah, I understand.
55:29
Okay. Data efficiency. So this is a joke, but implicit in this is that there's some kind of maximum data exposure. Right. So previously I would say that a lot of the training paradigms is like one epoch is all you need title of this idea. I would say that the real number maybe is between three to four epochs. And I do wonder what theoretical limit of data efficiency of a model in terms of training and compression should be. I don't know what that means, data.
55:34
Efficiency, but you're asking the question in a way asking like how much repeats is tolerable?
56:08
Is that one tolerable is contingent on. Does it actually improve in meaningful. It's not about like you actually want to do it for its own sake, but I do think there's that and then there's also just the sheer amount of stuff that we can learn with limited data. So let's just say you're not compute bound, you're not memory bound. But let's Say you are data bound. Last time that we were on the podcast we talked about chinchilla versus infants optimal training. But now actually I think a lot of people are even talking about data optimal training. Given limited data set, how well can you learn from it? I think that's an interesting research direction that not enough people are talking about. Maybe it is something that is commonplace in the labs, but it seems very clear that we are very unoptimized with regards to how much we learn from our data. I'll just put it there.
56:12
I think in general extracting more from every data point is definitely valuable, but I think that also related to the fact that we are running out of tokens in the world. So I don't work on data for pre training and I think things that I say were general state of industry, not general state of industry. Right. So I think that I don't even know whether data has diverged like the way that these things are done. It has diverged too much across these labs.
57:06
And no open. There's a lot of cross pollination for sure.
57:38
Cross pollination. Okay, okay. Yeah. But I don't think about data that much like the pre training data that much.
57:41
Maybe earlier the first half of this year I would have said that kind of pre training is dead and that everyone's just funneling all their work towards rl. And we had this grok chart which is very interesting where we're sending same amount of compute on site as to. You think it's a psyop?
57:46
No, I don't know. I have no idea.
58:04
Yeah, I think that people are taking it seriously. They are like yeah, okay, whatever. Especially in the agent labs like cognition cursor, they're taking the open source models from whoever and then adding let's call it pre trained scale RL on top of it. If they have that level of info data, which they do. Which is very interesting I would say. Yeah. This data efficiency argument. Yeah. I think to me it's also more trying to discover new paradigms of learning in order to get where we want to all go. Which is. And the existence proof is humans. Right. Your two year old daughter is much more capable than an LLM in some things having seen way like eight orders of magnitude less data. That's very interesting.
58:05
Yeah. Comparing human learning and machine learning is.
58:54
Definitely purely as an existence proof that we could probably do better. Three examples of dog four for example of unidentified animal. I can probably tell it's a dog as a human. But machines, classically you take 20.
58:56
The data efficiency of Humans is definitely way higher than models. Yeah. The only question is that where does this thing come from? Is it actually like putting more flops on every token? Or maybe it's back to the question about whether the transformer is the optimal architecture. Maybe it's backprop, it's the web one, maybe it's the off policy ness. Maybe it's the where is the bug? Right.
59:09
Exactly.
59:30
But maybe it's a feature, not a bug. I don't know.
59:31
Okay, we've identified probably. It took me a while to get this across. This is the kind of data efficiency I'm talking about. I think it's emerging basically at the end of every year. I try to take bets as to, okay, what will be the big themes for next year. I think this is one of them that people are really trying to focus on because you're feeling this data crunch even though everyone's still investing in data. I forgot to mention that I would say that I've been wrong on pre training being dead. Yes. I've now met pre training leads from both anthropic and orphan AI and I've seen the talk from the DeepMind guy recently. And so everyone's investing in pretrain still, which is nice to see.
59:33
Nobody said imprechain was dead. I know.
1:00:10
No, it's a theory that we are trying to disprove or proof anyway. So I think. Okay, let me wind back to my general idea. Right. So yeah, data efficiency seems worthwhile. You would treat it as like, okay, well show me where the bug is and I'll go fix it. We don't know where the bug is. We just have existence proof that it could be better. And then I think the final logical chain in this for me is that everyone is focusing on some idea of world model as a version of this form of more efficient learning, which potentially might not take the form of a sequence to sequence transformer. I don't know how that works. Like, definitely a little bit out of my depth here. To me, that is more efficient because every world must be internally consistent. And if the next piece of evidence come in and invalidate those worlds, then you no longer need to pursue those paths ever. And you can just narrow in on the world that you've identified. And so to me, that is learning where you are. Learning to fit world models. Yes. So yes. Maybe you can treat the learning process as curve fitting.
1:00:13
Yeah. So you're learning the world instead of learning the world model. Yeah, learning the world model. Right. Okay. By sampling multiple world models and then finding out which one fits the data the best?
1:01:18
So I guess my query is this what people talk about is this if you, I mean, obviously feel free to attack it because I'm just spitballing it, but this is what I pick up from talking with multiple people about. Okay, what are you talking about with world models? What are you talking about? Data efficiency and learning efficiency. And like, how do you gel it all together in a cohesive sense of the future where we can actually.
1:01:28
What's the definition of world model at the start? From the start?
1:01:49
Yeah. There are three kinds of.
1:01:51
Okay, okay, go on. Yeah.
1:01:53
First kind is the vo kind or the. What's the other one, genie that DeepMind has, which is the sort of video world model. You model everything with some kind of Gaussian spats or whatever and you like, you inhabit that 3D space.
1:01:54
Yeah.
1:02:06
Second, let's call it the Yann Lecun meta school of thought, which I don't know if you're that familiar with it. He has published the Jetpack architecture and then separately FAIR has also published the code word models where you're basically specifically for code. Very interesting. You are executing code and modeling the internal state of the execution environment as you go line by line.
1:02:07
Okay.
1:02:30
The LLM actually learns to predict those things and actually it seems a lot more efficient at the scale that they've tested it out, which is very cool.
1:02:30
Which definition are you anchoring on?
1:02:37
The third one, the code one, the.
1:02:39
Code word model, that's the second one.
1:02:41
Those two are bundled together. The JEPA fans are probably hating me right now because I'm lumping all metals to work under one school of thought, but whatever.
1:02:43
Okay.
1:02:52
The first one is VO Genie, like Super Spatial Intelligence, those kind of Fido based world models. Second one is some execution or some sort of explicit modeling as you sort of run through the corpus. And then I think the third one is this amorphous thing which I think people are trying to get to where they are doing what I said about the resolution of possible worlds and your curve fitting as you learn, as you inference.
1:02:53
Yeah, but what is the world model itself? Is it like.
1:03:21
It is a mental model of where everything is and how you think the world works. What I think you think everything.
1:03:25
But technically it's like it is something.
1:03:33
In the latent space.
1:03:35
Okay, okay. So you can, for simplicity, it could be just be like a transformer model between.
1:03:36
So to me that is the most coherent thing to the current paradigm, which is you could actually do this in current Transformers. I think the way that you train it will probably have to be different.
1:03:42
Okay, I see, I see.
1:03:52
I don't have any conclusion here. I'm just throwing it out as something where I know you're interested in this kind of stuff and I don't have that many knowledgeable people to talk to about it.
1:03:54
No, I don't think about world models that often. I think because role models are just not really well defined in the first place.
1:04:03
But I think so don't necessarily world models. But the problem is learning efficiency and maybe I guess accuracy or AGI capability that is not easily unlocked right now on our current path of scaling.
1:04:08
Yeah, I think when it comes to data efficiency, I think it's more like unbelievable of finding ways to spend more flops per token, right? Because actually basically if you are data bound you want higher data efficiency because you can learn more from every data point. You squeeze out more points, right? So things like that can extract more, can use more flops on every token is definitely like a form of data efficiency. Then there's the learning algorithm, right? Because I think there's a different scaling law for humans. Is this machine, is this dogs, are this cats, are this different Alia chart, right? Yeah, there's this famous chart. Point one and point two are just not entirely different things because it could be that that better architecture is actually just spending more flops per token. So if you come to a point where you are very data bound but not compute bound at all, you just find algorithms that spend a lot of compute on every token. So I think the overarching point is just that, okay, it's a learning algorithm thing for data efficiency. And then if whether the correct way is actually just to apply more flops per token does to squeeze out more from every data, every data point. Also because humans actually don't like they are exposed less when you say less or more data. It's also very ambiguous because they are technically on 24, 7 and then you have things mostly visual have a lot of different types of inputs, right? And whether they actually spend more flops on everything they listen is also a question because maybe they are just paying efficient just because somebody needs to count how much flops the brain use to process how much. Maybe they're just spending more compute on every token. And also maybe the learning algorithm is different. But I agree that data efficiency is very important given that I think we're going to like there's limited amount of data in the world.
1:04:21
One more thing before we go into dsi, you know how like we're talking about RL and like you're working on RL stuff. Why are people paying so much for RL environments?
1:06:06
Wait, so who is paying for RL environments?
1:06:17
OpenAI and antopic. At least nobody said anything about DeepMind. So a lot of the model labs that are not you are well known for paying at least seven figures for external startups to create RL environments for them to train in.
1:06:18
Okay.
1:06:34
And I think the question is, if your models are so good at coding, why don't you do it yourself? And so I think there's some amount of expertise that's being distilled from human experts into an RL environment that you can then let your agents run wild in. But I'm curious if there's any other deeper insight than that because I'm not satisfied with my own explanation.
1:06:35
RL environments that have a lot of domain expertise are probably very valuable. And actually I don't know specifically about what RR environments people are actually buying. But what's the thing that you're not satisfied by?
1:06:56
It was so valuable and a lot of people are saying, look, it's a Next JS app inside of a Docker container that logs stuff out when you send inputs in. Then you could probably do it yourself internally. Why you pay so much for some startup that you don't know to do it for you?
1:07:10
Actually, I have no clue about why this is happening. I have no clue.
1:07:26
And a classic example would be like if you want to build a computer use agent for buying things in E commerce, you would want our environments that perfectly replicate Maybe the top,000 e commerce websites and then you just parallel roll out on all of them. Does that seem meaningful?
1:07:30
I don't know.
1:07:49
All right, cool. DSI and LLM Rexis. A big bet for me this year for my conference was we actually started focusing on LLM Rexis.
1:07:50
The other actually what's the motivation behind starting LLM Rexis?
1:07:59
I think Rexis is the king AI problem in consumer it is the single most valuable thing. All your feeds, any even search is Rexis.
1:08:03
Basically search. Basically it's retrieval is the God problem.
1:08:11
Right, Because Rexis is ranking but then also filtering, also personalization, also re indexing and performance. It is the God problem and you get paid a lot for it. Engineers are not that excited by it, which is very weird because they don't see a lot of them don't work on Rexis and they probably never will, but they don't see the monetary value that can come out of a good Rexis. The other two pieces of updates for me Which I actually didn't even know that DSI directly tied into this was one. Then Twitter publicly adopted their feed algorithm as an LLM.
1:08:16
Racist LLMs are just used everywhere now. But whether it's actually like a big LLM, whether it's like a generative retrieval type of model, it's like another question, correct? Is it?
1:08:52
We don't know. All we know is that they have said that they have swapped out their current Rexis for an LLM based Rexis. That's how they found. But what is published is YouTube where they actually adopted semantic IDs for YouTube's Rexis. And YouTube is obviously a big deal.
1:09:03
Is it like public information?
1:09:17
Yes.
1:09:18
Okay. Okay.
1:09:19
They came and did talk about it with us. Okay. And then they published a V2 this year as well. Okay, more info. And I just thought so basically the last time you were on the podcast we didn't talk about DSI that much. But you have actually some background in ir. You care about ir?
1:09:19
No, I don't care about ir, but I think dsi. Okay. DSI or generative retrieval was like. I think one of my favorite works in the old. I have some IR background. When I was doing a PhD I did some Rexis work. I did some retrieval work with Rexis and stuff. So I have some IR and Rexis background. So I think generative retrieval and generative Rexis is very conflated. DSI started as a retrieval thing. So we did natural questions like ranking of index, like document, documents, everything. It started off as. I mean actually we did an interview with Yannick, like me and Don. The paper came out like long time ago. So at that time we wanted to reimagine retrieval and search. So we wanted. At that time LM we were still using T5 models at that time. It was like not. We were not in the LLM era yet. It was pre lm. It was like okay, pre training works kind of thing. And then there were like some pre training models around. So we wanted to reimagine retrieval. Right? But retrieval Rex is all the same formulation, ranking, retrieval problem. Right. And then that's where we started to imagine retrieval as one giant that encodes everything. In the memory career we tried so many different like semantic ideas. Actually my collaborator Vin was the one that came up with ideas that basically. And the start of this whole genetic retrieval was actually basically literally just trying to give a document like identifier and just predicting like raw brute force predicting like this. It actually like it actually works because the models can memorize something if you look at the literature from all the way to things like Dr. Vec it's very off like the words have no meaning, they just ID in the vocab. It's not just number. Right. And technically the models have enough capacity to predict. But I think semantic ID was an idea that basically you have some semantic association and then you actually try to break down the search space hierarchically. So how this would evolve in the registry was at a time after DSI came out. Right. So Ed Cheese Group and Mahesh, the guy who they did some exploration of applying DSI to to RAXIS and that's how that generative Raxis recommended system paper came out. He didn't even know he was involved.
1:09:34
That's crazy.
1:11:43
That was basically us transferring this. Basically okay. DSI works with. We try to try it on Rexis and then I think the recommended system people have a slightly different way of doing semantic ids. But it's basically just because the domain is slightly different. But after that I think we were done with the invention part with this one.
1:11:44
The rest is details.
1:12:04
The rest are details. So over time I also left Google and stuff like that. So over time these things evolved a little bit here every day. I think I also saw something like Spotify is also using something like YouTube Spotify they use this type of semantic IDs this type of DSI models. I think from the research community point of view the DSI work was the first one that and decode semantic tokens. But then when we went to I don't know, I kind of quantity is strange in a way that they will do things like oh, this is generally retrieval. It's not generally rexy. It's like they'll do this kind of random things. That is a bit strange. But yeah. Anyway this was like the whole history of this genrift retrieval. Apparently there's also a lot of people working on. I don't follow actually I don't follow at this at all now it's just not even in my mind. But there was once I went to even in the Singapore office. There are suis actually like working on generative retrieval. They don't work on it. I don't know whether they're still working on it but I met a person that tried to explain genetic retrieval to me. It was quite funny that I kind of like co invented generative retrieval. But I think this whole IRR thing has been. It's just an interesting phase and I definitely think DSI is one of my more creative works that I've done that is not really lm, but it's like.
1:12:05
Under the general principle of apply ML to everything. If the Googler is working on general retrieval, would that be like AI overviews? Is that something similar?
1:13:14
I have no idea.
1:13:22
Okay. For people listening, I did have a track there. I think you just type in AI engineer and you'll get it. Where the Gemini guy was talking. Sorry. The YouTube guy was talking about how to use Gemini for the rexes. I don't know what size of Gemini because he didn't talk about it. But this is public work now and basically every YouTube video uploaded gets encoded into some kind of code book and. And they retrain this every day on some kind of batch job.
1:13:23
Yeah.
1:13:52
Which is interesting. So yeah. I don't know if you even know what Gemini is being used for.
1:13:53
I don't follow these days I do.
1:13:58
Think in the sense of for people who are still not getting it, applying intelligence and the general intelligence of an LLM to the retrieval to the recommendation task means you can accommodate such weird recommendations, like such weird queries as well that normally no classical system can ever handle. And I think it's also somewhat emergent in a sense that when you were using T5 you just couldn't actually add that much value on top of a normal BM25 retrieval technique. Would you say that's accurate? It is not just about paraphrasing, it is about understanding query intent.
1:14:01
BM255 is a really strong baseline. Actually. BM255 is a really strong baseline.
1:14:38
Yeah, sorry, I don't know the comparative delta versus T5 for you guys versus BM25 but I don't expect it to be very high and I expect it to be a lot higher for a true LLM based practice depending obviously on the query set.
1:14:43
I didn't really think about it this way before but because I've done modeling in many different domains including search in the search IO community. There's also benchmarks and stuff like that. People he'll climb on. There's some. There's Amy. I don't know what is it called these days anymore. But generally the modeling dynamics of IR tasks is very different from like nrex's task is very different from standard language tasks or like vision tasks and something like the way we kill climb LM where you train models. The way that modeling things interact with this environment is very different. So I think that I honestly I hated working on Rexis and retrieval stuff. Okay. I'm just looking about old days when you work on like T5. You work on. You change, change architecture. You try to input perplexity. You use superglue. Like this is olden days era. Even now when you train lms, you just do zero shot, few shot stuff like that. Because as a researcher engineer, you just interact with the environment a lot by this, you're just like okay, rl by this environment. But Rexis and IR has a very strange feeling to it. Strange feeling in the sense that it feels like whatever works. It's like you are in a world where the gravity is different or like you are in a world where the modeling things that feel intuitive are not not intuitive. So it feels like a very strange space to. So I wrote some papers back in my days on Rexis and stuff like that. Every time I ran some modeling experiments for phrases and stuff like that I didn't enjoy. It feels like the environment was rude. It feels like the vibes are just like.
1:15:01
What makes it rude?
1:16:31
It just feels transactional. No, not transactional. I don't know how to describe it. For example, if you play sports, like play tennis permittal where you hit the ball, you have a very nice feeling like hitting the sweet spot. When you do modeling in traditional dome, when you get the feedback back, you feel like everything sounds right, everything feels right. Everything right. But rexys and IR is like a bron where it's like you hit the shutter cork and you hear a glass shuttle. Like randomly you just feel this weird sense of wh that cause and effect.
1:16:33
Are too far apart.
1:17:02
Like it just feels strange. And then sometimes maybe the metric like I think Rex is they use like all the NZCG effects and then the BM35 is strong and then you just. And then you get like worse than like the BM35. Back in the day when you stack 2LSTM to 3LSTM, you're like, Whoa, I see life.
1:17:02
It's just the game unrewarding area to work in.
1:17:18
It's just weird. Also the IR community and the retrieval community is also always behind the mainstream. And then now it's just probably going even more worse because of IRM and stuff. So okay, I'm getting into hot tech territory. But it's just like some of the conferences are just like behind newrips and ICML and stuff. Yeah, some conferences are just like, they just like applying things that this.
1:17:22
They're downstream of.
1:17:43
They're downstream. They're downstream.
1:17:44
Okay.
1:17:46
So it always feels very uninspiring to work on this.
1:17:46
Yeah, look, there's a reason that you.
1:17:50
Left, but it was like a side quest. Like, I work on it as a side quest thing.
1:17:51
Yeah, yeah, okay, I understand. I still think it's an important business problem, even though maybe it's an unrewarding field.
1:17:55
I kind of understand why. Because the academic benchmarks for those tasks are just so far. They're so far detached from what industry is. I didn't work on any of this, like, in, like. Like the thing, but just from an academic point of view.
1:18:03
Oh, then all you need are online advisor, right? Yeah. D test and.
1:18:16
Okay, that would have been a different experience.
1:18:20
Yeah, that is mostly our sort of topic, research topics, coverage and everything. I think we're just going to end on a very simple1. On GDM Singapore, you organize a symposium here. We brought Jeff, Dean, Kwok and all the others. Basically, what's the general message or the impetus for studying GDM Singapore?
1:18:22
So we will talk about the event first. So the event was mostly. So Cole and I are going to start a team. And then I think before I came back, we discussed this for some time. Jeff was very supportive of this. He was in the region many times, in Vietnam, in Singapore last, like, about around the time where I was going to come back. And I think that. So this event itself was Kwok and Jeff was visiting and we did just inspire the community here. I think that it's also a bit more like a soft, like, setting the tone right. For the start of the Gemini team in Singapore. And I think it's a very rare instance where you get somebody like Jeff and Kwok, who are the true pioneers of AI in the world, to be in one room and then are you there as well? And I think, like many people told.
1:18:42
Me, true pioneer of AI, but I was there to live. Tweet.
1:19:25
Yeah, I think having them all in one room and then giving these talks, like, many people came out to me and said they were very inspired by their presence in the region. So starting a team and starting something is also very like, there's also no one moment that it's not okay. I press the button, it starts. Right. It's like a process.
1:19:28
Right.
1:19:45
So we hire people and then people join one by one and stuff. Something like that. Right. So I think this event was more, I would say, like to set the vibe. I think it's possible for Singapore to be close to the frontier. And I think that with having the true pioneers of AI here, we want to. To give this basically more like an inspiring thing and also led for Kwok and Jeff to meet the people Here. So Jeff was here last year but Kwok hasn't been here for some time and he's going to have team here. So it's like also nice to bring him around and meet the people here. So it was a really amazing event. We met Long as well, who a.
1:19:46
Lot of people don't know has a CS degree. It's like one of the few PMs with a CS degree.
1:20:21
Yeah, I would say that the context of the meeting was more like partially like also he wanted to learn more about IMO stuff.
1:20:26
Oh really?
1:20:36
And then also about Jeff.
1:20:36
Oh, because they invited you without knowing that these guys were coming or something like that.
1:20:38
It's like some Jeff and Kwok and me we went to visit the chat with at the Istana and we discussed a little bit on Dipting discuss a bit about imo. And then I think the rest of it was more like Jeff and Lee Hsien was talking more about. It became less about AI and more about very macro economical, political thing which was. I was very out of element with. So I was just.
1:20:44
You were in a suit.
1:21:08
I was just talking about the deep thing and the IMO and stuff like that. But he seemed genuinely quite surprised that AI has reached this point. So I think. But it was also interesting I would.
1:21:09
Say for people like you have done something that is unique in Singapore's history so far. You're establishing a frontier research lab in Singapore, which is an accomplishment, I think. The other thing also that I guess I'm still trying to wrap my head around is does geography actually matter? Like you're all working on the team. You have your London people, you have your mountain view people and mostly you're just like collaborating with them anyway. You've collaborated with them your whole life. I don't even really know what countries mean anymore when it comes to research or just AI in general. Because this thing is just inherently international from the start.
1:21:20
This is a very good question. Also it's related to the thing about identity because I think you also moved from SF and Singapore quite a bit. I was in Malden View this one two weeks ago and I'm here. But almost all my. If you just look at my. Aside from my family, everybody I talk to is somehow in the Bay Area or just because of work and everything. I think the geography matters. Okay, firstly the most thing is logistical is probably the time zone.
1:21:56
So you literally want the 24 hour coverage around the world.
1:22:26
There are advantage. No, what I'm saying is that the difference is possibly it's also more like People define the location more than location define the people somehow. Okay. The time zone. We get to the time zone a little bit then pros and cons. I think there's pros and cons. Right.
1:22:30
So you bullish on Asia, Singapore people, the talent pool.
1:22:44
I think we managed to find really amazing people. But I also have to say that this type of things is more like talent attract talent. I think most of the time people are very excited. The vibe I get is that people are very excited because it's like Koch's team and my team and we're working on very core things related to AGI. So I feel like the talent we can get from the region, it's really good. But it's only because it's us. We can unlock this talent. Otherwise might join some other place.
1:22:49
Yes. And move to the U.S. yeah.
1:23:22
About the identity wise, I would say that I definitely agree with you that why does it matter? I think that the advantages of Singapore itself or just anywhere, it's also that you are like, okay, the world is very good. So technically you can interact as much as you want. You can also go there. But I think Singapore has this advantage where you can go close and you can go far. I think the Bay Area is like so much about. I have friends in London and New York that would just never move to the Bay Area. I'm not against Bay Area. I think Bay Area is a great place. Right. But it's just AI, AI, AI, AI everywhere. Right. I think sometimes if you have some like mental space and energy to like to have some other culture and then like, you know, like London, Singapore, New York, they have their own culture. Right. But the Bay Area culture is this like AI. Right. Like you just go anywhere, you just hear AI everywhere. Right. Even the billboards and stuff like that.
1:23:25
You can hear a bit much.
1:24:17
Yeah.
1:24:18
Although I did see some billboards down here. Yeah. No, I was like, what is this culture infecting civil?
1:24:18
I do think that to some extent, if you want to do research in, you need a little bit of peace and quiet somewhere.
1:24:26
Right.
1:24:32
So this island may be good for that, but then we're still able to be connected. Right. Okay. So I think that's mainly talent wise. I think people are strong here. Yeah.
1:24:32
So far enough away. But you're still connected, you have strong talent. What are you hiring for? You're still hiring, right?
1:24:46
We're hiring. My team will work on RL and reasoning for Gemini and Gemini deep thing. I think we care more about talent density now. So we're not also Growing that big, this small first, just because compute per capita is probably important and. Yeah, so I think that's something that we're hiring for now, basically. I personally there's a lot DNS, but I think generally there's a lot of people who are very capable. But I think what I'm looking for mainly is either you have a track record of RL research or some even not necessarily the RL or some exceptional achievement in coding competitions or some exceptional achievement somewhere, then that's like the kind of people that we want.
1:24:52
Yeah, because you don't strictly require. I do remember something about your record days where you're like, you like to train your own juniors from scratch. Right. So they don't.
1:25:39
I forgot if I said that or not. But to some extent. To some extent I think we'll definitely be very happy with people that are like very high stats and just like even without much statistical knowledge. No, no stats. Yeah, the stat points child wis int like this high that points people like just raw iq, high tech talent. People like this. I think all strong engineering skills ML can be learned easily. AI knowledge can be learned easily.
1:25:47
Yeah. I think maybe one version of this is. Can it be done on the student budget? Can you do something interesting anymore on a student budget? I would say relevant to the point where conferences are quieter these days. I did do an interview with one of the best paper winners where they worked on Thousand layer Neural network. R. Well, and that was done in the student budget. It was very cleanly executed pieces and paper and good findings. Look, I'm not sure if production models will ever go to a thousand layers, but they stretched it in an interesting direction and found some good recommendations and the guy immediately got hired by OpenAI. And I think that's encouraging for the grad students in the market who are like, okay, well do I need to know somebody who works at these labs in order to get in? My uncle works there. I get the internship or whatever. No, actually you can just do it on a student budget with good advisors.
1:26:16
Oh. I actually think one thing interesting is that for most of the people that I actually went to recruit them like personally. Right. So you see their work and then you send them DMs. Right. So I get a lot of test paper. No, no, no. Like for hiring generally. So generally I almost to your point, it's like you almost like you can just do good work, put it online and then somebody will contact you. Right. It's actually super easy but super hard at the same time because I can.
1:27:03
Tell you Like, I talked to a few of these grad students. They don't know what good work means. Right. Because they don't know there's so many things their professors have, the agenda they're forcing on them, which may not be right because it's not like their professors know what to work on either. So, yeah, they just need guidance. They just need, hey, work on these five things. It showed me an interesting result in any of them.
1:27:29
Okay. So if somebody comes up with something and then does something that you feel that is very tasteful and aligns with what researchers in the labs, like one, and they come up with that independently, you know, the, that the function that produces any idea this object is good. Right. Like if you just go and tell somebody to do this, you just get the signal that these guys can execute. Right. So I think there's some value in.
1:27:49
People that, yeah, they demonstrate taste. Research taste. Yeah, research taste. It's very interesting. I feel like I could give people. Yeah, I do care about this in some ways. The research directions work that I do is a little bit of that. Like, it's low accountability for me because obviously it's just, just thought experiments. But I think for a lot of people, it's like their career is bounded by. Can you demonstrate research taste with this short three, four years that you have and just do it?
1:28:11
Yeah, I would say that there's so much competition just because of. Everybody wants to get into AI. It's just more of how to. Mostly it's more like how you going to prove yourself. It must be hard these days to be a grad student trying to prove yourself. It's definitely harder, but. Yeah, yeah, yeah.
1:28:38
Not your job. Okay. That was it. Do you have any other sort of rants or topics that you had queued up before we wrap?
1:28:56
I don't. Yeah. But it was great. I had a great time.
1:29:03
It was fun chatting, man.
1:29:05
Fun chatting.
1:29:06
Yeah. I even love last time we were supposed to do. Last time we meet at the symposium we were supposed to record, but even we just ended up hanging out and chatting. It's just nice to get to the brain dump of what's going on in your world because, yeah, we're working on really important stuff, man.
1:29:06
Always great to chat with you and see you.
1:29:19
Yeah, good to chat. Parting words on the sort of. Of weight loss and workout journey, because that's also a big thing for you.
1:29:21
I think being healthy is important to be. To do good research. Right. And I think, I think I've been probably in one of. I'm probably in the peak physical health now.
1:29:28
Yeah. You look great.
1:29:38
Yeah, thanks. And I think it's also impacted my work in a good way.
1:29:39
You did the sort of Kathy inspired, like biohacking.
1:29:43
I didn't go too extreme, but I was. Was also quite data driven when I came. I would have my own emails and I will track this. I'm still supposed to make a blog post about this, but I feel like I'm not really at the end game yet. So when I get there. But yeah, just for people who don't know, I think I lost 23 kilos this year, actually across one year, one and a half years. So. 23?
1:29:48
Yeah. Basically, literally from the last forecast to now.
1:30:14
Yes. There's an ablation study now. 23 kilos. Yeah. And I think like my HRV heart rate variability has went up by two times and my resting heart rate has dropped by 30 beats per minute.
1:30:16
30 beats per minute.
1:30:27
It was like 80, 90 and now it's like 60.
1:30:29
Oh yeah, 80, 90 is super high.
1:30:31
Yeah. I was unhealthy. Yeah, yeah, yeah.
1:30:33
Okay.
1:30:35
Yeah.
1:30:36
So I think when it's hard, like what do you have a thing that kept you going? You know, a lot of people, they maybe they're focused on AI, including myself. Right. I do prioritize work. I enjoy work. I don't enjoy the fitness side, but obviously it fits in to your intellectual work, like the sort of log off and go for a walk, eat better, all that kind of stuff. It obviously fits in. But people seeing positive example like you, they will get inspired to do the same thing. So I think it is good to set yourself up as an example.
1:30:36
I think definitely helps. Like when I do these things for my health, I just think that it's also part of work because it helps me to get better at my job. So it's important as well. I think it's important as well.
1:31:07
Well, yeah, I like that your HIV off the bat. I have no idea what mine is, but yeah, there's a general question about what is productivity and how do you measure it? What. What really matters. And it's still unclear to me, but I do think general energy level and hunger. Almost like you almost have to like experience physical hunger in order to have intellectual hunger. And I don't know if that's like a thing.
1:31:17
No, when I'm hungry I just think of food. I think to me it's like these destroying things. But it's hard to do work when you're hungry. Yeah.
1:31:43
Okay. Thank you so much.
1:31:51
Yeah, thanks. It's really great. Yeah. Have a great time.
1:31:52