Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

92 min

•Jan 23, 20264 months ago

Summary

Yi Tay, AI researcher at Google DeepMind Singapore, discusses his team's work on achieving IMO Gold medal performance, the transition from off-policy to on-policy reinforcement learning in AI systems, and the establishment of Google's new Reasoning and AGI team in Singapore. The conversation covers technical aspects of AI reasoning, the evolution of transformer architectures, and insights into scaling AI systems.

Insights

On-policy reinforcement learning (where models learn from their own outputs) is becoming more effective than off-policy learning (imitating others' successful trajectories) for developing reasoning capabilities
The decision to abandon specialized systems like AlphaProof in favor of end-to-end general models represents a significant philosophical shift toward unified AGI systems
AI coding assistance has reached an inflection point where it can effectively debug and fix complex ML training issues without human investigation
Data efficiency remains a critical bottleneck, with humans demonstrating orders of magnitude better learning efficiency than current AI models
Geographic distribution of AI research teams provides 24-hour coverage and talent diversity while maintaining global collaboration

Trends

Shift from off-policy to on-policy reinforcement learning in AI trainingMovement toward unified models rather than specialized systems for different tasksIncreasing adoption of LLM-based recommendation and retrieval systemsGrowing focus on data efficiency and learning optimizationEstablishment of frontier AI research labs outside Silicon ValleyIntegration of AI coding assistants into core ML research workflowsRising importance of inference-time scaling and reasoning capabilitiesEmergence of semantic IDs and generative retrieval in production systemsIncreased investment in RL environments by major AI labsGrowing emphasis on long-horizon reasoning benchmarks like Pokemon gameplay

Topics

IMO Gold Medal Achievement On-Policy vs Off-Policy Reinforcement Learning Gemini Deep Think Development AI Reasoning and Chain of Thought Transformer Architecture Evolution Data Efficiency in AI Training AI Coding Assistance Generative Retrieval Systems World Models and Learning Efficiency Geographic Distribution of AI Research Semantic IDs and DSI RL Environment Development Long-Horizon AI Benchmarks AI for Science Applications Inference-Time Scaling

Companies

People

Quotes

"On policy is basically this idea of model training on its own outputs and letting the model generate its own trajectories and then letting some reward verify it and then the model train its own outputs."

Yi Tay

"If the model can't get to IMO gold, then can we get to AGI? So it's basically at some point we have to use these models to try these Olympic competitions."

Yi Tay

"The wonderful thing about this era of LLM is that you can be an AI researcher engineer and you don't have any domain knowledge and you can still get a gold medal."

Yi Tay

"There were so many moments this year where AI suddenly crossed that emergent thing. AI coding is one of them that we just discussed."

Yi Tay

"I think machine learning is the most scientific way we have ever studied learning, just in general."

Yi Tay

Full Transcript

2 Speakers

Speaker A

The thing that I find the most useful about these models in general is when I have this big spreadsheet of a lot of results and I just made plots of it. I think models can quite go screenshot, make a plot of this. I hate making this matplotlib stuff about. It's so annoying. There were so many moments this year where AI suddenly crossed that, like, that emergent thing. Why? The AI coding is one of them that we just discussed. I think nanobana also got to the point where I usually, like, if you make these images, it's just like for fun, you just troll your friend or something like that. But Nanomala actually really got so good.