Machine Learning Street Talk (MLST)

When AI Discovers The Next Transformer - Robert Lange (Sakana)

78 min
Mar 13, 20263 months ago
Listen to Episode
Summary

Robert Lange from Sakana AI discusses Shinka Evolve, an evolutionary approach using LLMs to automatically generate and refine programs for scientific discovery. The conversation explores how AI systems can discover new solutions through evolutionary algorithms, the challenges of open-ended research, and the future of AI-driven scientific discovery including the AI Scientist v2.

Insights
  • Evolutionary LLM systems can achieve sample efficiency by using stepping stones and iterative refinement rather than brute force search
  • Current AI systems struggle with problem formulation - they excel at solving given problems but can't invent new problems that might lead to breakthrough solutions
  • The future of AI research will likely involve humans as 'shepherds' guiding autonomous systems that run experiments and accumulate evidence continuously
  • True scientific breakthroughs may require AI systems that can co-evolve problems and solutions together, not just optimize for predetermined objectives
  • AI-assisted research is already ubiquitous in ML, but the transition from 'with AI' to 'by AI' will require better verification and human oversight mechanisms
Trends
Evolutionary algorithms combined with LLMs for automated scientific discoveryMulti-model ensembles with adaptive selection for different problem contextsShift from template-based to agentic tree search in AI research systemsOpen-ended AI systems that can run continuously for extended periodsIntegration of human shepherding with autonomous AI experimentationMovement toward AI systems that can formulate problems, not just solve themDemocratization of scientific discovery tools through sample-efficient methodsCo-evolution of tasks and solutions in AI research paradigms
Companies
Sakana AI
Japanese AI startup where Robert Lange works as founding researcher, focusing on AI for Japan and novel research
OpenAI
Referenced for GPT models used in evolutionary ensembles and future AI research automation capabilities
Google
Mentioned for Gemini models in ensemble approach and potential to dominate AI-driven scientific discoveries
Anthropic
Referenced for Claude/Sonnet models used in the multi-model ensemble for program evolution
Nvidia
Mentioned for GTC conference, DGX hardware, and potential Nemo Claw agent platform announcement
Hugging Face
Referenced in context of DGX hardware and AI development tools
People
Robert Lange
Founding researcher at Sakana AI, lead author of Shinka Evolve paper discussing evolutionary AI systems
David Ha
CEO of Sakana AI, known for work on hypernetworks and evolutionary computation research
Kenneth Stanley
Researcher whose open-endedness ideas and 'Why Greatness Cannot Be Planned' book influenced Sakana's approach
Jeff Clune
AI researcher known for work on evolutionary algorithms and automated capability discovery systems
Jeremy Berman
Researcher who worked on ARC AGI solutions using evolutionary approaches for instruction-based optimization
Francois Chollet
Creator of ARC AGI benchmark, known for defining intelligence as adaptation to novelty
Terence Tao
Mathematician who has publicly discussed using GPT-4 to speed up research and reduce drudgery
Quotes
"When we run LLMs autonomously, they tend to just kind of like nothing interesting happens. But oftentimes innovation for a specific problem might require first inventing a different problem."
Robert Lange
"The reason why I'm not that worried yet about labor market disruption is I still believe deeply that humans are the source of deep understanding and creativity in the world."
Robert Lange
"I think one of the Rubicon moments is when the new Transformers architecture or something massive is discovered by AI and we're all using it."
Robert Lange
"Oftentimes it's easier to generate a lot of solutions than to actually hard verify them."
Robert Lange
Full Transcript
3 Speakers
Speaker A

This episode is brought to you by Indeed. Stop waiting around for the perfect candidate. Instead, use Indeed sponsored Jobs to find the right people with the right skills fast. It's a simple way to make sure your listing is the first candidate. C According to Indeed data, Sponsored Jobs have four times more applicants than non sponsored jobs. So go build your dream team today with Indeed. Get a $75 sponsored job credit@ Indeed.com podcast. Terms and conditions apply. Ugh. You said you were over him, but his hoodie's still in your rotation. It's time. Grab your phone, snap a few pics and sell it on depop. Listed in minutes with no selling fees. And just like that, a guy 500 miles away just paid full price for your closure. And right on cue. Hey, still got my hoodie? Nope. But I've got tonight's dinner paid for start selling on depop, where taste recognizes taste list. Now with no selling fees, payment processing fees and boosting fees still apply. See website for details.

0:01

Speaker B

So I think a lot of sort of analogies from evolution transfer to scientific research, right? In the sense that we traverse a tree of different ideas or different experiments, and then in the paper we report one path through that tree.

0:58

Speaker C

When we run LLMs autonomously, they tend to just kind of like nothing interesting happens.

1:11

Speaker B

But oftentimes innovation for a specific problem might require first inventing a different problem, right? Sort of automatically coming up with this reduction or like this, let's say recursive nature of problem solving is something these systems right now not necessarily have built in intrinsically, right? Oftentimes it's easier to generate a lot of solutions than to actually hard verify them.

1:19

Speaker C

The reason why I'm not that worried yet about labor market disruption is I still believe deeply that humans are the source of deep understanding and creativity in the world. If I didn't believe that, I would be very worried.

1:41

Speaker B

So I think it's going to be an amplifier of sort of these latent dimensions humans are great at, right?

1:54

Speaker C

And I think one of the Rubicon moments is when the new Transformers architecture or something massive is discovered by AI and we're all using it. Nvidia GTC starts Monday in San Jose and it's free to attend virtually online. There's already been a leak this week of something called Nemo Claw, which is an open source agent platform. And if it's real, it could be one of the bigger announcements this year, so it's definitely worth watching Jensen's keynote. For that alone, I'm giving away a DGX Spark Nvidia Just hiked the price, $700. You probably heard about these memory shortages, right? So yeah, it's now $4,700 which is very, very expensive. And Merv from Hugging Face, by the way, she got one for her birthday and she said she literally cried. So it's a really cool bit of kit. If you register through my link in the description and you attend at least one session, then you are in the draw. This is a massive conference. Physical AI and robotics are going to be the breakout theme and J the keynote Monday at 11am Pacific. The link is in the description. Don't miss it. Robert Lange, it's amazing to have you on mlst.

2:00

Speaker B

Thank you, Tim. It's a pleasure to be back.

3:08

Speaker C

So you're working for Sakana. Tell us about that.

3:10

Speaker B

Sakana AI is a Japanese AI startup working mostly on AI for Japan and at the same time sort of exploring, exploring, let's say novel or ambitious ideas on the research side.

3:12

Speaker C

It's been around for over a year now. You're one of the founding researchers, right?

3:25

Speaker B

Exactly. So Sakana has been around for now almost two years, like 11 3/4 I would say. And yeah, it's pretty fascinating to look back and to look at the early days and how much the company sort of organizationally has changed. But in spirit we're trying to sort of embrace Ken Stanley's open endedness idea and sort of explore many different ideas which might not get the resources right now in the ML community more general.

3:29

Speaker C

And we've got a few interviews coming out with Sukana that we filmed here in Japan, so I won't spoil the surprise, but the CEO is David Ha. And David, you know, like there are these epic giants out there like Clune and Stanley. David Ha is one of these people.

3:54

Speaker B

David's work has had a lot of influence on my personal PhD. Right. He did a lot of fascinating work on hypernetworks and sort of modulation in neural networks, but also on evolutionary computation and evolutionary optimization. And yeah, that sort of also painted my path during the PhD.

4:10

Speaker C

You've released a paper called Shinka Evolve and we were just saying that that kind of means evolve Evolve because in Japanese Shinka is evolved, but that's quite common. It's a common thing to do to have these multilingual double namings in Japanese just before we get there. So we interviewed the AlphaRevolve team and I also interviewed Jeremy Berman a few weeks ago. And your paper is very much like a more sophisticated version of those in the sense that it's using language models to generate programs and it's doing an evolutionary approach where we generate the program program, we refine the generated program and we have an evaluator and we do this over several steps. And your approach does many things that the other ones don't do. Tell me about the paper.

4:28

Speaker B

First off, of course this was partially inspired by Alpha Evolve. I think it's great work. I know Alex and Matei and I think they're doing incredible science. One thing that is important about using all of these evolutionary LLM driven methods is sample efficiency. Many of these systems sample, let's say 1000 programs, programs for a given task. And what we tried to do with Shinka Evolve was try to essentially cut down costs as well as sort of computation evaluation time by introducing a set of sort of technical innovations to this evolutionary search. And we showed that it's possible with very few program evaluations to basically improve upon, for example, the circle packing canonical result that they showed in their paper. And yeah, more generally speaking, I think we are right now at a point or at an inflection point where these sort of, let's say evolutionary driven LLM systems can really revolutionize scientific discovery. And yeah, we hope to have made a step forward to making this more democratically accessible. Right. So the code is open source available and by its sample efficient nature, we hope that many people can interact with the system and can make their own scientific discoveries as well.

5:11

Speaker C

Yeah, that's actually a really important point because I suppose we can use these foundation models. And first of all, isn't it just fascinating to reflect that we have these amazing models out there that we can access. So like GPT5 and Grok4 and they are so much better when you get them to refine their solution in several steps. Why is that? I mean, I suppose a naive question would be why aren't they just good

6:27

Speaker B

out of the box potentially like with enough random samples. Right. It's sort of this monkey typing on the keyboard. They would potentially be able to get there. Right. But in principle it's sort of coming back to the principles of evolution, right, in the sense that need to collect a bunch of stepping stones first and then build on top of them to really find innovations or to tune innovations down the line. And I think language models with the right sort of evolutionary hardness are extremely powerful in terms of scaling up to make discoveries. And yeah, I think Jeremy, as well as the Alpha Evolve paper, as well as sort of work we've done on the Darwin Goodall machine for example, shows that this sort of stepping stone accumulation plus iterative verification and collecting sort of information and evidence from the real world. Real synthetic evaluator is really important for

6:51

Speaker C

that very cool and stepping stone collection. So this came from Kenneth Stanley, is a wonderful paper, why greatness cannot be planned. And he said that it's better to have systems that don't converge. So in natural evolution we are just trying all of these different things and greatness quite often follows a diverse path, which means you have to do things which initially seem quite stupid and then later on they turn out to be incredibly useful. We're trying to design algorithms that can kind of allow for a population of slightly weird things and then we kind of lock in and converge a little bit. So we're still converging though, so we're still building systems that don't diverge forever. What are we losing?

7:46

Speaker B

One thing I find extremely important after having done in Shinka evolve is sort of this problem problem. Right. So with all of these systems so far, maybe, except for the AI scientist, which we can also talk about, the problem is given. Right. So you have an evaluator, you have a correctness checker, and you sample programs only on that single problem. But oftentimes innovation for a specific problem might require first inventing a different problem. Right. So for example, I think in the matrix multiplication result that the alpha evolved people show, you can recursively apply the algorithm to larger matrices. So it's actually an important result. Right. But sort of automatically coming up with this reduction or like this, let's say recursive nature of problem solving is something these systems right now not necessarily have built in intrinsically. Right. So I think going forward it's going to be really important to not only sort of do open ended, let's say optimization of solutions, but sort of do the core evolution of problem and solution together in order to collect even more diverse stepping stones and to really kick off this open ended process. Because also to me, one of the big life goals or achievements I would want to see is really having a process that can run not only for, let's say a week or many weeks, but for years, even potentially collecting even more diverse interesting stepping stones.

8:32

Speaker C

Yeah, I spoke to Joel Lemon and he was talking about the Nityan uncertainty, which is that machine learning algorithms aren't very good with unknown unknowns. And in a sense the unknown unknown is talking about these stepping stones that might be useful later. And when we run these algorithms at the moment, it's the same with LLMs and reasoning systems, is that they're Very, very good when we give them a specific thing. And what you're pointing to is we might need to invent new unrelated problems and find the solutions which might then be related to what we're trying to do. So that feels like a bit of a catch 22 situation. Right. So we're saying, you know, circle packing. Here's my evaluation function, and I want you to sort of diversify and then, you know, kind of. And then converge towards this solution. I had the same thought with genie, by the way, that it gives you exactly what you ask for. So you put a prompt in, you know, like a Swiss lake with, you know, with boats on the water and mountains on the side. And I was thinking, where are the birds? Oh, I forgot to put birds in the prompt. Right. So how can we meaningfully build systems that actually kind of bring in other unknown things that might be useful?

9:56

Speaker B

I think one inspiration or thing I would personally want to sort of research are systems outlined in Powerplay or Poet by Schmidt Hober, by Jeff Kluhan and others. So where there is essentially a set of tasks and a solution generator, and both of them sort of co evolve in this almost like odd curriculum play, like style. Right. And I think sort of in Poet, the natural first application was sort of reinforcement learning. But I think this can now be broadened up to science more generally. Right. At least when there is a simulator available for running these evaluations. And by doing such a co evolution, you always try to max out the capabilities of the generator while sort of increasing this convex hull or potentially even more diverse problems while doing so.

11:06

Speaker C

I know that there's always the leading thought that even with poet, which was this thing where you had a load of environments and agents, and the environments were complexified so the agents would have a kind of effective curriculum to learn things and increasing complexity. But even then, isn't there a kind of design bias in the system where there's some code somewhere which complexifies the environment step by step? And wouldn't that also just be designed by the human? So it would also just give you exactly what you ask for.

12:02

Speaker B

Ultimately, this comes down to the hypothesis that language models can potentially do extrapolation or interpolation. Right. In the sense that even though these things might be in the end designed by humans, there are many unknown unknowns. Right. That we humans didn't think of while designing them. Right. So potentially it is possible for an LLM to find a novel discovery simply by us not having thought about it before. Right.

12:34

Speaker C

When we run LLMs autonomously, they tend to just kind of like nothing interesting happens. So depending on the prompt you give them, they'll kind of go a few steps in that direction and then no new interesting novelty emerges. And I think even if you wire them agentially with environmental feedback, they still seem quite parasitic on their starting conditions. With an LLM, could we build a system which actually adapted to novelty that could actually discover new things?

13:01

Speaker B

I think it really kind of also depends on what do you give the LLM as a starting point. So for example, in Shinka evolve, we from time on time saw that if you give an initial solution program which is already pretty optimized on the problem at hand, you still kind of get stuck in local optima where not a lot of novelty is introduced. While if you start off from an impoverished solution, there's much more room for diversity. And I think this is coming back to what I did before in my research, namely meta learning. It's sort of this classical trade off where you can either start out from something very, let's say unconstrained from a very simple solution and give much more room for the optimization. But this might actually require open endedness and a long time to find a good solution. Or you start out from something that is already very constrained by inductive biases, let's say. And then you might be much more efficient in terms of convergence, let's say. But you don't have this sort of open ended, big novelty, sort of benefit from it.

13:33

Speaker C

Yes, and I suppose where we want to get to is building systems which are not designed by humans. So for example, if I'm leveraging my deep understanding, LLMs are really good if you understand something deeply. And similarly, we could kick off a Jinko revolve and we could put a starting solution in there, which leverages my understanding we want to have AI systems that anyone could use. So just a non expert could say I want to solve this problem and it will solve the problem. We should talk about the evolutionary approach, right? So to maintain diversity you had a population of programs and they were separated into islands. Tell me about that.

14:41

Speaker B

The way how Shinka evolve, similar to Alpha Evolve works is you keep an archive, like a database of programs and then you sample parent programs with a set of sort of inspiration programs and then you ask an LLM to basically make an improvement to that program, right? So to provide code to edits or rewrite an entire program or to potentially even cross over different programs. And then basically you query the LLM, you get a program out and you evaluate it on the Problem at hand, for example, increasing the sum of the radii of a bunch of circles in a square. You run this basically each time collecting evidence from the evaluator, adding it to the database, and then sort of repeating this process. And you don't do this sort of sequentially, but you do this in parallel for many different programs. And each time a program is added, you essentially try to diffuse the knowledge that was collected by that program across the entire sort of database. So one way to think about this is you have a tree, a tree where each node in the tree represents a program and then you sort of branch off of it based on the parent nodes. And interestingly, these approaches do tend to scale, but ideally we can make the scaling happen at a faster rate. And this is something we tried in Chinka Evolve by sort of doing a bunch of innovations, including sort of model ensembling. So we're not using just Gemini, but we're using basically all frontier model providers and figuring out a smart way how to use each model for a given parent. So if you have a certain program, in some situations, it might be better to use a sort of GPT model. In other settings, it might be better to use Gemini model. And we sort of introduce a sort of adaptive prioritization scheme that can adapt sort of the evolutionary algorithm algorithm on the fly while running the algorithm. And this sort of also comes back to the naming. Right. So Shinka evolve, evolve, evolve kind of means that this evolutionary algorithm that we apply using LLMs sort of also co evolves at the same time while we

15:21

Speaker C

optimize the programs and while we're on this circle packing problem. So you had this plot showing how it converged, and it seemed to converge quite quickly. So we'll show the plot on the screen now. So very quickly the performance jumped up and then it slowly converged. And you said in the paper that it was using three, I think, three core innovations. And my thinking was if you ran this 50 times, would it be the same every single time? And how, to what extent is it thinking outside the box? You know, Sebastian Buback is always posting on Twitter Talking about how GPT5 has just, you know, discovered new things. And there's always the question of, well, is it just searching the Internet, is it just finding things that have been found before and yeah, combining things together in a new way, but could it really think outside the box?

17:33

Speaker B

Yeah, I think this is almost like a subjective question. Right. So first off, I don't know all problems on the Internet that try doing circle packing. Right. But What I can see in the tree that we also depict is there's for example, like a crossover operation between two programs happening where sort of different concepts are combined, right? So one important part is for example, the initialization of the circles. Another one is like the optimization. So basically like a constrained optimization program is executed. And then the final part is basically like a reheating stage where noise is added and sort of more try to be squeezed out. And to me, this sort of propagation of information through the tree is one that's really, really fascinating where in some sense these stepping stones are actually used. And so in a complementary fashion and with regards to rerunning the program multiple times, of course there's some stochasticity in it. So we're using language models sort of due to the queuing device scheduling on their server side, basically we can't get rid of all the noise. We've seen that, at least for the general quality of the solution. So what is arrived afterwards. It is possible to reobtain this, but sometimes with a different program or most of the times just by stochasticity. So it's not like for many problems there's not one solution that achieves that score, but there is a spectrum or a region, let's say in the program space that resembles the same. I think one thing that was very interesting about the circle packing problem, sort of also coming back to the problem problem that I discussed initially was that originally we used a formulation where the correctness is checked with a very tiny amount of slack, right? So the circles could overlap a tiny little bit. And then afterwards we sort of reduced the red AI and the solution was was exact, right? This didn't change the score by too much, so it's still state of the art, but it was essentially like a proxy problem. We then reran the Shinka evolve on the exact setting and we found that it took a little bit longer to actually obtain the same quality of a solution. So I think this already points a little bit in this direction of what I discussed in the beginning. Sometimes sort of surrogate problems might actually be extremely valuable in making such discoveries. And having an automated way for designing these surrogate problems in an efficient way might be something really important going forward.

18:21

Speaker C

Yeah, that's absolutely fascinating. It reminds me of support vector machines, where we make the optimization tractable by introducing slack variables. And you can think of that as a kind of surrogate problem. But then I'm thinking, well, would Shinka evolve or alpha evolve? Would it know to introduce a surrogate problem because you Know, as designers who understand, you know, we can think outside the box and we can do stuff like that. Because presumably if the fitness function had the constraint that there were no circle intersections, then it wouldn't occur to the algorithm to come up with a surrogate problem.

20:54

Speaker B

Exactly. Yeah. This is a big limitation right now. Right. So at this current point in time, we take the problem to be fixed and we optimize for that problem. But when you think about humans, we're really, really good at sort of inventing our own problems. Right. Or reformulating the problem, so then we can actually sort of work with it. Right. So I think a lot of sort of the innovations in, in, let's say mathematics come from taking a very different perspective on a problem. Right. So taking sort of number theory and applying it to linear algebra or the other way around. And I think right now these systems are not yet at the point of achieving such level of, let's say, transfer.

21:24

Speaker C

Yes. And it reminded me, I spoke to lion about this. You've got this Sudoku bench, and a lot of folks watch cracking the cryptic YouTube channel, and that's exactly what they do. They invent new problems based on abstractions that capture the essence or aspects of the problem you're solving. And then they do something which is similar to Shinka evolve. So they do this kind of evolution where they take these different solutions and they kind of combine the best aspects of both of them and they forge a divergent path to a new solution. And that seems to be the essence of what we need to do.

22:04

Speaker B

Yeah, for sure. I mean, there is some work also by Jeff Kluan Sheng Ren Hu and Song Liu on automated capability discovery. So there they look at language models that generate tasks. Tasks. Right. But it's in a, let's say, unstructured way in the sense that it's not done in order to enable the solution to one target problem. Right. And I think sort of doing these connections is going to be very fruitful down the line.

22:35

Speaker C

Very cool. Now, the other thing, we'll show the graph on the screen, the evolutionary graph. So for the circle packing problem, I was looking at that, and first of all, it looked incredibly parsimonious, which is good. It looked like it had found an optimal path to the solution very quickly. And I was thinking in my mind, well, maybe there's some natural pattern. There's something about that that we could use in the abstract to guide the evolution in the future. But the other thing I'm thinking about is right now, the problem with machine learning is that we don't really have semantics baked in. So what we're doing is we have a verifier, we're looking at the reward and we're sort of like doing patent exploration and we're taking steps towards the target. And I love mechanistic forms of reasoning where we actually know something about what the program components mean. And the reason this is important is when we're merging together the best performing programs from two different islands, that's a kind of first order interaction and it might not make sense to merge them together. It's wonderful that LLMs, you can give them any pairs of programs and it will find a way to merge them together. But wouldn't a more principled way be of there's some kind of semantic primitives here and we know they fit together. So there's this Lego analogy that we're kind of building up based on principles rather than forging a path based on the performance.

23:01

Speaker B

Yeah, that's a good point. So one thing we do in Shinkai evolve as well is we keep essentially a scratch pad. So each program is being summarized and then from the program summaries we keep sort of a set of global insights, let's say, that were shared or extracted from these programs. And then based off of this scratchpad we construct sort of meta recommendations that then become part of the system prompt. That way you can try to sort of semantically grasp some of the discoveries. But a general problem which is again sort of task dependent is thereby you sort of diffuse that knowledge across the tree. Right? But sometimes you want things to be much more isolated. Right? It's always like a trade off where you somehow have to find for your problem the right position on the spectrum of how much knowledge diffusion do you want to have and how much sort of, let's say hard islands of programs do you want to have. Right. And yeah, we're trying to make steps in the direction of sort of automatically adjusting this in an optimal way. But again, it's very program sensitive. And then sort of, I think another point where you're already sort of going into is sort of Jeremy's solution to ARC AGI. Right. And sort of doing solution evolution in the instruction space. Right, instead of the program space. I do think that this is something important. And we're, like I said, with the construction of this metascratch pad, trying to do sort of both at the same time. Again, it's problem dependent. Like I played around a little bit with Arc AGI 1 and Arc AGI 2 and I think on Arc AGI 1 actually the transform sort of program direction is actually quite effective. Right. Like Jeremy said, it's deterministic and it's easier to sort of get clear signal to improve on during your evolution process while on others like arc, agi2 like this whole sort of semantic evolution seems to be more efficient. So I think ideally we can get a system that can automatically in some sense decide whether or not it wants to take like a programmatic approach in settings where it's actually feasible and easier to bootstrap off. Or it takes the semantic approach of evolving instructions or like LLM driven input output mappings.

24:21

Speaker C

Yeah, it's so interesting because like a symbolic AI person would say, oh, I don't like connectionism because it doesn't, you know, the only semantics in connectionism is this notion of similarity. It doesn't really understand things. So, so they would say, well just, just start with an entity relationship graph and then just kind of build up using, you know, composition and first principles that, that doesn't work. Right. So we're using neural networks because they're incredibly flexible and they understand a lot of things about the world, but they don't have the kind of constraints that we want. So what we do is we use these tricks. So Jeremy evolved program descriptions. On your program selection, you had a semantic novelty detection, you know, using like

26:35

Speaker B

a embedding based similarity.

27:14

Speaker C

Yes. You had like a kind of self similarity matrix and you know, based on the cosines and indeed you've got this meta scratch pad. So what we're seeing is this fascinating spectrum of possibilities where still using neural networks you can imbue semantics in using all of these different tricks. Tricks. But they all come with trade offs.

27:17

Speaker B

Yeah, for sure. I think it's kind of interesting. We've had a long period of computer science where algorithms were sort of designed by humans. Right. Then we had sort of this Andrew Karpathy Software 2.0 paradigm where we trained neural networks that then performed a certain function. And now we're sort of at this point where we're using LLMs to design algorithms or solutions more generally. Right. And I think, think actually even though large frontier language models are extreme, let's say black boxes or it's very hard to get a full mechanistic understanding of them. The outputs can be the programs, the instructions and so on. So I think it opens up a very sort of new paradigm of doing research or basically doing anything. Right. If you think about it. But I think we're just sort of at the starting point of figuring out the right user interface for that.

27:34

Speaker C

So the other innovation in the paper was using ucb, which is upper confidence bound. It comes from the multi arm bandit literature, which is this problem where you can pull these levers and at the beginning you don't know which levers to pull and over time you kind of reduce your uncertainty and you can kind of pull the ones that work. But there's this exploration exploitation dilemma and you've implemented that for figuring out which LLM. So it could be Gemini, it could be like Grokva or something to figure out which one to use.

28:29

Speaker B

We're using a model ensemble to propose program mutations and intuitively one could say the best frontier model on Swebench is always the best mutation proposal model. But that's actually in practice not always the case. And in general it's extremely hard in this evolutionary setting to assign clear credit to a single model. So you have, for example, one improvement is implemented by GPT5 and then the next one is implemented by Sonnet 4.5. And it's unclear basically if the performance gain you get from the second mutation actually originated from GPT5, sort of collecting the first stepping stone or from Sonnet 4.5. So instead of sort of uniformly sampling models, what we do is we implement this bandit based approach where each model is basically one arm of a bandit. And then we look at how often did this model improve performance of a sort of parent node by creating a mutation. And we then adjust sort of this posterior probability to sort of first explore all arms once, right? And then essentially change over the course of time in order to prefer models that sort of yielded improvements before for similar nodes.

29:01

Speaker C

The great thing about using a UCB like algorithm is it actually has a theoretical regression which means it's like only log worse than the optimal switching path, if that makes sense. But if I understand correctly, UCB is based on a sort of like a global rating, like a mean score of every single LLM. And I think what we want is to have more of a contextual switching decision, which means we know for this particular program Gemini is better. And do I understand correctly at the moment that it might converge to a single frontier model and then in a nuanced situation we might still get the wrong model.

30:20

Speaker B

So in general there is some amount of probability allocated to all models. So it's not like it can just peek on one model and then you stop using the others. Right? So there's still a chance for open endedness and serendipity, if you will. And in general for the Problems we considered we haven't seen that like one model clearly dominates all the others we've seen. Then it really depends on the course of this evolutionary process, like which model is better and UCB or the banded approach that we take dynamically adjust this in an efficient way.

31:01

Speaker C

And would it be possible in the future to use an LLM to make this judgment?

31:36

Speaker B

Potentially, in some sense. In that case, again, you think of the LLM as a surrogate model in some sense you can think of a Gaussian process as a surrogate regression model. And there has been some work showing that language models can act as surrogate models. And the real question to me is how do you represent the information to the LLM in the sense that if you use the raw programs and their fitness evaluations, you quickly run out of context. So you need some amount of compression in order to present the information the right way to the LLM in order to do this prioritization of the models.

31:40

Speaker C

I hadn't appreciated how long the context is. I was thinking, thinking, could we use an 8 billion llama model? And we're doing active fine tuning. So we're saying I just ran it on, I just ran this program on GROK and it got this score and then over time that this thing for the given run of this evolution it will kind of know that GROK is good at these problems.

32:15

Speaker B

Yeah, potentially. I'm not sure how efficient this fine tuning is if we're only evaluating 150 programs. But in principle one could imagine, I think it's on the engineering side, not necessarily the prettiest to do. Yeah, it could in fact happen. But I think for all of these things, we started out sort of with the, let's say most intuitive algorithmic component that we had. And UCB was one that really did the job here. And yeah, much credit to Eduardo Cetin who introduced this to Shinka.

32:38

Speaker C

So let's talk about the diffs and the mutations. So we, we generate programs and I think you folks were inspired a bit by AlphaRevolve. So they actually had this gating where you kind of gate part of the code which is mutable. Tell me about all of that.

33:10

Speaker B

A program is just, let's say a long string, right. And in order to make sure that certain parts which are sort of essential to the evaluation, for example into the imports and so on, were not sort of deleted by the LLM mutations, there are so called markers which basically state which parts of the code are, are mutable and evolvable. And it's easy to programmatically sort of make them actually mutable. When you get a diff proposal and these will not be changed, so only the rest of the code snippet will be changed. We sort of implement a type of rejection sampling with reflection approach, where if an LLM by chance, for example, tries to mutate this part, it's going to be rejected and you resample a new proposal and thereby you can somewhat mitigate certain security or safety problems and get a robust sort of mutation. One of, I think the bigger questions is how can you turn this from a single file mutation setup to a multi file mutation setup. So working on entire code bases, in principle you can represent many code bases in a single file, right? But the hierarchical structure might be actually useful. And there are some ideas from, let's say adir, this coding tool where you construct a repository map and sort of have some level of abstraction, but they also come again with positive and negative trade offs.

33:28

Speaker C

Basically. I love Ada, by the way. It feels that in the future the code generation systems will actually resemble Shinka Evol and if you think about it, it'll be using some kind of git repo. Maybe Cursor already does this, because in cursor you can restore previous checkpoints, but it can be exploring different branches and merging checkpoints together. And you know, obviously you just say in natural language what you want to do. But we didn't talk about mutation, by the way, so we just spoke about diffs. And there's also an option to do the full file re rights, but there's also this notion of crossover. So how does that work?

34:56

Speaker B

A small innovation on top of alpha evolve, where I believe they only use sort of diff based mutations, is that here we wanted to have more flexibility to entirely rewrite the program to come up with a completely different stepping stone, if you will. So again, there you can make parts of the code mutable, but instead of proposing, let's say a patch to change certain parts of it, we essentially rewrite the entire program program. And this sometimes is helpful, right? It's not always like a clear benefit, but it allows you to essentially get more diversity into the search. Right. So this is one type of mutation next to sort of this diff patch based approach. And the other one is a crossover mutation where we sample basically not only a single parent program, but sort of two different ones and we ask the system to sort of make a complementary improvement. And again, on some problems this is really helpful and also. And others it's not. But in general we found that Sort of having a diversity in terms of operators is also helpful in discovering new things. And I wanted to sort of follow up on the point you made before about this sort of being a new paradigm. I think so too. I'm really convinced. I think right now we're sort of at the beginning where we still think a lot about sort of this chat assistant interface as the way how we interact with LLMs. But it's most of the times inherently single thread. So we're sitting in front of the computer, we're interacting with the chat, we're seeing sort of changes as they occur in the editor, we accept them and so on. But I think this is sort of also just a stepping stone towards sort of a more, let's say distributed way about thinking about research optimization and so on. So I like to sort of think of vibe coding, vibe chatting and on the other hand we have sort of vibe optimization and vibe research. Where my ideal future scenario is one in which you as a researcher, sort of during the day, co work with a system like Shinka or the AI scientist, you sort of steer the ship like a shepherd in some sense. And then during the night you press play and you go to bed and in the background you have multiple experiments running and automatically new ones being proposed by LLMs, evidence being accumulated. And then in the morning you come back and sort of, you have have a multi threaded sort of system running in parallel. And you're more like the shepherd of this ship than the person actually executing experiments and analyzing. You're still analyzing, but you're not executing. This is happening sort of by the system itself.

35:35

Speaker C

Yes, and increasingly this might be semi supervised or even proactive. I mean, you know, there's that new product from OpenAI where it knows what you're interested in and while you sleep it's going off and you know, pole stuff. That's right. And you know, we're in the situation now where we're reasonably technical people. So you know, Matlab and Mathematica, they're supremely powerful, but you need to know how to express problems precisely. Whereas I can imagine a future where we express problems just in natural language or maybe just based on our interactions with language models. The platform knows what we're interested in and it can just go and find things on our behalf because this is about democratizing this technology to people who perhaps don't know exactly what they're looking for.

38:10

Speaker B

I think one of the bigger problems there is sort of this verification aspect to it in the sense that oftentimes it's Easier to generate a lot of solutions than to actually hard verify them. Language models are capable of doing soft verification, looking at code and sort of latently running a stack trace of execution, but it's not exact. And I think these notions, notions of reward hacking and not doing real discoveries but sort of shortcutting them is one where we need to put more time and effort into to figure out how to make sure that this actually moves in the right direction. And I would hope that language models at some point can do this efficiently themselves. So either implementing in code or latently doing it. But this is also part of the problem. Problem. It's not only coming up with the problem, but also with the automatic verification at the same point.

38:55

Speaker C

Yeah. Isn't it a tantalizing idea that there are natural patterns in the world and the building blocks to construct novel solutions are already there? Right. And, and maybe they're there for a reason. Maybe they just reflect natural regularities in the universe. Because there's always this question of, you know, intelligence is about adapting to novelty. So the world is always changing and the world tomorrow will have things, things that we can't explain with our knowledge today, but we do have abstract knowledge that could be easily recombined to explain the future. And LLMs might already have those building blocks.

39:49

Speaker B

Yeah, for sure. I think in some sense the more you think about Occam's razor applying to everything in our world, let it be language or let it be sort of science, is pretty interesting because these artifacts now go into our language models of today and potentially there is some amount, amount of this being captured. I think though it might also be an inductive bias that leads to a local optimum at some point and you need more complexity. But I do think with systems that sort of do this evolutionary mutation sort of style approach, you might still sort of push the system out of these local optima eventually.

40:29

Speaker C

Yes. And there's also the notion of the importance of adaptivity. So this is what Chollet says intelligence is. And since we've had these models that actually do adaptivity at inference time, so things like, like test time, active fine tuning and the reasoning models and so on, they started getting non trivial performance on arc. Now it's very, very expensive to have adapting huge foundation models. It's just a practical concern why we haven't done that yet. But what we can do is build systems like shrinker evolve that leverage the best of both worlds. So they leverage frozen foundation models, but they give you adaptivity. And the purpose of adaptivity is to respond to novelty is to create new building blocks, synthesize new building blocks in this principled tree like structure that allow us to adapt to novelty. So we are having our cake and eating it.

41:03

Speaker B

I have to say I found it very interesting that Jeremy basically in your podcast when you asked him about Shinka was saying like he doesn't believe that there are a lot of sort of percentage points to be gained by using a system like Shinka, but you can make it much more efficient. Efficient. That was sort of the gist of his answer. And to me it's like once you have made it much more efficient, you can scale it up again. So if you essentially have a cheaper system that can generate many more sort of instructions, I would expect that by the nature of open endedness, you might get some amount of improvement out of it. Right now I don't have any evidence for it. I would love to collect that evidence. It's again like the magic of open endedness that comes into play that as long as sort of these training examples of ARC AGI give you a good signal for a final test submission, you should be able to progress.

41:54

Speaker C

Yes. And that is a great segue because certainly on the circle packing problem it was so sample efficient that in less than 200 interactions with an LLM you converged on the solution. But I was thinking that great, but it's still quite dependent on the starting conditions. We talk about this design bias and so on. So what we put in is very important. But now what we could do is scale out so we could run this a thousand times and we could have another process which prompts, generates, breeds, the starting conditions. Because every time we run chinkl evolve, what it's doing is it's searching parts of the epistemic tree. And what would happen if we just scaled that out massively?

42:40

Speaker B

We haven't tried. But you could even start with like an empty program. Right. Which would be basically the same. Right. And then you would branch off of that empty program. I would expect. Yeah. We haven't done this simply out of sort of cost and time reasons, but I do think in many ways, sort of this is the question that will push us towards this true open ended vision of running a system for a month or so. Really trying to squeeze this out. Yeah, I'm not sure if we're entirely there yet, but I will do my best. That we will.

43:22

Speaker C

And the reason this is interesting is we know as a practical matter that we can't start with nothing. If we were just sort of like starting from the Most primitive building blocks, the search space would just be huge and there'd be no learning signal. So we know we need to start a little way up the stack, but we can massively parallelize that. So let's say we have a thousand different instantiations of Chinka Evolve. It doesn't have to be embarrassingly parallel. We could still have some sharing. So during their execution we could still have a little bit of like crossover and maybe then we could run all the Tinker evolve instantiations in a similar kind of meta evolution loop. And my suspicion is, contra Jeremy, I agree view, we know there are diverse stepping stones out there that could dramatically, dramatically improve many of these solutions. We simply haven't scaled it up yet.

43:50

Speaker B

Yeah. I also believe that using a system like Shinka Evolve could be able to sort of automatically detect whether or not an instruction based optimization approach for a given problem or a transform based approach is actually the right thing to do. And sometimes, potentially, potentially, it's like even the mixture. Right. There's some things you can probably easier even articulate in Python than you can articulate in sort of language. Right. So I would be really interested in sort of exploring that.

44:40

Speaker C

Yeah. I mean you said earlier about Jeff's. What was Jeff Klune's paper? The thing that generates problems, capability, discovery. I did speak to him about this at Neuros, but something like that could be fascinating as well, where we're also generating the problems and solutions and then kind of moving the the back end. But I think the way this will land commercially is there'll be a new type of GPT where everyone is solving different types of problems and the system, it'll be like a kind of Chinker evolved, but a massively distributed version where mathematicians are using the platform over here to solve this problem and it will see commonalities and it will kind of like link them together because you need to leverage like human creativity in this process as well.

45:08

Speaker B

I think like a big challenge going forward is going to be like how do we change our incentive system for this to actually scale? Right. I think for example, some amount of economy will be needed or some amount of mechanism design in order to make sure that everyone is still happy to engage in it. Right. So maybe we're going to have many more leaderboards for whatever is numerically sort of scorable. And I think this will be really, really interesting to see how sort of compute these automated agents, human shepherding and steering will ultimately sort of change and revolutionize science and I guess society more

45:47

Speaker C

generally and rob Looking at the future, we've got a load of people in San Francisco that want to scale language models and they are adding in implicit forms of adaptivity and composition so that they're building controllers and they're doing reinforcement learning with verifiable feedback and so on. I think that you subscribe to the slightly different idea that we need to be far more open ended and we need to be using evolutionary apps, algorithms and so on. But do you think that they are on a path to nowhere? Do you think they might change tack? Where is this going?

46:24

Speaker B

So I actually think that these things can be complementary, right? In the sense like let's say you fine tune a model to be like a circle packing expert, right. So I do believe that mixing in sort of different, sort of RL fine tuned models into sort of the ensemble of models and then having a good way to adaptively select which model to use use is not a bad idea. So to me I just very fully subscribe to this philosophy of open endedness. And reading Ken's and Joel's book was really like a fundamental moment in my life. And I want to see how far we can push this. And I think we're not yet at sort of convergence where either the capabilities of the models has converged or, or the way how we scaffold around them or the way how we humans interface with them. So to me they're really like these three points like model capability, model scaffolding and sort of the user interface. And I think we have a lot still to push on all three angles.

46:54

Speaker C

Beautiful. The only thing we didn't talk about was we spoke about the circle packing problem, but you also applied it to a few other things. Can you tell us about that?

47:59

Speaker B

So one thing we did was we sort of used a framework called adas, Automatic Design of Agentic System, where basically instead of manually writing an agent scaffold, you use an LLM to write agent scaffolds for a specific task. So what we did is we looked at mathematics tasks. So Amy and we used Chinka to evolve basically an agent, right? So using an agent to evolve an agent and we found that there we could dramatically improve, improve sort of the performance of very cheap models like GPT 4.1 Nano. But the agent scaffold was also able to either generalize to other language models or to different years of Amy. That was one application. One important other application that we did was to Ale Bench. Ale bench is basically work done by other folks at Sakana, including Yuki, who's also part of the paper which is considering heuristic programming contests, sort of previously done and executed by Adcoder, which is like this famous Japanese competitive programming organization. And we sort of showed that Shinka can also work very well as a co scientist. So basically we took initial solutions obtained by an ALE agent that was previously designed and then we optimized on top of these initial solutions with Shinka and showed that on one of these programming tasks, if the combination of this agent and Shinka would have competed in the challenge, it would have ranked second place, basically. So I think there's some evidence that Shinka can work as a co scientist, not only for LLM agents, but potentially even for humans, like we discussed before. And then finally the final application that we looked at was designing sort of mixture of expert load balancing loss functions. So at Sakana we've done some previous work called Discopop. I think we discussed this during the last podcast we did where we are using LLMs to design objective functions. And back then we did it for preference optimization and post training and here we did it for load balancing of mixtures of experts. Also there we found that within I think like even only 20 sort of generations, we were able to sort of of explore, let's say, not only a single objective function, but sort of, let's say a convex hull where there are different trade offs between sort of performance and load balancing and so on. So I think this is another application of Shinka where it's not only basically about sort of finding the best solution, but essentially illuminating a program space where there are always potential trade offs between, let's say for example runtime and the quality of the circle packing. And having a system that can explore all of is important as well.

48:05

Speaker C

I'm very excited to see you apply this to the ARC challenge. Like what are your thoughts about that?

50:55

Speaker B

I still need to collect results, so I don't want to make any claims or hard claims before having done this, but I would hope that there is some chance of for sure improving sort of the cost of these systems and then potentially even performance. But yeah, to be seen.

51:00

Speaker C

So you've done some experiments. Exciting news is potentially coming.

51:17

Speaker B

I started looking into it, yeah.

51:20

Speaker C

I mean what are your thoughts in general about ARC though?

51:23

Speaker B

I think it's great. I think it's really important and I think it fills an important gap and I do really deeply respect Francois and sort of read the paper when it first came out and no one thought of actually being able to get numbers above 10%. Right. And, and it's also pretty Fascinating on a society level how far we've come since then. And sometimes while you're sort of deep in the say, battle mode or work mode, you kind of forget where you were one year ago. And then just looking back, it's pretty amazing also how far we've come since 01.

51:26

Speaker C

It's insane. I think Francois doesn't get enough credit because it's such a good benchmark and not necessarily for reasons people think, because Francois has always said, saying that we need to have a benchmark which is easier for humans and hard for AIs. And, and in a sense that's not quite the case. I, I said when ARC v2 came out that it's actually very difficult for humans that, you know, there was one task where Duggar was stumped for about 15 minutes with there's three of us looking at it and we just. And it's one of those things that depending on your perspective, you might get it straight away or you might not. So there's that criticism. And people have said that Arcv3 is even harder.

52:04

Speaker B

Yeah, yeah.

52:36

Speaker C

You know, but I think that's rather missing the point. I think he's saying that with a lot of these competitive coding problems, the data set is contaminated. These are problems that have been solved before in part or in whole. Which means when you look at the epistemic tree, many of the building blocks for solving them are very high up in the tree. He's looking at these problems that there is very little dataset contamination and, and they need to be solved from very abstract building blocks. So you're starting much lower down the tree and you're synthesizing a model by composing together very abstract building blocks, which is the essence of intelligence. And I think for that reason ARC is really kind of pushing us to build adaptive systems which we could say are intelligent.

52:37

Speaker B

Yeah, I agree. I mean, in many ways I'm really looking forward to the next years and seeing how far we can push this and then also also how much generalization we can get afterwards. Because I believe when you look at the more recent models, they're getting much better at the transform style, code evolution or outputting for ARC than they are on the instruction based level. And I think this might already be a small sign of some amount of overtraining on arc.agi1 at least I do believe there are some aspects of work which will be automated before it comes to sort of fully science, automation and the type of work I'm doing. But I could imagine that certain parts of the dimensions that I deal with every day are for sure going to be hit by AI. And then the question is, are there going to be new dimensions opened up that we as humans will fill in? And I think what I said before about shepherding and so on, I really hope that that's the way forward, right, in the sense that humans are the ones steering the ship while just being massively amplified in their productivity.

53:22

Speaker C

Right now I am not really seeing the kind of job market disruption that was being predicted. I know from personal experience that in a sense it's made it very difficult to hire people. Script writers use ChatGPT, I can spot it instantly, constantly. And writers and copy editors are actually in more demand than they were before, fixing all of the crap that has been generated with chat GPT. And there's the cloud analogy as well. So, you know, it, system administrators who were earning, you know, £60,000 a year in the UK, they rebranded as, as Cloud DevOps engineers and they more than doubled their pay. And people are very adaptive. They see new trends, new bandwagons and they just adapt and they add value on top. And that has been the trend for a very long time. Do you think that AI is going to be so transformative that it will transcend people's ability to adapt?

54:30

Speaker B

I think it's just a question of speed. Right. So I was talking about sort of cultural evolution and technological evolution and it seems like we humans, we need more adaptation and more time. Time to get used to the technology, to carve out these niches where we can fill in and it's complementary. So first off, I think we're still not at the ceiling of the technological progression, so maybe in a couple of years we will need less of sort of slop editing like you said. But I do think we need some more time to adapt to the different modalities of interacting with these systems. Right. I think everyone can sort of interact with, with a chat assistant, but I think this is the most naive form of interacting with AI agents, for example, I think we need to get the pacing of all of this right and we need to do much more exploration in human machine interfaces, UI UX design and how to make sure that humans sort of feel or feel for fulfilled during this experience.

55:28

Speaker C

This is particularly relevant because, you know, you were behind the AI scientist paper and there's now version two of that. Allow me to be a tiny bit skeptical. You know, we were talking about when we evolve systems to do us to do a particular thing and at the moment it feels like as good as they are, they are still quite parasitic on the instructions and intentions of the human supervisor. So it's very, very much an exchange between the humans and the system because the implication is that in the future we might have systems that are so autonomous and so open endedness and can figure out valuable things to research that humans wouldn't be needed anymore. And the reason why I'm not that worried yet about labor market disruption is I still believe deeply that humans are the source of deep understanding and creative in the world. If I didn't believe that, I would be very worried.

56:36

Speaker B

I agree. To me, the AI scientists like V1 and now V2 are sort of glimpses into a potential transformation. But I fully agree in order to make really big scientific breakthroughs, like multiple of them, like every day or whatever, you still need humans in the loop to sort of either seed or guide the direction in which to explore or to verify, to check and actually transfer these insights. So I think it's not going to be like all ML PhDs will be unemployed. It's more going to be a sort of core evolution of humans with this technology. And potentially in an ideal future for me, it will allow humans to focus on what they're really, really great at. Right. So I think it's going to be an amplifier of sort of these latent dimensions. Dimensions humans are great at. Right. I think something that's critical is that we as humans try to interact with these systems as early as possible in order to actually have influence and ownership over this development process. It's ultimately collective intelligence that will shape all of these systems together.

57:31

Speaker C

And do you think these systems can become incredibly sophisticated such that they are somewhat detached from humans?

58:43

Speaker B

Well, I mean with the AI scientist v2 we sort of released that one paper that we submitted to an ICLARE workshop, was able to sort of pass the acceptance threshold before meta review. So I do think at least for sort of workshop level contributions, we're getting there. While not every submission an AI scientist paper does is reaching that threshold, we're at the point where we can even talk about noisy review processes and this actually being something that as long as you have a large budget, you might get something out of it. I think going forward for the bigger innovations and so on, for now you still need humans, but we're sort of at the GPT one moment of mechanical making this sort of a reality and potentially in 10 years this is going to look very, very different once the sort of also the infrastructure for it has been built up. Right. So there are places like periodic labs which sort of now are building real physical labs with robotic systems to automatically sort of execute experiments. This will take some time, but it is sort of imaginable for sure that as we sort of do RL on these types of systems and we actually also account for negative and for actual hypothesis testing. So getting these systems to be real good hypothesis testers with verifiers in the loop, that we might be able to unlock many more capabilities.

58:53

Speaker C

Yeah, I mean, I suppose I don't want to sound like a Luddite. So it's entirely possible that this is just, you know, I don't have the imagination to think about the future. So it is possible that in the future that these systems might understand very deeply and be creative. You know, I think right now the problem is, problem is they only understand things a few levels down in the epistemic tree. So they can do some surface level recombination and they can discover new things in the basin of things that have already discovered. But, but we understand things very deep down in, in the epistemic tree, which means our, you know, our cone of creative potential is, is much wider. It's possible that that gap might be closed. What would happen then?

1:00:24

Speaker B

The way how I kind of think about the scientific process is like a tree search ultimately. So I think a lot of analogies from evolution transfer to scientific research in the sense that we traverse a tree of different ideas or different experiments, and then in the paper we report one path through that tree. And I think what I kind of alluded to before, we need much more full tree data sets for training these LLM systems to actually learn how to do this exploration and this foraging basically. At the same time, I feel like evolution will also take place on a cultural level. Like for us, we will get better at sort of steering the ship. And I can imagine that in the future world sort of the way how we do research will be completely different. And I'm pretty sure that right now, already 99% of machine learning research is done with sort of AI assistance. Right. Think about ChatGPT, brainstorming, cursor coding, cloud code, et cetera. In the long run, we're going to move on that spectrum from sort of with AI, closer to by AI and then sort of more high level sort of orchestration and overseeing by humans.

1:01:06

Speaker C

There's also the notion of how intrinsically coupled to humans is the value function. So one school of thought is that AI will develop a mind of its own and it will, will, you know, basically transcend humanity and it will just have agency which is not parasitic on ours. I personally don't subscribe to that view. But the other view is that it is like, let's say the AI scientist, like version 10. It's going to be continually epistemic foraging. It's going to be finding new things that are useful and they kind of have to be useful to us because if it finds things that are not useful to us, then we just won't use them and then nothing will happen. Do you think there'll always be a kind of coupled value function to humans?

1:02:19

Speaker B

Jeff Kluhne had this work on OMNI and using LLMs as sort of amortized notions of interestingness for humans. And I think ultimately the way how we train these systems is coupled in human data and going forward it will also be coupled with human data that is called collected using verifiers. So I have a hard time believing that in the long run when you run this open endedness paradigm with AI scientist agents, it's going to completely divert to something that's either fully non interpretable or unrelated to problems we as humans care about. And then again, humans can steer to a certain degree where the search happens. So you can tell a system. Okay, try to do cancer research and sort of work on problems that we care about. And ultimately we are the ones who control how much flops are being pushed into this.

1:03:04

Speaker C

Yeah. Because as a thought experiment, I can imagine, let's say in the world of mathematics, what if an AI scientist could come up with entirely new problem formulations and then solve them and these are things that humans had never conceived of before and maybe they would be less interested in the answer because humans hadn't spent time thinking about it. And if you think about it, we could just explore the phylogeny of mathematics just to the nth degree. And at some point maybe we just wouldn't care anymore. Maybe we can just carve out that space just forever and ever.

1:03:59

Speaker B

Yeah. But maybe down the road there is a stepping stone that enables a new innovation in a different field that we actually care about. Right. So it's very hard to say a priori whether or not something is interesting or not. Right?

1:04:28

Speaker C

Yes. And there's also the notion of, I love this idea of diverse intelligences and diverse minds. And maybe we could just create artifacts in a space which is completely alien to us and we might even ascribe moral value to them. And we might not want to turn off the power because we want these alien artifacts to stay alive.

1:04:40

Speaker B

Maybe. I read a lot of science fiction, but I would Sort of shy away from speculating about all of this. But I do think one thing I'm extremely certain of is, is that the way how we conduct research and science is going to fundamentally change in the next five years, 10 years and 20 years. And I hope that we're going to be able to sort of tackle some of the biggest problems which are still sort of seemingly unreachable right now with and by AI.

1:05:02

Speaker C

So Terence Tao has posted that he's been using GPT5 and it's been speeding him up. It's taking away a lot of the drawing drudgery. But the cynical take is that. And Scott Aronson posted something similar as well. The cynical take is that maybe laziness is stepping in and in some pernicious way using AI models is actually stopping us from thinking outside the box. So it's encouraging us to kind of search in the neighborhood of things that are known. And that is very useful. It's very useful to have an artifact that knows all of the experiments, all of the things that are ever done by people 20 years ago. But now we don't have people really kind of applying their brilliance, their talent in completely new areas.

1:05:32

Speaker B

So first off, it's great that these experts are already using the technology in their day to day work. Right. And I think it's also important that really, really top level scientists try to push what's capable with these systems or squeeze out where there might be sort of black spots or stuff these systems can't do. Second off, I think it comes down sort of to, to discipline and how we raise the next generation. So discipline on a personal level, how much do you just tap. Accept everything that's being proposed by these systems and responsibility in terms of educating the next generation in the sense that we need to teach our kids that ultimately what comes out of these systems might not always be true, that facts can be sort of subjective, if you will, and that there needs to be more research about what's being given to you. And I think this will be, like I said, this cultural evolution that we have to step through and try to make the best out of.

1:06:18

Speaker C

Yeah, the autopilot thing is very interesting because there is a tendency using cursor just to at some point the models are getting so quickly that you can't even read the tokens coming at you. And then you just press accept. And you press accept. It's the same thing in cars that as soon as you have too strong of an autopilot you just completely switch off. And then you see A divergence, because there's something about thinking that it must be grounded on your path. There's this path dependence. And when you start kind of becoming parasitized by this other train of thought, then you stop thinking about your path and then you're not in the driver's seat anymore.

1:07:18

Speaker B

This is now a bit of a harsh statement, but sometimes I wonder if these systems, like these coding assistants assistance are almost like drugs in the sense that you become addicted, you use up all your budget and then you need to load up again. And once you fully reached the budget limit, you feel like, okay, what am I going to do now? And I think once that happens to you, you should really rethink the way how you work. And to me right now there are certain parts where auto accepting is acceptable and there are certain parts where it's definitely not. And you really need to go deep into it. And I think right now we're sort of in this weird non equilibrium state where things are moving constantly, right? So the systems or the models are changing, the features are changing, the sort of parts where the systems are good are changing all the time. And we humans need to constantly adapt to that. That. Right. And I think it's a big cognitive challenge and I think we just all need to be aware that there are certain problems and certain challenges that we have to adapt to. I think the best way to do so is just interact with this technology as much as you can and maybe find new research ideas out of that experience.

1:07:53

Speaker C

And how is AI scientist V2 different to V1?

1:09:13

Speaker B

In V1 we used sort of a template based approach. So we had like a base experiment and then for that base experiment we asked an LLM to generate ideas with semantic scholar calls and literature search. And then it implemented these ideas based on the template basically code divs. And then it linearly executed an experiment plan and wrote a paper in the end. And so what could happen was that there was an idea and that idea didn't work out. But then in the end the experiments were still executed linearly and you wrote a paper and this was already impressive in the sense that it looked very much like science. But if you think about human science and the scientific method, it's much more like tree search. Like I said before, you sort of adapt what you're going to execute next and you sort of refine based on evidence that you accumulated. So this is sort of the notion of falsificationism from Karl Popper in the sense of that we collect evidence for hypotheses and we reject others. And we do so in a loop basically until we want to publish or we find something. And we tried to take this notion and directly build it into the agentic scaffolding for the AI scientist V2. So now it's basically like a parallelizable agentic tree search where there's no longer a template experiment needed, but this is drafted up by the LNF itself, and thereby the AI scientist v2 can be applied to many more sort of settings, if you will. So at the core is sort of this new agentic tree search paradigm. And then we use sort of a couple of minor technical changes, like using a VLM reviewer for sort of figuring out if captions of a paper are aligned with the figures. And we scale this up to many more sort of computational notes and then write a paper in the end.

1:09:17

Speaker A

Close your eyes, exhale, feel your body relax, and let go of whatever you're carrying today. Well, I'm letting go of the worry that I wouldn't get my new contacts in time for this class. I got them delivered free from 1-800-contacts. Oh my gosh, they're so fast. And breathe. Oh sorry. I almost couldn't breathe when I saw the discount they gave me on my first order. Oh, sorry. Namaste. Visit 1-800-contacts.com today to save on your first order.

1:11:10

Speaker B

1-800-contacts.

1:11:38

Speaker C

USAA knows Dynamic Duos can save the day like superheroes and sidekicks or auto and home insurance. With usaa, you can bundle your auto and home and save up to 10%. Tap the banner to learn more and get a'@usaa.com bundle restrictions apply. And again, so I was trying to say this in the most polite way possible, but a critic might say, I don't want to use the word slop, but a critic might say say we are producing papers which appear like papers. So they have figures and they have results and they have things written in a certain way, but they're not grounded deep down the epistemic phylogeny, which means they have near the top of the tree. We're seeing some novelty and composition happening, but it doesn't reflect a deep understanding. What would you say to that charge?

1:11:39

Speaker B

It's for sure that not every paper that comes out of the AI scientist V2 is a nature worthy publication. Right? That's for sure the case. So definitely there is some amount of, let's say, slop or content that is not like a scientific big discovery being written up by the AI scientist V2. But ultimately we showed that it was possible to Obtain a workshop level paper. I do think this is sort of the first time basically where we can see that at least now we're able to fully autonomously spend, compute, spend API calls to obtain, obtain some amount of scientific insights. And for me at least right now it's a good way to sort of prototype ideas or to investigate a certain field, get like initial starting point, initial results and then to work on top of it. But for sure more work needs to be done to make this entire process more robust, more efficient and essentially produce many more sort of true positives, if you will.

1:12:26

Speaker C

Yeah, and it might be one of these things, you know, like when we moved from GPT3 to GPT4, there was just a massive increase in identity because the thing is with, with slop, to me it simply means lack of deep grounded understanding. And there's no reason in principle why these things couldn't have a deep grounded understanding. They just don't have it yet. So it's something that could improve over time, but it's likely to improve quite slowly. And then at some point we might just think, oh my God, we've got an AI scientist.

1:13:25

Speaker B

Yeah, I mean like to me this, this kind of comes back to what we were discussing, discussing about before. So first off, there is a verifier in the loop or in the sense that experiments are actually executed on a computer so the numerical results can be fed back or are fed back into the system to come up with the next thing to explore. But we haven't made let's say discovery like residual connections or something that have diffused into everything in machine learning. And I think what we really need is to make these systems be much better at sort of integrating knowledge over multiple experiments and sort of become better at sort of formulating the next hypothesis based on previous insights. And yeah, this might require some amount of post training on sort of these traces basically. But I'm pretty positive that we might also get there with just diversity and scaling these systems up in an efficient but scaled up way.

1:13:52

Speaker C

I'm just thinking that the first breakthrough discovery, would it resemble the AI scientist paper or would it resemble tinker evolve? So for example, we could do like a massively scaled up chinker evolve and we could say I want to discover a new architectural design and would that happen? And then we would get the AI scientist paper to kind of write it up and do ablations and stuff. Maybe that would be the pattern of

1:14:52

Speaker B

it to a certain degree. I've been thinking a lot about how you can potentially even combine these Two paradigms, right. The AI scientist in Shinka or Alpha evolves style optimization algorithms. And I do think there is some amount of work to be done on this autoverification aspect to it, on the sort of problem formulation aspect to it. The paper writing part is actually the least important about the AI scientist. It's a form factor that we humans are sort of used to and it helps anchor our mental model of scientific discovery. But ultimately I'm not sure if the paper is going to be the knowledge transmission medium in let's say, 20 years. Something else I've been thinking of a lot is whether or not we can make papers much easier, agentically accessible in the sense that right now it's a latex document, but you could imagine sort of equipping every paper with several model context protocols so that every figure is reproducible, data is accessible and essentially make it much easier for the LLM agents to essentially either replicate work or to work off of them afterwards. Right. Doing sort of epsilon improvements, ablations yourself through that interface to a paper. But to be entirely honest, I'm not sure if it's going to happen because there have been many great ideas for improving sort of, let's say, the format of scientific artifacts out there, and people still seem to like the paper format, which has existed for, let's say, hundreds of years. So I think it's a question of incentives again and really showing that if something like that would exist, it would enable much faster progress of AI agents for scientific discovery.

1:15:15

Speaker C

Yeah, paper is a great human interface. It's a similar thing with automated driving, right. That we could revolutionize the road network to have sensors and we could dramatically improve the monitoring and observability and optimization. But I'm fascinated by that idea. So you're saying not just reproducibility of the experiments, but also the way that the figures are designed and the code and so on, because then we could create this huge playground where agents can repurpose, recombine, restudy work that has been published by other scientists. And it also made me think does like having an automated scientist, does that make peer review more or less important?

1:17:06

Speaker B

I do think it actually makes it more important, at least for now. Right. In the sense that we now have a mechanism or could have a mechanism that generates many, many papers and it first increases the workload on human reviewers and we need some effective way for filtering and then essentially only taking the cream of the crop for human verification afterwards. So I think for now the ultimate verification is still the human and the diffusion of the result through the community. And we need better tools for doing this. Automatic filtering and verification. We have the AI reviewer that comes with the AI scientist, but you actually probably need some form of experiment execution for actually verifying everything. But there is, for example, work by OpenAI on Paperbench and trying to go into that direction using. Using sort of LLM soft verification and these types of things. So I'm hopeful that we're going to figure this out in the next years.

1:17:46

Speaker C

Yeah. And I think one of the Rubicon moments is when the new Transformers architecture or something massive is discovered by AI and we're all using it. My worry, I suppose, is that probably folks like Google who have enough compute power, they're going to be running AI scientists and they're going to own many of these discoveries. Discoveries. Which is why it's so important to have work which can efficiently discover new things in science.

1:18:47

Speaker B

And it's important to have work that's openly available. Right. I think, like, with the AI scientist and Shinka, we're really trying to make sure that we can sort of apply the collective intelligence of all of us to shape how this might look in the future.

1:19:11

Speaker C

Amazing. Well, Rob, this has been so fantastic to have you on the show. Sakana is hiring amazing engineers, by the way. So if this sounds like. And it is an amazing opportunity, get in touch with Rob and the guys. And I trust you're working on some exciting new things that are coming up.

1:19:25

Speaker B

Yes. And I hope to be able to talk to you in the future again about some of this.

1:19:41

Speaker C

Absolutely. Rob, thank you so much for coming.

1:19:44

Speaker B

Thank you so much, Tim.

1:19:46