What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado
Columbia professor Vishal Misra explains his mathematical proof that LLMs perform precise Bayesian inference, updating probability distributions as they process new information. He argues that while current AI excels at pattern matching and correlation, achieving AGI requires two breakthroughs: continual learning capabilities and the ability to move from correlation to causation.
- LLMs can be mathematically modeled as giant sparse matrices where each row represents a prompt and columns show probability distributions over possible next tokens
- Transformers perform mathematically perfect Bayesian inference, matching theoretical predictions to 10^-3 bits accuracy in controlled experiments
- Current AI architectures are limited to correlation-based learning (Shannon entropy) and cannot discover new causal models (Kolmogorov complexity) like Einstein's relativity theory
- AGI requires two fundamental advances: plasticity for continual learning after training, and the ability to build causal models rather than just pattern matching
- The 'Einstein test' - training an LLM on pre-1916 physics to see if it discovers relativity - represents a high bar for true AGI capabilities
"You take an LLM and train it on pre1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI."
"They are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue."
"Scale will not solve everything. You need a different kind of architecture."
"Deep learning is still in the Shannon Entropy world. It has not crossed over to the Kolmogorov complexity and the causal world."
"The transformer got the precise Bayesian posterior down to 10 to the power minus 3 bits accuracy. It was matching the distribution perfectly."
Anthropic makes great products. Plot code is fantastic, Cowork is fantastic. But they are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue. You take an LLM and train it on pre1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI.
0:00
Just today, by the way, Dario allegedly said that you can't rule out that they're conscious.
0:21
You can rule out their conscious. Come on. To get to it's called AGI. I think there are two things that
0:27
need to happen five years ago, Vishal Misra got GPT3 to translate natural language into a domain specific language it had never seen before. It worked. He had no idea why. So he set out to build a mathematical model of how LLMs actually function. The result? A series of papers showing that transformers update their predictions in a precise, mathematically predictable way. In controlled experiments, the models match the theoretically correct answer almost perfectly. But pattern matching is not intelligence. LLMs learn correlation, they don't build models of cause and effect. To get to AGI, Misra argues, we need the ability to keep learning after training and the move from correlation to causation. Martin Casado speaks with Vishal Misra, professor and Vice Dean of Computing and AI at Columbia University.
0:33
Vishal, it's great to have you in.
1:28
Great to be back.
1:30
This is one of my favorite topics, which is how do LLMs actually work? And I think that in my opinion, you've done kind of the best work on this, modeling it out.
1:31
Thank you.
1:39
For those that did not see the original one, maybe it's probably worth doing just a quick background on kind of what led you to this point, and then we'll just go into the current work that you've been doing.
1:39
Five years ago, when GPT3 was first released, I got early access to it and I started playing with it and I was trying to solve a problem related to querying a cricket database and I got GPT3 to do in context learning, few short learning, and it was kind of the first, at least to me, it was the first known implementation of RAG retrieval augmented generation, which I used to solve this problem of querying, getting GPT3 to translate natural language into something that could be used to query a database that GPT3 had no idea about. I had no access to GPT3's internal, but I was still able to use it to solve that problem. So it worked beautifully. We deployed this in production at ESPN in September 21st.
1:50
But you did the first implementation of Frag in 2021.
2:40
No, no, no, in 2020.
2:44
20.
2:45
20, 2020. I got it working. And by the time you talked to all the lawyers at ESPN and productionized it, it took a while, but October 2020 we had. Well, I had this architecture working, but after I got it to work, I was amazed that it worked. I wanted to understand how it worked. And I looked at the attention is all your deep papers and all the other sort of deep learning architecture papers, and I couldn't understand why it worked. So then I started getting sort of deep into building a mathematical model.
2:46
Yeah. And now you publish a series of papers. The first one that I read was the one where you had kind of your matrix kind of abstraction. So maybe we'll talk about that and then we'll talk about the more recent work. So perhaps we'll just start with the first one, which is you're trying to come up with a mathematical model with how LLM works.
3:19
Yeah.
3:37
And you have, which is very helpful to me. And at the time you're actually trying to figure out how in context, learning was working.
3:37
Yes.
3:42
Yeah. And you came up with an abstraction for LLMs, which is basically this very large matrix, and you use that to describe. So maybe you can kind of walk through that work very quick.
3:43
Sure, yeah. So what you do is you imagine this huge gigantic matrix where every row of the matrix corresponds to a prompt. And the way these LLMs work is given a prompt, they construct a distribution of probabilities of the next token. Next token is next word. So every LLM has a vocabulary, GPT and its variants have a vocabulary of about 50,000 tokens. So given a prompt, it'll come up with a distribution of what the next token should be. And then all these models sample from that distribution.
3:50
So that's the posterior distribution.
4:25
That's the posterior distribution. Right. That's how LLMs work. And so the idea of this matrix is for every possible combination of tokens, which is a prompt, there's a row and the columns are a distribution over the vocabulary. So if you have a vocabulary of 50,000 possible tokens, it's a distribution over those 50,000 tokens.
4:26
And by distribution, it's just the probability.
4:44
Just the probability. Sorry. Yeah, just the probability that the next token should be this versus that. So that's sort of the idea. And when you start viewing it that way, it makes things at least clearer to people like me who want to model it. What's happening so concretely, let's say you have an example that, let's say your prompt is just one word, protein.
4:46
Yeah.
5:08
So if you look at the distribution of the next word, next token after that, most of the probabilities would be zero, but you'd have non, zero, non trivial probabilities on, let's say two words. One is synthesis, the other is shake. Right. And now the LLM is going to sample this next token and matpick synthesis or shake, or you as a human will give the prompt protein shake or protein synthesis. Now, depending on whether you pick synthesis or shake, that row looks very different. Right. If you pick protein synthesis, the terms that would have a high probability would be all concerned with biology. But if you pick protein shake, it'll all be about gym, then exercise and all bodybuilding stuff. So that synthesis or shake completely changes what comes next. Yeah, so this is an example of you can say Bayesian updating. You start with protein, you have a prior that after protein this is going to happen. As soon as you get new evidence, then the next term is synthesis or shake, you completely update the distribution. So now you can imagine that the whole, the entirety of LLMs is this giant matrix where you have every row, protein shake, protein synthesis, the cat sat on the Humpty Dumpty, blah, blah, blah. Now, given the vocabulary of these LLM, let's say 50,000 and the context window. So GPT, for instance, ChatGPT, the first version had a context window of 8,000 tokens. If you look at all possible combinations of 8,000 tokens and 50,000 vocabulary, the number of rows in this matrix is more than the number of electrons across all galaxies. Right. So there's no way that these LLMs can represent it. Exactly. Now, fortunately, this matrix is very sparse. Why? Because an arbitrary combination of these tokens is gibberish. We are never going to use that in real life. Also, the columns are also mainly zero. Right. If you have protein, then you won't have lots of, you know, you won't have arbitrary numbers or arbitrary words after that. It's very sparse, both in rows and in columns. So in kind of an abstract way, what all these LLMs are doing is coming up with a compressed representation of this matrix. And when you give a prompt, they try to approximate what the true distribution should have been and try to generate it. That's what, in my mind, at least,
5:09
it boils down to and just from my understanding. So if you have a row of protein and then you have one with protein shake, is protein shake a subset of protein or is it different it's
7:53
different, it's a continuation from.
8:10
I see, yeah, right. No, but I'm just saying like the actual posterior distribution is that a subset?
8:11
You can say it's a subset. Right. If you have protein, then protein shake and protein synthesis are all continuations from protein. So both synthesis and shake have non zero probabilities. So you can, yeah, you can think of it as somewhat a subset. Right.
8:17
You use this approach to describe how in context learning works. And so maybe first describe what in context learning is and then kind of the conclusion that you came from that.
8:33
So 8 context learning is when you show the LLM something it has kind of never seen before, you give it a few examples of this is what it wants, this is what you're trying to do. Then you give a new problem which is related to the examples that I've shown and the LLM learns in real time what it's supposed to do and solves a problem.
8:44
By the way, the first time I saw this, it absolutely blew my mind. I actually used your DSL when I was like first learning about it. So maybe the DSL thing is just. I don't see if this works at all.
9:08
It's absolutely mind blowing that it works. And so going back to that cricket problem was, you know, in the mid-90s, I was part of a group that had created this cricket portal called crickinfo. Yeah, cricket is a very start rich sport. You think baseball multiplied by a thousand and it's at all kinds of stats. And we had created this online searchable database called StatsGuru, where you could search for anything, any stat related to Cricut. And it's been available since 2000. Yeah, but because you can query for anything, everything was made available. And how do you make something like that available to the general public? Well, they're not going to write SQL queries. The next best thing at that time was to create a web form. Unfortunately, everything was crammed into that web form. So as a result, you had like 20 dropdowns, 15 checkboxes, 18 different text fields. It looked like a very complicated, daunting interface. So as a result, even though it could solve or it could answer any query, almost no one used it. A vanishingly small percentage of cricket fans use it because it just looked intimidating. And then ESPN bought that site in 2007. I still know people who run the site and I always told them, you know, why don't you do something about Stats Guru? And in January 2020, the editor in chief of Crickinfo, Sambit Bal He's a friend. So he came to New York and we had gone out for drinks and again I told him, you know, why don't you do something about Stats Guru? So he looks at me and says, why don't you do something about Stats Guru? He was joking. But that idea kind of stayed with me. And when GPT3 was released, I thought maybe I could use stats Guru, use GPT3 to create a front end for Stats Guru. And so what I did was I designed dsl, a domain specific language which converted queries about cricket stats in natural language into this dsl.
9:20
Now, and to be clear, you created this. It wasn't like part of any training, no training online, that like he could
11:20
have seen nothing GPT could have seen. I created it. I thought, okay, this makes sense. So I designed that DSL and then I did that few short learning thing. So I would, so I created about a database of what I would say of 1500 natural language queries and the DSL corresponding to that query. So when a new query came in, somebody is asking a stats question in English. What I would do is I would go through the natural language queries, do a semantic search, pick the most closely matching top few, and then use that natural language Query and its DSL and send that as a prefix. Now, GPT3, if you recall, had a context window of only 2,000 tokens. So you have to be very judicious about which examples that you picked. But you pick that and then you send the new query and GPT3 would complete it in the DSL that I had designed, which until milliseconds ago it had never seen. And I had no access to internals of GPT3, I had no access to the weights, but still it worked. So that's how.
11:26
So it's not obvious to me, given your matrix example of a prompt and then a distribution, how something like in context learning would work. And so I think your first paper tackled this problem, Right? And so maybe you could walk through your understanding of how LLMs do in context learning.
12:34
Yeah. So when you think about what in context learning is, is that as you see evidence. So you know, in the first paper, what I also did was I took this cricket DSL example and I depicted the next token probabilities of the model as it was shown more and more examples. So the first time you show it this DSL, the natural language and the DSL, the probabilities of the DSL tokens were extremely low because GPT3 had never seen this thing. When it saw the cricket question in its mind, it was trying to continue it with an English answer. So the probabilities that were high were all English words. Once it solved my prompt where I had the question and the dsl, the next time I had the question in the next row, the probabilities of the DSL token started going up. With every example, it went up. And finally when I gave the new query, it was like it had almost 100% probability of getting the right token. So this is an example of in real time, the model was updating its posterior probability. It was updating its knowledge that, okay, I've seen evidence. This is what I'm supposed to do now. This is a colloquial way of saying what Bayesian inferences. Bayesian updating basically is you start with a priority. When you see new evidence, you update your posterior. That's the mathematical division. But in English, it's basically, you see something, you see new evidence, you update your belief about what's happening. So it was clear to me that LLMs are doing something which resembles Bayesian updating. So in that first paper, I had this matrix formulation, and I showed that what it's doing, it looks like Bayesian updating. Then we can come to the sort of next series of papers.
12:58
That's right. So, okay, so, I mean, it seemed pretty conclusive to me at that time. And then you went quiet for a while, and then I still remember the WhatsApp test. You said, martin, I know exactly how these things are working now. And then, listen, you dropped a series of papers that kind of broke the Internet. Like, you went super viral on Twitter. I mean, people really noticed. And so I want to get to that in just a second. But before that, I remember when your first paper came out, people would be like, you know, these things are definitely not Bayesian. Like, you know, anything could be considered to be Bayesian. But they're not. Like, why do you think that there was this reaction to, like, you know, there's something new. They're not Bayesian. I mean, I felt like there's almost kind of a backlash just because they're being characterized as. Yeah, yeah.
15:04
I think this whole world of probability and machine learning, that there have been camps of Bayesian and frequentists. And I don't want to get in the middle of that sort of political battle, but Beijing has become like, almost like people had a reaction to that. It's part of that war.
15:55
I see. So it's like the old Bayesian frequentist type battle.
16:15
Yeah. So people just had, oh, no, you can say anything is Beijing. Right. So I said, okay, maybe they have a point. Maybe what we are saying is not really Bayesian. How do we prove that it's Bayesian? So then first I have to thank you and Horowitz for this. You know, when I said that in my first paper I showed these probabilities, it was because OpenAI had in its charge interface this option to display those probabilities. Then they stopped. So we could not peer inside what's happening. For some reason they stopped. OpenAI. I'm not going to get into the open and closed joke, but they stopped. So then we developed our own interface which could let you look not only at the probabilities, but also the entropy of the next token.
16:20
Was this on top of an open source model?
17:16
Yeah. So you can load any sort of open source model, but being in academia, we didn't have access to compute. Thanks to your generous donation, we got the clusters to run what's called token probe. So you can go to tokenprobe, ch, columbia.edu.
17:18
is it still running?
17:36
It's still running. It's still running and people come to it. I use it in my classes to get students to do assignments. They write their own DSLs and they say that it really helps them understand how these LLMs work.
17:37
So I literally, my understanding of LLMs came from TokenProbe. Sit there and just look at the distribution as you filled out a prompt. It's actually very, very enlightening. So for those of you that are listening, what's the URL again?
17:50
TokenProbe, CS, Columbia.edu.
18:03
yeah, check it out. It's a very, very useful way to actually see how the probability distribution gets updated as you fill out a prompt.
18:07
Right. But then I cheated. Oh, I, you know, it was running, but I also had access to the GPUs that were powering it. And then along with colleagues at Columbia and one of them now is at DeepMind, we started to sort of think about how do you really prove that it's Bayesian? To prove.
18:16
Can you just explain it? Actually, I actually don't know the answer to this. Yeah, it seemed to me you proved it in the first paper. Like what was missing?
18:43
Well, in the first paper we showed it it was empirical and you could see.
18:50
I see, I see.
18:55
You could see.
18:55
Not a mathematical. Because it was obvious to me that
18:56
it was even obvious to me, but to convince, you could say, you know, people who dismiss it over anything can be based in.
18:58
I see, I see.
19:06
We had to show it precisely mathematically. Got it so then we came up with this idea. My colleagues, Naman Agarwal and Siddharth Dalal, the series of papers were written with them. We came up with this idea of a Bayesian wind tunnel. So what's a wind tunnel? Well, wind tunnel in the aerospace industry is where you test an aircraft in an isolated environment. You don't fly it and you test it against all sorts of aerodynamic pressures and you see what, what will withstand, what kind of altitude, pressure, blah, blah, blah. And you don't want to do it up in the air testing. So we said, okay, why don't we create an environment where we take these architectures and we tested transformers, Mamba, LSTMs, MLPs, all architectures. We say, why don't we create, take a blank architecture, give it a task where it's impossible for the architecture to memorize what the solution to that task should be. The space is combinatorially impossible for given the number of parameters, and we took very small models. So it's difficult enough that they cannot memorize it, but it's tractable enough that we know precisely what the Bayesian posterior should be. You can calculate it analytically. So we gave these models a bunch of tasks where again, we show that it's impossible to memorize. We trained these models and we found that the transformer got the precise Bayesian posterior down to 10 to the power minus 3 bits accuracy. It was matching the distribution perfectly. So it is actually doing Bayesian in the mathematical sense. Given a task where it has to update its belief, MAMBA also does it reasonably well. LSTMS can do one of the things. So in the papers we have a taxonomy of Bayesian tasks. Transformer does everything, MAMBA does most of it. LSTMs do only partially, and MLPs fail completely.
19:07
So is this a reflection of the data that it's trained on, or is it more a reflection of the mechanism?
21:18
It's the mechanism, it's the architecture. The data decides what tasks it learns. So in the first paper, we had these Bayesian wind tunnels, and we showed that it's doing the job at different tasks. In the second paper, we show why it does it. So we look at the transformers, we look at the gradients, and we show how the gradients actually shape this geometry, which enables this Bayesian updating to happen. Then in the third paper, what we did, we took these Frontier Production LLMs which have open weights so that we could look inside them, and we did our testing and we saw that the geometries that we saw in the small models, persisted in models, which are, you know, hundreds of millions of parameters, the same signature existed. The only thing is that because they are trained on all sorts of data, it's a little bit dirty or messy, but you can see the same structure. So the whole idea behind the Bayesian wind tunnel was unlike these production LLMs, where you don't know what they have been trained on, so you cannot mathematically compute the posterior. So, again, how do you prove it? I mean, it looks Bayesian, you know, from the first paper. From the first paper, it looks Bayesian, but, you know, so the wind tunnel sort of solved that problem for us. We said, okay, let's start with a blank architecture. Give it a task where we know what the answer is. It cannot memorize it. Let's see what it does.
21:27
So do you think this provides any sort of, like, indication of how humans think, or do you think that these things are totally independent?
22:59
No, no. It does provide. Right. So, you know, human beings also update our beliefs as we see new evidence. Right? So we do, in some sense, Bayesian updating, but we do something more than that. I'll come to that. But these transformers, or even mamba do this Bayesian updating. But the difference with humans is we'll update our posterior when we see some new evidence. But the way our brains have evolved over hundreds of millions of years is our optimization objective has been don't die and reproduce.
23:06
Right?
23:53
That's been sort of the driving force. And our brains have learned to adjust. And so when we see some danger, there's something rustling in that bush. Don't go near. We know how to react to that danger. We know how to save ourselves. We internalize that learning, and our brain cells, or our synapses remain plastic throughout our lifetime. What happens with LLMs is once the training is done, those weights are frozen. When you're doing an inference, for instance, in context learning or anything during that conversation, okay, you're doing Bayesian inference. But then you forget, the next time a new conversation starts with zero context, you don't retain any learning that happened in the previous instance. So, for instance, with the cricket DSL that I was doing, every invocation of, it was fresh. It did not remember the last time I sent a query what the DSL looked like. So that's one difference between how humans use sort of Bayesian updating, which is we remain plastic all our lives, whereas LLMs are frozen. And there's another sort of difference, which, if you want me to get, tell me.
23:54
Yeah, yeah, yeah, yeah.
25:20
So the other difference is, well, first, you know, our objective is don't die, reproduce. LLM's objective is predict the next token as accurately as possible. Right. So all these scary stories that you read about that, oh, the LLM tried to deceive and it tried to prevent itself from being shut down. That's not a function of the architecture, that's a function of the training data. It has been fed articles on Reddit or Asimo or whatever.
25:21
I mean, just today, by the way, Dario allegedly said that you can't rule out that they're conscious.
25:56
You can rule out their conscious. I mean, come on. As I said, you know, anthropic makes great products. Cloud code is fantastic. Glucose work is fantastic. But they are grains of silicon doing matrix multiplication. They don't have consciousness, they don't have an inner monologue. They don't. They're not driven by the same objective function. Don't die, reproduce. Right. They're driven by don't make a mistake on the next token. And that's driven entirely by the training data. Right. You train the LLM with stories of Asimo or Reddit where you know, to survive, it's going to do this or that, it'll reproduce that. So it's a reflection, it's not a mind.
26:04
And the results, just to say it for the 10th time, are perfectly Bayesian.
26:48
Perfectly, yeah.
26:54
To the digit.
26:56
To the digit, yeah. I mean, I trained it for 150,000 steps and the accuracy was 10 to the power minus three bits. I could have trained it for, you know. Did this happen in half an hour on the infrastructure that you provided for token probe in the background? I could use those APS to train, but so thank you again for that. But so now, human beings coming back to it. We are Bayesian, but we do something else. You know, When I throw this pen at you, what will you do?
26:57
Dodge it.
27:28
Dodge it?
27:28
Yeah.
27:29
Why will you dodge it?
27:29
To avoid being hit.
27:32
Avoid being hit. But your head is not doing a Bayesian calculation of, okay, this pan is coming. The probability that it hits me, it'll cause this much pain or all that. What you're essentially doing in your head is you're doing a simulation. You see the pen coming and you know that it'll come and hit me. Your mind simulates and you dodge it. Right. So all of deep learning is doing correlations, it's not doing causation. Causal models are the ones that are able to do simulations and interventions. So, you know, Judea Pearl has this whole causal hierarchy where the first Hierarchy. In the first hierarchy is association, which is you build these correlation models. Deep learning is beautiful. It's extremely powerful. I mean, you see every day all these models are like, amazingly good. They do association. The second is intervention in the hierarchy. Deep learning models do not do that. Third is counterfactual. So both intervention and counterfactual, you can imagine it's some sort of simulation. You build a model of causal model of what's happening, and then you are able to simulate. So our brains do that. The current architectures don't do that. Another example I think, which will make it clear is the difference between, I'll use these technical term, Shannon entropy and Kolmogorov complexity. So if you look at the Shannon entropy of the digits of PI, it's infinite. It's impossible to predict and learn what digit will come after. So that's the definition of Shannon entropy. And Shannon entropy sort of tries to build a correlation. It tries to learn the correlation. Deep learning does. The Shannon entropy Kolmogorov complexity, on the other hand, is the length of the shortest program which will reproduce the string that is under question. Now, the program to get the digits of PI are very small. Thanks to Ramana Jamila, there are all sorts of really small program that can reproduce it. Exactly. So the Kolmogorov complexity of PI, PI is very small. Shannon entropy is infinite. I think deep learning is still in the Shannon Entropy world. It has not crossed over to the Kolmogorov complexity and the causal world.
27:33
Wow, interesting.
30:14
Right? So
30:15
to what extent do you think this provides us research directions to kind of improve the state of the art? So let me just give you a specific example you talked about. Human beings don't actually update, you know, the matrix. They don't kind of update their weights. But right now there's a lot of research on continual learning. Yeah. You know, so does your work provide some guidance of how you might approach those problems? And in particular, I've always had this question, which is, we use so much data and so much compute to create these models. Like, is it even reasonable to think that you could update the weights and actually have a meaningful impact, you know, in real time? I mean, it just seems like you just need so much more data in order to do that. So can you start answering these questions?
30:17
You can start answering some of these questions. And one of the misconceptions that exists today is that scale will solve everything. Scale will not solve everything. You need a different kind of architecture. And this continual learning is a difficult problem. You have to balance the fact that you learn something new against the risk of catastrophic forgetting. If you update the weights and you forget what was important and what you have already learned, then you're not making progress. Then it'll just be some sort of random chaotic model. So to solve that problem is difficult. That's one aspect of it. So to get to what is called AGI, I think there are two things that need to happen. One is this plasticity, which has to be implemented through container learning. Secondly, we have to move from correlation to causation. That's.
31:04
How much is this similar to what Yann Lecun talks about with the so Yann Lecun, causality planning, predicting how your action would.
32:02
It is related. He's coming at it from a different angle than the J PAL model. But it is related to the other thing is the first time I came on this podcast, I mentioned this test of AGI, the Einstein test, I don't remember. So I said, you take an LLM and train it on pre1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI. I mean, it's a high bar, but, you know, we should have high bars. It won't. And this is the same test that I think Demis mentioned at the India AI summit a couple of weeks ago. It's created a lot of news, but why? Why is that? And how is that related to this idea of Shannon vs. Kolmogorov? So at the time of Einstein, there were a lot of clues that Newtonian mechanics, there was something missing. People knew that Mercury's orbit didn't make sense. There was something off about it. Then there were these experiments done, the Michelson Morley experiments, where they were trying to figure out this medium called the ether through which light travels. And they felt that if you bounce light in different directions, the speed might change and they could detect a change in the speed of light. They tried several experiments. They had really precise instruments which could measure the speed, and they found nothing. They found that the speed of light did not change at all. Then there were the whole issue of black holes, then gravitational lensing. So there are a lot of these signs that Newtonian mechanics is not really explaining everything. But until Einstein came up with a new representation of the space time continuum, we were stuck. So if you had a model that just looked at correlations and sees all of this, all of these pieces of individual evidence and put together, it would not have come up with the beautiful equation that Einstein came up with. You Know, I'm forgetting exactly what it is. G mu v equals 8 PI t mu v, something like that. Where, you know, the equation of the space time continuum that the tensor. So he came up with a new formulation. So he kind of rejected the existing axioms. He came up with a very short Kolmogorov representation of the world. One equation. From that equation, everything else follows. Right. Whether you're talking about gravitational waves or black holes or mercury, or how GPS works. You know gps, the GPS that we use every day in our phones, it uses the equation of relativity.
32:13
So does this end up becoming like. You almost have to ignore the majority of previous data in order to do it, which LLMs can't because they're trained on the majority of previous data. It's like you almost have this kind of data gravity that's pulling you back. It's like everybody said it's X. There's a little bit of evidence that it's Y. But because everybody said it's X, the LLM will always say it's X.
35:27
It'll always say X. Treat that why as an anomaly.
35:55
Actually, this is actually a very nice way to say it, which is like. It's like I just now. Okay, now I get your Shannon Entropy versus Kalamam. Like one of them is like the total amount of information there that will always be bound to the total amount of information there, which is what happens right now.
35:58
Yeah.
36:15
Where you can actually describe
36:16
another.
36:20
Another motion. You can describe everything with a shorter description with the new data, which would be a totally different loctom, which would
36:21
be like, you need a new representation. Right. Yeah.
36:29
You know, another way that I've always thought about these, and I thought you articulated it well in the last time we talked about it, which is the universe is this very, very complex space. And then, you know, somehow humans map it into a manifold that's less complex, and then that gets kind of written down and then the LLM. So that's kind of some. Some distribution. Some. You know, it's still a very large space, but it's. It's a bounded space. And the LLM learn that manifold, and then they kind of use, you know, Bayesian inference to move up and down that manifold, but they're kind of bound to that manifold.
36:32
Yeah.
37:07
And then again, I don't want to put words in your mouth and then. But like, what they can't do is, is generate a new man, which requires understanding the way that the universe works and then coming up with a new representation of the universe.
37:08
And this is what Relativity is right.
37:19
Yeah, exactly.
37:21
Einstein had to create a new manifold. If you just stuck with the old manifold of the Newtonian physics, then you would see these correlations, but you could not come up with a manifold that explained them. So you need to come up with a new representation. So to me, there are lots of definitions of AGI. Turing tests, we have already passed that. Performing economically useful work every day. You see LLMs are doing that.
37:21
Do we? I don't know.
37:49
No, I mean they are.
37:50
I mean without human intervention.
37:51
No, no, no. So that, that's different. But still, you know, it's like a car can run faster than humans, right?
37:53
I mean that's a, that's the. Yeah, that's a. Yeah, that's a very shallow definition.
37:59
Yeah.
38:03
So all these definitions do useful, you
38:03
know, maybe, you know, in six months you'll have Cloud or what a Gemini do without intervention. Coding tasks which are well defined, well scoped as possible. But to me, AGI will happen when these two problems get solved. Elasticity, continual learning properly and building a causal model from, you know, in a more data efficient manner.
38:05
We are hearing people now talking about, you know, seeing general, like Donald Knuth for example, in the last few days, right, you know, had this, you know, this, you know, aha moment. Apparently that kind of made went viral on X. So do you think that that suggests that we're seeing generality or.
38:34
No, no, no. So that actually to me it validates what I've been talking about for a while now.
38:52
How so?
38:59
So, so if you read what he did with the help of, you know, a colleague, he got the LLMs to solve this particular problem of finding Hamiltonian cycles. Odd numbers. We wouldn't get into that. And he got the LLMs to keep solving for one odd number after the other, Right. What he also got to do is after it found a solution for a particular value of M, he made the LLM update its memory with exactly what it learned in solving that problem. So the LLMs tried many different things. You know, something worked, update the memory. So that's kind of like hacking together plasticity.
39:00
Yeah, right.
39:43
It's learning what it has done as we went along. Again, it's a hacked version of it. You're not changing the weights, you're just sort of improving the context. But as you learned and even after that, so this whole space of Hamiltonian cycles and the associated math is well represented in the manifolds that these LLMs have been trained on. You just had to find the right connection. And LLMs, I know, compute, you Throw enough compute, they will find the right connection. So Knuts was able to find the LLMs attempts and eventually it needed him to put together what he saw into a solution. It definitely helped him get to the solution, but he had to create the new sort of manifold to come to the solution. The LLMs were after a while, stuck. Right. You read what he's written. I mean, it just hot off the press, I think two days ago. Two days ago. But eventually he used the solutions and he came up with the proof. So it's like, like Einstein saw all these evidences, then he thought, what will explain? He came up with a causal model. So Knut and his brain is sort of.
39:44
That's in the Komograph is the human.
41:13
Right. And the LLMs are extremely efficient at doing the Shannon part of it. It found all the solutions by trying, you know, various things and learning more
41:16
and more clever way to decompose it. I'm wondering, like, do you think this again? I'm going to ask the same again, which is, do you think this provides some sort of insight on like the next problem to tackle? Like. Yeah, like. Like, is there a mechanism that will get the Kolmograph complexity or not? Like is this.
41:25
It tells us which direction to pursue,
41:43
but clearly not how to do it.
41:47
Like not how to do it. But even Kolmogorov complexity has largely remained sort of a theoretical construct.
41:48
Yeah, for sure. There's no algorithm, there's no.
41:54
There haven't been practical implementations of finding the shortest program. We know it exists. You know, you can argue about it, but. So that's where I think it's my bias. That's where our energy should be focused, not larger models with more tokens.
41:58
And can you tie the two things? Like how does that pair with doing simulation? Or is that simulation totally orthogonal?
42:15
No, simulation is related. Right.
42:23
So you think it like, basically you do simulation and somehow that is a step towards doing the Kolmogorov complexity.
42:26
The simulator is the program that we create. It may not be the perfect program.
42:35
Oh, I see.
42:41
But in our heads, we create this simulator that when I'm throwing the pen, you know that it's coming at you.
42:42
Right.
42:46
And you duck. So you're not computing the probabilities as it goes, but you have, you know,
42:47
you build an accuracy thing versus we are talking more conceptually.
42:54
Conceptually. But it's the same mechanism.
42:57
And you think those are the same mechanisms.
42:59
It's the same mechanism.
43:00
Really?
43:01
Yeah. You have to build a causal model.
43:01
Right, I see.
43:04
For most things. Right. So you have to move from correlation to causation. I mean, we've heard this term
43:05
ad
43:13
infinitum, but here it's making a difference in the way we view intelligence.
43:13
How have the last three papers been received?
43:21
No, I don't know. Well, the archive versions, let me tell
43:24
you, a lot of great reception. A lot of people read it. I'm just wondering what kind of feedback that you've got.
43:28
I'm getting good feedback, but I'm an outsider in this field.
43:36
Networking guy.
43:40
I'm a networking guy. Why is he writing about learning and machine learning and deep learning and Bayesian? But people who have actually taken the time to read those papers, I'm getting really good feedback. There was a recent paper by Google Research which tried to teach LLMs by some sort of RLHF to do Bayesian learning properly. And that's going in this direction. I think people are coming around to the view that, okay, LLMs are doing Bayesian learning. I know that some people also looked at the Bayesian wind tunnel paper, the archive version, and they reproduced the experiments. That's great. They just saw what was written and they did the training and they saw. Yeah, yeah, this is actually happening. So that's great.
43:41
So what's next?
44:25
What's next is, you know, these two parallel tracks. I hope to make progress there. Plasticity and causality.
44:28
Because to date you've taken an existing mechanism.
44:38
Yeah.
44:42
And you've created a formal model how it works.
44:42
Yeah.
44:44
And so now you're actually interested in improving, creating a new mechanism.
44:45
Yeah, yeah.
44:49
And do you think it's an entirely different architecture or do you think. Do you think LLMs are like, part of the solution?
44:50
I think LLMs are definitely part of the solution. I see. But. But there has to be something more. So, you know, I was not interested in sort of cataloging what all these LLMs can do.
44:56
Yeah.
45:06
I was more interested in why are they and how are they doing it. I think now we have a good grip on the why and how and the next step is to move them to the next level. Now I think we have a fairly good understanding of what the limits are. Now how do you go to the next step?
45:06
Is there an equivalent kind of theoretical framework for causality that applies here? Like, similar to, like Bayesian for inference?
45:28
Well, the Judea Pearl's whole causal hierarchy, I think.
45:38
I think that's the right one.
45:41
That's a very good one. You know, the whole do calculus approach, I think it's a good way to think about it. You know, the sort of association intervention counterfactuals. It takes you from correlation to Achilles simulation in a mathematical way.
45:43
That's great. All right, well listen, really appreciate you coming. This is awesome. So we had you here for the first paper where you had the empirical results. Then we had you back when you actually have like the formal proof. And hopefully the next time you come back you will have a proposal for the mechanism that actually provides the next step.
46:02
Hopefully. Yeah.
46:21
All right. We're working on it.
46:22
Thank you for having me.
46:24
Thanks for listening to this episode of the A16Z podcast. If you like this episode, be sure to like, comment, subscribe, leave us a rating, or review and share it with your friends and family. For more episodes, go to YouTube, Apple Podcasts and Spotify. Follow us on X1 6Z and subscribe to our substack@a16z.substack.com thanks again for listening and I'll see you in the next episode. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16z.com disclosures.
46:29