Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

336 | Anil Ananthaswamy on the Mathematics of Neural Nets and AI

74 min
Nov 24, 20255 months ago
Listen to Episode
Summary

Anil Ananthaswamy explores the mathematical foundations of modern AI and neural networks, tracing the history from the perceptron through backpropagation to transformer architectures. The discussion clarifies how classical mathematics applied to extraordinarily high-dimensional spaces enables deep learning systems to function, while examining both the capabilities and fundamental limitations of current large language models.

Insights
  • Neural networks don't discover symbolic equations or new conceptual frameworks like Kepler did; they learn statistical patterns from data, making them fundamentally different from human scientific reasoning
  • The mathematical foundations of modern AI are largely classical (linear algebra, calculus, gradient descent) rather than cutting-edge mathematics, but applied at trillion-parameter scales that only computers can handle
  • Large language models have no mathematical guarantee of correctness and are sample-inefficient compared to human learning, suggesting fundamental architectural changes rather than scaling alone will drive next breakthroughs
  • The attention mechanism in transformers works by having word embeddings contextualize each other through matrix operations, allowing models to understand relationships between distant words in a sentence
  • Kernel methods solve the curse of dimensionality by computing dot products in infinite-dimensional spaces without explicitly entering those spaces, enabling efficient nonlinear classification
Trends
Scaling alone insufficient for AGI: Industry consensus shifting toward need for architectural innovations beyond larger modelsSample efficiency gap: Current AI requires vastly more training data than humans, indicating missing learning mechanismsMathematical transparency in AI: Growing emphasis on understanding rather than just engineering AI systemsTransformer architecture dominance: Attention mechanisms becoming standard across AI applications beyond languageHybrid symbolic-neural approaches emerging: Recognition that pure statistical learning cannot replicate human scientific discoveryComputational efficiency focus: Engineering optimization becoming as important as algorithmic innovationInterpretability challenges: Difficulty extracting symbolic knowledge from trained neural network weights remains unsolvedMulti-step reasoning limitations: LLMs struggle with problems requiring conceptual leaps not present in training data
Topics
Perceptron Convergence ProofBackpropagation AlgorithmGradient Descent OptimizationNeural Network Architecture DesignTransformer Models and Attention MechanismsLarge Language Model TrainingCurse of DimensionalityPrincipal Component AnalysisKernel Methods and Kernel MachinesLoss Landscape OptimizationStochastic Gradient DescentDeep Learning MathematicsSample Efficiency in Machine LearningHopfield Networks and Memory StorageSigmoid Activation Functions
Companies
Google
Referenced as using neural network technology in Google Maps for image recognition and prediction tasks
OpenAI
Mentioned as developer of GPT-4 and large language models discussed throughout the episode
Anthropic
Referenced as developer of Claude, a large language model competing in the LLM space
MIT
Ananthaswamy received Knight Science Fellowship at MIT to study deep learning and neural networks
Stanford University
Bernie Widrow developed adaptive filter algorithms at Stanford that preceded modern backpropagation
Cornell University
Frank Rosenblatt developed the perceptron at Cornell; Kilian Weinberger taught influential ML courses
Intel
Ted Hoff joined Intel startup after PhD, becoming one of first microprocessor designers
Lockheed
Donated analog computer to Stanford used by Widrow and Hoff to test early neural network algorithms
People
Anil Ananthaswamy
Science writer and author of 'Why Machines Learn'; former New Scientist editor; engineer turned AI researcher
Sean Carroll
Host of Mindscape podcast; physicist and author discussing AI implications with Ananthaswamy
Frank Rosenblatt
Cornell psychologist who invented the perceptron, the first artificial neural network in late 1950s
Bernie Widrow
Stanford researcher who developed adaptive digital filters and least mean square algorithm, precursor to backpropagation
Ted Hoff
Stanford PhD student who worked with Widrow to design LMS algorithm and first hardware artificial neuron
John Hopfield
Physicist who developed Hopfield Networks using condensed matter physics concepts for associative memory
Marvin Minsky
Co-authored 'Perceptrons' book proving single-layer networks cannot solve XOR problem, influencing AI winter
Seymour Papert
Co-authored 'Perceptrons' with Minsky; work influenced shift away from neural networks research
Geoffrey Hinton
Persisted in neural network research during AI winter; contributed to backpropagation algorithm development
David Rumelhart
Co-authored 1986 Nature paper on backpropagation algorithm with Hinton and Williams
Ronald Williams
Co-authored 1986 Nature paper introducing backpropagation for training multi-layer neural networks
Kilian Weinberger
Cornell professor whose 2018 machine learning lectures influenced Ananthaswamy's deep learning education
Kepler
Historical scientist used as example of conceptual discovery that neural networks cannot replicate from limited data
Isaac Newton
Referenced as originator of calculus concepts underlying modern gradient descent optimization
Albert Einstein
Referenced as example of conceptual breakthrough (relativity) unlikely to emerge from LLM training data
Quotes
"It's not like you're developing new math in order to be able to do this. It's not like you're using the most advanced reaches of modern category theory or topology to figure things out. You're applying math in very, very, very large, dimensional spaces that only computers can really handle."
Sean CarrollIntroduction
"The perceptron convergence proof was a huge statement to make in computer science terms back in the 1950s that an algorithm will is guaranteed to work. And it's a very, very simple proof that uses just basically linear algebra."
Anil AnanthaswamyPerceptron discussion
"If we had an LLM that had data about physics that happened until 1915, and then could it come up with Einstein's theory of relativity, without having anything in the data about relativity? Very, very, very unlikely."
Anil AnanthaswamyLLM limitations discussion
"Scaling up alone is not going to get us to a place where we are 100% sure of the accuracy of the model. We're probably one or two steps away from something quite transformative."
Anil AnanthaswamyFuture of AI discussion
"The attention mechanism is essentially the process that allows the transformers to contextualize these vectors and it's a whole bunch of matrix manipulations. It's just very, very neat matrix math going on."
Anil AnanthaswamyTransformer architecture explanation
Full Transcript
Back on the dating scene, let's see your profile. Fine. Sporty one. Cute dog one. 120 over 78. What's that? My blood pressure. Why? Well, the pharmacist at Boots said it perfect, so I popped it on my profile. Well, I'm sure that'll get hearts racing. Get a free health MOT for over 40s, including NHS blood pressure check. Book yours at bootstockcom slash health MOT. Boots with you for life. Subject to availability and eligibility criteria. England only. Hey, call my wife. Calling UK Wild Life. No, call my wife. Here's a cheese knife, lester. Voice assistance not working for you. With BlackRock investment trusts hands on investing, long-term approach to growth and regular dividends, you have a lot working for you. I live in Kent. Get to know BlackRock Investment Trust at blackrock.com. You have a lot working for you. Capital at risk, marketing material, BlackRock Investment Management UK Limited, authorised and regulated by the financial conduct authority. Hello, everyone. Welcome to the Mindscape Podcast. I'm your host, Sean Carroll. We all know that artificial intelligence in various forms has been exploding over the last couple of years. After decades of effort and various summers and winters in AI research, we clearly have crossed some threshold where AI is being put into use in all sorts of different places. Now, we can debate the words artificial intelligence, is it really intelligence? That's the large language models, which are a particular approach to AI, which have really gotten all the attention lately. They're based on a broader idea called neural networks or deep learning, which has been all over the place for a long time, something like Google Maps uses this kind of technology. But now, with the more human-like behavior of AI in the form of large language models, they become much more ubiquitous. And there's been a wild range of reactions to what is going on. Some people saying that maybe they'll become super intelligent and take over the world, and that's a danger. Other people just complaining that they can't download a new software or an app without it being infused with AI that they don't really want. So I'm not myself very sure what the long-term impact of AI is going to be, at least in this sort of large language model incarnation. A couple of years ago, when it first became a big thing, I said that it's probably somewhere between the impact of cell phones and electricity. And still, that's a lot of impact one way or the other. Cell phones have had a lot of impact on our lives, but not really completely changing the way we live. I think that's a minimal expectation for the impact that AI will have for better or for worse, whereas the larger thing of the impact level of electricity is maybe the upper level of where it could possibly reach. None of us knows. That's my personal guess. Various people have very strong opinions when we're the other. Many of them are more educated than mine. But we all should be a little bit educated about what this technology is that effect is affecting us so much. So that's what we're here to do today. Today's guest is Anil Anand Taswami, who is a science writer, actually, a former editor at New Scientist, and now mostly a freelance science writer. But he got his start in engineering and computer engineering in particular. And when large-language models came along, he became sort of refacinated by this aspect of technology and dived into it. So in addition to being a science writer, he's actually thinking at quite an advanced level about AI and what it is. And the book that has resulted from this thinking is called Why Machines Learn, the elegant math behind modern AI. It's actually, it came out last year. This is my fault for waiting so long to talk about it. Anil is a person, a friend of mine, that I've known for a long time. But it's a book, chock full of big ideas and mathematics. It's exactly in the spirit of my own biggest ideas books, et cetera, that really doesn't just say, well, you know, AI is going to take over the world. It says, here is why you need to understand how to diagonalize the matrix to understand what AI is really telling us. Now, there's a lot more in the book than we could possibly cover in a one hour podcast so we hit some highlights. But I think that for me, this conversation was extraordinarily helpful in clarifying which advance in AI technology came first, what the importance of it was, what it led to later. The actual math that is used, which is on the one hand fascinating, on the other hand, as we say, in the podcast, mostly classical, in the sense like it's not like you're developing new math in order to be able to do this. It's not like you're using the most advanced reaches of modern category theory or topology to figure things out. You're applying math in very, very, very large, dimensional spaces that only computers can really handle. And that's enough to do something that is very, very different than anything that's been done before. So trying to understand it the best we can is worthwhile and ever. Let's go. MUSIC Anil Anand Swamy, welcome to the Mindscape Podcast. Thank you, Sean. It's my pleasure. So you've written a book about AI. That doesn't single you out. There's lots of people who've written books about AI. You've decided to write a book about the mathematics behind AI, which is an interesting choice. What is it that led you to that? Oh, it began well before the AI craziness came about. I think sometime in 2016, 2017, I started noticing a whole bunch of stories that I was beginning to do as a journalist that had a machine learning component. When I became a journalist, this was when I transitioned from being an engineer to being a journalist, I was writing mostly about physics and neuroscience. And when I would write about those subjects, like particle physics, I was happy just doing as much research as I could and understanding it to the best of my ability. And then writing about it, I never had any illusions about being able to do particle physics or do neuroscience. But when I started encountering stories and machine learning, I think when I would talk to the researchers explaining their algorithms, their machine learning models, I think I certainly felt, hang on, this is something I could do. I could do this. One who are not in the way they were doing, but I could certainly get my hands dirty because of the software background, because of my engineering background. And so what happened was I got a fellowship at MIT, the Knight Science Genesome Fellowship. And as part of that fellowship, we had to do projects. And the project I took on was essentially teaching myself deep learning. So the project question was, could a deep learning or deep neural network do what Kepler did? So it was trying to build a neural network that would try and predict the future positions of planets, even the client-of-the-clippler access to. OK. And the short answer very quickly, I found out absolutely not. There's no way a neural network would do what Kepler did. In a Kepler had access to literally a few tens of positions of the orbits of orbit of Mars and Jupiter, very, very little data that Thaikobrahed collected. And so then I ended up writing a simulation to generate loads and loads of data and learn how to train very simple deep neural networks to make predictions about planetary positions given years and years of data. And that was still just empirical stuff. I was gone back to CS101 and taught myself coding in Python. It had been maybe almost 20 years or so since I had done any coding. So I had to sit in a class with teenagers and coaching myself coding. That was fun. And I think at some point, what happened towards the end of that fellowship was COVID happened. And we all got locked up in our apartments. And my interest shifted to wanting to understand more about machine learning. It wasn't enough to just sit and do some coding. I felt like I needed to get under the skin of this thing. So I just started watching lectures from Cornell, from MIT. There was one professor at Cornell, Kiryan Weinberger, that I put my discovered. It was a 2018 class that he gave, which is online even today. And it's just him giving his talks to his students. It is not produced for YouTube. It's amazing. There's nothing slick about it. It's just a professor and the students. Fantastic stuff. I got sucked into that. And kept learning more and more of the math. And at some point, I think the journalist and the storyteller in me woke up again saying, hang on, this math is actually quite lovely. That there are stories here to be told. But because I was so steeped in the math by then, I mean, again, I don't want to make it sound as if I did a lot of math. It was steeped in the math relative to the kind of math I was steeped in before. It was not very much. So this was more for me. For actual machine learning practitioners, this is pretty simple math. But for me, it was a fair bit. And I think it was that desire to communicate the beauty of the math that I was encountering and to tell the stories about it. So some combination of getting really stuck into the math. So I remember when I made my proposed book proposal to my editor. I read and you know, you and I share an editor for disclosure. We share. So and I was pretty sure that he was going to say, no, given how much math I was proposing to put into the book. But I made it very clear that I was going to put the math in there. I saw it in the book for you. Yes, exactly. That's how I realized that later. Absolutely. So yeah, that's how it came about. It wasn't a book that was written with a desire to write the AI wave because this was proposed in 2020 well before all of the craziness came about. And I just wanted to share in the beauty of the math that I was encountering. And I need to dig into that Kepler story a little bit because I think it's secretly profound. I mean, the idea, would you still think that it's true that a machine learning or some sort of deep learning algorithm, given the data that Kepler had, would not be able to come up with Kepler's laws? Somehow, it seems like that must depend on the space of possible theories that the LLM or whatever it is has access to. But I'm not quite sure what's going on there. Yeah. So well, first of all, it's unlikely that we would use an LLM large language model to solve this problem because the data is simply, in this case, what the data that I was using is just orbital positions or planets. And I was teaching a neural network to learn about the patterns that exist in this data. And then it was like a time series. Then I would just try to predict into the future where those planets were. And if you have enough data, you can train deep neural networks to learn the time series and make predictions going into the future. But what they don't do is they will not give you a symbolic form of the equation. The equation might be there in the system, but it has no way of spitting out some symbolic form of Kepler's laws. So, but the network is embodying in its weights. Something that is very similar to what Kepler would have figured out. The question is, how do we extract that out of that network? And so that's one problem. The second problem is what you were alluding to is that the amount of data that Kepler had access to, there's no way. Today's neural networks are extremely sample inefficient. So they require too much data to do what they need to do. And so we're certainly missing something in our AI models in terms of being able to learn the way humans do. It's also true that Kepler came with a whole bunch of prior knowledge. And he was a smart fellow. So he was obviously coming with a whole bunch of inbuilt knowledge about geometry, about calculus, and all of these things. And so we have to take that into account also. So maybe large language models, which have been trained on a whole bunch of human text, have that kind of prior knowledge built in. And it would be interesting to see how one would solve this problem using a large language model. I wasn't using large language models. This was well before LLMs came on the scene. And I was just simply training these things called LSTM, which are the current neural networks, which are good for time series. I went to a seminar recently. There was supposed to be an intro to the power of AI for physicists and cosmologists in particular. And the speaker started the seminar. He had a data set, a data set that from LIGO, from a particular gravitational black hole in the spiral. And he basically had, he had cooked up ahead of time a one-page long prompt. And he fed the data set in the prompt to, I think it was a large language model, certainly a deep learning thing. And basically, the LLM wrote the paper. So it wrote a bunch of Python scripts to analyze the data. It made the figures. It wrote the paper. It embedded the figures. It found the references and everything. And it was finished by the time the seminar had finished an hour later. And I'm open. Of course, you'd want to check it that didn't make mistakes, right? You would definitely want to check it very, very carefully. But I'm open to the possibility that that kind of science is doable by deep learning methods, whereas Kepler's kind of science, where you're literally coming up with a new conceptualization of what's going on, seems to be much harder. Yes. And I think I would very much agree with that. Someone I was listening to just recently about this exact issue. And he was pointing out that if we had an LLM that had data about physics that happened until 1915, and LLM, and then could it then come up with Einstein's theory of relativity, without having anything in the data about relativity? Very, very, very unlikely. I wouldn't even say I'm like, yeah, I was saying possible at this point. Right. It's just a different kind of thing. Yeah. Well, OK, good. So we're here to learn about what it is actually able to do, and why it's able to do it. Should we just start with the going back to the beginning and the perceptron and the first neural networks? Yeah, absolutely. The perceptron, the first neural network started in the late 1950s. So the perceptron was designed by Frank Rosenblatt. He was a Cornell University psychologist. And the perceptron is essentially a single layer artificial neural network. And the artificial neural network is simply a network of artificial neurons. And artificial neuron is very simply a computational unit. It takes in a bunch of inputs, and thus some sort of weighted sum of this inputs, adds a bias term. And then if that weighted sum plus bias exceeds some threshold, it will produce a one. Otherwise, it produces a minus one. That was, in essence, Rosenblatt's artificial neuron. And he showed how you could use a series of search artificial neurons lined up vertically. So one layer of them to do some sort of linear classification. So if you had, for instance, images of one kind of digit, let's say the digit nine, and images of another kind of digit, let's say the digit, I don't know, something that looks very different for. And if these were, let's say, 20 pixels by 20 pixels, black and white images, then each image can be effectively turned into 400 pixels. And if you were to map each pixel along one axis, then in 400 dimensional space, each of these images becomes a bunch of points. And so the image nine, all the images of the digit nine will be in one location of this 400 dimensional space. And the digit fours will be somewhere else. And as long as those things are pretty distinct, and you can draw some sort of hyperplane separating out those two regions, the perceptron algorithm was guaranteed to find one such claim. As long as the data was linearly separable in any kind of dimensional space, the perceptron would find it. And this was a big deal. This was a very big deal. In fact, when we were talking earlier of the math that inspired me to write the book, it was actually the perceptron convergence proof, which was a few years after he came up with the algorithm. So he first comes up with the algorithm empirically to how to do this. And then people get into the act of trying to figure out mathematically, you know, the properties of this algorithm. And there was something called the perceptron convergence proof, which basically said that if the data is linearly separable, then the algorithm will find it in finite time. And this was a huge statement to make in computer science terms back in the 1950s that an algorithm will is guaranteed to work. And it put a lower bound in saying this will definitely work. And it's a very, very simple proof that uses just basically linear algebra. And nothing more. And if you look at the proof, it's so lovely that I think, you know, I did put it in the code of one of my chapters. And beginning of that code, I tell the reader that you really don't have to read this to read the rest of the book. But I should also tell you that if I didn't, if it weren't for this proof, I would not have written the book. It's a way of teasing them into reading that book. I wouldn't have had a ride. Sorry. I just wanted to say that this trick is actually due to a British novelist called Somerset Mom. He had a novel called The Razor's Edge. And it's a very interesting book. And there's a chapter somewhere in the middle where he says to the reader. He addresses the reader directly saying, dear, you don't have to read this chapter because it won't change the rest of the book. But I should tell you that if it weren't for this chapter, I wouldn't have written the book. You steal from the best. That's what we all do. Yes, writers. That's okay. I want to dwell on this, not the proof, but the linear separability idea, because it is kind of deep and but also hard to visualize. You're saying that I have a 20-byte 20 grid of pixels. And I can think of that as a single point in a 400 dimensional space. Once you've done a certain amount of math in your life, that's obvious. Before you've done that amount of math, that's almost impossible to quite wrap your head around. But then the point is that all of the nines kind of cluster in a group, hopefully, in that 400 dimensional space and all of the fours cluster somewhere else. And so if a new data point comes in that might not be any of the existing data points, but you can say it's closer to one cluster than the other one, right? Right. So in the case of the perceptron, it will find a hyperplane. And so this will be a 399 dimensional plane that will separate out the two classes of data. And it doesn't guarantee that it will find an optimal hyperplane. It will just find something. And because if there is a gap between the two clusters of data, there's, you know, in principle, there's an infinity of hyperplanes that will pass through that. So it'll find one, the first one that it finds it stops. And it might not be an optimal one, but then what happens is then when you give it a new digit, saying, okay, tell me whether this is a nine or a four, it actually doesn't matter whether it's either whether it's a nine or a four. It just, all it does is it's going to say, is it to this side of the hyperplane or is it to that side of the hyperplane? And, you know, it'll classify it as such based on, you know, which side of the hyperplane the new digit falls on. So, you know, you already identified a problem with the algorithm that you could have a data that is, you know, it's much easier to think of this now, you know, the same thing applies to images of cats and dogs. You could have, you know, a hundred, you know, a thousand by thousand image of a cat and of a dog. So this would now be dots in million dimensional space, right? And the hyperplane will separate the two sets of images, the cats on one side, dogs on the other. Now, you bring in an image of a horse that the pacifier has no idea that holy does it, it says which side of the hyperplane does this image fall on and it's going to call the horse either a cat or a dog. Right? But again, I don't going back to the late 1950s, early 1960s. This was a very big deal because, you know, you are basically saying we can recognize images. We can classify images and passifications the first step towards recognition. And I presume that it's not that hard to let the computer know that we're going to introduce a new category called horses and then we're going to separate that space into three subsets. Yeah. So, you know, then you can have classifiers that are multi, they're, you know, they're classifying into multiple categories. Yes, very much so. But the basics began with that first linear classifier. And at the same time, there was another researcher who doesn't actually get talked about as much, but whose work was just as a seminal and this was a Bernie Widrow, who was at Stanford. And he was somebody who had been working on designing digital filters, digital, adaptive digital filters. So filters that learn about the characteristics of the signal that they're processing and learn how to separate noise from signal adaptively. So they are learning on the fly. And he was very much steeped in that and then realized that that the techniques that he was using to build his adaptive digital filters were actually exactly the same techniques he needed to build an artificial neuron that learned a linear classifier exactly like what Rosenblatt was doing, but Widrow's approach was very different. And in a very fundamental sense, the algorithm that Widrow came up with to build his linear classifier is actually the true precursor to today's, you know, back propagation algorithm, which is used to train artificial neural networks. And there's actually, you know, talking about stories. There's an amazing story about how the Widrow, it's called the Widrow Hof, least mean square algorithm. And Widrow was an assistant professor at Stanford in the late 1950s. And there's no honest or young student comes in wanting to see if he can do a PhD with Widrow. And so Bernie Widrow starts scribbling some stuff on the blackboard trying to tell him about adaptive filters. And in the course of two hours of discussing what a PhD project would be like, they end up designing what's today called the least mean square algorithm. They realize that they have designed an algorithm to train a simple artificial neuron. And then the duo walk across the room, there's an analog computer out there that Lockheed has donated to Stanford. And Hof, who's a kid, you know, looking for a PhD thesis project, he goes and, you know, programs the analog computer to simulate the algorithm shows that it works. And then now it's Friday evening by then at Stanford and all the supply rooms rooms are closed and they want to build this thing in hardware. So they walk across to Zach's electronics by all the stuff that they want go over to Ted Hof's apartment. And over the course of the weekend, build the world's first hardware artificial neuron Monday morning, they have it working. Right. And, and they have a very, very crude, so you're, you know, most people who are following AI by now will know of this algorithm called Backproperty. And this is the kind of application and this idea of doing gradient descent and stochastic gradient descent as, you know, these are algorithms that are used for optimizing the parameters of your model. And what we draw and Hof had done was an extremely noisy version of stochastic gradient descent. They had used that comp with an algebraic formulation instead of using any calculus or anything, they, they just came up with a very straightforward sort of algebraic formulation that they could then implement in hardware. So that Monday morning, they had an artificial neuron on the desk working and I want to make sure that all prospective graduate students out there know this is not usually what happens. This is a rare story. And, and I think Ted Hof's story is pretty amazing because once he finished his PhD, he gets an offer from a startup in, in the Bay Area. And he comes to Bernie Goodr to ask, should he join the startup and Woodward tells him, yes, you should, that the startup turns out to be Intel. Good. Smart. And, and Ted Hof goes on to become one of the first designers of the world's first microprocessor. Well, okay, very, very good. I want to, I want to get into the actual neuron just a little bit. This idea of a threshold is apparently very, very important here. You know, so your little artificial neuron is taking in some signals and rather than just sort of adding them together, it says, I'm going to hold off until it crosses some threshold and then I'm going to fires at the basic idea. Yes. And it's inspired from our understanding of biological neurons, right? That's what biological neurons are doing in a very simple way of looking at what biological neurons do. You have all these signals coming in through the dendrites of the neuron into the cell body and the cell body is kind of accumulating the signals that are coming in. And when, when it crosses some threshold, both in terms of the strength of the signal and the timing of the signals, it then fires a signal on its own axon, which then travels to the dendrites or other neurons. So, and this model was already known. And so the, the simple kind of artificial neuron is a very basic approximation of this biological neural mechanism. And then the real fun, go there and the special thing is important. Yes. The real fun comes in when we start adding them together in layers. Yes. So it's, and this was actually not possible in the 1950s and 1960s, because what they had was just a single layer of neurons, which means that the inputs are coming in from one side into the neurons, the neurons do this weighted some and then add a bias term and then, and then if that weighted some plus bias exceeds some threshold, they fire. I mean, they're always producing an output. If the, if the threshold exceeds a certain amount, then the output becomes one. Otherwise, it's minus one, right? Or zero or one. I choose your output. But, but that's it. So on the output side, you have either one or minus one. And on the input side, you have these inputs coming in. The neuron does this computation. The moment you add another layer of neurons so that the outputs of the first layer of neurons goes in as inputs to the second layer, then the training algorithm that Rosenblatt ad and Bernie Whittrow had the, you know, the Whittrow half LMS algorithm, least means square algorithm. They didn't work. So, so and that there's a very interesting story as to why this was a very big deal in the 1960s because we had people like Marwin Minski and Samorl Pepper who wrote this extraordinary book that was published in 1969. It was called Perceptrons in honor of Rosenblatt. And it was a mathematical analysis of these kinds of machines that learned. And in it had the Perceptron convergence proof and a whole bunch of other things. But they also had been a very important proof that showed that a single layer neural network of the kind that Rosenblatt had and of the kind that Bernie Whittrow had could not solve something called the XR problem. So if you imagine four points on the XY plane. So at the origin at the point zero zero, you have a circle at the point one zero on the X axis, you have a triangle. Then at the coordinate one one, you have another circle. And then on the Y axis coordinate zero one, you have a triangle. So you have two triangles and two circles. But they are on that on the diagonals of the square. But right, you cannot draw a straight line to separate the circles from the finals. It's true. And so this was the XR problem and Minski and Pepper had a very elegant proof saying that single layer neural networks will never solve this problem. And this was a huge knock because this was such a simple problem that anyone looking at it can solve it. But you know this incredible thing that people had been going on and on about couldn't solve it. And then what they did, which was kind of underhanded was, you know, they also insinuated without giving any mathematical proof that even multi-layer neural networks will not solve this problem. Oh yeah. And this effectively killed, you know, well, the legend is that this effectively killed research into neural networks and led to the first day I went there. So sometime in the 1970s, research into neural networks just died because people didn't think that these things were good for anything. If they couldn't solve something as simple as a XR problem, except that they had no proof that multi-layer neural networks couldn't do it. And no one had yet come up with an algorithm to train multi-layer neural networks. So only people who kept the faith. What people like Jeff Hinton who persisted, Hinton, I remember talking to him and he was completely convinced that, you know, that multi-layer neural networks will solve the problem. He thought that Minskian Pappert had just kind of pulled a fast one over everyone. Everyone's eyes. So basically the argument there was Minskian Pappert were very interested in another form of AI called symbolic AI. Sure. Right. And they were, they wanted research funding to go into that. And this approach of connectionism in neural networks was against what they were looking at. I don't know how true that is whether there was any underhandedness here. But certainly the first day I went or was in part influenced by Minskian Pappert's group. And so map of the big role. Yeah, no, I mean, the human beings they're they're endlessly fascinating. But at some point one way or the other, the multi-layer neural networks did start gaining traction. Yes. So that we would have to wait until the 1980s. So I think the first big change that happened in the early 1980s was John Upfield coming up with Hopfield Networks, which were a different kind of neural network. They were essentially neurons that were fully interconnected, which meant that the output of a neuron would go as input to every other neuron in the network except itself. So it couldn't influence itself, but it's output could influence every other neuron. And so these were fully connected networks. And Hopfield networks were essentially networks that were used to store memories. And the way they were very deeply inspired by condensed metaphysics, the icing model of sort of ferromagnetic materials and the spin glasses. And what Hopfield was after was basically, he was formerly a condensed metaphysist who was looking to do something in competition neuroscience or biology. And he was looking for a problem to solve. And he figured out that he could solve the problem of how the brain stores and retrieves what he calls associative memories. By designing this kind of neural network where given this fully connected neural network, he did he designed an equation which allowed you to calculate the quote unquote the energy of the network. This was modeled after the Hamiltonian of a material. So and here the the energy of the system would be at a minimum when you store the memory. And and any time you corrupted the network, it would enter into a higher energy state. And because the neurons were all connected to each other, the moment you corrupted one of them in terms of changing the output of a neuron, you would end up setting up a dynamic that made the network just traverse the energy landscape and come all the way down to a minima. And when it came and settled into a minima, then you would just read off the outputs of the neuron and they would be exactly the memory that you had stored previously. So the the way you set the coefficients of your network, which in this case are the strengths of the connections between the neurons. Dependent dot, dependent on the memory you wanted to store. So given that you wanted to store, let's say it's a 10 by 10 image, black and white image that you want to store. And so again, 10 by 10 is 100 pixels and you want 100 neurons in your network so that each neuron is responsible for one of those pixel values. Right, so you take those 100 pixel values that you want to store and Hopfield had an equation that told you that given that I want to or these 100 pixels, what should be the strength of the connections between the neurons. And you would set the strength of the connections of the neurons such that when the memory is stored, the network is at an energy minima. So it is stable. It won't do anything. But the moment you perturb it, moment you add some noise, let's say you want to corrupt your image and corrupting an image simply means that some of the outputs are now changed. Let's say, there were zeros and once in certain neurons, you just flip them. And now the dynamics of the networks takes over the network finds itself at a higher position in the energy landscape. And because the neurons are influencing each other, they will just start flipping just like a magnetic moments in a ferromagnetic material. Right. And then the whole system will traverse its way down the energy landscape and then settle back into its minima, which way definition is a stable state. And the way the network is designed. And when it settles back into a stable state, you just read off the outputs of the neurons and you caught your image back. And I have obviously there's stealing ideas from physics, but we made up for it by giving them the Nobel Prize. So I think that that's afraid. So this was 1982 and 1984. That's when Hopfield came up with this Hopfield network. But we still didn't know how to train multilayer neural networks. What was interesting was that the ideas that would eventually influence people like Hinton were already in the air. In fact, going back to the 1960s, people in rocketry had ways of sort of optimizing their models. Like when you think about what happens when you launch a rocket, which is supposed to reach some destination in deep space, your rocket is actually going in space, you have your model of the rocket, which is controlling, which is, which is giving you the control system. And at every time step, you have to actually look at where the rocket is. And then modify the parameters of your model, because your model has to adapt to the fact that the rocket may not be doing exactly what you think it should be doing. So your model is being updated, your model parameters are being updated on the fly. And and the attack in a sense had the beginnings of the back propagation algorithm already built in except it was not formulated as such. And so and there were there were others, you know, people at MIT, there was a graduate in MIT who had done some work in economics, who was also playing around with similar ideas and many others who had done bits and pieces. But it was a hint and rumble heart and Williams who in 1986 put out a paper in nature is just an amazing three and a half page paper that is essentially the back propagation algorithm, which shows you how you can train a multi neural network. And I'll use using what turns out to be just the chain rule of calculus. It's an extraordinarily simple idea in retrospect. And maybe one thing I should mention here that remember we talked about the fact that the early neurons had a thresholding function. Right. So, but that thresholding function is not different, differentiable. So it's, you know, it just transitions, steeply at the point of the, you know, when the weighted some like seats certain threshold and then outputs one, otherwise it's a zero. So you can't differentiate that. And this turned out to be very key to not being able to train the network when you added more layers because the chain rule requires that the entire computational graph from the output all the way back to the input has to be differentiable. Every computation that you do has to be differentiable so that you can you can use the chain rule to kind of, you know, back propagate your error from the output side all the way to the input side. But these discontinuities discontinuities that were there in the way the artificial neurons were designed with a step threshold function kind of ruined that. So one of the things that hinted in another's date was change that step function into a sigmoid. So certainly the sigmoid was differentiable. And so they essentially ensured that the computation from the input all the way to the output regardless of how many steps there were. We're all differentiable. So the sigmoid, we might as well what that is, that's, that's like a smoother version of the kind of step function that turns the neuron on or off. Yes, so the initial neurons had a very sharp transition from let's say zero to one. Okay, so the scope of the function at the point of the transition is infinite. Right. And the sigmoid is essentially a smoothing of that function. Okay, so this step is gone. So it's a very smooth transition from zero to one. And let me show you the right mental picture here when Hopfield has these networks where every neuron talks to every other neuron. In what sense is that multi layer like what happened to the layers. So yeah, good question. So that that was not a multi layer network in the way we think of it today. That was simply a fully connected neural network. I brought that up only to say that the research in neural networks had a research and so with Hopfield networks, but they hope field networks are not the kind of networks we use today. They're what are called recurrent neural networks because, you know, the outputs of a neuron can feed back into then, you know, the rest of the network and there's a lot of feedback going on, which is not the way networks work in today's neural networks. Today's neural networks are what are just called feed forward networks. They if the computation proceeds from one end from the input side and then it goes layer by layer to the output side and the outputs don't feed back to the input side. Okay, I guess it's interesting. There is this idea of back propagation. Is that part of a feed forward network or is that excluded? So back propagation is the way you update the weights of the network when you want to train it. So right, so the computation when the neural network is doing a computation when you give it an input and it has to produce an output, that is the feed forward process. So the input comes in from one side and each layer does some computation feeds the result of its computation to the next layer and the next layer does its computation. And then finally this exits on the output side as the output that you want. And one way to think about it is a vector of information comes in on the input side and the vector just propagates, you know, changes size as it goes through the network because the layers can have different sizes. And then on the output side you'll get back another vector. So and you know, one vector comes in, gets transformed by each layer into a different vector and then finally on the output side, you have another vector and that output side could be a scalar. It would just be a number zero or one saying this is a cat or a dog or it could be a vector which has more information than just a scalar. But that's the that's the computation that's happening in the feed forward pass. But when you're training the network, let's say you're training a network to recognize images of cats from images of dogs. And let's go back to our example of 10 by 10 images. So that's 100 pixels. So you you turn each image into a 100 dimensional vector. Right. Pixel of 100 pixel values and they are fed into the network on the on the input side. And then on the output side after it has gone through a bunch of transformations, you get back either a zero or a one zero for a cat and one for a dog. Now in the beginning when you when you initialize the network randomly all the the weights of the network, which are the strengths of the interconnections between the neurons, they are initialized randomly. And so when you feed in some image on the input side, you're just going to get the wrong answer on the output side. And but you know what the right answer should be because you supplied at the training data, you have you have human annotated data saying these are cats and these are dogs. So on the on the output side, you know that it should be outputting either one or a zero and it maybe it does the wrong thing. So you calculate the error now. And the amount of error that it makes is a function of all the parameters of the model. All the weights of the network or all the, you know, strengths of the connections between the neurons. So you now have something called a loss function where the loss is formulated in terms of the parameters of the model. And and you can imagine this as some sort of very, very high dimensional surface. And when in the beginning when the network makes the loss, you lend up one location in that loss landscape, which is pretty high up. You made a large amount of noise. A loss. So you now use something like gradient descent to try and work your way down to the point in the landscape where the loss is at a minimum. And and and that part where you have to now figure out what the how much to update each weight. So that your network is slightly better at the same test than it was. You know, the first go around. So let's say you took one image and it made a certain amount of loss. You took that loss figured out how much you need to tweak all of the parameters. So that the next time you feed the same image back. It the loss will be a little less. And if you keep doing that eventually you'll come down to a point where the loss is at a minimum. But the trick here is that you have to do this for all images because we just did that for one image. Then you'll be off for all the other images. So you have to do it, you know, in parallel for everything simultaneously. So for your entire data set, you want to reach the bottom of the loss landscape. Right. And that part where you are trying to update the weights of the network. That's the back propagation part. It's all back propagation because the loss is calculated on the output side. And now you have to propagate that loss all the way back layer by layer so that you can update the weights of each layer as you go back towards the input side. That's very hard. That's where yeah, because that's where the table comes in because you calculate the gradient on the output side. And then you have to chain all the, you know, differentiable computations together so that you can calculate the gradient over the entire network. It makes sense because there's a difference between training the network where obviously you're going to have to go backwards and fix all the parameters early in the chain of layers versus just doing the calculation once you've trained it, which is a purely feed forward mechanism. Yes. So and and so for people who are, for instance, using something like chat GPT today. And when we use it, we're just using the feed forward part of right. But when you're training it, you have to keep doing this back and forward. And say more about the idea of gradient descent. It seems to me, maybe I'm not sophisticated enough here. It seems to be like it's a very high dimensional version of something that Isaac Newton would have told us about many years ago. Well, that is the amazing part about this field is that a lot of the ideas go back all the way to, you know, the invention of calculus and you know, algebra, these are all the 16th, 17th, 18th century math, very simple stuff in some fundamental sense. And yet, of course, now when we do gradient descent, we are doing it in an, you know, extraordinary high dimensional spaces, right. When you think about modern deep neural networks, like the large language models that we have, you know, GPT-4 or in OpenAI's O1 or O3 and Claude, all of these are quite, we don't know exactly what the number of parameters they have, but they're close to a trillion, right. So your loss that the error that the network is making has to be formulated as a function of these trillion parameters. Great. So your loss landscape is in trillion dimensions, right. And you're trying to, and the other thing that I should have mentioned earlier when we were talking of these artificial neurons is that the early artificial neurons were linear neurons. And, you know, the, so the subsequent sort of artificial neurons have a non-diniarity built into them. The addition of a sigmoid essentially makes, you know, makes the neuron non-linear. So now your loss function is not only does it have a trillion, you know, it's a function of a trillion parameters, but there's a huge amount of non-minarity baked into the whole system. So it is not a convex function. It's not just, you know, even if it is in trillion dimensions, you can imagine a convex function, some, you know, trillion-dimensional bowl-shaped function that you can just slide down all the way and your guarantee to find the global minimum. That's not the case with these modern neural networks. These loss landscapes are extraordinarily complex. They have probably, we don't even know for sure that they have a global minimum. But they may have lots and lots of very good optimum optimal local minima. And so the trick is to somehow use gradient descent to find one of these local minima that is satisfactory. And not a trivial task. Yeah, right. But the math is the same regardless of whether it's a trillion parameters or four parameters. That's the amazing part. The back propagation algorithm, you know, what hinted in 1986 in their paper, the Rommelhardt, Hinton and Williams, you know, the same stuff holds true today. And they had, I don't know, I forget how many parameters they had, but you know, you could tens or ten, ten, twenty, thirty or something like that. And today we're talking about trillion, but the algorithm is the same. Yeah. And of course, there are all sorts of innovations about, you know, how you do the stochastic, how you do the gradient descent, you can do stochastic gradient descent, where you don't, you know, sort of you take a small batch of the data that you're training on at any given time. And so the loss landscape that you calculate is always some approximation of the true law, loss landscape. And so when you do the descent, you are sort of stochastic in terms of whether you're actually going in the right direction or not, but turns out stochastic gradient descent works brilliantly. And so there are these kinds of innovations that have happened since the 1986 paper, there are also other tricks about how fast you do the gradient descent, do you use momentum, do you, you know, keep track of the gradient at previous steps, when do you slow down, when do you speed up, you know, those kinds of things, but, and they're really important in terms of engineering, but conceptually it hasn't changed. And so just so I think that I'm getting the right visualization here, because I know there's a trillion dimensional thing, but I'm visualizing a two dimensional landscape, because that's all I can do in my head and the impact of the nonlinearities, which you're emphasizing, I think would be that if you just had linear would be a tiny change in input always give you a tiny change in output. And therefore your fitness landscape that you're trying to find the minimum of would be pretty smooth, it would be gently rolling hills. But now that you have these neurons that can sort of click on and off in a nonlinear way, now you have a jagged landscape and it becomes much, much harder to sort of know from where you are and what your local conditions are, where your actual minimum is lying. Yeah, and because of the size of these networks and complexity of the computations that happen layer by layer, it's not even clear that there is a there's a global minimum, these are not, you know, if you can think of a, you know, why is equal to X square function, which is convex and you have a well defined global minimum that you can descend down to and that will represent the lowest loss. We are not guaranteed that these functions are convex. So not only are they jagged, they may not have global minimum, a global minimum. So most of the sort of mathematical work that is going on is a lot of it isn't trying to figure out the actual nature of these loss landscapes. But okay, despite the fact that a lot of the math is sort of classical math, you like it's good old back there, there are some some more recent developments, right, or maybe I don't even, I guess I shouldn't say that because I don't know what dates, different developments correspond to, but you do talk about the curse of dimensionality, the number of dimensions is so large. That the one of the things you try to do is sort of find a good subspace where interesting things are happening using ideas like principal component analysis, maybe you could make a make an efforts to explain that to everybody. So so the curse of dimensionality actually, you know, has been well known for a long, long time. It predates or is almost orthogonal to neural networks in terms of a problem. One of the best ways to understand this is a very classic machine learning algorithm that was developed in the 1960s called the nearest neighbor search algorithm, right. So here again, we can go back to our images of cats and dogs. Let's say images that are 10 pixels by 10 pixels. So each image is essentially 100 pixels. So if we map it into 100 dimensional space cats end up in one location dogs ends up and dogs end up in another location. And now we're given a new image. So the perceptron would have started by saying, oh, it's going to find a hyperplane that separates the two clusters of data and then checks to see which side of the hyperplane the new image falls. And then it classifies it as a dog or a cat. The key nearest neighbor algorithm does something actually intuitively very simple. It basically says, OK, where does this new image map to in that 100 dimensional space? Is it closest to a dog or is it closest to a cat? If it is closer to a dog, it's a dog. If it's closer to a cat, it's a cat. Right. And if you're just using one neighbor to make that discernment, then you can end up in all sorts of you're essentially overfitting the data because you can have noise in your data, original data. And if your new image is closer to a noisy data point or a mislabel data point, then you will you will get an error. So you can mitigate that by saying, OK, I'm going to look at three neighbors or I'm going to look at seven neighbors or whatever some odd number of neighbors and then you take a majority world, right. But that process depends on being able to calculate some sort of distance between these data points in high dimensional spaces. So 100 dimensions is not high dimensional for machine learning. Right. So, yeah. And so you're basically going to use some sort of let's say you plead distance or, you know, there are various metrics you could use, but let's say you're using some sort of, you know, you plead distances fine. Let's say you use that. And then what happens is, as you increase the number of dimensions of your data, then there comes a point where the notion of similarity or dissimilarity that this algorithm is. And that similarity that this algorithm depends on that similar data points are closer together than this similar ones that starts falling apart because in high dimensions, everything is as far away as everything else. So the notion of similarity and the similarity just doesn't work anymore. And that in a very simple way is the course of dimensionality. And this is a serious problem for machine learning because. And the number of dimensions is also the number of features that you have in your data. So in this case, each pixel is a feature of your image. But you could have other kinds of data where let's say you're thinking about penguins and you have your trying to analyze penguins while looking at the length of their beak, the depth of their beak or their flipper length, or et cetera, et cetera. Of penguin can be characterized by, I don't know, 10 such features and you have a pretty good idea of what kind of penguin it is. But there will be situations where 10 is not enough where you probably need, you know, 10,000 features to figure out what might be happening in terms of classifying an object. And or even more. And the more and more features you add, you end up with this course of dimensionality or just not going to be able to, you know, do what you want. So, so one obvious thing is to do something like PCA, principle component analysis, where you essentially project that data back into lower dimensions and hopefully capture more, most of the variance in your data. Along the few lower dimensional axes that you've chosen. Right. So if the data varied equally in all of the higher dimensional axes, then you're in trouble. You be stuck. But but if you can, if there is something about the structure of your data such that if you do bring it down into lower dimensions and it still captures most of the variance of your data along those fewer dimensions, then you can, you know, take that lower dimensional data and do your classification on that. A good training machine learning model on the lower dimensional data. So that's one way of tackling. But you know, oddly, higher dimensions also really, really useful. So for, for instance, if you have data that, let's say, is 100 dimensional. And you cannot build a linear classifier because the two clusters of data are not linearly separable in 100 dimensions. Or let's, let's not even go into 100 dimensional. Let's just take two dimensional data, right, a smattering of dots on an x, y plane. One is colored red and the other is colored green, but the red and green are kind of mixing up such that you cannot draw a straight line to separate the two. Right. So one easy trick would be to actually project this data into higher dimensions. So that let's say you go into three dimensions or four dimensions. For instance, if your red colored dots are centered around the origin and your green colored dots are an annular ring around the red colored dots. There's no straight line that's going to separate the two clusters in two dimensions. But you can imagine adding a third dimension where you're just multiplying the x coordinate and the y coordinate to create a z coordinate. Right. And or other, you know, will multiplication be enough? Maybe not. You'll have to, you'll have to square the x coordinate and add it to the square of the y coordinate. And when you plot this data in three dimensions, the green dots are going to rise about the red dots. Yeah. And then you can draw a hyper plane between the two. You can use a linear classifier to separate out the green dots and the red dots in three dimensions. Once you've figured out what the separation is, when you project it back into two dimensions, you will get a sort of nonlinear curve, which is going to be some sort of oval shape that separates out the annular ring from the dots in the center. Right. So this is something we can visualize in two dimensions and three dimensions. But this exact thing is also done using this extraordinary technique called kernel machines or kernel methods where the idea is that you want to protect your data. And if you want to find a linear classifier and high dimensions, the algorithm requires you to take dot products of vectors of the data points. And as you go up to higher and higher dimensions, the dot products are computationally expensive. Okay. Right. I mean, because let's say you started off with a hundred dimensional data, which is technically low dimensional and you project and you can't find the linear linearly separating hyper plane and use projected into a million dimensions. Well, you can find that separation. Right. But now you're basically moved your computation of dot products from 100 dimensional space in all the way into million dimensional space. And that's you're essentially creating a computationally intractable problem. So kernel methods are this amazing technique where you have to find the function that takes in two low dimensional vectors and spits out a number that is equal to the dot product of the corresponding two vectors in the higher dimensions. So let's say there is a vector x in the low dimension, which maps to a higher dimensional vector, fee of x. And there's another vector y in lower dimensions, which maps to another higher dimensional vector, which is fee of y. Right. So in the higher dimensions, the dot product that you would need is fee of x dot fee of y. And that might be million dimensions, dot million dimensions, very expensive. Do you have another function called kernel? It's called K. So you feed in the two lower dimensional vectors and it spits out a number that is actually equal to fee dot a fee of x dot fee of y, except you've never gone into the million dimensions. And you're operating in the 100 dimensional space, your function just takes in 200 dimensional vectors and gives you a number, a scalar, which is equal to the dot product in the higher dimensional space. And once you have this, you can now run your linear classifier in million, million dimensional space without ever stepping into the million dimensional space. And the amazing thing about this kernel methods is that you don't, you can project it to infinite dimensions, where you're guaranteed to find a linearly separating hyper plate. And and technically, you can never even write down what a million dimensional, infinite dimensional vector is going to be computationally, but your kernel method, your kernel function will take two lower dimensional vectors, give you a scalar value that is the dot product of two infinite dimensional vectors and infinite dimensional space. So your linear classifier now is operating in infinite dimensional space, where it finds a hyper plane and then when it's projected back into your lower dimensional space, you've found some very intricate nonlinear boundary. It's interesting to me how much of the effort needs to go into these sort of speeding up processes like you might find an algorithm that would do amazing things, but if it takes 10,000 years to run, it's not good to be very helpful to you in the real world. No, I think that's what's amazing about this whole enterprises is so much engineering chops that is needed and and math informed engineering. And I know that like one of the big papers that really made a revolution more recently was the transformer architecture and I made a little bit of effort to understand what that means, but I've kind of failed as if possible for you to explain why that was important. Yeah, so you're talking of the attention is all you need paper 2017. Yes, yes, so yeah, so I'll do my best to see how to explain that it is it is an amazing paper when you think about, you know, one paper which changed the course of AI, right. This is not to say that the paper came out of the blue, I mean, there was a lot of work that was happening that led up to that paper, but it was a very transformational paper because I think maybe it's almost simpler to just talk about the transformer is rather than the paper, right. So the way a large language model works is that let's talk about the training process, right. You take, you take a sentence. Let's say, this is a sentence I keep taking in my talks, the dog ate my homework, right. And you blank out the last word homework, right. So you have the first four words, the dog ate my blank and then you feed it to your model and you're asking you to predict what is to follow, right. Now, traditionally, when we do next word prediction, we tend to look at, you know, before, before all this happened with LLNs, we were kind of looking at maybe the previous one word or two words in order to predict what the next word might be if you just looked at the last word in that sentence that you had the dog ate my if you looked at my and you tried to predict the next word, you would be completely wrong because you would be saying something like my own my dog, I know it could be anything, right. Just about anything. So the model will have no idea, you know, how to predict the next word. Let's, if you took two words, the dog ate my and if you looked at eight and my and then said what should be the word to follow eight my, you'll probably say lunch or dinner or something, something completely wrong in the context of this sentence. It's only when you look at the word dog that you realize, oh, that should be, we know that this is a very popular sentence in, you know, for children that as an excuse to their teacher about why they did it to their own work. So the dog ate my homework. So, so what happens when, when you feed these words to a large language model is the first thing that the AI does is it turns these words into vectors. So are they're called embeddings. So each each of these words, each of these four words are turned into vector system vectors in some high dimensional space. Let's say, you know, 1000 dimensional space. So and and and these vectors then just flow through the deep neural network that is the black box that is being called the transformer. And what it has to do is like if it was just looking at the final vector, which represented the word my and using that to predict the next word, it'll probably get it wrong. So as those four vectors flow through the deep neural network, which is the transformer, it has to contextualize it. It has to keep massaging those vectors such that the words start paying attention to each other. So the four vectors are just simply moving through the layers of the network and at each layer, the four vectors have changed such that they capture something about each other. Right. And so they're kind of and the hands that turn retention, they're paying attention to each other. So after the first transformation, maybe the four letter four vectors have changed enough that you are predicting something that is close to hunger, but not quite. And then as you keep going through the transformer, layers, the you know, final analysis at the very end, you have you still have four vectors, but now the fourth vector has so much information contextualize information. It knows that it has paid attention to all the other words. And the vector has changed such that now the LLM can say, oh, I can look at that last vector and I know that the next word should be homework. Right. And that's and the attention mechanism is essentially the process that allows the transformers to contextualize these vectors and it's a whole bunch of matrix manipulations. It's just very, very neat matrix math going on. And then you just split out these four vectors at the end, you just look at the final vector, which is the vector for the word my, but now it has knowledge about the fact that it had paid attention to eight and dog and all of that. And it can allow you to make the prediction that next word should be homework during training. Of course, it will make an error because all of the weights of the network are randomly initialized. The matrix stuff that the transformers doing has to be learned. It needs to learn what it has to pay attention to given a certain sentence. So in the beginning, when it predicts a word, it might predict something completely wrong. In fact, it will predict something completely wrong, but you know what the right word should be. Right. So what the, what the language model is predicting at the very end is a probability distribution over its vocabulary. It's basically saying, oh, if my vocabulary is a thousand words, then here's the probability distribution over my vocabulary as to what is the most likely next word. And it's going to get it wrong in the beginning, but you know what the correct probability distribution over your vocabulary should be. It should be one for the word homework and zero for everything else. And so you then calculate an error. Now here and here and that error is a function of all of the, you know, 500 billion or trillion parameters in your large language model. And you do back propagation all the way through your networks to fiddle with the weights of the network. So that the next time you give the same sentence, it predicts a word that is a tiny bit closer in that probability distribution space to the word that you want. And so as you keep tweaking with every back propagation step, your network will get better, get better and better at predicting the fact that the next word should be homework. But that's just for one sentence. Now imagine doing this for every sentence that you swear about the internet. And that's why training these language models takes months. And that was great that you did it. I think I do understand. I think this is the first time in my life. So thank you very much for that. So we're near the end of the podcast. The final question is going to be a completely unfair one. So answer it to whatever level you want to answer it. You know, given that you studied some of the math, some of the history of how these have gone, what is your feeling about the future of progress in these kind of AI landscapes? Is it, is it more just going to be scaling and we have more computing power, more data? Or is there some conceptual leap out there remaining to be made that's going to make everything very different? My sense is that scaling, well scaling what, right? So currently when we talk of scaling things up, we're talking of scaling up large language models. And large language models, the reason why is scaling them up alone will not get us to any kind of generalized intelligence is potentially because we have no mathematical guarantee that a language model is 100% accurate. You cannot guarantee accuracy, right? Because the output is, it's outputting a probability distribution over its vocabulary and every, with every forward pass, that's what it produces. It produces a probability distribution over its vocabulary and then you sample from that distribution. So there is an in-built stochasticity there. And there's no mathematical guarantee that the probability distribution it produces, even if you sample the most likely next word out of that distribution is going to be the word that you want or token. So scaling up alone is not going to get us to a place where we are 100% sure of the accuracy of the model. The other problem with large language models is that they are extremely sample inefficient. They require enormous amounts of data to get to where they are. And the reason why scaling has worked so far is because this entire process of training a large language model is more or less hands off in terms of human inputs. You just scrape some amount of text information from the internet. You mask the last word and ask the network to learn how to break them next word, right? Yeah. And that's a completely, a process that can be completely automated and hence I mean able to scale scaling up. And they manage to do that now for a long time and the results are pretty amazing. But given its sample efficiency, given that it has no guarantee of correctness, even though they're getting much better at being correct. But there's no guarantee it's an asymptotic thing. So we're not going to guarantee 100% accuracy. Given those two things and other concerns, I think most people in the field are expecting some sort of sort of, you know, there is something similar to what happened with the attention is all you need paper. That paper changed everything. Yeah. We're probably one or two steps away like that from an AI that is capable of generalizing to questions that it hasn't seen answering things or questions about patterns that don't exist in the training data. So effectively going back to our early argument, doing what Kapler did. Yeah. Right. And LLM's are not those kinds of systems very unlikely. But you never say no at these things. Right. But my hunch is that word two or three breaks was away from something quite transformative. I like that. That is the youngsters in the audience something to think about and something to try to do. So it knew on at the Swami. Thanks so much for being on the Mindscape podcast. This was great. Thank you, Sean. It's my pleasure. Thank you.