Do AI Models Agree On How They Encode Reality?

29 min

•Feb 3, 20263 months ago

Summary

This episode explores how different AI models represent reality internally, drawing parallels to Plato's theory of ideal forms. Researchers at MIT found that as AI models become more powerful, they converge on similar ways of encoding concepts like 'table,' suggesting they may be capturing genuine structure about the world rather than just statistical patterns from training data.

Insights

AI models appear to develop internal representations (vectors in high-dimensional space) that encode meaning through relationships between concepts, similar to how human brains may process information
Convergence across diverse AI architectures and training methods suggests models may be learning about underlying reality rather than memorizing training data quirks
The 'similarity of similarities' metric allows comparison across fundamentally different model types (language vs. vision) by examining relational structure rather than absolute neuron values
Evidence for platonic representations is strongest in better-performing models, but interpretation depends heavily on which datasets researchers choose to test, introducing subjective bias
This research has practical applications beyond theory, including translating representations between different AI models

Trends

Growing convergence in AI model representations as models scale and improve in capabilityShift from black-box AI interpretation toward mechanistic understanding of internal model structuresIncreased collaboration between AI researchers and classical philosophy to frame fundamental questions about machine cognitionDevelopment of cross-model comparison techniques to validate whether AI understanding reflects objective realityDebate over interpretability metrics and what constitutes sufficient evidence for machine understandingRising interest in whether AI models develop universal representations independent of training data modalityEmergence of 'Rosetta neurons' and similar concepts suggesting convergent neural patterns across different architectures

Topics

AI Model Internal RepresentationsNeural Network InterpretabilityPlatonic Forms in Artificial IntelligenceVector Geometry and High-Dimensional SpacesCross-Model AI Comparison MethodsLanguage Models vs Vision ModelsAI Convergence HypothesisConcept Cells and Representation LearningAI Understanding vs Statistical Pattern MatchingPlato's Allegory of the CaveSimilarity Metrics for AI ModelsTraining Data and Model BiasMultimodal AI RepresentationsPhilosophy of AI CognitionMechanistic Interpretability

Companies

MIT

Led research team studying AI model convergence; Philip Isola is senior author of the key paper discussed

New York University

Researcher Ilya Socholitsky from NYU contributed concept of 'similarity of similarities' for comparing models

People

Philip Isola

Senior author of MIT paper on AI model convergence and platonic representations; leads research group

Ilya Socholitsky

NYU researcher who developed 'similarity of similarities' framework for comparing AI model representations

Ben Brubaker

Quanta Magazine computer science writer who authored the story and discusses research findings

Plato

Ancient Greek philosopher whose allegory of the cave provides theoretical framework for AI representation research

John Rupert Firth

1950s linguist whose principle 'you shall know a word by the company it keeps' informs vector relationship analysis

Quotes

"You shall know a word by the company it keeps"

John Rupert Firth•Circa 1950s linguistics principle cited in episode

"Half of everybody is telling us this is obvious and half of everybody's telling us it's obviously wrong"

Philip Isola•On reception of the convergence paper

"The similarity of similarities"

Ilya Socholitsky•Describing method to compare representations across different models

"What we call the real world is just a shadow of this realm of ideal platonic forms"

Ben Brubaker (paraphrasing Plato)•Explaining Plato's allegory of the cave

"The models are learning things about the world behind their training data"

Ben Brubaker•Describing the platonic representation hypothesis

Full Transcript

you're used to hearing my voice on the world bringing you interviews from around the globe and you hear me reporting environment and climate news i'm carolyn beeler and i'm marco werman we're now with you hosting the world together more global journalism with a fresh new sound listen to the world on your local public radio station and wherever you find your podcasts It's satisfying when, in the course of doing these modern science and math stories, we get the opportunity to cite thinkers from the deep past. Scientific insights have arisen all over the world, but in Western thought at least, the ancient Greeks come up a lot. Pythagoras and Aristotle and Epicurus, they had ideas around math and the universe and atoms and the brain. So it shouldn't surprise us when Plato recently made a guest appearance in a story about artificial intelligence. Plato had this idea that there are idealized, perfect forms that provide the structure for all the messiness of reality. We can recognize a lot of different things as tables because there's an essence of tableness that they share and that we can recognize in them. Maybe you see where we're going with this. What does AI think is a table? And do different kinds of artificial intelligence arrive at the idea of a table in different ways? How would we even know? Welcome to the Quanta Podcast, where we explore the frontiers of fundamental science and math. I'm Samir Patel, Editor-in-Chief of Quanta Magazine. This idealized version of a table is what is known as a platonic form, and it's still all over modern thinking. It appears in psychology and media and politics and science, and now it's inspired a theory about how AIs perceive reality. This was the subject of a recent Quanta story called Distinct AI Models Seem to Converge on How They Encode Reality by our computer science writer, Ben Brubaker, and Ben is back with us to talk about it. Welcome back to the show, Ben. Hello again. So, Ben, what's the big idea? So the big question here is about the internal representations of AI models. So you show an AI model a sentence or a picture. How does it encode that? What is that picture inside the model's head, so to speak? And one important aspect of that question is, to what extent are different models representing the world in the same way? Not long ago, I was talking to Yasmin, our biology writer, about the idea of concept cells in the human brain, which is to say there's a set of cells in our brain that fire when we think of a table, see a table, hear the word table. So we're looking essentially for the equivalent of what is the internal representation, what's happening in AI when it's perceiving a table, and what does that mean? Yeah, it's pretty similar to that. There was actually a paper on what are called Rosetta neurons in neural networks, which are these things that appear in every neural network focusing on the same thing in image models. But yeah, the paper that I was writing about is about a slightly different aspect of that topic. Okay. We were talking about Plato earlier. There's a particular platonic allegory that you cite in the story. People might know this is the allegory of the cave. Can you explain this to us? Yeah. So this is from 2,400 years ago by the Greek philosopher Plato, as you said. It's this pretty elaborate scenario that he concocted, which was like, imagine there's these prisoners that are trapped in a cave, and they're looking at this wall, and all they can see is the shadows cast on the wall. There are objects that, you know, are outside, and they're casting these shadows. So that's all that the prisoners know of reality. And his point in setting this whole thing up was basically, we are all like these prisoners. He was saying, what we call the real world is just a shadow of this realm of ideal platonic forms, and these forms, like, you can only know about through philosophy, not sense experience. So that obviously makes a job philosopher really important. So it's this big metaphysical argument about the nature of reality. But in the context of this AI research, it's not quite the same. They're using this allegory as a jumping off point, but they're not making this kind of metaphysical argument. I just wanted to say that off the bat. So the idea that Plato's exploring here is that there's like an unseen sort of structure to the world. How do we then apply that to the way that we're thinking about AI? Like, how are we going to get to this idea of talking about their internal representations? So you think about how humans experience the world, as you said. We're getting information constantly through all these different channels, like where we have these different senses, we read things, and we somehow assemble all of that into one coherent picture of the world. And so you'd like to think that somehow the structure of the world, that kind of common structure that gives rise to all this stuff that we experience is reflected then somehow in the structure inside our heads. And in some sense, AI models are similar. They're also just taking in a bunch of data about the world. But one respect that they're different is that they are often much more specifically focused on one kind of data. So language models typically are just trained on a ton of text. They're presented with this ton of text. They're trying to predict what happens next, and in the course of that, they come to be able to write really fluent sentences. There's other models that are trained on just reams of images, and they learn to identify images. So I should say that this is not an absolute rule. There are models that have been trained on multiple data types, like both text and images, but nothing that is close to a human. Like nobody has built models that have been trained on every type of data that exists in the world. So you can see how the cave analogy might be relevant here. We're not talking about this metaphysical realm of forms anymore, but kind of taking that allegory one step down, you have the physical world that we touch and whatever, and the text that's on the internet, the text that models are trained on is like a shadow of that world. It's sort of describing the world. It maybe captures some of the aspects of the structure of that world, but the models don't have access to the world directly. And so just like Plato was asking about, well, what do the people really know about the realm of forms? You could ask, what are these models actually learning about the world just by looking at these shadows? So just to kind of recap, we experience tables from multiple senses in multiple ways. We read about them in books. We sit at them. We're sitting at one right now. From the time that we're babies, this is our training, right? We understand and we develop this internal representation of what a table is. we have to believe or we have to understand that AIs are probably doing something similar. Okay. So now we go to the AIs. Let's take just the example of like a large language model and keep it simple. A large language model trained on the entire internet. There's a whole ton of information on the internet about what a table is and what a table isn't. Our understanding is that the AI model because it appears to understand what a table is must have some kind of internal representation built from this data of what a table is So we want to get a sense of what going on inside AIs. Now, famously, we've talked about this before. There's a lot of things we don't understand about what's happening in AIs, just like there's a lot of things we don't understand about what's happening in the brain. What are we actually learning about what the AI is thinking when it sees a table. Good, yeah. So as you said, there's a lot we don't know, but we're in this kind of weird situation where we have a really good understanding of what AI models are doing at a super microscopic level, but we don't know how to make sense of it. And honestly, this is kind of similar to the situation in neuroscience. You can put somebody into a brain scanner, you can show them a picture, and, you know, at least in principle, you have a really good brain scanner, you can be like, ah, that neuron is on, and that neuron is on a little bit less. Just totally not the way it works, but in principle, I understand. I don't know anything about neuroscience. But, you know, in AI, it actually kind of is the way it works. Like you can just pinpoint exactly what individual neurons are doing, at least in open source models and so on. But what are you going to do with that information? It's just a bunch of neurons, or in the case of an AI model, these AI models, as you said, they're kind of similar to the brain a little, and in some very loose sense, they're built on these mathematical structures called artificial neural networks. And just like a neuron in your brain will either fire or not fire, depending on signals from other neurons that are feeding into it. A neuron in an artificial neural network does some math and spits out a number. That number is called the activation of that neuron. And it's going to depend on the activations of the other neurons that are all feeding into it. So just concretely, what you do is you give an AI an input, let's say a picture, a particular picture. You're going to basically take a snapshot and you're going to look at what's called like one layer of the model, which is like what this model is doing at some particular stage of its processing. Right, because it is freezable. And you have to decide what part of it you're going to look at. But you show it a picture of a table, and then you lock it into place. Yes. And you decide which layer of the neurons, and there's lots and lots of these layers in an AI, you want to look at. And then you look at, presumably, the activation numbers that are in that slice. Exactly. So it's the activation of the first neuron on that layer. Let's say it's 0.7. That's the activation of the second neuron. Let's say it's 0.2. And you have thousands of those numbers because that's about how big these layers are. But so now you have this list of thousand numbers. What do you do with that? So one thing you can do with a bunch of numbers is you can think about them geometrically as an object that you call a vector. So this is just an arrow that points in some direction. And so, like, for example, if I have two numbers that's a two-dimensional vector, the first one maybe is, like, how much is it pointing to the right? And the second one is, like, how much is it pointing up? And I could do the same thing for a three-dimensional vector, which is going to be like an arrow in three-dimensional space. And there's nothing to stop me mathematically from doing all the same stuff for like a thousand-dimensional vector. And so now I have this crazy geometric space that I can't visualize. And I'm going to say that is what I mean by representation in an AI model. It's just going to be some vector with a thousand entries. So it's an arrow in some impossibly high-dimensional space. And each of the numbers in that vector is just representing how much one neuron is activating. Can you give me an example of what that sort of looks like? Like, for example, we crack it open. We have a, like a vector for a multidimensional vector for table. We have a multidimensional vector for chair. Right. We have a multidimensional vector for hot air balloon. Yes. Presumably, and you can tell me if this is the case, the table and the chair ones are going to share some similarities that they probably want with the hot air balloon. Does that sound right? Exactly. That's exactly right. We're talking here about the different inputs to the same model. So we're just going to take one model, as you said, and give it different inputs. And what you will find is that the more similar the inputs are, the more similar their vectors will point in similar directions in this high dimensional space. So if we're talking about a language model and we're just giving it a bunch of individual words, then, yeah, table and chair are probably going to be pretty similar. Hot air balloon is going to be somewhere else. The crucial thing is this context kind of helps you make sense of these vectors because just by itself, a vector is just like some arrow pointing in some direction. So you're going to be like, OK, well, what am I going to do with that? Like, how do I know even what that means? But it is this insight that the relative, like, relationships between different vectors have some kind of meaning to them. This goes back actually all the way to the 1950s and linguistics where this guy John Rupert Firth famously said, you shall know a word by the company it keeps. So it's kind of like embracing this idea that meaning exists in the relationships between vectors. But so far, we're all just talking about within one model. To go back to an earlier point we were making, does this multidimensional vector in some way represent a platonic representation of whatever the input, the table, or whatever it was? is? I would say probably the vector represents a particular input that might be a particular picture of a table. It might just be the word table. It might be a sentence about a table. So it's specific. And if you really wanted to care about like tableness, then you might say, well, let me give this AI like a thousand different, slightly different pictures of tables. And then I'll see like, what's the general direction that they all kind of pointed. You could do some mathematical operation that would give you like, oh, here's what a table is. Right. And that's kind of what we're doing when we compare things that are not all table. We could say, oh, you know, the table and chair kind of point in this direction. That direction has something to do with furniture or something like this. So where does Plato come into this? the main focus of my story was this paper from 2024 by a group of scientists at MIT and the leader of that group is named Philip Isola and so where the platonic nature comes in is that they're taking this analogy where again like the AI models are these kind of prisoners in the cave they only see the shadows and then the question is how much are they learning about the world behind the shadows and they're kind of making the opposite point that Plato was awkwardly they're arguing for the case that the models are learning things about the world behind their training data. And the reason they're arguing that, like the things that they're drawing on, they're drawing on this large body of work about similarities in the ways that different models represent the world. So, so far we've been talking about all the different vectors within one model, or many different vectors within one model. But let's set aside this question of how we do this for now. We're going to say we can compare representations across models now in some way, And if we see that they're similar, the more different those models are, that's evidence that the models are really capturing something that isn't just their specific quirks of the shadows that they're seeing. So you could kind of analogize this to you have prisoners in one cave and they're looking at the shadows on the wall. And, you know, there's some stuff that they're looking at is obviously related to the objects that they're the shadows of. But some other stuff is just like the fire is flickering in this particular way. And if these prisoners are like cave prisoner scientists, they're like, oh, yeah, that flicker pattern is like so important. And maybe it just nonsense right Maybe it has nothing to do with the world But now if you had these like prisoners had a walkie talkies and they could all communicate with prisoners in different caves and they all seeing different shadows of the same objects then the more they had similar representations So they're not looking at the flickering, which is different in all the different caves. They're kind of converging on the same properties of the objects. There's still like a metaphysical question about do they really know the objects if they've never seen the objects. But clearly they're grasping some shared structure. So to back up again, the argument of this paper is that this similarity between different AI models and like more specifically evidence that models are becoming more similar as they grow more powerful, like the newer models, the better models are more similar to each other than the older models were. That is evidence that the models are converging on the same way to represent the world. That is what they're calling like a platonic representation. I want to make sure I understand this fully, right? We don't know, broadly speaking, what a given AI does or doesn't understand. Yeah. This method of thinking is a way of making an argument that AIs kind of do understand what's going on in the world. And if all of these different AIs that are trained in different ways and trained on different data are representing reality in the same way, then that suggests under this hypothesis, I guess we could call it, that AIs are getting reality in some capacity that would not be obvious to us otherwise. Yeah, that's the basic idea. And that's, I mean, it's a provocative idea, but putting that idea aside for a second, in order to even make that argument, you would have to show, and I think this is where the paper comes in, you would have to show that multiple AIs trained on different kinds of data and in different ways converge in the way that they represent something. Now, if we are looking at a single AI and we're looking at the multidimensional vector for table, the multidimensional vector for chair, we're going to see that there are similarities between them. That's cool. And you can see how you could compare across multiple large language models, for example. But then, and this is where you add this layer, right, is if you have something that's being trained on images, I'm assuming it's multidimensional vector for a table. Yeah. It's going to, I mean, even among different LLMs, like they're going to look really different. Yeah, exactly. So even if you're talking about different LLMs, it's not like trivial how you would do this. So I'll give an LLM the word table and I'll look at that long list of numbers. And then I'll give a different LLM the same word and I'll look at its list of numbers. And they're just completely different. There's no relationship between these lists. And so they look like numbers pointing in different directions. And it doesn't even mean anything to compare them because the neurons are not related to each other. But the key idea here is this geometric idea of the similarities within models gives you a handle on how to talk about the similarities between models. And one researcher, Ilya Socholitsky from New York University, put it in this nice way that I talked to. It's like the similarity of similarities. So what does this mean? This means I take my one model, I give it a bunch of words, and I just map out all these vectors. So you can imagine sort of a cluster of arrows, and some of them are more pointing in similar directions, some of them are pointing in different directions. and then I take my other model and I give it all the same inputs that I gave the first model and I see another cluster of vectors. And then I can ask those clusters, the individual vectors might be pointing different directions, but do the clusters look similar to each other? Like maybe I can abstractly rotate one of these clusters and see, okay, like the table and chair are the same closeness in the two models and they're the same distance from hot air balloon in the two models. And, you know, it might not be exactly the same, But this is a way of comparing across models. This makes sense because in a way it lets us kind of abstract this all to math and say, do the representation clusters of different, let's say, objects in a room, for example, do they resemble each other in different models? Because in absolute terms, they're not going to look the same because the models are trained on different stuff and they're going to look different. And if you look at different slices, you're going to get different things. But if you see that there are similarities in the way that these things are pointed, when you give them a lot of different inputs, then you can say, okay, these two LLMs are quote-unquote thinking the same way. Yeah. There's similarity, but then it's just like, okay, you get another problem, which is you have a number, what does it mean? Let's say you use some particular mathematical method for taking those different clusters of vectors and comparing them to each other, and you get a number and it's 0.3. Is 0.3 similar? Is that good? Is that bad? What does that mean? So often what's like better evidence for this and what's exciting that people have observed is that you can compare a bunch of different models and some of them are worse and some of them are better. And you'll see all the better models having similar relationships to each other and the worse models have less similar relationships. So that kind of suggests we don't really worry about what these numbers are. We care that they're getting better. And that's kind of where this like notion of converging towards a platonic representation comes from. Now, this becomes more complicated if you want to compare now a language model to a vision model. And why would you want to do that? I mean, the best evidence you're going to get for this sort of platonic representation idea is like you want models that are very different from each other. The more different, the better. And you want to see that they're still doing similar things. So you're sort of limited in what you can do. Instead of giving them exactly the same inputs, You want to give them like a set of paired inputs. So concretely, this might be like a set of photos and their captions. People have assembled these data sets from Wikipedia, for example. You'd give the first photo to the vision model and the first caption to the language model. And you'd assemble these clusters where the clusters are all the photos on one side and all the captions on the other side. And then you'd compare them the same way you would for just language models. a lot. Now, when you say that as models get better, meaning as they're trained on more data, they have more computing power, their error rates go down, that's how we're roughly defining better models. Yes, exactly. And what you're saying is that their conception of what a table is, as the models get more powerful, starts to look more and more similar. That's right, yeah. Regardless of what kind of data they're trained on. Yeah. Which then suggests under this platonic representation hypothesis, if you want to take it to its, like, full extreme, that they're all getting what a table is and they're getting it in the same way, which means that their understanding of reality is not some random statistical what-word-comes-next kind of argument, but that they're kind of getting it. That's the idea. The trends are certainly there, at least in some cases, But I think the most strident proponents of this idea would say, like, look, as the models keep getting better, eventually they will have, like, exactly the same representation. And, like, that is the only representation you could have that allows the model to complete all these tasks. That is like the platonic representation But we not there yet Right right It strikes me and I think this was reflected in your story that we start with a theoretical idea There some math behind it That's all very relevant and rigorous. And then there's this extension of it to say, okay, what does an AI understand or not understand? It strikes me that there are some strong feelings about whether this is a valid way to interpret this data, yeah? Yeah, Philip Isla, who's a senior author of this paper said, like, half of everybody is telling us this is obvious and half of everybody's telling us it's obviously wrong. And he was like, yeah, we're happy with that response to the paper. So it touches on a lot of, you know, contentious issues. I think it's not so much a question of, like, is the platonic representation, is it true that they're converging or false that they're converging? The biggest question here I would say is, like, where is this hypothesis applicable and where isn't it applicable? I think everybody would say there's evidence of convergence. Then you could ask about how far it will go. But one reason you might have different opinions about how far it will go is, is the evidence that we're looking at, is that representative evidence, so to speak? There's a bunch of different technical questions here, but I would say for me, the one that stood out the most is, which data set are you testing on? To measure an AI model's representation of a word, a sentence, or a picture, you actually have to inspect its neurons while it's processing that specific input. And you simply cannot do this for every conceivable sentence or image. There are literally infinitely many possibilities. So you have to decide on which specific inputs you're presenting to the two models you want to compare. Like maybe it's captioned pictures of furniture, and maybe it's sentences about emotions, or maybe it's even strings of nonsense characters. And there are certain cases where maybe like you will see more convergence or less convergence. And I think it's almost a matter of like taste which ones are the most interesting to you, depending on what kind of researcher you are. Which suggests that you could look at certain clusters and see weak evidence for this hypothesis and look at certain clusters where you see strong evidence for the hypothesis. And the taste is whether you like the hypothesis or not as to which you think is more important. Right, exactly. So I would say like to represent everybody's position here, the authors of this paper would say, sure, you know, there's cases where maybe it doesn't apply, but what you're trying to do in science is find the stuff that is universal between experiments. And other people would say, we should be honing in on the ways that these models are different from each other, because that's going to give us a clue about what they're missing. There have been applications that use this idea of like similarity of representations across models to do things like translate between models. And so in some sense, like the people who are pro similarity will say kind of the proof is in the pudding. Like we can do something with a similarity, even if it's not perfect, you know. It tells us something. It tells us something. Yeah, yeah, exactly. I'm always interested in trying to understand theoretically, mathematically, metaphorically what's going on in AI since they're like such a big part of our world now that trying to get at some of those more mechanistic questions is always interesting for me. Yeah, that's an interesting topic. Yeah. Thanks again for coming on the show. Sure. Yeah. Always happy to do it. Ben, we like to end every episode with a recommendation. What's exciting your imagination this week? Yeah. So consistent with the theme, I guess, I would like to recommend a podcast, which is called The History of Philosophy Without Any Gaps. And this is a lot of little bite-sized episodes about different philosophers or different aspects of one philosopher. You got a lot of Socrates, a lot of Plato, and so on, but also a bunch of people you've never heard of. So I used to be a physicist before I became a journalist, and there's this kind of attitude in physics like philosophy, schmilosophy, whatever, you know. That's like an old-fashioned way of asking these questions. And, you know, I've kind of increasingly come around to the idea that there's a lot of really interesting questions that are still relevant that people back then were grappling with. There's a lot more continuity between philosophers who are kind of trying to work out, like, these real questions about, like, how do you think about time and motion and all these things that really do have a lot of continuity with physics. And yeah, just you learn a lot of interesting questions about like, how do you define truth? What things are like necessarily true versus contingently true? It's a really fun show and it's very accessible. All right, well, we'll check that out. Thanks, Ben. Also on Quanta This Week, you can read about the surprising roles that non-neuron cells play in the brain and how vast archives of old glass plate photographs continue to be relevant to astronomers in the age of powerful digital telescopes. In addition to Plato, we have one other surprise guest in this episode, writer, director, actor Orson Welles. In 1973, Wells narrated an animated educational film that dramatized Plato's cave allegory. The film was produced by Counterpoint Films, directed by Sam Weiss, and illustrated by Dick Oden. We're going to leave you today with just a little bit of Wells' distinctive narrative stylings. Of the objects which are being carried in like manner, they would only see the shadows. And if they were able to converse with one another, would they not suppose that they were naming what was actually before them? And suppose further that there was an echo which came from the wall. Would they not be sure to think when one of the passersby spoke that the voice came from the passing shadows? To them, the truth would be literally nothing but the shadows of the images. If you've been enjoying the Quanta podcast, please take a moment to rate the show and leave a review. We'd love to hear from you. The Quanta Podcast is a podcast from Quanta Magazine, an editorially independent publication supported by the Simons Foundation. I'm Quanta's editor-in-chief, Samir Patel. Funding decisions by the Simons Foundation have no influence on the selection of topics, guests, or other editorial decisions in this podcast or in Quanta Magazine. The Quanta Podcast is produced in partnership with PRX Productions. The production team is Ali Budner, Deborah J. Balthazar, Genevieve Sponsler, and Tommy Bazarian. The executive producer of PRX Productions is Jocelyn Gonzalez. From Quanta Magazine, Simon France and myself provide editorial guidance, with support from Samuel Velasco, Simone Barr, and Michael Kenyongolo. Our theme music is from APM Music. If you have any questions or comments for us, please email us at quanta at simonsfoundation.org. Thanks for listening. From PRX.