Latent Space: The AI Engineer Podcast

Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

67 min
Apr 2, 2026about 2 months ago
Listen to Episode
Summary

Moonlake founders Chris Manning and Fan-yun Sun discuss their approach to building causal, multimodal world models that prioritize reasoning and structure over pure scale. They argue that symbolic representations and action-conditioned models are essential for embodied AI and gaming, contrasting their philosophy with pixel-focused video generation models like Sora.

Insights
  • World models require action-conditioned understanding (knowing consequences of actions) rather than just visual coherence, fundamentally different from video generation models
  • Symbolic reasoning and abstraction layers are more efficient than pixel-level prediction for long-horizon planning and spatial reasoning, enabling 5+ orders of magnitude data reduction
  • The boundary between symbolic priors and diffusion priors is fluid and should shift based on domain requirements and customer feedback, not fixed by ideology
  • Multimodal integration (audio, vision, text, physics) within a unified world model enables emergent capabilities like spatial audio that generic stacked models cannot achieve
  • Game engines and code are cognitive tools that models can employ, enabling human intent injection and creative control in ways pure generative models cannot
Trends
Shift from pure scale-based approaches to hybrid symbolic-neural architectures for embodied AI and simulationWorld models transitioning from research demos to commercial products in gaming and robotics trainingMultimodal reasoning models becoming prerequisite for spatial intelligence and long-horizon planning tasksGame engines and physics simulators being integrated as tools within AI models rather than replaced by themEvaluation metrics for world models moving from proxy benchmarks to end-user utility and task-specific performanceRendering becoming programmable and part of gameplay loop rather than derivative of game stateData flywheel approach where creators and users guide model improvement priorities in production systemsPhilosophical debate between vision-first (Yann LeCun/JEPA) and language-symbolic approaches (Manning/Moonlake) intensifyingInteractive, action-conditioned synthetic data becoming premium commodity for robotics and embodied AI trainingSpatial audio and cross-modal consistency emerging as differentiator between integrated world models and stacked generative models
Companies
Moonlake
Startup building causal, multimodal world models for gaming and embodied AI with reasoning-first approach
NVIDIA
Mentioned as paying for interactive world generation and synthetic data for robotics and policy training
OpenAI
Referenced in context of large language models and their effectiveness in language understanding
Meta
Developing JEPA model at scale as alternative approach to world modeling focused on visual learning
Google
Genie demo mentioned as example of video generation without true interactive world understanding
Anthropic
Claude model mentioned as alternative LLM choice compared to GPT and Gemini
Physical Intelligence
Recent blog post cited for storing text-based world state for long-term memory in world models
DreamWorks
Inspiration for Moonlake company identity and name, representing creativity-focused AI approach
Pixar
Referenced through Ed Catmull's Creativity Inc. as example of creative computing integration
Disney
Walt Disney cited as first world modeler through theme parks and immersive environments
People
Chris Manning
NLP pioneer and Stanford professor bringing language/symbolic reasoning expertise to world models
Fan-yun Sun
PhD researcher from NVIDIA working on interactive world generation and synthetic data for embodied AI
Yann LeCun
Philosophical counterpoint on vision-first approach vs. symbolic reasoning for world understanding
Ed Catmull
Referenced for Creativity Inc. biography on integrating computing with creative vision
Dan Dennett
Cited for concept of language as cognitive tool enabling extended causal reasoning
Brandon Sanderson
Fiction author example of world-building through changing single assumptions about reality
Ted Chiang
Science fiction writer example of consistent world modeling by changing one variable
Quotes
"The true opportunity is actually building reasoning models that can do these things like how humans do these things today."
Fan-yun SunEarly discussion of Moonlake genesis
"Vision understanding sort of stalled out. You got to object recognition, and then progress just wasn't being made. If you look at any of these vision language models, it's the language that's doing 90% of the work, and the vision barely works."
Chris ManningDiscussion of vision model limitations
"An action-conditioned world model means you only actually have a world model if you can predict, given some action is taken, what is going to change in the world because of it."
Chris ManningDefinition of world models
"We're not going to be more creative than our users. We want to make sure that we're building things in a way that really allows them to express their intent."
Fan-yun SunDiscussion of human intent in world models
"If there are ways in which you can work with five orders of magnitude less data than people working purely from pixels, you're going to be able to make a lot more progress, a lot more quickly."
Chris ManningEfficiency argument for symbolic approaches
Full Transcript
I think this whole space is extremely difficult as things are emerging now. And I mean, it's not only for world models. I think it's for everything, including text-based models, right? Because, you know, in the early days, it seemed very easy to have good benchmarks because we could do things like question answering benchmarks. But, you know, these days, so much of what people are wanting to do is nothing like that, right? You're wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month. It's not so easy to come up with a benchmark, and it's the same problem with these world models. Before we get into today's episode, I just have a small message for listeners. Thank you. We will not be able to bring you the AI engineering, science, and entertainment contents that you so clearly want if you didn't choose to also click in and tune into our content. We've been approached by sponsors on an almost daily basis, but fortunately enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way. But I just have one favor to ask all of you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the only thing I'll ever ask of you, and it means absolutely everything to me and my team that work so hard to bring the in-space to you each and every week. If you do it, I promise you, we'll never stop working to make the show even better. Now let's get into it. ["Moonlegs 2 Leads"] Okay, we're back in the studio with Moonlegs 2 Leads. I guess there's other founders as well, but Sun and Chris Manning, welcome to the studio. Thanks for having us. You guys have, you know, come burst onto the scene with a really refreshing new take on world models. I would just want to sort of, I guess ask how you, the two of you came together. Chris, you're a legend in NLP and just AI in general. You're his grad student, I guess. Actually, my co-founder. Oh yeah. I should give a lot of credit to my co-founder, Sharon. She was actually working with Professor Pevele and Jojen, and then she ended up working with Ron and Chris Manning here. And then, so I got connected to Chris initially, actually through my co-founder. What is Moonleg? What is, actually, I'm also very curious about the name, but like why going into world models? So I was working a lot with actually NVIDIA research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embodied AI agents. And then there's two observations, one in academia and one in industry. In industry, like folks in NVIDIA are actually paying a lot of dollars to purchase these types of interactive worlds, whether it's for the sake of evaluation or training the robots or policies or models. And then in academia, same thing is happening. And more specifically, when I was actually working with NVIDIA on the Synthetic Data Foundation Model Training project, we were actually generating a lot of the synthetic data and showing that, hey, you can actually, these synthetic data are actually as useful as real-world data when it comes to multimodal pre-training. But then, like I said, there's a lot of dollars being paid out to external vendors or other folks to manually curate these types of data. It was very clear to us that, OK, on our way to, let's call it embodied general intelligence, models need to learn the consequences behind their actions, which means that they need interactive data. And the demand for those types of data are growing exponentially. But everybody's sort of thinking about it from a pure, say, video generation perspective or something else. But we feel like the true actually opportunity is actually building reasoning models that can do these things like how humans do these things today. So that's a little bit on the genesis of Moon Lake and I think the reason I got into world models was partly a philosophical take on the world, where I believe in the simulation theory and stuff like that. But on the other hand, it's really just like, oh, there's an opportunity there that I feel like nobody's doing it the way I think should be done. I can say a little bit about that. Yeah, so overall goal is the pursuit of artificial intelligence. And most of my career has been doing that in the language space. And that's been just extremely productive, as we all know the story of the last few years. I don't have to tell about how much we've achieved with large language models. But although they're being extremely effective for ramping language and general intelligence, it's clearly not the whole world. There's this multimodal world of vision, sound, taste that you'd like to be dealing with more than just language. And then the question is how to do it. And despite a huge investment in the computer vision space, as the research field computer vision has been for decades far, far larger than the language space, actually, I think it's fair to say that vision understanding sort of stalled out. You got to object recognition, and then progress just wasn't being made. If you look at any of these vision language models, it's the language that's doing 90% of the work, and the vision barely works. And so there's really an interesting research question as to why that is. And at heart, the ideas behind Moonlake are an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains, which aren't in the mainstream vision models, which is still trying to operate on the surface level of pixels. I think one of your blog posts you put it as structure and not scale. Is that a general thesis? Yeah. Well, scale is good, too. Yeah, scale is good, too. Lots of data is good as well. And scale. Nevertheless, you want the structure to be able to much more efficiently learn. Yeah. The other thing I really liked also is you put out an example of what your reasoning traces look like, which you would still, as the word that comes to mind, I don't even think that's a good description. But it would involve, for example, geometry, physics, affordances, symbolic logic, perceptual mappings, and what have you. But that is the next kind of example that involves, let's call it spatial reasoning, bold model reasoning, as compared to normal LLM reasoning. Yeah. But also, taking it a step back, so how do you guys define world models? A lot of people see, OK, you can do diffusion. You can do video generation. But you guys put out quite a few blog posts. You put out an essay recently. We can even pull it up about efficient world models. You have a pretty structural definition here. But for the general audience, I don't super follow the space. What's the difference in what we see from a video generation model to a world gen simulator? How do you kind of paint that landscape? So I think this is actually a little bit subtle, because people look at these amazing generative AI video models, Sora, VO3, one of these things, and they think, Genies, they think, oh, this is amazing. This is sort of, we've solved understanding the world because you can produce these generative AI videos. But the reality is that although the visuals do look fantastic, those visuals actually aren't accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are. And that's what's really needed for spatial intelligence. So I mean, a term we sometimes use is that you need action-conditioned world models, that you only actually have a world model if you can predict, given some action is taken, what is going to change in the world because of it. And in particular, that becomes hard over longer time scales. So if you're simply trying to predict the next video frame, that's not so difficult. But what you actually want to do is understand the consequences, likely consequences of actions, minutes into the future. And to do that, you actually need much more of an abstracted semantic model of the world. Yeah, the question comes where you want to have more structure than is available in just predicting the next token. And typically, well, let's call it the experience that the last five years has been, that that is just washed away by scale, right? So what is the right middle ground here that you don't ignore the bitter lesson, but also you can be more efficient than what we're doing today? You know, one possibility is, look, if we just collect masses and masses and masses and masses of video data, this problem will be solved. Under certain assumptions, that could be true. But there are sort of multiple avenues in which it could not be true. The first is what's really essential is understanding the consequences of actions, producing an action conditioned world model. And if you're simply collecting observational video data, which is the easy stuff to collect when you're sort of mining online videos, you don't actually know the actions that are being taken to see how the video is changing. And so if you're never collecting directly actions and you're having to try and infer them from what happened in the observed video, that's not impossible, but it's very hard and it's not really established that you can get that to work at any scale yet. And so there's a lot of premium on collecting action conditioned video data, which is part of why there's been a lot of interest in using simulation. So that you can be collecting data where you do know the actions, which isn't quite limited supply. But there's also in the limit of as much data as you could possibly have, you know, maybe the problem is eventually solvable. But even though we collect huge amounts of text data, text data is always at a great level of abstraction, right? Language is a human designed abstracted representation where there's meaning in each token and it's representing an abstraction of the world, right? As soon as you're describing someone as a professor and as soon as you're saying that they're condescending, right? You know, these are very abstracted descriptions of the world, it's not at sort of what you're observing as pixel level. And so to get to that kind of degree of abstraction starting from pixels is orders and magnitude of extra data and processing. And so although, you know, we absolutely want to exploit get as much data as possible, use the bit of lesson. Nevertheless, if there are ways in which you can work with five orders of magnitude, less data than people working purely from pixels, you're going to be able to make a lot more progress, a lot more quickly. And that's the bet here. And so you could just say that's only wanting to be able to, you know, do it more efficiently, do it more quickly, do it more cheaply. But I think it's actually more than that. I think one should be making the analogy to how human beings work. At one level, you know, yes, we have these high resolution eyes and we can look and see a scene like a video. But all of the evidence from neuroscience and psychology is that most of what comes into people's eyes is never processed, right? That you're doing fairly fine processing of exactly what you're focusing on. But, you know, as soon as it's away from that of, yeah, there's another guy over there that you've sort of only processing top down this very abstracted semantic description of the world around you. And so, you know, that's what human beings are doing. They're working with semantic abstractions. And so I think it is just the right representation. Because we also have other goals. We want to be able to do, you know, real time worlds. That means there's a limit to how much processing you can do. And we want to do long term planning and consistency. And again, that favors abstraction. I mean, I guess there was actually a recent blog post that came out from our friends at Physical Intelligence. And, you know, they were sort of heading in the same direction. They were saying, oh, to the pain model. Yeah, to maintain a long term memory of what's happening in the world. So we can do longer term. We're actually storing text. Oh, what is, you know, been happening in the world, right? It's not such a successful strategy of trying to keep it all at a pixel level. And yeah, I mean, you can see it in video models like that temporal consistency. We're at a scale of train on, you know, all the video data we have. We have it for maybe 30 seconds, a few minutes. That's not the same as a game state played for half an hour, right? I thought you guys break it down pretty well. You have a blog post about building multimodal worlds with an agent. I don't know if you guys want to talk about this. This is one of the things I read. Yeah, as soon as the thing I talked about with the reasoning chain. Yeah. So there's like different phases to this. It seems like it's more of an agent, a scaffold, very different approach than just, you know, typing a prompt and you don't have the same consistency. It also like for people that are listening, you know, I would highly recommend reading it. It breaks down the problem in a different light, right? So like, what do you need to consider when you're talking about video, like world game models, right? How would, what do you need to consider? What are the factors? What are the elements? What's the state? So I don't know if you guys have stuff to talk about for this one. Yeah. Um, actually I wanted to add on a little bit on our previous point, which is just like, I do feel like sometimes people confuse like, oh, like we're taking an up, an up method with abstraction. That means they don't believe in bitter lesson. Like, like that's just false, right? We are believed as a bitter lesson. But then I feel like the question that we always discuss is like, what is the right abstraction level today? The analogy I like to make is like, let's just say we can encode and decode, represent all of images, videos, audio in bytes. Then the most bitter lesson approached is to train a next byte prediction model as opposed to the next token prediction model, where it's just like, okay, it's natively multimodal. But it's like, well, yeah, like to Chris's point, it's like the scale and compute you need to achieve that. Um, um, so that's why we always come back to like, okay, what is the most efficient way to do it? And, and reasoning models to, to the point of this blog post is a showcase of like, Hey, we're actually just like reasoning about the world and reasoning about the aspects of the world that, that matter for me to learn what I want to learn from this world model. Um, yeah, it's like you're improving the encoder of whatever you're, um, trying to model and like a better representation would just represent the important things in less space, which would just be more efficient. Yeah. Um, so yeah, I fully agree that it is not, um, antagonistic to a bitter lesson. Do you want to mention one more thing? Um, is there any philosophical differences with the Jebler stuff that uh, Yanlacun is working on? I gotta go there. Are you, you're, you're mentioning like some latent abstraction. I'm like, okay, fine. Let's, let's talk about it. Right. Like it's an elephant in the room. Yeah. There are philosophical differences. Um, Yanlacun is a dear friend of mine. Um, but he has never appreciated the power of language in particular or symbolic representations in general. Yarn is a very visual thinker. He always wants to claim that he thinks visually and there are no word symbols or math in his head. Um, maybe that's true of yarn. It's certainly not the way I think. Um, but at any rate, you know, um, the world according to yarn is the basic stuff of the, the world and of intelligence is visual and language is just this low bit rate communication mechanism between humans and it doesn't have much other utility and it's far inferior to the high bit rate video. Um, that comes into our eyes. And I think he's fundamentally missing a number of important things there. Right. Think of this evolutionary argument, looking at animals, right? That the closest analogy is the things with chimps, right? So chimpanzees, you know, have fairly similar brains to human beings. They have great vision systems. They have great memory systems. They've got, you know, better memory than we do of short-term memories. They can plan. They can build primitive tools, but, you know, humans massively ahead in what we understand about the world, what we can plan, what we can build. And essentially what took off for us was that humans managed to develop language. And that gave a symbolic knowledge representation and reasoning level, which just gave this sort of vaulting of what could be done with the intelligence in brains. So the philosopher Dan Dennett refers to language as a cognitive tool and argues that, you know, humans unique among the creatures in the world have managed to build their own cognitive tools and languages. The famous first example, but other things like mathematics and programming languages are also cognitive tools. They give you an ability to think in abstractions, in extended causal reasoning chains, and that allows you to do much more. And we use that for spatial representation and intelligence and planning and gameplay as well. So we believe, and this is, you know, underlying the specific technologies that Moon Lake is making, that symbolic representations are powerful and you want to use that in your understanding of the visual world, when you want a causal understanding, when you want to maintain long term consistency and prediction. And, you know, as I understand it, that's just not in Yanlacun's world view. So I think that's the fundamental philosophical difference. Then there's the specific model he's been advancing, Jefa. I mean, that's a reasonable research, better as a direction as to head for building out a model of the visual world. To my mind, it's sort of one reasonable research bed. It's not really established. It's the best one that everyone should be following. At least developed at scale and meta. But it's not just vision, right? Like, I mean, Jefa is a, you know, just drawing a building prediction can be applied to anything, really. And people have done it. The argument is that there is a latent representation, or that is, that is probably more suited to the task than why not let machines do it for us instead of pre-defining it at all? And isn't something like a Jefa-shaped thing the right answer? And if not, why not? So I think there's a part of Jefa that's right, which is you do want to have a joint embedding that gives you a consistent model of the world. And Jan's argument is you can never get that from auto-regressive language models because they're sort of left to right churning out one token at a time. I guess this is where we're, you know, the research arguments of the field. You know, I'm not actually convinced that's right, because although the token production is this auto-regressive process that's heading, you know, left to right, I guess, don't have to be left to right, but anyway, in sequence of tokens, we can have right to left to Arabic. You know, although that's true, all of the weights of the model that are internal to the transformer, they are a joint model of the model's understanding of the world. And so I think you can think of the weights of the model as a form of joint representation, and therefore it is plausible to think that that could be the basis of a world model which avoids Jan's objections. I think I follow. And obviously that will touch on what Moon Lake eventually ends up doing as well, right? Like, which it's hard to tell because you put out the end results, but we don't know the inputs that go into it. So it's, you know, that's something that we have to figure out over time. Yeah. I mean, I guess this kind of breaks down some of the outputs. Do you want to walk us through it? Yeah. So this really just walks us through the reasoning traces of like, okay, let's just say if we want to build a world in this context, it's really just a game demo that shows the variety of interactions that this world model can build. And yeah, it's really just a reasoning traces of like, okay. You're prompted to create a bowling game. Like how did it achieve what you saw, that level of causality interaction and consistency, right? So yeah, this is almost just like an example of like a reasoning traces. Very detailed. Very, very detailed. Like you don't even realize it, right? Like when a video is generated, what happens when a ball strikes a pin, right? So first, like there's audio in that, like audio triggers happen, score increments, the world changes, like pins up to start dropping. There's a timer that goes on. You know, it's just like very similar to how now we're used to reasoning for language models. There's a whole state of what happened. So geometry, physics, all this stuff. And then there's kind of that single prompt. So asset, um, physical education, all this stuff. It's like a, it's a nice view to see what's going on. I think Sun is also too polite to point out that, uh, both like Google's genie, uh, demos as well as, uh, world lab's is marble do not have interactive worlds. Uh, that's the benefit of having a reasoning model, right? Like, cause you can, you can say, Oh, like maybe in this particular context, I want to learn how to bowl. And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay. Maybe it's like, I need to understand the, the basic of like physics and I want to throw it over them. I want to know that when I, when it resets, it's, it's a new game. So I know that, yeah, basically, you know, you know, you know, to pick up the ball, you know, the ball's going to call the pins to fall down. You know that what's important to this particular bowling game is to score. And then you know that the score corresponds to the number of pins that fell down. So it's just like, if it's a model that sort of knows what it looks like, knows what a bowling game looks like, but doesn't actually allows you to practice over and over again and to understand that, oh, like what it takes to actually get a high score, then it sort of doesn't actually allow you to learn what you set out to learn within the world model. Right. And, and I think this is really just one example of showing like the advantages of the approach that we're taking over most of the, that's called the zeitgeist is today, when people talk about clinical world models. Right. So it sort of seems like the question to ask when there's a world model, is can I not only just wander around the world and look at the beautiful graphics? Can I interact with the objects in the world and see the right consequences of actions? And you also understand what the consequences would be if you do something, right? So it's not just like, OK, there's one thing, if I pick it up, something will happen. But, you know, there's, there's 50 options. And I know I can expect I can infer what would happen if I do any of them. Right. So very different when you can actually see it play around with it. There's two cheeky elements of that. I mean, the sort of, I guess, less ambitious one is let's really establish for listeners, why is this fundamentally different than writing unity code? Right. Like just creating a model to translate a prompt into unity code. So there is an underlying physics engine. Yeah. In that sense, there's some overlapping things to unity. But the way we think about it is like physics engine or tools or code are cognitive tools, like borrowing Chris's term, right? Like tools that the model can employ as means to an end. So today, maybe you say, OK, in this particular context, we care about physics, we care about the long term causality consequences. Then yes, we deploy it, employ physics engine. And then maybe tomorrow we say, OK, we're training, I just say, drones, where we only care about really fluid dynamics and the visual aspect of the world. Then then yeah, maybe we don't actually, the model actually doesn't have to use a physics engine or maybe it employs other types of representation in physics engine to achieve the task. So yes, writing code for unity is sort of similar to a tool that our model can employ. But our goal is for model to take a representation conditioned reasoning approach or process internally. Using these things is just like general two calls, which I think is very interesting. The other more ambitious one is some kind of recursive element where it becomes multiplayer. Like here, there's single player elements. You're not modeling any other people involved and that is a whole other thing. But in fact, we can already do multiplayer. Oh yeah, I haven't seen any demonstration. So if you just actually just prompt our model to say, hey, like configure to multiplayer, then it'll do like this. You'll be able to configure multiplayer. Persistency database for you. Easy. Yeah. So what are like some of the current limitations and where we're at? So there's one approach of like, okay, scale up video predictors. Obviously there's data issues. You know, with approaches like this, is it data constraints? What are like the next steps? Is it real time? Like so there's one side of, you know, write an agent to write unity code. But okay, I want to be streaming a game real time. I want to have characters being also like agentic. But where do we kind of see this scaling up? Right? Yeah, there's definitely a data constraint. Like the more data, the better this reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever is necessary. And then there's a sort of fidelity constraint, which we're actually solving with another model, Reverie, which we can talk about later. But it's like, well, it's not as easy to get to photo realism with the approach that we're taking, but we think there are better solutions to that, which is we couldn't dive into later. The one thing you note here is it's a diffusion model, right? So there's a few approaches, diffusion, caution, splatting. Yeah. So Reverie diffusion model, you guys want to introduce? Yeah, totally. So within our world modeling framework, we think there are two models that we train, right? Like there's the multimodal reasoning model that we just talked about that essentially handles mainly the causality, the persistency and logic determinism, determinism of the world. And then Reverie is our bet on saying, okay, like while all those model can take care of all these things that we just talked about, its limitations compared to existing say video models is that it doesn't have as high of a pixel fidelity right off the gate, right? And Reverie is to say, hey, we can actually take whatever persistent representation that we generate without multimodal reasoning model and learn to restyle it into photo realistic styles or arbitrary styles you want. So this model is almost to say, hey, I'm going to respect the persistency and interactivity of the world that you created. But my only job is to make sure that its pixel distribution is close to what we want. Yeah. Yeah, for example, you kept the KL divergence. No, no, I mean, this is a classic like how you don't stray too far from the source material as you kept the KL, which is kind of cool. Yeah. I mean, and the difference is, and I mean, Sun was pointing at this where sort of saying it's in one way or more difficult path, but a better path that, you know, typically the diffusion models are producing the whole scene and it looks lovely, but there isn't spatial understanding behind it, which is allowing for the real time graphics gameplay, the spatial intelligence understanding the consequences of worlds where this is taking a path where it is assuming an abstracted semantic model of the world, the world state. And then the diffusion model is then being used on top of that to produce the high quality graphics. Is there intended practical or business use for this? Or is it like a demonstration of capabilities? We actually believe that this is going to be the next paradigm of rendering. So it's going to replace how Rasterizers, it's going to replace DLSS today because it not only has these pixel prior that's learned from the world, such that you can literally play any game in photorealistic styles, which is a lot of people's desire when they do GTA, right? Like all the mods, all the people adding perfect lighting and all this. So skins for worlds, let's call it skins. That's called skins for all skin. You can call it customization. You can play it how you want, right? Yeah, exactly. And I think another thing that we really pointed out specifically in this blog is the programmability of it. Right. So what this means is that this renderer, well, historically renderer is always a derivative of the game state, right? You're saying, here's the game. Say I'm rendering out of frame, but here I'm saying actually this renderer can be part of the gameplay loop. I can say something along the lines of if upon getting 10 apples, my weapon of choice, my bullets going to turn into apples. And that's that's possible because we can say we can basically dynamically have certain game state trigger the preconditions to the renderer, such that the rendering is now part of the game loop to one thing is to just say, OK, it's it's the appearance. But the second thing is also to say there's these novel interactions that are possible because this renderer now has actually priors of the world. It is up to the artist to figure out what to do with it. It is up to the creators. Yeah. And I also think that's actually another big argument that we're making and the reason that we're picking back, taking the bet we're baking is that a lot of the times, whether it's for embody or gaming, like you want a layer where human can inject their intentions, right? So, for example, that just say in the context of gaming, it's obviously like my creative intent, but maybe in the context of embody AI, it's like, oh, like I take this foundational policy and I want to actually find it to deploy in my house. So you want to almost say inject, have a layer where human can say, oh, here's the distribution of things I want to create to achieve my goal. And I think 3D graphics as it is today is basically a layer for people to say, hey, what do I care about in this world? And it allows basically human intent to be expressed in these worlds much more explicitly and distributionally, as opposed to just saying, hey, I'm going to generate like arbitrary and it's like just prompts. You know, it's one of those things where like I think you're going to build up a series of models, right? This is just one of this is probably like the highest utility or heaviest frequency one. I don't want to call this where like, yeah, you can immediately drop this in on any game and you don't need anything else that that you guys do. But I could see that. I think the human intent is something that people are not even used to because we're so used to static worlds or, you know, worlds that just don't react. Or I don't know, it's it you're kind of blowing my mind right now with like, well, I wonder if you've talked to people at GDC and what are they going to do with it? Yeah. Now, the stance that we take on this front is like, we're not going to be more creative than our users. To ship it out. But we want to make sure that we're building things in a way that really allows them to express their intent. The thing that you said about here's the distribution that I want. I think text may be too low of a bandwidth to to really demonstrate because, you know, I'm probably just going to want to drop in a bunch of reference assets and then you can figure it out from there. We want to do a mixture of both, right? Like you throw in a few images. I wanted this style. I wanted to look like this. It's a mixture, right? I think it's a mixture. I mean, yeah, I mean, there's clearly a visual component of this. And it's not that, you know, everything can be text because of course you want to give a visual look, but there's also a massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text and be very time consuming and difficult to do by a visual means. So I think, yeah, you want a combination of both. So one question I kind of have is how do we go about evaluating world models? So like there's many axes, right? One is like, OK, I have preferences. How well do we adhere to prompts? One is the simulation. One is like, do things is there core logic that's broken? So coming from we know how to evaluate diffusion, there's fidelity, there's stuff like that, but what are some of the challenges that most people probably aren't thinking about? Yeah, I think this is like a great question and probably one of the hardest questions in world model, because like I think it always comes back to what are you building this world model for? And depending on your end goal and purpose, the value you should defer. So in the context of games, then the most direct way of measuring is how much behind are people actually spending in this world that you create? And if your goal is to say, for example, in the context that we just talked about like deploying, deploying action in body agent, then your end metric is then, OK, after training in these worlds that you generate, how robust it is to when you actually deploy to the target environment. But then it's hard to measure these end metrics. So today, people have these proxy metrics that I call that basically try to measure what we really care about which is the end metrics. But then, frankly, it's different for every use case. Which seems like quite a challenge. Like in language models or video models, image models, your benchmarks are proxies. People aren't actually asking instruction following tool use questions. They're proxies of how well it will do downstream. But for this, so like, you know, should should team, should companies have their own individual benchmarks outside of games? If you think of stuff like, OK, video production, movies, stuff like that, that also want to use world models, should should they sort of internalize? Like their own proxy is something you guys do. Where does that kind of? I think this whole space is extremely difficult as things are emerging now. And I mean, it's not only for world models. I think it's for everything, including text based models, right? Because, you know, in the early days, it seemed very easy to have good benchmarks. Because we could do things like question answering benchmarks. And could you answer the question based on these documents and the various other kinds of, you know, do pieces of logical reasoning or math. But again, these are sort of and there are sort of visual equivalents of things like object recognition, right, for these small component tasks. But, you know, these days, so much of what people are wanting to do also with language models is nothing like that, right? You're wanting to have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in Europe next month. And it's not the same kind of thing, right? And it's not so easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you in a good way for shopping, right? So and it's the same problem with these world models. So if we take the game design case, well, success is that a game designer can produce what they are imagining in a reasonable amount of time. And that's really the kind of macro task. But, you know, that's a very hard thing to turn into a benchmark. And I think a lot of this is actually going to turn into people walking with their feet, right? I mean, I guess that's what's happening, you know, at the large language model level, right, when people are choosing to use, you know, GPT-5 or Gemini or Claude, you know, individuals are trying out these different models and deciding, oh, I like the kind of answers that GPT-5 gives me or no, I feel like I get more accurate detail from Claude, right? It's a lot of time. It's like five checking. I realize that. But it's actually where the people feel it's giving them utility in what they want. Right. And the interesting thing there is like a lot of people prefer the visual, right? This looks pretty, which is not the objective of what this is for, right? If a game designer is working on something, they care about the game engine. It's it can look whatever you can fix that up later, or you can have a really good game state and you can quickly edit it to 20, 20 different versions like KeepState. Right. So that's a really important distinction for and for speaking to Moon Lake strength, right? So, yeah, I mean, you know, great visuals are lovely to look at for a few seconds. But games are really all about the concept, the game play. And, you know, a lot of the time that doesn't actually even require great visuals. I mean, there are just lots of very successful games which have relatively primitive visuals, and there are other games where people have spent millions producing photo realistic visuals and the game sucks, right? So keeping those two axes apart is really important in thinking about what's important in a world model for different uses. This conversation is reminding me of some game review and fiction discussions I've had in my sort of non-AI related life. Some for some people might know Brandon Sanderson, who's a very famous fiction author, is a big, big game reviewer. And he's a big fan of video games where you change one thing about a normal what you might assume about the world. For example, Baba is you. I don't know if you might have come across that where the rules change as you play the game and also where you can do things like reverse time selectively or change gravity selectively. And I think this also reminds me of other kinds of world models that are created by authors where Ted Chiang is my typical example where he will take the world that you know today, but change one thing about it and but then create a consistent world based on that, which is long when the answer for me to say is it's easy to create alternative worlds that don't exist, but you change one thing. And then let's let's run a whole bunch of people through it to see if it works. My first dance will be that seems a lot easier and more conceivable to do using technology like moon lakes that with some of the other world models out there. Whether some can actually make it happen. I'll let him give the second answer. If I guess for you, you're constrained by the game engine tool, right? At the end of the day, that's the that's the thought partner that you have. If I ask for something where they've never is allowed to reverse time or if gravity only ever works one way, then well, that's it. But sometimes gravity might change. But it's a lot easier to change with code as opposed to a model that is learned primarily on data of real world and virtual worlds that are, I guess, like, for example, junior, like, there's actually a lot of real world data and a lot of virtual gaming data. And it's hard to say, well, maybe it's easy to say, OK, I want to change the visuals and like the time period of the world, but you can't change gravity, for example. I feel like you can to light bounds, right? Everything comes down to like code is a better way to execute it. But the models aren't that diverse and creative, right? You can say, OK, make gravity slower. It can do that. But it's limited to your representation of how you text it out, right? Like they're only going to do a few iterations, whereas programmatically, you know, if there's a game engine under the hood, you can you can kind of go wild, right? So one of the I don't know, one of the limitations of most models is that they're very overtrained to one style, right? And extracting diversity is pretty difficult, at least. That's something we've seen. I mean, other examples you have in mind where existing models, you know, like it would be easier to do that's not using code. Like certain types of creative intent or like, you know, state transitions, clipping other models, other one models are very good at clipping through things, clipping my legs, clipping through a rock because because it's, you know, it's just it's just bad. Like you would have to struggle very hard with your stuff to actually make that happen, which I think is maybe a topic that you actually prepared on Gossi's batting versus the other stuff. Yeah, yeah, it's just for those not super familiar, right? There's a there's Gossi and Splatting. There is diffusion like what works, what scales up. I feel like in February when Sora one came out, the blog post was literally titled like bring it up, bring it up. You know, world world video generation models are world simulators. It's super bitter lesson pilled. Yeah, a lot of it is emergence, right? So not to go through their blog post. Basically, their whole thing was as you scale up all this consistency, all this stuff just kind of solves. It's a very simple premise, right? They just scaled up diffusion and from there, you know, this is this is FEP 2024. How much can we it's already been two years, which is basically five years, you know, how much more in a I time do we need to just scale up or or do we hit a data cap? But I think we already talked about this a lot, right? This is back to the beginning discussion of what's appropriate for the time. And that seems like your approach, right? Yeah, the point I'm trying to make is that there are many, many different types of world simulators and like having a world simulator that can produce pixel coherency is very, very useful for games and marketing and all these things, but it's not as useful as people think when it comes to causal reasoning, when it comes to embodied AI. And yeah, like this title is true. Like we're not saying that it's it's like, you know, not a great world simulator, but actually in the blog that we wrote, the bet is more so that they're going to be disproportionately large share of value of real world tasks and virtual tasks where high resolution pixel fidelity is not needed. And yes, video models have their values. Yeah, this is at the absolute limit of my physics understanding. But one example that comes to mind is basically having to solve like the equivalent of a three-body problem in a deterministic world, whereas the video models would just approximate it good enough. Yeah. Right. There's some point at which your approach kind of runs into like the while you now have to simulate the world, please. Thank you very much. And like you're trying to do that, but only to the extent that the game engine lets you and like game engines cannot do some things. Yeah. No, I mean, I think the interesting or more technical question here actually is where do you draw the boundary between what's handled with that, say, diffusion prior and what's handled with symbolic priors? Yes. OK, right. Because this boundary can actually be fluid. Like I think like maybe what you're trying to get at is like, OK, people are saying pixel prior everything. But what we're saying is, OK, there's a boundary that we draw where this is where we think provides the most economical value for the domains and things that we care about today. And I actually do think and it's something that we do internally all the time, which is like, OK, given new equations that we learn or new elements of the world and that we learn or maybe some other knowledge that we acquire in the process developing the models, should we still be maintaining this line exactly as it is today, or should we move it a little bit left or a little bit right? Right. Like sometimes that we realize that, oh, like maybe customers or folks like want certain things that are better handled with pixel prior as opposed to symbolic prior. Yeah, your skin thing is an example of moving it right or left. Yeah, I don't know what the left right is. Yeah, yeah. The the the the the reverie model. Yes. Actually, we have a few iterations of them. They're actually a slightly different. I know. You should you should do that. That's a cool dimension to show. Yeah. Is quantum mechanics the diffusion prior of our world? Right. It's like that's the boundary of classical mechanics versus quantum, right? That's it. Right. At one point, God plays dice and the other point doesn't. I don't know. Of course, you want to say it, but I think I think generally I feel like physics is better with symbolic priors. Even quantum physics. Even quantum physics. Yeah. This is starts against the M LST territory. It's just what I call it, where he likes to get philosophical. We're quite friendly. I mean, we need to get we need to get singularity. I heard some of that. No, I think that is actually really helpful. And I mean, I just want you to productize this as a product guy. I'm just like, oh, it's a gamer, you know, I want a researcher, you know, like it's cool, like this is a theoretical, like you have a very good, I don't know, like the way of thinking about these things. But I just want to see you like, you know, express it. I do think like your fundamentally things when you leave open new tools like, OK, use human intent to incorporate it into how you render. Well, artists are going to have to take like two to three years to figure out what to do with this. And you just don't know. But I think, you know, this is gives a much more approachable and controllable world for the beauty of NLP that will enable it to be adopted and used. And we're very hopeful about that. Yeah, yeah. I mean, we are we are very focused actually on commercialization in the sense that like we do we do really believe in the data flywheel approach where we put this in the hands of the creators and the users and then they will teach us when what capability or model should improve. And that's why we are we are actually, you know, like products in beta. Yeah, focusing on gaming. What's like the adjacent thing to gaming? In body, I just basically so maybe we can we can I'll maybe start with where we see the platform in three years, which is like, OK, the users would tell us what they want to achieve. The end goal could be, hey, I just want to make something to teach my kids the value of humility or it could be, hey, I want to find you my drones to be really good at rescue situations. I could be vacuum robots. I want to like train my manipulation or like vacuum robot to be very robust to my office, right? But it's like whatever it is, it's an area of my office. Like in Africa, very robustly in my office. But then it's like whatever end goal that you want, our world model will say, OK, given what you want to achieve, let me generate a distribution of environments such that I can train and evaluate whatever it is you want. Right. Maybe for the purpose of games, it's just the end simulation and that's the end product for certain policies. It's like I can train it within these environments and then help you see where your policy is failing or not. And then, you know, so I think in that case, much more of a training tool than in other evaluation, both right? Sure. Same thing. Yeah, I think it's just this world model that allows people to train any policy that can act in any multimodal environments. Would it be harder to reward hack? Is there an angle here where it is harder to reward hack? Like I'll just put it generally, because I think that's obviously a key problem that a lot of people face when they're training agents in these environments and I don't know, can you solve it? I think not necessarily. I mean, to the extent that there's a mis-specified reward that it seems like it could be hacked in a more symbolic world or in a more pixel based world. I don't know if Sam's got any thoughts, but I don't think that's really being solved. The other thing that comes to mind is just you could just build a better Sora as a video generator model, right? Because then you you would move the diffusion side a bit more further to the right, I think, if I got the directionality correct. And that's it. It's better on domains, right? Like on consistency over an hour, for sure, it exists versus something doesn't, right? Yeah. Is your question more like like? I'm just riffing on like how do you what can you build with the stuff that you have? I do think that the minority academic does go immediately to training and in evaluation, but like art tends to take unusual directions like you might end up. Yeah, but the question is, can you use this piece of software to develop a compelling gameplay? And I don't think you can take soar and produce compelling gameplay, right? If you want to have a world that you can wander around in a bit, you're good. But what are your abilities to have gameplay mechanics implemented the way you'd like them to be and to have things stay, you know, with the long term history of your gameplay that influences future actions? I think there's just nothing there for that. Yeah, I do tend to agree. I'm just trying to sort of test the boundaries. I would also make the observation that as triple A games industry has developed, the line between what is a movie and what is a game has blurred. And you do end up basically producing a two hour movie. As part of your game. Now, honestly, there's so many actually applications in adjacent markets that our model can go into. Yeah. But yeah, it's sort of fun to riff on. Although on an execution side, we sort of we need to stay focused with like, OK, what are the capabilities we want to unlock over time? And there's a roadmap for that. But yeah, we're just riffing on sort of like the possibilities. I feel like whether it's endless. Yeah, classic. And the embedding for a possibility at endless in my mind is very close. Yeah, I do want to focus on one like weird choice. I don't know if it's weird. Maybe I got something here. Audio, right? You could have just said no audio. And audio in my mind has a lot of recursion, whereas in video, you can just do recasting and that's much computationally much simpler. Audio just seems way harder. I don't know if you want to just comment on just the special 3D audio problem, did you really have to do it? I guess you do to be immersive. But like a lot of people do treat it as like, well, we just stick a TTS model on top of. Well, there's a lot more to game audio than just speech, right? It's not just TTS. Yes, TTS, SFX, PGM. Spatial in my mind echoes and reflections. And I don't even know what else. I don't know what are the problems in this space. Yeah, I think this point is sort of more pointing to the benefits of using a game engine as a tool that's available to the model. Right, because like part of the spatial audio is from the code that is underlying the simulation. And while we do give our model access to other types of audio models as tools, none of them would be spatial, I think. Right, but that's exactly sort of more point to we're giving our model an abstraction or a suite of tools such that it's able to achieve that. And you can argue that sort of spatial is like an emergence out of the tools that we an abstraction that we provide to the agents. And I think that's the beauty of this this this approach is like, there's a lot of things kind of like how humanities built technology. And they're like Lego blocks that built on top of each other. And it's the same thing here. Like there's going to be things that sort of just emerges from being able to put these things together in like a combinatorially interesting ways. Right, this integrated audio model exploits the understanding and semantics of the Moon Lake world, right? And whereas in general for the gen AI video models, there's no actual integration across the audio tool, right? That someone might stick some music or stick a soundscape or whatever else on top of their video. So it's not a silent video, but they're in no way connected into a consistent world model. And there's nothing that's OK. An action is happening in the video. Therefore, there should be a sound that's coming from this part of the visual field. Yeah, is that different than Sora too? Does it not have audio? Not to say it's not like there's an spatial audio. It doesn't. No. I've played around with it enough. It just sounds like someone put an 11 labs voice on top of it and just try to do the I've seen, OK, generate a dog at the beach and reactions to big wave and move. Yeah, definitely. Like, so have the dog move away from camera and see if the sound goes down. Right, doesn't right, because they don't have spatial audio. We do want to basically like we are a model like the one we're training is basically towards the goal of having a combined representation across all these different modalities, right, such that you can like reason across these different modalities. So, for example, if I close my eyes and you play a video, you play a sound of like cars skidding away from me. I almost can like visually extrapolate that trajectory in my mind. And I think that that type of capability we want our model to be able to reason. Right. And that's the reason that we're sort of taking this multimodal reasoning approach. It's like we want to combine the in space that can. Yeah. Oh, you said the in space. We like that here. We have to play the bell every time someone says the in space. No, you got to train their devil one where you it's only audio, but you have to work out where everything is. Cool. I think that was that was about it for our Moon Lake coverage. I do think that we have like a couple of Chris Madden questions on on IR and just any any other sort of attention topics or NLP topics. OK, well, I mean, yeah, it's just fun. You know, we talked a bit about how you guys meant, but you basically you you are like the godfather of NLP for say, right? You spent the whole career from early embeddings, early, early attention. You did 2015 attention for machine translation, everything. You you had information retrieval. So rag before rag, you know, we just want to shout that out and admire a lot of that. Right. So what prompted the switch over to world models? How how did all that come about? To some answer, it is the enthusiasm and creativity of students. But there's a bit of a history there. Right. So yeah. So clearly most of my career has been doing stuff with language. And, you know, how I got into research was thinking, oh, this is just so amazing how humans can produce speech and understand each other in real time and somehow they managed to learn languages when they're kids. How this possibly happened? And so, yeah, starting off, I was very focused on language. But, you know, as it sort of got into the 2000 and 10s, I started, you know, going, I'd been working on question answering. And then I started to get interest in visual question answering. And that was an area where it was very noticeable that the visual understanding was bad, right? You know, these were the days when, like, it sort of seemed like there's almost no visual understanding. You're just getting answers that came from priors. So, you know, if you asked how many people are seeing at the table, I'd always answer to regardless of how many how many people you can see in the picture. And, you know, so it seemed like, oh, these models actually aren't able to get semantic information out of images. And so I was interested in that problem and tried to work more on that. And so then that required knowing more about what's happening in vision and how you can represent visual information. And then things start, you know, there start to be this revolution of doing generative AI images. And then I had students that started looking at that before the era of Moon Lake. I was also working with Demi Gore, who founded Pica. And so and Ian, obviously, with Gans. Yeah, so Ian was never my student. Yeah, I was very aware for the whole decade there, Ian with Gans. And I mean, Ian was a Stanford undergrad, but yeah. Richard, doos.com, I believe he was your student. Yeah. And, you know, there were there were links across at that stage as well. So, I mean, you know, there were several papers in that era of doing. I mean, so Andre Capati was a PhD student at the same time as Richard. And so there was some joint language vision work in that era as well. You know, it seems kind of ancient by modern standards. But yeah, we're trying to go from sort of textual dependency graphs to visual scenes. At a time, the glove embeddings really took over a lot of TF idea of like one hot encoding or that the early vision language models we saw were like lava style adapters, right? It's it's technically still just embedding latent space. Let's add image. It's like mixed modality. So that's one of the things you super put out there too. Right. Yeah. Yeah. Well, thank you for all of that. Thank you for advancing the world on world modeling. I honestly do think that if people deeply understand everything we just covered, they will see what's coming. And I think you guys have made some really significant contribution here. What are you hiring for? You know, what is the people? You know, we agreed that the CTA was a hiring call. Yeah. I mean, don't we have a GI? You don't need you don't need engineers anymore, right? Yeah. On the model side, we we are actually striving towards basically a self improving system. But what that means is that we need people to set up the self improving system. More specifically, people who have the intersection of knowledge within cogeneration and computer vision and graphics. Right. Yeah. That's that's sort of the core research background that we look for with NREL team and the majority of the team today do have like both backgrounds. When you say computer vision and graphics, are they the same thing? Or is it computer vision? One thing, graphics and other thing. And how intertwined are they? They're intertwined, but different. Yeah. And I think, you know, this relates to some of the themes that we've been talking about, that the more explicit underlying world models that are being constructed inside Moon Lake really draw on the computer graphics tradition. And so it's then combining that with the visual understanding of vision. Got it. Yeah. All right. So I think you've written a game engine. Come talk to us. Right. Oh yeah. But I do think that the line is blurred, like increasingly blurred these days. Where it's like, if you have a general understanding of vision and graphics. I think for your standards, it is for me, it feels like vision is, is, you know, I leave that to the big labs. Graphics, I can get that, you know, you would want to do that for more first principles. But vision, there's so many vision models off the shelf that I can take, but probably not good enough for your. I see. I see. If you're sort of like making that distinction, then then maybe we care a little bit more about having graphics knowledge. It could be like, you know, sometimes a hiring call can be as simple as like, if you know the answer to, you should talk to me, you know, like the sort of core, known, hard problem in your world. I see. Yeah. In that case, if you, yeah, definitely if you've written a game engine before, if you've rL'd a variety of coding models on different objectives, like. Easy. Many of those. Yeah. If you've done multi-modal in space alignment, I intentionally included. Yeah. Yeah. Poor editor has a thing every time. Yeah. Lean space alignment. Honestly, is it that hard? Well, there's some scripts out there that I've saved for the day. I, someday, someday have to do it, but I don't have to do it. But it's done. I think yeah, there's a version of that they're done. But I think we are aligning audio, text, language and video. Yeah. Like and basically we have these world models that are able to act as agents to like acting these worlds and extract long horizon videos and including that back to the models to sort of self-improve. So it's an insanely exciting, but also technically challenge problem. So people who want to do their lives best work, you know, makes a place. How big are you guys? Where are you guys based? We're currently based in San Mateo, although we're moving up to SF. We're about 18 folks right now. My ending question was going to be why? What is the name? What's the name? Oh, very cool graphics and design, by the way. Actually, at the time when the when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of like open AI, but for like almost like industrial light and magic vibes. Because we care about creativity and using that as a funnel to solve AGI. So then we brainstorm a lot around like DreamWorks, right? Like industrial light and magic. So there's a few, few, basically space of things that we feel like are very, very semantically close to the company's identity. And then it ended up being Moon Lake partly because of the DreamWorks vibe. You know, the DreamWorks moon. Like exactly. So that was a little bit of that inspiration. And then the moon was sort of like a it basically was like about the reflection. The reflection part also implies the self-improvement loop. Wow. Sort of like really believed in and that's the path towards multimodal general intelligence. So that's that's that. I love a good name. I love a good name. This is a very good name. It's very good lore. I'm glad I asked the question. I will also say, you know, one of my favorite story books or biographies ever is Creativity Inc. with Ed Catmull's story about Pixar and how he, you know, was rejected as a Disney animation artist. So then he went into computing and brute force his way into back to Disney. Yeah. And Walt Disney is also like one of my favorite founders. He's like his story. Like at the time you're like, OK, I'm going to create this like immersive park. Like people can't don't even have that technology to create a virtually. But like, you know what, let's just build it very physically such that people can. So he's the first world modeler. I tell people that theme parks are world models, too. Yeah, yeah, yeah. I mean, you know, it's a small world or it's like the Epcot Center with all the little replicas of the countries. Yeah, those are very interesting. OK, well, thank you. We've covered a huge amount. Thank you for your time and thank you for inspiring us. Thank you for having us. It's been a fun chatting. Yeah, it's been a good time.