Waymo: The future of autonomous driving with Vincent Vanhoucke

49 min

•Nov 6, 20258 months ago

Summary

Vincent Vanhoucke from Waymo discusses how autonomous vehicles perceive and navigate the world using multi-sensor fusion, simulation, and AI models similar to large language models. The episode explores the technical challenges of autonomous driving, safety validation through simulation, and Waymo's expansion to new cities including potential deployment in London and Japan.

Insights

Multi-sensor fusion (cameras, LiDAR, radar) provides safety through redundancy and complementary weaknesses rather than voting mechanisms, similar to how human eyes perceive depth
Autonomous driving is fundamentally a social and predictive problem requiring understanding of other agents' behavior through simulated 'motion conversations' rather than just perception
Cloud-based teacher models can distill complex AI reasoning into lightweight onboard systems, enabling real-time autonomous decision-making without internet dependency
Simulation at scale (billions of miles) is critical for safety validation and training, enabling testing of rare edge cases and counterfactual scenarios impossible to observe in real driving
Waymo vehicles achieve approximately 88% lower severe injury accident rates than human drivers by adopting more conservative safety postures rather than mimicking average human behavior

Trends

Foundation models and multimodal AI (like Gemini) are being leveraged for zero-shot semantic understanding of driving scenes without task-specific trainingWorld models and generative simulation are emerging as key research frontiers for creating faithful, controllable environment simulations for autonomous systemsTokenization of sensor data (cameras, LiDAR, radar) as abstract tokens compatible with transformer architectures is becoming standard in autonomous driving perceptionClosed-loop learning and reinforcement learning from preference data are replacing pure imitation learning as the dominant training paradigm for autonomous systemsGeographic expansion requires localization of driving behavior and rule interpretation rather than simple model transfer, with hand signals and traffic norms varying significantly by regionSafety frameworks are shifting from implicit learned behaviors to explicit, verifiable rule enforcement that can be mathematically guaranteed at every decision pointSimulation-based counterfactual analysis (e.g., 'what if the other driver was drunk?') is becoming standard practice for continuous safety improvement and incident mitigation

Topics

Multi-sensor fusion and perception systemsLiDAR, camera, and radar sensor integration3D environment modeling and reconstructionBehavior prediction and motion forecastingSimulation and synthetic data generationReinforcement learning for autonomous drivingImitation learning from human driversSafety validation and testing frameworksCloud-based model distillationSemantic scene understandingHand signal and gesture recognitionGeographic adaptation and localizationFreeway vs. surface street drivingClosed-loop learning and dagger problemFoundation models in autonomous systems

Companies

Waymo

Primary subject; autonomous vehicle company operating driverless cars in San Francisco, LA, Phoenix, Atlanta, with ex...

Google DeepMind

Podcast host and parent organization; Vincent Vanhoucke previously worked on robotics at Google before joining Waymo

Google

Parent company of Waymo; mentioned in context of Google Maps data and Gemini multimodal AI model used for scene under...

People

Vincent Vanhoucke

Distinguished engineer at Waymo; primary guest discussing autonomous driving technology, perception systems, and safe...

Hannah Fry

Host of Google DeepMind podcast; conducted interview with Vincent Vanhoucke and provided context on autonomous vehicl...

Quotes

"I could picture a future in which my grandkids ask us, hey, is it true that in your day, we used to drive by hand?"

Vincent Vanhoucke•Opening and closing remarks

"The autonomous driving problem is the simplest robotics problem. You have basically two things you need to do. You have to know if you're going to turn left or right. That's one number. And then you have to know if you're going to accelerate or decelerate. That's two numbers."

Vincent Vanhoucke•Early in discussion

"Safety really comes from taking different sources of information, never entirely trusting them 100%, and merging the evidence based on the different pieces of hints of information that you get."

Vincent Vanhoucke•Sensor fusion discussion

"Driving is inherently a social thing. We model those interactions as little bits of conversations. Literally, it's visual movement conversations."

Vincent Vanhoucke•Behavior prediction section

"I don't think there is any need for any formally new breakthroughs. I think we're in the right generation of technology."

Vincent Vanhoucke•Conclusion

Full Transcript

I could picture a future in which my grandkids ask us, hey, is it true that in your day, we used to drive by hand? So it's entirely possible that we're going towards a future where the experience of driving by hand is no longer the norm, and most of the driving happens automatically. Welcome back to Google Deep Mind the podcast with me, your host, Professor Hannah Fry. Now, the idea of having autonomous vehicles has been this science fiction dream for decades. And now they are a reality. I join you from the back of a Waymo, a driverless car that is operating in numerous U.S. cities, in San Francisco, where I am, in L.A., in Phoenix and Atlanta. And they're very noticeable on the streets. They're these big white cars with lots of sensors on top. and crucially, nobody sitting behind the steering wheel. But getting to this stage where cars can be out on the roads with passengers without the need for human intervention and doing it so it's reliable and safe has been an incredibly complex journey. So today I get to talk about that with distinguished engineer from Waymo, Vincent van Hoek. Welcome to the podcast, Vincent. Thanks for having me. I mean, I know you've worked for Google for a number of years previously for robotics. How does the driverless car problem differ from a more generic robotics problem? Well, in some ways, the autonomous driving problem is the simplest robotics problem. You have basically two things you need to do. You have to know if you're going to turn left or right. That's one number. And then you have to know if you're going to accelerate or decelerate. That's two numbers. In most robotics problems, you have to predict hundreds of numbers to figure out all the degrees of freedom of your robot. This is the simplest robot that has only two degrees of freedom. But that hides all the complexity of the actual problem. Predicting those two numbers is actually a very deep and hard problem. You have to understand the environment. You had to understand the people that are around you, around the car, how they're going to behave, what the environment is going to look like in the future. You have to predict the rules of the roads, what you're allowed to do, what you're not allowed to do. And the mix of all this makes the problem hard. Conceptually, it is a robotics problem. Those are robots, but they're very social robots. And they're also embedded in the real world, which I imagine could be quite humbling. The real world is extremely challenging to work in. The expectations in a lot of robotic contexts is that you have a robot in an environment that you more or less control or that you have a reasonable expectation about what the other agents in that environment will do. Like a factory floor, for example, where you have total control. Yeah. And in autonomous car contexts, we have to basically understand and mesh with the environment, be respectful of the people that live in it, blend into the environment as best we can so that we can serve the public and to enable us to drive and have the freedom to operate. Okay, so in terms of choosing those two numbers, as you put it, I mean, first, the car has to perceive the world around it before it plans what to do next. So I guess if we start there, then, in terms of the perception, I mean, Waymo has a number of different sensors, you've got cameras, LiDAR and radar, what are the benefits of each of those? And perhaps, where do they struggle more as well? Yeah, the different sensors have different strengths and weaknesses. A camera is basically like your eye, right? You see the world as a human would see, but it gives you maybe slightly less information about the depth information until you actually put multiple cameras and then you can reason about depth. In contrast, LiDAR is very good about sensing depth. That's what it does. LiDAR is basically a laser that you shoot out and bounces off of objects and bounces back and gives you an estimate of how far the objects are. They don't see in colour, right? So they only give you geometric information. They give you a lot less about the semantics of the scene. Those lasers also bounce off things quite easily, don't they? Yes, they reflect off of things like, you know, polished metal and things like this. Which there's quite a lot of on the road. There's quite a lot of it on the road. So that can be a disadvantage in the sense that it adds noise to the signal or it can also mean you can get to see behind corners. We didn't talk about the radar. Radar is very good at sensing speed. And so the relative speed between the other agents and the car is a really important signal for us to understand are we at risk of colliding with something. And radar gives you quite, I mean, you can use radar to sense quite far out as well, right? As opposed to visual, which is just in your immediate percentage. Yeah, they have a much longer range. Cameras are going to be obstructed by, you know, the cars in front of you and the radars can look much, much further. What I think is more important is that it adds a different piece of information to the context. And when you want to fuse information from different sources, you want to have information that comes from different places. The analogy is you have two eyes. If your two eyes gave you the same information exactly, you would not be able to perceive depth. It's the discrepancy between the two that gives you that extra bit of information. So similarly with LiDAR and camera and radar, they give you very different pieces of information with different strengths of weaknesses. And then the role of the AI is to fuse them into a cohesive picture of the environment. There is something interesting there that you have these three different senses. all of which have strengths and weaknesses. All of them are flawed in some way. And yet, because they're flawed in different ways, you can build a bigger picture of the scene. But what do you do when they disagree? Like, who gets the ultimate say between those three? It's not like they're voting, right? It's a merger of the different information. The example I like is, again, going back to your eyes, your left eye basically tells your brain that your nose is on the right. Your right eye tells your brain that your nose is on the left. it's not like one eye is going to win that contest, right? You basically fuse in your brain the different information that are slightly conflicting, but that actually give you a global picture of the scene in front of you. A great example is what happens at nighttime. Imagine it's really dark out there. Your camera just sees a wall of black. You don't have a lot of information. That's a place where the LiDAR is really useful because the LiDAR doesn't care if it's day or night. It just keeps shooting its lasers and figure out how far objects are. So that complementarity really is what drives the safety there. Safety is a lot about redundancy. Safety really comes from taking different sources of information, never entirely trusting them 100%, and merging the evidence based on the different pieces of hints of information that you get. such that you can have an overall system that you can trust that has a much higher degree of fidelity. We often say there's only one way to be right. There is many ways to be wrong. If your different sensors are wrong in different ways, you know that there is something not right about the information that you get. But once the sensors start agreeing about a picture of the world, you're pretty sure it's going to be right. So is it almost like the car is sort of updating its belief, as it were, about what the scene is around it at any time? It's very literally that, in the sense that the mathematical fibrillation is a belief update. And you can prove that the more information you add that comes from different sources, you only improve the overall estimate of your belief. Even if the information is noisy, you're really just adding information. and the fusion that happens between the different sensors is really the crux of enabling a very safe perception stack. We can, for example, hide one of the cameras. The system will be okay with that. We can have, if there is like dirt that accumulates on one camera, we can still understand the end. You don't want to have a system that's brittle that will just collapse if a single sensor is providing erroneous information. You want something robust. And the diversity is really what brings the robustness. That idea of having all three, though, I mean, particularly the LiDAR, because there are people who do work on driverless cars without the addition of LiDAR. Do you think it's absolutely necessary to get fully autonomous vehicles that you need all three? The current state of evidence that we have is that it looks like you can get to human-level performance by just using a camera. And the proof to that is that people use their eyes and they don't have a fancy radar in their brain and they can drive just well enough. What we're seeing is that people want to know that those cars are safe beyond what the average driver would be able to provide. because it's a new technology and also because we can. We have proven that we can improve the safety posture of cars on the roads. So let's do it. This is actually a valuable thing that we're adding to society in general. Just going back for a minute to that idea of the car having a belief of what's going on around it. Is it constructing a 3D model of the world as it goes? It is constructing 3D models of the world. It's looking at the geometric information that it obtains from its different sensors. This is very useful for two things. One is planning is easier when you have a 3D space that you can use to reason about. You want to be saying, hey, I want to avoid hitting this. I want to turn right here because that's what the rules of the road are telling me, or this is the route that I want to take. Also, having a 3D representation of the world enables you to simulate the environment. And that's a critical piece, is we are leaning really hard on simulation as a way to validate that our driver is safe and behaving the way we want. We've driven billions of miles in simulation, many orders of magnitude more than we've driven on the roads. And having a simulator that's very faithful and close to reality is what enables us to make the technology advance fast and validate it offline before we actually have to do testing. OK, so I see how in simulation you're sort of making a prediction about what the car will do next and then have it interacting with your simulated environment. But inside the car itself, as it's out on the roads, you're still making predictions about what will happen at the next time frame. Is there some elements of simulation in that too? Yes. So predicting what other agents on the road will do or might do is a very important piece. you want to know that, for example, a pedestrian that is on the sidewalk, is it likely that they're going to jump in front of the car? Or are they just walking straight? Are they trying to cross the crosswalk Or are they staying there because it not their turn Other cars are they going to try to drive through an intersection at a stop sign or not All of this reasoning about the other agent is a very important part for the car to be able to make its own decisions. Driving is inherently a social thing. Yeah. What's interesting is that we tend to model those interactions as little bits of conversations. Literally, it's visual movement conversations. You know, I move forward. What will this other car do? This car stops. OK, I can go. Or this car goes. Then I'm going to have to stop. It's literally modeled as a visual or motion conversation, very similar to what you would do in a conversational agent. So there are lots of parallels between the conversational AI and the autonomous driving problem that we can leverage and learn from and improve the technology based on that. Because that was always one of the big questions was if you have two autonomous cars that come to a stop sign at precisely the same time, what happens? Which one yields? Do you end up in this situation of a stalemate? And that's the solution, right? It's that you're predicting what the other car will do and making a small intervention and continuing to update your belief about the situation. Yeah. And it's important because one of the very interesting aspects of AI for autonomous driving is the closed loop problem. And by that, I mean, you cannot learn your behavior in isolation. You have to learn your behavior in the context of the other agents in the road. And so one of the only ways to do that is to imagine what you would do, unroll what the world would do in response in a simulator, and then feed that information back. So evaluating and training your model in a closed loop enables you to learn the kind of behaviors that you would actually observe in a real-world environment. We know from robotics in general that there's what's called the dagger problem. The dagger problem is that very often if you just optimize for open loop behavior, you end up in a place where your system tries to do the best it can at every step. But every little error that it makes just accumulates over time. Right. And so if you want to learn in a way that doesn't have an accumulation of error, you have to simulate the entire environment and feed that back into your system. That makes it very, very complex. Again, there is a similar analogy in language models that if you have a long conversation and you don't train your system right, your conversation might veer into really weird idiosyncratic territory because the model is not being trained to stay on topic. It's very much the same problem of staying on topic when you're having a long conversation as it is to drive in a way that is stable and gets you to the place that you want eventually. I mean, it does have those elements of a game to it. I mean, I'm thinking about playing chess. There's no point in learning how to play chess, not against an opponent, because playing in isolation is like it's sort of worthless in a lot of ways. Except that you're playing with many, many, many players simultaneously. I mean, it's sort of a wonder that you've got as far as you have when you consider how complex this problem is. As always, simulating the environment has been at the core of the development and the study of the problem. building a simulation that is very faithful to the real world and that enables you to reason about this closed loop problems is a key component to making it work. How much of your prediction about what another agent in the space is going to do comes down to your categorization of what that agent is? I mean, I'm thinking, for example, I was watching some videos this morning and there was a Waymo going past a cat and the cat was sort of curled up, could have been a football, right? Like, how do you have to categorise what it is before you predict how it might behave? We don't necessarily have to categorise it exactly, but if we know it's a cat, we can make better predictions about what its behaviour might be. A cat will change direction very quickly and go in a random direction. They will not get on the crosswalk to cross the street. So all of the different agents, we try to categorise them as best we can and to predict, essentially, their behaviour. It's also important so that we can make decisions about how we should behave in terms of rules, right? So we want to know this is a bike, this is somebody on the bicycle, they're in the bike lane. We can pass them. We need to give them that much width to pass them safely. So all the semantic information about the agents is very important. I think that's been kind of the long-term learning from 15 years of autonomous driving has been the semantics of a scene. What the agents are is extremely important to reason about. We want to know that a car is not just a cube on the road. A car can be an emergency vehicle, and we need to yield to them. If there are money, emergency vehicles somewhere, probably we don't want to go through there because there is police activity or something like that. So all this kind of deep semantics really matters to being able to operate the car. Yeah, because I remember the DARPA challenges, sort of the earliest days of driverless cars. There was a big problem with tumbleweed, I seem to remember. Oh, yes. Not knowing what it was, just seeing that there was an object in front that you could have driven through. You know, it wasn't like potentially going to be damaging. Do you still end up in that situation? I'm thinking about snow here, for example. Like, do you find yourself in situations where there are objects that are in the way, but they're not things that should necessarily affect the planning of the vehicle? Yeah, so snow is a great example. It's something that we started working on seriously because we're trying to expand in the more northern cities. So we've done a lot of testing in the Sierras over the last year. And snow is a typical example of here is something that is big, massive, potentially on the roads, but you have to reason about and say, OK, this is snow. So the right thing for me to do is to drive through it unless it's a big pile of snow and you can't. But if it's a reasonable pile of snow just that you experience in normal driving, you want to cross through that snow. If it were a rock, you wouldn't do that. So categorizing things at a fine grain like this and understanding what you can or cannot do is really part of the equation and is necessary to be able to drive in those conditions. I mean, these ambitions to have driverless cars, they really predated the big changes of large language models. And we're talking here about like understanding the context of a scene and the semantics of it. How much has the advances that have been made in multimodal models fed directly into driverless cars? Quite a bit. It's funny. I was reflecting a few weeks ago on, do you know when the first transcontinental autonomous drive happened? There was a transcontinental drive where they were at about 98% autonomy, so almost fully autonomous, driving about 60 miles per hour. That happened 30 years ago this summer. Wow. That was in 1995. It took 30 years from the proof of concept of autonomous driving to where we are today. And what's fascinating is that it took basically not only a lot of work, but multiple generations of machine learning and AI to really get to a level of performance that was necessary. And the modern AI revolution that we're seeing today for the last few years is but the last of those. There's been several over the past decades that we've experienced and gone through. The modern AI world, what it's opening up for autonomous driving is really this idea that you can get at the semantics of a scene, essentially zero shots, without having to train the model specifically for it. If I show to Gemini a picture of an accident scene on the road, Gemini will tell us this is an accident scene. This is not something I need to train specifically. This is something that can be learned from what we refer to as world knowledge or more prosaic things like what do emergency vehicles look like in Tokyo or in London. We don't necessarily have that knowledge built into our driver a priori because we've never driven in those areas until very recently. But those large AI models have that knowledge built in. So the key is how do you leverage that in a way that is robust, in a way that basically provides the right level of information for the car to be able to operate? Well, so how do you is the next obvious question, because, I mean, you're talking about lots of different elements here. You've got like the integration of the sensor data. You've got the categorization of, you know, the semantics of the scene. You've got the prediction of how different objects might behave in the future. I mean, how do you put all of that together? Yeah. So one cheat is that we can do all of that in the cloud first. So we can build essentially a very large driver in the cloud, very large model that incorporates all that information, All the sensor information, all the experience that we have from driving millions of miles, all the data that comes from various sources that provides us with world knowledge. And the benefit of that is that when you do that in the cloud, you don't really have the same operational constraints that you would have on a car. It can be slow. It can take a lot of memory, take a lot of compute. It can also not necessarily meet all the real-time constraints that the car requires. But once you have that teacher driver, you can use that to teach the onboard system based on that supervision that you provide from the cloud-based driver and distill all that information onto the onboard system, which itself can have different operational constraints, different compute constraints, different architectural constraints, and so on and so forth. So that's one path to bringing the power of very powerful AI onto the car without really having to just shove it all in the car. So let me make sure I understand that then. So you have this sort of giant model in the cloud that sort of gives you like a solution space, as it were, of like this is what good driving looks like. And then when you're in the car, you just have to work out which bit of that solution space you're in in that particular moment in time. So we don't do that in real time, to be clear, right? This is not, you know, the car is on the road, it's going to ask the driver in the cloud, you know, what do I do next? No, no, that doesn't work. We need to have the driver be self-contained on the car and have all the, you know, be completely autonomous and independent. Not rely on internet. That's right. Relying on internet connectivity would not be a good safety posture. But we do that offline. meaning that when we train the models that are used in the car, we query that large model in the cloud as an oracle to tell us, you know, what would be the ideal thing to do, then essentially backpropagate that onto the onboard system I mean it sounds like you stepping closer towards the sort of Gemini version where there like there is the giant model and everyone sort of tapping into the main model Is that sort of the big aim Yeah, it's closer to this. What's been interesting in the journey is that, as I mentioned earlier, the driving problem is not that far removed conceptually from the dialogue problem that LLMs solve. It is a visual dialogue or a motion dialogue with multiple agents. And because we can frame it that way, I mean, literally the math is the same, right? So we basically train a model that has very much the same properties as a Gemini model. and we are able to leverage all the same techniques that a Gemini model applies to the problem, including how do you scale it? How do you provide it the right level of supervision? And all those questions are very, very, very similar. I'm also wondering, I mean, end-to-end is something that people are talking about quite a lot now. Is there an ambition that you could tokenize sensor data in the same way that you can tokenize language? That's already very much what a multimodal model like Gemini does, right? You tokenize all the images and the sensor inputs and pass them on to a language model as abstract tokens that basically act like words. So the machinery under every perception system fundamentally is compatible with this idea of tokenization. And then the question is, what kind of tokens do you want to pass around? Do you want tokens that are very abstract? Or do you want to pass around information that is very concrete? The abstract information can potentially be richer in some ways. Information that is more concrete gives you the power of simulating that state in a much more direct way. Actually simulating down to the pixel level is something that is within reach. There's a lot of work right now going towards world models. So once you can do sensor generation in a controllable way, that opens up the capability to simulate the entire rollout of an autonomous car. And that's nascent. That's kind of where a lot of the technology is headed. Is it going to solve the entire problem? You know, time will tell. But that's really what the edge of the research is at right now. We actually, we had on the podcast, the people who are working on Gini, the open world models, exactly as you're describing. Those simulations, though, I'm thinking here about the sim to real gap that roboticists in particular talk about a lot of the time. I mean, how realistic are those simulations that you're creating? Because presumably there'll be certain things in an environment that are broadly not necessary for the purposes of driving. I'm thinking about, I don't know, the open world models with Genie, how there was a really amazing example about how the light changed as a cat moved through it, which is kind of looks very beautiful, not that relevant for driving, except that maybe in some situations where the light changing might affect the visuals. Yeah. The fidelity of the simulation is very, very important, but not necessarily this kind of visual fidelity that you're talking about. We want geometric fidelity. We want to make sure that the physics of the environment are respected. But the visuals really matter too. One of the things that we do with our simulator is that we're able to take examples of driving and then re-simulate it in different conditions. at night, in snow, in the morning, in the evening, by simply just using AI to adapt the visuals and turn a spring scene into a winter scene, turn it into a summer scene or a night scene, and really augment the amount of data and the different conditions in which we can simulate the entire environment. So is there an ambition then that at some point in the future you will be able to have a driverless car that just purely takes in the sensor data and just, rather than sort of spitting out all these, maybe not intermediate steps, but things that give you some way of exploring what's going on behind the scenes, but just purely outputs speed and direction. There is another angle to this story, which is the safety angle and the validation. So the way that safety rules that you and me can relate to exist, exists. It's basically things like don't collide with anything. It's don't run a red light. It's respect, priority, and world rules in general. All of those rules are very concrete. They're not abstract token in token space, right? So you want to be able to enforce those rules, both the road rules and the safety rules in a way that can be reasoned about, can be explicit, can be guaranteed. So having a concrete representation of the environment, of what the driver is going to do, has a huge amount of benefit in terms of providing a way of expressing a safety envelope that you wouldn't have necessarily if all of the reasoning of the car happens in a completely abstract fashion. Just on the idea about things being concrete, those safety rules that you listed there, right? No collisions. What are the other solutions you have? Don't run a red light. Don't run a red light. Sure. Yeah, exactly. Are those hard coded then into the Waymos? They're provided as the guidance that we provide to the car. We have a safety framework that basically encodes the kind of information about what it is to do proper driving. And we want to be convinced as engineers that the encoding of those rules is something that the driver will meet at every point in time. Let me just dig into the human side a little bit more. Should an autonomous vehicle behave like a human driver? It's a really good question. One thing that we've learned over the years is that you want to be basically the most normal car on the road. You don't necessarily want to be more timid than other drivers on the road. Because then people will pick up on that difference and actually abuse the car, not necessarily out of mischief or anything like that, just because, you know, if they know that a Waymo car will always be more timid than they are, they will want to drive in front of it. If on the opposite side, you're more aggressive than the average driver, then you're disruptive to the flow of traffic or you violate other people's expectations. They don't expect an autonomous car to be raging on the road, right? So the sweet spot is really if you act like the most boring, normal driver on the road. It turns out it's also the safest. It's the best safety posture to have. We don't necessarily want to reproduce everything a human driver does. In fact, what we find is that very often humans are not very good at risk analysis. They will do things that, if you do the math, are not necessarily safe. Give an example. I mean, a lot of people will tailgate at distances that are much smaller than what is recommended by people who've done the analysis. Say, you know, you should be at this many meters away from the rear of the car in front of you if you're at that speed. And I'm very sure that most of those guidelines, people look at them and like, that feels extreme. That feels too much. And it's not what in practice people will want to do on the road necessarily. But when we look at the data, we understand that there are things that we should be a lot more cautious. People also don't necessarily reason about what could happen when it's not in front of their eye. So you have a big truck occluding a pedestrian crossing. You have to think about it's very possible that a pedestrian will be crossing through there. If it's not insights, it's not always that people think about all the possibilities of what could be happening. So the safety posture that we need to have and how we reason about those kinds of situations tend to be a bit more conservative than humans. And I think that's part of why the numbers today show that we're compared to the human drivers in the same kind of environments that we're driving in. We're basically around an order of magnitude safer. The rate of accidents that lead to severe injury is about 88% lower with way more cars. And that gap really we can attain by not just doing what every human would do, not by hitting the average, but also having a more conservative safety posture. Back on the idea of the average driver, I mean, does driving change depending on where you are? I think you're doing some testing in Japan at the moment, Ani. Is there a different style of driving in different countries? Yes, expectations are going to change depending on where you are. And again, in the spirit of being as inconspicuous as possible, we will need to adapt the driver to the local conditions. Simply because the rules of the road are different in different places. Like even in the U.S., some states let you turn right on red, some states don't. And there is a lot of expectations that are built into not just the road rules, but also common practices. In practice, what we've seen, at least in the U.S., is that people make the difference in driving up a lot more than it is there in reality. Really? I think a lot of people are proud of their local way of driving or proud to say that people in their local environment are terrible drivers or are aggressive drivers and things like that. At the end of the day, this is still a very homogenous kind of environment to be driving in continental US. I see, I see. Phoenix versus LA. So Japan is, you know, first of all, driving on the other side of the road, a very simple, massive road rule change that has an impact on how you behave, does an impact on, you know, how pedestrians behave, because, you know, you will look left versus looking right. Can you not just switch it? Yeah, to a first degree approximation, it's a mirror image of the world. Switch the camera around. But there is also a lot more, I think, the expectations from driving. Anecdotally, I don't have hard data on this, but anecdotally, there is a lot more agents on the road gesturing at cars in Japan. It's a lot more of a normal part of traffic. It's not just in emergency situations. And understanding those gestures well and what they are and maybe how they're codified slightly differently may be a factor. Well, a lot of that, I guess, comes down to, I mean, the sort of the physical size of the cars that you're in, the like shape of the roads, the kind of density of traffic. But I am wondering about the hand signals, though, because even if I started driving in Japan, I think I would have an intuitive sense of what hand signals might mean. How do you build those into the car? I mean, hand signals in particular are quite difficult, aren't they? They are a topic in and of itself. And often, you know, we say hand signals, but really it's a whole body language. There is a lot of information that gets conveyed that is not just how the hand moves. You have for example people at night shining lights to tell you where to go or things like that We have a lot of that data and we can evaluate how we interpret that data and measure our effectiveness at understanding hand signals and improve on the system using data methods using heuristics I was just coming here. I saw a construction worker stopping a Waymo with their hand in front because there was a truck that was backing up. And people were very confident that that hand signal was going to be interpreted properly by the Waymo. That's a huge amount of trust that people are placing in the system. And we need to meet that trust. Yeah, absolutely. So I am curious, though, how do you teach a driverless car in the situations where it's acting like the average driver? How do you use that as an instruction? I mean, what's the reward function that you're going for here? What are you optimising? There are different optimisation functions. One basic optimisation function is just imitation learning. So this is the bread and butter learning that every robot starts from. It's how you build a baseline system that basically you take in the data from human drivers and imitate exactly what they do on average. That takes you to some level. Just like for LLMs, just doing imitation learning on human text is the baseline that you start from when you build a language model. On top of that, you tend to have things like reinforcement learning. You tend to have learning from preference data, learning from human annotations and things like this. So you can imagine a similar setup in which beyond just the human imitation, we would provide signals to the driver about areas where we don't think humans are optimal and hints as to what is a better solution. And then we can simulate that. We do a lot of simulations. We also simulate a lot of the extreme cases that we never want to observe on the road. So that's another angle that is very important is there are a lot of things that we hope to never see happen. And so we cannot learn from humans in those scenarios because we just don't see them. Like what kind of multi-car pileups? Exactly. Exactly. Right. So if there is a multi-car pileup in front of you, what do you do? Yeah. We won't have that data. We hope to never have that data. But we simulate those kinds of pileups and then we learn from that what is the right behaviour for the car. Let me ask you about safety then. Have there been incidents involving the Waymo's? Yeah, we are not perfect and the world around us is not perfect. So there have been incidents. Whenever there is an incident, we look at it very carefully. We play it in simulation. We always look at, are there ways that we could have done things better? Are there ways we could have mitigated the issue, irrespective of responsibility? And we try to build in a lot more defensive measures, even when there is not an incident. If there is something that makes us think that potentially there could be an incident, we try to mitigate that too. Simulation is a great tool for this because it enables us to provide counterfactuals. So we simulate what happens if, in this instance, the other driver was drunk, and so their reaction time was a lot slower than what it was. Would we have mitigated the issue? Would we have been able to perform in a way that would have avoided an incident in general? So we take those kinds of learnings very seriously and incorporate it in the driver in a continuous way. So these simulations, then you're using them, I guess, to validate that the model that the car is using is the best that it can be? It goes beyond just validating. We also use simulation as a way of providing feedback on the driver. So there is a training loop that basically leverages simulation and feeds back this information to improve the driver itself. So it's not just passively looking at whether we meet the performance criteria that the simulation is providing us with, but it's also using that signal as something we can backpropagate into the model so that the driver itself improves. When it comes to rolling out into a new location, I know there's rumours of you going to London quite soon. How much mapping do you have to do before you can put cars without drivers on the roads? We like to have a map because a map gives us a prior on what we should expect to be there. There is usually information that is also more relevant to us that is not necessarily directly available on Google Maps, for example. We like to map where there are speed bumps. We will try to see that there is a speed bump and slow down. But if it's not visible, for example, if it's at night or something like that, we have a knowledge that there might be a speed bump there. And all of that gives us a way of having a higher level of confidence about the information that the car should expect to see in an environment. At the end of the day, the car makes its own decisions. We don't ever assume that the map is correct. But it's like an extra piece of evidence that it's adding to change its belief of what the best thing to do next is. So then how different are these different areas? I'm thinking of like freeway driving compared to inner city driving. The different types of roads have different challenges. A simple example is if you're on the surface streets, if something wrong happens, typically you can just stop. It may not be optimal. It may not be the right thing to do. But push comes to shove, there is the option of going over and stopping. On a freeway, you don't want to be stopping in the middle of traffic. This is not the safe to first option. So you want to be able to reason about exceptional circumstances in a very different way. Speed changes everything, right? The fact that you have a car that is at much higher velocity means that you really need to anticipate much further ahead what may be happening. And you have to reason about traffic lanes in a different way. You often have traffic stacks at exits of freeways. It's not necessarily the kind of thing that you experience on surface streets. I've been taking a lot of rides on freeways lately and experienced the Bay Area freeways. In the Waymo? In the Waymo, yeah. Because that's not available to the public yet? It's not available to the public yet, but this is one of the dimensions that we're working on to enable. I mean, look, I'm biased, right? so I'm going to ask you about London. But I think sort of London feels, well, at least qualitatively very different from the American streets in terms of the density of the cars and traffic, in terms of the number of road users, in terms of how narrow the streets are. I mean, is it just a case of picking up what you've got here and then moving it to a new location? Or are there a whole other range of edge cases that you have to be concerned about that just don't come up when you're in the American cities? So what's interesting about San Francisco is that it's actually the second densest city in the US, believe it or not. After New York. After New York. So by US standards, it's actually a pretty complex and dense environment to be in. Is it on par with London? I know it's not. London is a lot more very narrow and complex. The question is, is it fundamentally harder or is it just that it's the same thing, just with different tolerances and things like this? Are there things that are materially, you know, different and fundamentally different about it? I think the traffic is awful in London. Like, you can't drive anywhere fast. No, I mean, that's true. It's a very slow-moving problem. It's slow. I'm reading some fantastic novels right now that are set in London, and they spend half of their time in the novel in traffic. It's really sad. What breakthroughs do you think have to happen then before fully autonomous driving becomes fully mainstream? What are the barriers that remain? I don't think there is any need for any formally new breakthroughs. I think we're in the right generation of technology. I'm not saying we have solved it. I'm saying autonomous driving has a history that date back 30 years. It has to have gone through, I want to say, five different generations of technology along the way. It started from a perspective of an autonomous car is a robot. Let's solve it the robotics way. And then machine learning and computer vision came in. Transformers came in. Behavior modeling. Foundation models came in. I feel that today's the day. We are in the right moment. There is a lot of innovations that are coming down, things like world models, things like large scale. foundation models. But that's the present in some ways, like in the sense it's on the horizon. I don't think we need another job for it to become practical in the real world. So I think it's the moment. I really feel like autonomous driving is happening and we've got to make it happen. And then do you think, let's say this is the moment, do you think that autonomous cars will be ubiquitous or do you think that there will still be room for human drivers? I could picture a future in which your grandkids, my grandkids ask us, hey, is it true that in your day we used to drive by hand? That sounds scary. That sounds dangerous. So it's entirely possible that we are going towards a future where the experience of driving by hand is no longer the norm and most of the driving happens automatically. Whether this future will be realized, I think, depends on a number of factors. I think cost and accessibility is one of them. Adapting the infrastructure, I don't think it's going to be soon. but it's a possible future. Absolutely fascinating. Vincent, thank you so much for joining me. Thanks for having me. It's particularly telling, the comment that Vincent made at the end there, that the path to autonomous vehicles working all over the world is laid out. And okay, sure, there's loads more work to do, but we don't need another big revolution in artificial intelligence to get there. The stepping stones are in place. We've got semantic understanding of a scene. We've got good simulations. And if you start to put all of that learning together in the right way, you won't just replicate human driving. You'll surpass it. You have been listening to Google Deep Mind, the podcast, with me, your host, Hannah Fry. And if you have enjoyed this journey with me in a very bumpy way, then please do like and subscribe wherever you get your podcasts. And as ever, we have got plenty more incredibly interesting topics coming up later in the series. So please do join us again. Thank you.