Driving Safer AVs Faster with Smart Simulation, Neural Reconstruction, and Data-Centric Tools - Ep. 289
This episode explores how AI-powered simulation is revolutionizing autonomous vehicle development through neural reconstruction and data-centric tools. Rohan Basan from Fortellix and Dan Giral from Voxel51 discuss how companies are moving beyond traditional physics-based rendering to create high-fidelity synthetic scenarios for training and testing AV systems more efficiently.
- Time, not compute power or data volume, is the most valuable resource in AV development - efficient simulation can dramatically accelerate testing cycles
- Neural reconstruction and Gaussian splatting have created a step-change in simulation fidelity, making synthetic data nearly indistinguishable from real-world scenarios
- The industry is shifting from modular AV stacks to end-to-end models that handle perception, planning, and control in unified systems
- Edge cases and safety-critical scenarios remain the primary challenge, requiring smart data curation rather than simply collecting more petabytes of data
- Perfect visual realism in simulations may be less important than ensuring models learn correct driving behaviors from synthetic scenarios
"The most valuable resource when you're developing an AV system is not GPUs, it's not data, it's not people, it's time, right? How can I train as fast as possible? How can I test as many cases as possible?"
"We literally had people playing GTA 5 and crashing into other people to capture data, right? This was not a joke. This was state of the art, even at the time."
"I am highly confident that even someone with minor technical ability can build a self driving car that drives correctly 90% of the time. There's just no way in the world you would ever release a self driving car that only drives safely 90% of the time."
"The level of fidelity that we can get from neural reconstruction is unmatched by any kind of physics based rendering engine. It's just not even a close competition."
"I actually don't care if it doesn't look exactly like the real world as long as my car gets better at driving."
Foreign.
0:00
Welcome to the Nvidia AI podcast. I'm Noah Kravitz. Nvidia GTC is this March 16th to 19th online and in person. Visit Nvidia.com GTC to learn more about the premier global AI conference and to register now. Today we're talking about autonomous vehicles and AV simulation using AI powered systems to make driving safer and more efficient. With me are Rohan Basan and Dan Giral. Rohan is senior solutions engineer for sensor simulation at Fortellix and Dan is head of technical partnerships and a machine learning evangelist at Voxel51. Gentlemen, thank you so much for taking the time to join Nvidia's AI podcast. Welcome.
0:10
Thanks for having us here.
0:51
Thank you for having us.
0:53
So before we get into talking about all things AV simulation and dr, maybe we could start with kind of self intros. Dan, maybe you can start just kind of tell us briefly what your company does, what your role is there.
0:54
Yeah, like I said, my name's dan. I'm from Voxel 51. We were one of the premier data platforms for physical AI and av, allowing you to help find all the kind of problems and curate your data set to make sure you're always focusing on the most important data. One of the things we always love to say around here is garbage in, garbage out, right? So if you're not training on the right amount of data or the right data, you're not going to be able to get that model that you're hoping for. Personally, my background is in all things physical AI. Before we called it physical AI, I spent a lot of my time working on edge devices such as drones, cars, robotics and many such. I have a very strong background in hardware and things like that as well. But now, of course, as we evolved more towards this physical AI world, I can now kind of slap that label on it and everyone should know what I work with.
1:07
Awesome. Rohan.
1:56
Hi, I'm Rohan. I'm a solutions engineer at Fort Elix and I help our customers work with sensor simulation tools from a variety of providers to meet their synthetic data generation needs. My background is in sensor simulation as well. I was part of developing Ford's internal simulation toolchain for their level two and level three driver assistance stack.
1:58
Very cool. Maybe we can start with a little bit about what AV simulation is. We've talked about it on the show before. A lot of listeners will be familiar, but maybe kind of just one of you guys could give just kind of a brief kind of overview of what AV simulation is, why it matters in your work and in the Broader mission of making autonomous vehicles real and safe and worth using. And then you can get into talking about what the ecosystem is like today.
2:18
I'm sure Rohan can give a better answer for this one, so I'm going to pass this one to him.
2:47
Yeah, definitely. I think the AV simulation ecosystem has definitely changed a lot since I started working on this six, seven, eight years ago. A lot of focus then was on camera based solutions and using bespoke neural networks for individual perception tests. So think tests like object detection, lane detection, semantic segmentation and fusing those together in the vehicle. Now we see a drastic shift away from that into kind of more end to end solutions where we have one stack that does all the perception controls and outputs, steering, throttle and brake commands to generate a driving trajectory that's safe. I think big reason for that has been developments in generative AI. Right. We talk about 3D Gaussian splatting or diffusion based models and both of those have contributed significantly to the shift in sensor simulation data and the fidelity of sensor simulation data that we're able to get and use for downstream training or validation of our AV stacks.
2:51
I'm sure there's not a one size fits all answer to this, but when we're thinking about, for the listener who's thinking about sensors on an autonomous vehicle and gathering this data and what's this all mean on a typical, you know, car or, or truck? Whatever, whatever works. How many sensors are we talking about? Roughly how many sensors does each vehicle have? And, and you know, what are the kinds of data that you're talking about when we're thinking about all these different sensors pulling data?
3:48
Yeah, that's, that's a great question and it's always fun to get that question because it really depends on who you ask. Right. If you go walk into Waymo, they'll tell me one answer where we have 25 different to sensors and this thing and that thing. If you go ask Tesla, they're going to tell you a completely different answer. Right. They're going to adjust cameras and that's in reality what we typically see is you're going to have a mixture of almost always camera data, that's things like videos and images and then we call lidar data, which is if you're familiar with things like radar and such, laser based, you know, pointing to understand how far away things are. Basically your car shoots a laser at the car in front of it, that laser comes back. I know, the car's 10ft in front of me now. Right, right. Famously, Tesla does not have LiDAR, just calling that out, but for the majority of everyone else, they do have either lidar or radar on their cars. Okay, now something that Rohan touched on as well is we're seeing this new adoption of even additional sensors on top of this. Traditionally when we were doing lane line detection or actor prediction things like that, we only really relied on the video and the lidar in order to get these predictions. But now we also have a lot of additional sensors to measure a lot of the physical components of the car that is being fed into our models as well. Things like very things that you'd be common to find on every car, like a speedometer of course, to understand how fast we're going, but also very complex physics based sensors to help understand exactly how the car is kind of positioned, where it's moving, which way it's turning and which way it might be getting pulled and such as.
4:14
Well, and so when we're kind of moving from physical world, the real world, I hate saying the real world, but the physical world and then simulations, where are the gaps between, you know, what's available in the tools and the workflows that devs work with in these simulations and what AV devs actually want or even need for the work that they're doing? Is it a matter of getting more and more accurate, you know, simulated data? Is it something entirely different like where do the gaps exist that you're trying to address?
5:41
Yeah, this is a great one and I'm sure Rohan and I, we can balance bounce off of this one because he's really good at the second half of this answer I'm going to give and then great, I'll cover the first half here. Right. So the first thing I want to touch on is is it just more data? The answer is absolutely not right. If it was just more data, we would have self driving cars. Right. If you're not familiar with this most very serious self driving car companies, think of your waymos, your waves, your Zeuxis, they're on the hundred of petabyte scale typically, or somewhere close to that. You don't know what a petabyte is. A thousand terabytes. Thousand terabytes is one petabyte. So we're talking a lot of data. A lot of data. Exactly right. So where I see a lot of the issues today with a lot of the users that I interact with, a lot of problems is something that you touched on there is the actual translation from the physical to the digital. Did you do everything correctly? Is everything calibrated correctly? Because it's all well and good if you got exactly that scene or that drive that you were looking for in the real world. But if you can't properly communicate that back into your model, you're going to have a lot of issues.
6:15
Of course.
7:17
Right. If it doesn't see it the same way it actually happened, your model is going to degrade because of that performance. And that's a lot of what Voxel51 is supplying to our AV users and such is ensuring that digital or physical digital translation is as clean as possible. Now, there is a lot of other things that people are going to be expecting whenever you do these translations and what happens after the translation. Things such as, well, I don't want to be doing this just for pulling out of the garage, for instance.
7:18
Right.
7:45
There's scenes that matter more than other scenes. Right. And this is where Vortelix really, really shines in their tool and how they can help identify these faces.
7:45
Thanks for that. Yeah, exactly. Right. I think for us, as Dan said, there's a lot of data already out there. And Fortellix's goal is to help AV developers make sense of their data. Right. Identify what data they have and not necessarily generate or collect more data, but use that data smarter and fill in the gaps with exactly what they need. Right. Not scenarios of you pulling out of the garage, but niche edge cases that are safety critical. We have tools that allow you to generate smart replays and create unique and interesting scenarios which are going to stress your stack and test whether it's actually up to the mark. And that's the kind of smart and I guess, niche data that you need to generate that little bit of additional model performance rather than throwing hundreds more petabytes than what you already have.
7:53
Right. Can you guys talk a little bit about foundation models and neural reconstruction and how these two technologies in particular have changed expectations around simulations? AV simulations, yeah.
8:41
Neural reconstruction definitely is something that's created like a step change in what AV developers look for in simulation. The level of fidelity that we can get from neural reconstruction is unmatched by any kind of physics based rendering engine. It's just not even a close competition. You know, this technology is still developing, it's only a few years old and there's still improvements to be made, but goal is much more approachable now than it ever was before. And with Fortellux's smart replay technology, you can take a relatively uninteresting snippet and generate variations on that. We can have pedestrians crossing in front of you, other vehicles driving in behaviors that you don't expect breaking the rules and creating interesting scenarios with a level of realism that was unimaginable even just five or seven years ago.
8:54
Dan?
9:40
Yeah, I mean, I think, to add to that, I think the most important thing to understand here, I like to make this point very often, is that the most valuable resource when you're developing an AV System is not GPUs, it's not data, it's not people, it's time, right? How can I train as fast as possible? How can I test as many cases as possible? Right? And so every time you train a scene that it's not going to make your model better, you just wasted a lot of time, right? And so with neural reconstruction and foundation models, we don't have to go drive and find the snowy days. I'm not checking the weather to figure out when it's going to rain next time, right? I can use models like Nvidia. Cosmos Transfer is a really great example to just generate, say, hey, this is the scene I'm interested in, but how do I do when it's raining? How do I do if it's snowy? You can use a tool like Fortellix to change the actors. What if I didn't turn left? What if I turned right? Right? And the main goal here is to save time, right? Both because these neural reconstructions render faster, it's easier to move things around. I mean, just to show you where we were five years ago, this is not a joke, this is not an exaggeration. We literally had people playing GTA 5 and crashing into other people to capture data, right? This was not a joke. This was state of the art, even at the time, right? And so think about how much time we're saving and how far we've come in only a few years, where now we can just imagine these things with foundation models and neural reconstructions, Gaussian splatting, where we can skip not just driving in the real world, not playing gta, but actually just doing this in a matter of a couple prompts or clicks in order to figure out where we need to go next.
9:41
I don't know if I'm more amazed by the fact that that was only five years ago or that, like we've come that far in five years ago. You know what I mean? I'm kind of like, oh man, that was a long time now. It was only five years, you know, but. But I mean, things have advanced so much. Is it, I shouldn't say, I don't know if is it still is the right way to ask this, but Rohan, you mentioned this. The edge cases of, you know, somebody, a kid running out onto the road, you know, what have you, sudden weather change or an object in the road, that kind of thing. Are these the kinds of things that are still amongst the most, I don't know, sort of time and bandwidth intensive problems to solve when it comes to, you know, building a safe AV stack? Is it still kind of solving for these unknown edge cases or what are, I see you guys nodding. People can't, you know, listening.
11:17
Of course.
12:13
Yeah, yeah, yeah. What are, what are these things? Because, you know, I was going to ask you about this process of. And again, you talked about a little bit of creating this high fidelity 3D reconstruction and then you can kind of turn it into all these different edge cases to test for and to kind of stress tests on your stack. Is it these, you know, again, like the unexpected person or obstacle in the road? Is it weather conditions? What are some of these, you know, either most important, most difficult edge cases to test for?
12:13
Yeah, I think these are definitely amongst the most safety critical cases. Obviously the tails of the distribution are very long and it's impossible to capture all of them. But I think the hope, you know, for us for Voxel51 one is that we can capture enough of these events in your model that your model understands what to do, even if it's not seen the exact scenario before. Right. As humans we learn that whether it's a dog running in front of us or a cat or a human, we know that we should stop and kind of create that same learning in your models. Again, with the high volume of data that's being collected by a lot of different companies, they're collecting a lot of nominal cases of you driving smoothly along the highway where nothing's happening. And that data is not necessarily incrementally useful to your model. And so at Fortellix, definitely one of our key offerings is being able to create a huge variety of edge cases just from one relatively nominal log and allowing you to understand how your stack reacts to each different event.
12:40
Right. And just to add to that, I'll say it in a different way, right. Where if I gave you, Noah, $1 billion in a year and a half and said build me a self driving car, Right. I am highly confident that, that a person of, you know, even minor technical, you know, not to undermine the great work that people are doing, but even someone with minor technical ability can build a self driving car that drives correctly 90% of the time. Right?
13:38
Okay.
14:01
There's just no Way in the world you would ever release a self driving car that only drives safely 9% of the time. Right. It's just, of course, this is how we think about it. Right. And so trying to close the gap from. And anyone that works with machine learning models know how difficult this is. Closing the gap from 95% to then 97% to 99% and now we're in this 99.9% of the time kind of region. Right. And still not being able. It's not safe enough. Right. We need to make sure it's safe enough. That's where the real problem lies. Right?
14:01
Yeah. Each of those little jumps is like an order of magnitude more effort, more time, more data.
14:31
Yeah. The last mile problem. No, for sure. When you're working so deeply in simulated environments, how do you measure, and maybe there's a better word than this, but how do you measure the realism of scenarios, of objects, of behaviors within these simulations? And how can you come up with benchmarks, other ways to measure things that will translate into safer outcomes out on the road?
14:36
I want you to go because I have a really big devil's advocate answer to this question.
15:01
Okay.
15:04
Well, it's a teaser. I like it.
15:05
I can go first. Yeah. For us, realism falls broadly into two categories. Right. And it depends on what you're trying to do, if you're trying to train your stack, or if you're trying to test your stack's performance. In each case, obviously better realism leads to better AV stacks and safer driving behavior. For training, we want the synthetic data that we are generating, whether it's through neural reconstruction or diffusion based generation like Cosmos, we want that data to increase the model performance of your end to end driver stack. Right. And whether that's making sure that you're following the rules, minimizing disengagements or collisions. Essentially, we want training data to improve the model performance. And we can kind of use intermediate like task based perception models as proxies to make sure that this is something that's happening. That's typically what you would do with data generated from Cosmos Spark for testing. We want to see the same performance in the reconstruction that we saw in the real world. That means we want our stack to fail in the same way, whether that's predicting false positives or false negatives, to understand that or ensure that our simulation is behaving exactly like the real world. It's not exactly applicable to end to end models, but again, using task based perception models can be a good proxy here. It gives us confidence that adding Reconstructed snippets back into the training set with labels will improve the driving behavior. When you're able to see new scenarios, add those to your training data set and hopefully learn not just, you know, again, like I said, that a person is running in front of a car, but have the same reaction if it's a dog or a cat or something else unexpected.
15:06
Yeah.
16:44
To add to that, I'll give the marketing answer first. Right? Which is very important first, which is we do the same thing very similarly.
16:45
Right?
16:52
And Rohan and I, our companies serve different purposes inside of these orgs that we're.
16:52
Right.
16:56
So Rohan's very focused on safety and making sure it fails in the same way. These are all absolutely crucial. That's not to say that I don't +1 everything you just said, but voxel2d1 we focus a lot on how the machine learning model interprets the context of this. Does the model think it looks like the real one? And there's many different metrics and stuff you can do, such as embedding models. You can extract embeddings from your model that is driving your car. So whenever we do a scene and then we reconstruct it synthetically, or I make it rainy or if I make it snowy or something like that, does it still look, does the model still think it's driving in the same neighborhood? Because in the past we see this was not the case. It looks like a butterfly to me. The model does not think it looks something completely different. That's exactly what we're trying to avoid. There's also a lot of classical kind of photo gamutry video gamma tree methods you can apply here, which is like, hey, this doesn't really look like a normal video to us. Right. Like the pixels are in the wrong place, it's not really moving in the right directions. It's a lot of generative AI noise, right. That we don't want to have that's throwing our car off and such. Now the devil's advocate answer here is I don't think it actually matters as much as you think it does.
16:56
How so?
18:00
The goal of training a self driving car is to make sure it doesn't crash. Right? So I actually don't care if it doesn't look exactly like the real world as long as my car gets better at driving. Right? And this goes back to my point earlier about time and things like that, right. Where if you can make me a synthetic scene that looks 90% of the way like the real world and my car can drive in it it learns from it, it learns from its mistakes and that we see it's reproduced in the testing sim that Rohan is talking about, then that's great. Right now if you spend 10 times as long to get to 100% and now I have all my RTX shaders on, I have all my ray casting, my shadows are perfect. If I look at the puddle, it reflects my reflection. All these kind of things. Studies have shown that doesn't matter. Right. That the car doesn't care if it sees a shadow or if it sees a puddle and things like that. The important part here is that it's learning how to drive better. Right. And so there's always a trade off here in terms of we always want to be as real as possible. Of course we're getting closer and closer to that goal every day. But there is this acceptable little zone here where as long as you're being efficient in this process, your car is driving safer, and you're always maintaining that real world data inside of your testing and validation loops, then some level of not realness is acceptable.
18:01
Right, that makes sense.
19:22
Yeah, that's. I completely understand that perspective as well. A little bit difference of like machine learning engineers, the data side of things.
19:24
Right, Absolutely.
19:30
But yeah, there have been studies and papers published that show like mixing real data and synthetic data, even if that synthetic data isn't 100% realistic, still leads to improved model performance, at least on kind of traditional task based perception models. So it's very reasonable to think that that should apply to end to end.
19:31
Models as well and to re emphasize like the digital twins, Gaussian splatting the Omniverse New rec is still way better than where we were a year ago. Right. Like it's way more realistic and everything. Like not to say that we're sneaking.
19:47
Bad or anything like that.
20:00
Right.
20:01
Of course.
20:01
We're getting so much better at it every day.
20:02
Yeah.
20:04
I'm speaking with Rohan Bassin and Dan Gorel. Rohan is senior solutions engineer for sensor simulation at Fortellix and Dan is the head of technical partnerships at Voxel51. Rohan, I want to ask you about a Fortellix technology or an early step in the pipeline that you guys do called scenario driven data curation. Can you unpack that, talk a little bit about what it is and how it differs from traditional ways that AV teams pick drive logs for simulation and for retraining?
20:05
Yeah, absolutely. Scenario driven Data curation is what we call our automatic scenario labeling tool that allows us to automatically label temporal events like approaching a stop sign or turning right on a junction when pedestrians are crossing in logs that we ingest and we add them to our database. It allows us to search for individual events or combination of overlapping events to identify scenarios of interest. We also have dashboards and metrics to see what data you've collected for your stack and what data is missing and how you know that my role is to help our customers fill those gaps.
20:38
Right. And then I understand you guys also have a product called Fortify that helps when customers are scaling up their programs to big fleets and using larger models. Talk a little bit about what Fortify does.
21:12
Yeah, Fortify is, I guess, like our overall toolchain.
21:26
Okay.
21:30
And this is a step within that, I think, in the way that it's different from, as you were asking before, like the way it's different from kind of traditional ways AV teams pick drivelogs, that this is like an automatic process that we have when we ingest logs and you don't have to go searching within your logs for certain events. Oh, great. As Dan said, you know, you've got hundreds of petabytes of data digging through that is a difficult process. And our scenario matching tool helps kind of label those events and let you search within each category to understand what data you have in your odd and what's missing.
21:30
Fantastic. Dan from Voxel side of the house, how do your neural reconstruction and data curation tools change the way that AV teams think and go about operating?
22:04
Yeah, right. And if I haven't heard it or said it already, it's all about efficiency here.
22:15
Right.
22:20
We're trying to be as efficient as possible as well as holding as much context as possible for the right job. Right. So Fortellix, for instance, as we just heard, takes a really interesting approach, really great approach about scenarios and scenes. And can you label and find the temporal aspects such as left turns and U turns and pedestrian crossings and things like that? Right. For us, it starts earlier in the pipeline from data curation based off the videos or the 3D assets there. To help understand one of two things. One, is this the right thing I'm looking for? I'm really looking to see where is the model uncomfortable? And I link that kind of make it ambiguous because it is an ambiguous thing where we don't always want to make and project what we think the model needs to learn. We need to hear and listen to our model to understand where we're going to learn.
22:20
Right.
23:07
So using different methods of embeddings and curation, understanding model performance and evaluation help us understand exactly kind of where the model feels uncomfortable. So that way we can make sure that we're only reconstructing and training on the scenes that are going to give us the most value. Right. That we can then pass to a tool like Forcalix and say, hey, this is the subset of that petabytes of data that we think are the most interesting. Let's label these first and then we'll move back down the chain and we'll, you know, gradually grow our data from where we go from there.
23:08
Right.
23:36
Additionally, it's all about considering how accurate your translation is from physical to digital. Right. We can always get better. Right. There's always going to be microsecond differences centimeters off. There's always going to be additional machine learning context we can add to it. Similarly to how Rohan talked about those perception models in the middle to evaluate kind of the performance of your autonomous driving stack. You can actually do this before training as well to understand things like, if I'm really interested in traffic lights that are turning yellow, for instance. Right. Well, I can use a machine learning model that finds specifically yellow traffic light and then only grab those seeds to see, hey, did my car go through the intersection? Should it have stopped? How close was I to the intersection?
23:37
Right.
24:18
This is a very interesting scene from an AV scenario perspective and something that we can do inside of the Voxel 50 tool to make sure that you find the scene that best suits what you're looking for.
24:18
Right, Right. Amazing. From the engineer's point of view, how much do things like interactive exploration and visualization of data sets, how much do those help in helping the engineers TR trust these reconstructed 3D worlds as they use them to simulate things?
24:30
Yeah, it's a great question. Right. I think visualizations are always going to be important. Right. A lot of computer vision. And I'm very thankful we've gotten to this point because it wasn't always like this in the beginning. A lot of computer vision starts with your eyes. Right. We need to look at the data, we need to understand data. We need to understand that, like, hey, whatever I'm about to go train my model on, I've actually looked and checked it and understand. Right. Has that human element to it. Right. And so as the field evolves, these visualizations evolved with it. Rohan talked about all the fortellix dashboards, understanding things like evaluation metrics such as, where do my SIM scenes seem the most realistic? Where are they drifting the farthest from real scenes? Do I have enough coverage? For instance, how many of my scenes are in parking lots versus highways versus Cities. These are all visualizations. You can't go and scroll through all of your scenes to find this information. This needs to be kind of brought to you in a way where you can easily kind of understand not just the fact that I don't have enough parking lot data, but, oh, I can also see the correlation of like, look, my model's actually a lot worse in parking lots as well. Right. So now this makes sense, right? Where I have a gap in my data. I see the gap in my data. And that's really the name of the game for an engineer here.
24:47
Right.
25:54
I'm going to Obviously use the Voxel 51 for Telix example here. But, you know, I get my data into Voxel51, it's all translated great. I passed all my checks, I'm ready to go reconstruct. I pass the reconstruction to Cortellix. Fortellix labels it. They scan it now. So we did this over 100 scenes. It's gonna be able to tell me where those gaps are. Now that I know where the gaps are, I can go back to a tool like Voxel51 and find if I have that data somewhere in my data lake of data, or I can use generational methods like Cosmos and things like that to generate those methods. It's always gonna be a cycle. It's always gonna move it continuously in a circle. But a lot of that is driven through today, digitalizations, dashboards, to understand exactly the right correlations here to kind of find those causations. Right?
25:55
Yeah.
26:38
As Dan said, I think, yeah, that's where Fortel X and Voxel work really well together. Right. We identify the scenarios of interest, but at a more kind of deeper level, identifying individual events, like Dan gave a great example of traffic lights that are turning yellow. Identifying those scenarios and making sure that they look realistic. You can understand how your data looks, whether it's from a realism lens or how your model is interpreting it. Both of those are super relevant and kind of point towards a partnership that works well for both of us.
26:38
As these stacks, the AV stacks, use larger and larger models, world models and perception models. How do you ensure that as you go through training and model training and then evaluation and then working in simulations, how do you make sure that all the different components can. Can understand each other and sort of. Is there a commonality amongst the data? Is there kind of a common data language that need to get things talking? How do you sort of manage that level of understanding, you know, throughout the stack?
27:09
Definitely an interesting question. It's something that we're still trying to understand the best answer for today. Right. So there are some things that we can, you know, definitely answer today. One is that there is no common data language. This is an issue, right?
27:40
Yeah, yeah.
27:53
I don't think it's an issue that is easily fixable as well. Right now, Voxel51, this is one of our biggest physical AI workbench. This is what we're pushing out here to understand. Like, hey, I'm not saying that I'm going to push out this new data language. Everyone get on. That's not practical. People have data, it exists already, it's in their own format. That's not going to happen. Right. So it's almost better to confirm not the language you're speaking, but that I'm translating it correctly. Right. So it's going to come in all these different forms. There's open source formats, but a lot of people tend to actually just have their own formats internally. Let's just ensure that whatever we're communicating to that world model, that foundational model is being set correctly. Right. And that is the important part here.
27:54
Yeah.
28:33
In terms of how we evaluate them afterwards, it's tricky, right. This is something we're learning. Right. I would make the argument that we're a little early today. Right. But before we were evaluating so many different modules in the stack, right. We talked about perception models. Perception models go to tracking, tracking goes to prediction, prediction goes to planning, and planning goes to control. That's like six different modules that we would evaluate and I would know how each one performs. But I didn't really understand the cascading effects and waterfall effects of errors. Right. With world models, we can skip a lot of steps. Right. A lot of times it's end to end, it's input, output. How well did you do? Right. So I don't have to ask as many questions. I can just look at one model and say, are you doing good or are you doing bad? So the problem gets a little bit easier. But the question of are you doing bad and why is still a little bit of an open question today. How we solve this.
28:34
So you guys are building on Nvidia platforms and as you've talked about, at length using neural reconstruction technology. What's possible now, building at this scale on this technology that just wasn't feasible before when your teams tried to do it on smaller, maybe bespoke stacks. What does the current scale allow you to do?
29:24
I think for us, the current scale actually is kind of how many variations we're able to render quickly on different scenarios. As Dan said, time is kind of the key currency here. And with the level of compute that's available now, we're definitely able to generate a huge variety of scenarios, render them quickly, and then also pass that down into our training set, evaluate our stack faster. And it's sped up the iteration of each model, training and testing cycle that's hopefully going to lead to safer autonomy, faster for everyone.
29:48
Yeah. And just to add to that, I think there's two interesting angles. I'll get both of them. One, from the business ROI standpoint, there's so much data that these companies and teams have collected over the last five, ten years even, and have just been sitting around, like on a drive, more or less useless. Like, you know, so as much as you think that these cars are driving around and we're training on those constantly, that's typically not the case. A lot of times it's mostly useless. However, neural reconstructions have made this useful all of a sudden, because I can go back and take all these logs and say, like, hey, well, my car didn't crash in this case, but it almost did. So what if it did? And I can take this old data, doesn't matter how old it is, doesn't matter if it's sitting on a hard drive for years and now start to generate these augmentations or variations that are really interesting from a pure business roi. Thank goodness all this data we collected is not useless. Right. That we can actually do stuff. And this is going to solve a lot of the problems, we hope, from the other perspective. And this is really focused more around the Omniverse, new rec, and kind of technical advancements. Here we talk about Gaussian splatting. If you don't know how Gaussian splatting is, I'm going to give you a really easy, simple, over the top basic example of what it is, go for it is you're basically painting a 3D picture, right? You basically have your paintbrush and I'm painting the picture, but it's 3D, right? So I can decide how far away my strokes are. But that's all it was for a long time, Right. And so when we think about driving cars, well, like, cars move, right? So I need to be able to move things in the painting, but I have to be able to then communicate that in very complex way to fit into simulation environments. We won't go on to all that detail. The idea is now with Omniverse, New Rack, and some of the other available models that are coming out there, I can now Kind of specify like, hey, this truck that I'm painting right here, it's moving. It's going to continue to go down the scene. I might want to pick up that truck and move it somewhere else. Right. Whereas all these other points out here that I've proven, all these brush strokes for keeping the painting analogy, they're static, like they're not going to move anywhere. Don't move them as the truck moves. Right. And that small advancement of being able to specify, hey, this is a moving object versus, hey, this is the background, has been probably the biggest step change that we've seen in the last six to eight months.
30:23
Yeah, yeah, yeah. Fantastic. So, Dan, I want to ask you about another Voxel 51 innovation product thing you've announced recently. Something called the physical AI data engine. This is also powered by Nvidia Tech. Can you explain what it is and how it changes day to day workflows for, you know, AV teams who, as you've been talking about, are just working with, you know, petabytes of data, trying to wrangle all of this data at once, right?
32:38
Absolutely. So something that we're very proud of that recently came out, our physical AI data engine. It helps to solve a lot of problems that we've been seeing. And this is something that we created not because, you know, we thought it was a good idea. This was something that we created because we actually went out and talked to these AV companies in person and begin to hear the same pain points over and over and over again. Right? And so it was just a matter of time of like, hey, like, we just need to build this, right? If we want to move on to whatever that next stage of self driving is going to look like, this is something that just needs to exist. So we might as well just build it, right? And it's important to remember that AV teams, even today, it's not a bunch of people sitting at one table kind of doing their work. Right. When we think about AV teams, you have to keep in mind that there's a self parking team, that there's a L2 team, an L3 team, there's a fully autonomous team, right? And these all people who work in different regions of the world, they might be remote, they might work on different cars, there's many different factors, right? These people don't always, and most often than not don't talk to each other, right? Including this. The data team doesn't talk to them either. So a lot of this has just kind of been throwing things over the wall, back and forth to each other for the longest time. Right. And so this causes a lot of issues.
33:05
Right.
34:14
The first issue is that you're not getting key advancements, both internally at your company as well as externally, from developments outside in the field. Right. Your data pipeline has become so Frankensteined for your specific use case that if I'm like the self parking team, for instance, I actually only care about the backup camera. Why would I need. We're only backing up into spots or something like that. I don't need the side cameras or something along those lines. And you can extrapolate this out to all the teams. So now we can put them into the physical AI data engine of saying, hey, the data team or the ML team, when the data comes off the car, we need to make sure that everyone has access to this data in the same format, in the same workspace. That way, if you have this key innovation, you have this great new model that parks the car the best or drives the car the best, or predicts how far away a car is the best, whatever it might be, that that can be easily shared across all your teams, right? In a way that you don't have to rewrite every single code. You're not throwing files or Python files over the wall hoping for the best. Right now you have a direct, streamlined access all the way from when that data comes to the car, all the way till it gets to that ML engineer. And everyone can see exactly every step of the way. What is happening and the way this is done actually in practice is through two main steps that we do today, with a third step being the Nvidia tools that we're building on today. The first step is when a data gets ingested, we want to maintain that physical to digital translation. The only way to ensure this agnostically, in a way that every company in the world can make sure that they're on the same level playing field. And translating it the same way as possible is through something we're calling basically the physical AI audit, where I am going to check all the different translations. I have 100 different tests. I'm going to run over all of your data sets to ensure that all your cameras are aligned, that your LIDAR is set up correctly, all your timestamps are correctly, that the car not only thinks it's in the right spot, but it sees it in the right spot, and every single sensor of the car is working in unison. I would say that probably more than 50% of the time, large companies that everyone would recognize, fail this test, right? And it's just natural it just happens. Data's messy, sometimes it's old, as we've been mentioning. And so it's important to know where I need to fill in my gaps in order to get to this latest and greatest world models like Cosmos. And because previously I only cared about the image or the video, I didn't care about everything. Then after I spot the error, well, it's not good enough to just spot the error. I gotta be able to help fix the error for you, of course.
34:15
Right.
36:43
So we then head into our next area which we call enriching. This is a mixture of photo gamutry, video gamutry, driving techniques, ML models that will fill in those gaps for you. A way to think about it, it's like a nine out of ten problem. If you give me nine of the things, I can find the ten out for you. And I'm pretty confident in this case what we've actually discovered in development is that number doesn't have to be nine, it could be more like four or five or something like that. And I'll fill in the rest. And this actually leads you to get way better data sets than you would if you just got the data off the car and passed into neural reconstructions. We've done comparisons with certain customers and data sets and such where if you just pass the raw data set into tool like Omniverse New rec, versus if you pass it through audit, then enrich, make it better than it was when it came off the car, then go do Omniverse New Rec, you're actually going to get a better reconstruction the second time because you've added so much more color to it. And then after you've already audited your data set, it's in its best form it could possibly be. Before you go speak and talk to and work with those world models, that's when you then can work with those tools like internal world models, Cosmos Omniversity Rec and any other one, ultimately generating high quality reconstructions or variations augmentations which then you can then pass into great tools like Fortellix to then do all these kind of scenario based analysis.
36:44
Right? Yeah. Fantastic. We always like to wrap on kind of a forward looking note, but in this case it's going to be kind of a very specific question, kind of following everything we've been talking about. But if everything you guys have been talking about, this kind of reconstruction data centric reconstruction driven simulation becomes the standard, how do you think it'll change the way teams operate, team structures and the skills that individuals need to have inside AV companies over the next, let's say five years. How do you think this kind of simulation is going to change the game?
38:02
I think five years is very difficult to predict because as Dan said five years ago, I hope I have a.
38:36
Job in five years.
38:42
I know you want to go. You want to go 18 months.
38:43
Yeah, I thought you're going to say 18 years. I was going to have a heart attack.
38:46
I know you want to go 18 months.
38:50
Five years ago we were playing GTA. Now it's completely different game, right?
38:52
GTA? Yeah.
38:55
I think in the very short term it's going to make life a lot easier for especially verification and validation teams. The neural reconstruction quality is so much better than any photorealistic effort. Right. From physics based rendering solutions. It's going to make trust in simulation a lot easier. That means V and V teams can sign off on new features or new end to end driver models without as much extensive real world testing and reducing costs associated with that. Again, time is the main currency here. Right. The faster you can develop new models, faster you can get them out the door, the more data you can collect with those models and iterate on them, improve them. So I think that's going to be the number one kind of benefit from simulation here.
38:56
Yeah, I agree with that too. I think teams will get a lot more flat. Right. As we things like, you know, I'm not over generalizing again, it's a podcast, I'm allowed to do that.
39:37
Yes.
39:45
A lot of things are just going to become profitable. Right. I can just ask a model to do something for me and it will generate this reconstruction for me.
39:46
Right, right.
39:53
That's coming down the line. It's very obvious. So we're trying to remove a lot of these quite outdated skills. I want to say skills, I'm not trying to say jobs. I'm hoping that, you know, those things are not necessarily going hand in hand where people are just going to be able to do more with the resources they have. They're going to be able to do more and fill in more lines of gap. And we're going to remove some of this telephone game we've been playing today where it comes off the car, there's a team for that comes off the car, it goes into the data set, that's another team. Then it gets simulated, that's another team. There's a reconstruction team, there's a synthetic data team, there's a safety team, there's so many different teams. We can flatten these structures and roll them into one cohesive team. That is working together. That sounds like it's much more beneficial for everyone involved. And taking the other side of your question, I think that's a lot of short term looking forward for the long term looking forward. I'm really, really interested, especially the CES announcement from Nvidia around Alpecin and Alpha Mayo I think is really, really interesting where now we're thinking about even doing simulation inside of world models and using world models to generate these variations for us. I think it's just really intuitive. I think we are a little bit grounded in the GTA world even still today, right. Where like some simulation experts will not say neural reconstructions are good enough unless they can drive the car in the engine themselves and see and feel it.
39:54
Right.
41:13
I argue like why do you care? Right? Like you're not driving the self driving car. That's the whole point. Right. And so I think in some ways we'll just lean on world models even more, such as Alpa Sim I think is really incredible in the opportunities where the model should know where it performs bad. So then go find these simulations yourself and then generate these traffic patterns, whatever it is and do all these variations without me having to go in and manually move the cars around and such myself. Right. That's probably more like two to three years down the line. We're speaking honestly about when that's probably going to land. But I think it's a no brainer that we're heading in that direction and it's really exciting to see how the field that's going to evolve. I mean this is less than a month old or we're only a month away from CES at this point of recording. So it's all, we're all still learning right now, but it's very exciting.
41:13
It's all brand new. Yeah. Amazing. Dan Rohan for listeners who want more, who want to learn more about Voxel51 for Telix, want to learn more about just the state of AV simulation and everything we've been talking about best places for listeners to go online, company website, social media handles or where should they take a look? Dan?
42:01
Yeah, two main places I can point you to. First and foremost check out our Voxel 51 docs. 51. The main platform that we've been talking about today is actually open source. Right. So a lot of this you can use yourself. We have great tutorials and guides on how to get started with self driving examples including some for around causing splatting and things like that. So I highly encourage if you've been interested about the topics that we're talking today about, go head over to our docs, that's voxel51.com and you're going to be able to easily find our docs and all of our information about physical AI there. Additionally, my name is going to be attached to this podcast. Daniel Garau. I appear everywhere exactly the same. So if you ever have any follow up questions, if you're interested in chatting more, just interested in connecting, I recommend you reaching out to me on LinkedIn. That's where I'm most available and I'm always happy to chat about physical AI. It's on my mind all the time, so I'm always happy to share some of these thoughts.
42:23
Right, awesome. Great. Rohan.
43:15
Yeah, definitely reach out to me on LinkedIn. I'm happy to answer anything sensor simulation related and I can always point you to the right person at Fortellix if you have other questions. Fortellix.com also has great resources for what we can provide, what services we provide, and yeah, if there's anything we can do to help, definitely let us know. We can also check out our announcements at CES and see if there's anything that interests you there.
43:17
Fantastic. Well guys, thank you again for taking the time to come talk about AV simulation and just it's all moving so fast. I say that every episode, but it's never not true.
43:41
Hey, it wouldn't be fun if it wasn't, you know.
43:53
Exactly. Well said. Well, appreciate the conversation and best of luck to you both on all the work you're doing and you know, look forward to recording the next one from the backseat of an autonomous vehicle as it guides us down the road.
43:55
Yeah, absolutely.
44:09
Yeah, you can do podcasts, getting coffee in cars.
44:10
There you go. Right, exactly.
44:13
Right, exactly. Hope to see everyone at GTC as well, I'm sure. I think both me and Rohan will be there as well. So if you're being in person there, feel free to stop by. We'll be in San Jose. Excited to do so.
44:15
Yeah, look forward to seeing everyone at GTC as well. Great to be here. Thank you.
44:25