The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Intelligent Robots in 2026: Are We There Yet? with Nikita Rudin - #760

67 min
Jan 8, 20265 months ago
Listen to Episode
Summary

Nikita Rudin, CEO of Flexion Robotics, discusses the current state of robotics and the gap between demonstration videos and real-world deployment. He argues that no humanoid robot currently generates actual value in industrial or consumer settings, but predicts this will change by late 2025 or early 2026.

Insights
  • Current robotics demos are misleading - they either involve teleoperation or require extensive data collection in controlled environments that don't translate to real-world deployment
  • The sim-to-real gap remains a critical challenge, requiring deep understanding of both simulation and real-world physics to create effective training environments
  • Modular approaches combining locomotion models with higher-level planners are more practical than end-to-end training for complex robotics tasks
  • Industrial applications will likely see successful robot deployment before consumer applications due to more controlled environments and specific task requirements
  • Hardware limitations, particularly in tactile sensing and onboard compute power, remain significant barriers to autonomous robot deployment
Trends
Shift from pure reinforcement learning to hybrid approaches combining RL with imitation learning and vision-language modelsMovement toward modular robotics architectures with separate locomotion, planning, and execution componentsIntegration of pre-trained vision-language models (VLMs) into robotics pipelines for task orchestrationFocus on industrial robotics applications before consumer deploymentIncreasing use of simulation for training with real-world data collection for specific challenging tasksGrowing ecosystem of affordable robotics hardware for research and developmentEvolution from teleoperation-based demos to autonomous task executionEmphasis on reducing human effort required to deploy robots for new tasks
Quotes
"My hot take on that, and I'll be happy to be proven wrong. But I think there is not a single humanoid robot today that actually generates value. Meaning there might be a robot that does something fairly close to what it's supposed to do in a factory or in a warehouse, but it's not the exact task. So in the end it's not generating value because it's not doing the actual thing it's supposed to do."
Nikita Rudin
"My prediction is that will change around the end of next year, maybe beginning of 2027. So, so we won't. Again, prediction, hard to predict, hard to say exactly what's going to happen. But I don't think we'll get this, you know, change chatgpt moment where suddenly you have a billion robots everywhere."
Nikita Rudin
"Until the robot can really go anywhere a human can go and you don't even need to think about can it do it or not, how reliable it is. My take is that locomotion is not solved."
Nikita Rudin
"Typically when I personally see a demo of a robot doing something, my approach is to think about what is the absolute easiest way to achieve that. And that is typically how it's done."
Nikita Rudin
"We have 35 people in a company and quite a few of these people still spend hours tuning rewards. And typically this is obviously referred in a negative way. So you don't want people tuning rewards."
Nikita Rudin
Full Transcript
2 Speakers
Speaker A

I'd like to thank our friends at Capital One for sponsoring today's episode. Capital One's tech team isn't just talking about multiagentic AI. They already deployed one. It's called Chat Concierge and it's simplifying car shopping using self reflection and layered reasoning with live API checks. It doesn't just help buyers find a car they love, it helps schedule a test drive, get pre approved for financing and estimate trade in value. Advanced, intuitive and deployed. That's how they stack. That's technology at Capital One.

0:00

Speaker B

My hot take on that, and I'll be happy to be proven wrong. But I think there is not a single humanoid robot today that actually generates value. Meaning there might be a robot that does something fairly close to what it's supposed to do in a factory or in a warehouse, but it's not the exact task. So in the end it's not generating value because it's not doing the actual thing it's supposed to do.

0:33

Speaker A

Foreign. Welcome to another episode of the TWIML AI podcast. I am your host, Sam Charrington. Today I'm joined by Nikita Rudin. Nikita is co founder and CEO of Flexion Robotics. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Nikita, welcome to the podcast.

0:59

Speaker B

Thank you. We're excited to be here.

1:26

Speaker A

I'm excited to have you on the show and I'm looking forward to digging into our topic for the conversation, which is really digging into the gap between where we are today with robotics and where we need to be to fulfill the vision of the technology. You've been working in this space for quite a while. You did your PhD at ETH Zurich and spent some time at Nvidia. Why don't you share a little bit about your PhD and the focus of your research?

1:28

Speaker B

So when I started, we were trying to use simulation with reinforcement learning to teach a legged robot very simple things like just walking on, on flat ground and when the robot could take a few steps. That was already a big success. And the core focus was to reduce the training time needed to, to achieve that.

2:00

Speaker A

And when you say legged like a quadruped.

2:18

Speaker B

Exactly like a quadruped. Big legged robo dog. We were not using bosonic spots, we were using Anybotics Animal Anybotics is a Swiss startup that was a spin off from our lab. Very similar to A Spot, but it's read and made in Switzerland. We were really trying to reduce the training time needed to achieve that. So Before I started, there were some results of reinforcement learning for, for such quadrupeds, but it would take weeks of computation to achieve anything. And using GPUs and massively parallel simulators, we managed to reduce that to just a few minutes. Actually, we had a demo on stage at some point at some conference where we were running training live on a laptop while I was holding the Robot. And every 15 seconds, the laptop would send the latest policy to the robot and you could literally see how it went from just falling over to taking a first step. And then after three or four minutes, it would be able to walk around the stage. That was a pretty cool visual demo for everyone to see exactly how the learning process happens. And from there, my PhD was pushing the agility of that robot using the similar techniques. It was still training neural networks in simulation and then transferring them to the real world. But the inputs got more complicated, the tasks got more complicated. So by the end, we could go to a search and rescue facility here. Where? In Switzerland. So you have to imagine collapsed buildings, a lot of muds, moss, gravel, big rocks, terrain that is very hard to navigate even for a human. And we would just tell the robot to go from point A to point B and would use its whole body, so it would use the knees to climb on top of big rocks and then jump over gaps and all, again, all autonomously, all end to end, using images and the state of the robot to plan its next actions.

2:20

Speaker A

In telling that story about, you know, deploying this robot, a search and rescue context, envisioning the demo, I've seen similar things. The robot dog is going, you know, maybe opening some doors, maybe that wasn't part of your demo. But I've seen similar demos of, you know, the robot dog, like, climbing hills and crossing rubble. And I think those demos, like, you know, attempt in many cases to land the idea that, hey, you know, like, flag underground, we're done here, like, talk a little bit about the distance between, you know, what you were able to accomplish with that demo and, like, what you think needs to be done to deploy one of these robot dogs in a real search and rescue scenario, for example.

4:10

Speaker B

It's a great question, and I had this debate so many times, even with my, my colleagues at Nvidia and eth. The general question is, is locomotion solved or not? And we already had this debate five years ago when the robot could just barely walk on the flat ground, and people were saying, yeah, look, a motion is solved, we don't need to focus on it anymore. My take was Always like no. Until the robot can really go anywhere a human can go and you don't even need to think about can it do it or not, how reliable it is. My take is that locomotion is not solved. So where we are today is that anything that is blind can be very, very robust. Blind, meaning that the robot does not perceive its terrain, so it's just reacting. That makes the training much easier because you just have to throw a lot of things at it. It always tries to be stable, especially for a quadruped. It's fairly easy to remain stable. And for example, you can even walk upstairs. It will just hit the first step, realize there is something and then climb up doing it with perception. So now you want it to actually react. You don't want it to hit the first step, you want it to place the feed much more carefully. That's harder today in end of 2025, where I say, well, we can do that. We can have fairly good policies that plan the sequence of actions based on the perceptive inputs. Because we're training everything in simulation. This means that we need to be much more careful in how we simulate those sensors as well, to put a lot of effort into creating complicated terrains that can be seen by the cameras and also all sorts of noise models to simulate all the disturbances and defects of those images.

4:54

Speaker A

So what I'm hearing you say is that the addition of the additional information that you would think would help actually makes it more difficult because it introduces a lot of noise. And whereas previously the robot could kind of stumble through the terrain, now it's trying to incorporate this visual input to plan, but it ends up, you know, making it, you know, what actually happens when it is trying to do this? Is it, do you see it stuttering or does it just not work? Is it just hard to train from a model perspective? Like what happens?

6:37

Speaker B

So I mean, in the end what happens is that the final behavior is better if you do everything right. But that's a big if. The typical thing we refer to is the so called Simtorio gap. If you're training things in simulations and deploy them in real life, it's not the same. And things that work really well in simulation might not work at all in real life. That sim 2 real gap is much larger once you have perception in the loop. Once you're simulating either depth images or RGB images, it's even worse. So that makes the job of the researcher, of the engineer harder. You need to cross that center. Real gap. For both the physics of the robot and the perceptive inputs, you have to simulate the sensors carefully. But if you do it right, then the final behavior is actually much better because you can see that the robot is not simply reacting to whatever is happening under his feet, but actually planning accordingly in advance.

7:15

Speaker A

Is this problem of locomotion solved then with, you know, the, with vision at least for, you know, let's take quadrupeds as an example. Or is it. Is, are there still kind of outstanding issues like is there a generality gap or how do you think about it?

8:10

Speaker B

It's interesting. So it really depends where you draw the line on locomotion. So the. Once the robot can cross complicated terrain, the next step goes more into navigation, which means where should it go? Should it climb that thing or should it avoid it? For now, everything I've described so far was mostly using geometry, so there are no semantics. But now if you imagine the robot.

8:31

Speaker A

Walking behind it a point A and a point B and it's going to take the straight line and kind of plow through whatever is between here and there, as opposed to think about which way to go more or less.

8:55

Speaker B

If there is a huge wall, it might avoid it, but then mostly it will try to climb on whatever is in front of it. Okay, now if you imagine the robot walking behind me in the office, you don't really want it to climb on every single desk. You want to avoid some things, but also walk on others, right? So if there's stairs, you want to take the stairs, but you don't want to hit plants or whatever else, which means that suddenly you have to add semantics to the policy.

9:06

Speaker A

And what does it mean to add semantics to the policy?

9:33

Speaker B

Once again, if you go from simulation to reality, it makes this interior gap bigger because suddenly you have to simulate all these offices, all these different objects. Plus you probably need to give it images, not just depth images, you need to give it RGB images, which means you need to, to simulate it in a photoreal way. Or the other option, which is probably the more correct one in the short term, is to split the problem. So you train the robot to be very good at walking on anything, but you don't give it semantic information and you train another thing on top, which will be the, you can call it the planner or a higher level policy that will steer it around.

9:36

Speaker A

And now historically, in the conversations I've had with roboticists like this has been a big debate, you know, whether, you know, we should be using end to end deep learning models that can figure all this stuff out, or using a more Modular approach. It sounds like what you're saying is that a more modular approach can still be a pragmatic way to overcome the challenges of end to end training.

10:15

Speaker B

Yes, it's interesting because my, my whole PhD was about going more and more end to end for, specifically for locomotion. And still I'm here arguing that we should not do everything end to end. I think at some point we'll get there. But in the short, medium term, as you said, the more pragmatic approach is to split the problem and use different techniques for different parts of the problem.

10:41

Speaker A

We're splitting the problem into kind of a locomotion model and a planner model. Does the planner. What's the objective for the planner? Uh, when you're trying to, you know, the, you know, English language objective is I want this thing to intelligently choose the, like the best path. But what is best path? How do you define that? How do you create an objective around that? Is it the path that. Maybe it's the path that uses the, that's most power efficient if you're in a robot. Or maybe it's the path that, you know, gets you there faster or least distance. Like how do you balance all that?

11:03

Speaker B

There are different ways to do this. If you still choose the RL route when we're doing that, you can train these planners with reinforcement learning. And typically you have to define this reward function. It would include things like don't hit anything, avoid objects, and don't move too fast. Because that's one thing that makes robots seem very dangerous. These reinforcement learning policies will try to optimize everything so they will go very, very quickly to the goal. This is not really what you want with a robot that operates around humans. Typically we're actually trying to slow them down as much as possible. And yeah, that's mostly it. And then it depends really on what is in front of the robot. There might be things that it's okay to work to walk on, others that it's not. There is also another approach. Once you split the problem in half, you could train the locomotion with pure reinforcement learning and simulation. But you could train the planner with automatic other data. For example, videos of humans walking around the office. And you can extract the trajectories from that. And then you don't really need reinforcement learning anymore. You train behavior, cloning, imitation, learning policy that will just steer the robot just like a human would walk.

11:45

Speaker A

And thus far, you know, speaking about how a human would walk, thus far we've been talking primarily about quadrupeds. How much of all of this translates from quadruped to humanoid robots?

12:55

Speaker B

All of it transfers. That is the magic of reinforcement learning that these policies don't really care if they're controlling a quadruped or a humanoid. This was the big switch from my PhD to Flexion, where we're working mostly on humanoids, and we've seen that the exact same techniques transfer. There is one interesting thing that happens.

13:09

Speaker A

With humanoids that makes sense to me for a planner, but it's less intuitive for locomotion model. And maybe I'm thinking of a locomotion model that maybe the locomotion model itself that I'm thinking of is like split into multiple components. But let me be more specific. I'm including in locomotion like the outputs that control stepper motors and all that kind of stuff. Is that part of what is trained in the locomotion model. And then I would think that you would need to at least like tune it or do something else to if you're going to change the form factor of your robot.

13:29

Speaker B

No, this is a very good point. When I say transfers, the general techniques transfer, the models themselves don't. So for sure you need to retrain a new policy, a new controller for the new robot. But if all the simulation pipelines are general enough, it can be as easy as changing the input file, the URDF that describes the robot, retraining it, then you're ready to deploy. The tuning part is an interesting one because it is a little bit harder for humanoids compared to quadrupeds. And I don't think it's really related to the fact that it has two or four legs or anything like that. Personally, I think it's mostly related to the fact that we have very specific expectations of how a humanoid robot should walk, whereas a quadruped will. If it walks slightly differently from a dog, it's completely fine. But a humanoid, if it doesn't move the arms in the right way, if it bends the knees too much or walks a little bit sideways, humans have a very strong reaction to that.

14:15

Speaker A

I saw a tweet kind of touching on the same idea just the other day. And it was essentially this humanoid robot that was kind of locomoting like a quadruped. Like it was like on its back with its arms like this and like moving really quickly. And the, the main thrust of the tweet was that, you know, humanoid robots move like humanoid robots because we, like, that's our expectation, but that, you know, there are potentially other, you know, even with that form Factor. There are potentially other more efficient ways for these things to move, but they just seem wrong to us.

15:17

Speaker B

Yeah, that's true. But if we wanted robots operating around humans, we have to create some trust. So we, we have to take the less optimal route if it makes humans feel a bit more comfortable.

16:00

Speaker A

We've been working on robots for a really long time, but it seems like over the past year, like, we're, you know, seeing advancements and via video demos coming very quickly. Like, I kind of asked this question before, but, you know, I want to, like, have you kind of parse through how to think about, you know, these videos. Like, you know, we were seeing humanoid robots walking at the beginning of the year and now they're running, you know, now they're doing dishes and all these things. Like what goes into creating a demo like that? And what are the limitations of what it says about what the robot is capable of?

16:13

Speaker B

First of all, I would say that it's really exciting that the whole ecosystem is moving so quickly. Feels like every single day there is a new video of a new robot doing something, and that's really, really good. We see a lot of progress, both in hardware, but also how it's on the AI side as well, how these robots are capable of doing. Having said that, typically when I personally see a demo of a robot doing something, my approach is to think about what is the absolute easiest way to achieve that. And that is typically how it's done. So we are showing, as an ecosystem, we're showing a vision of what robots should be doing. But the behind the scenes is sometimes a little bit different. For example, if you have a robot standing and doing some sort of manipulation on a table, folding sheets, or something like that, typically one of two things is true. Either there is someone hiding behind the robot, behind a curtain, somewhere in another room, teleoperating that robot. So it's not really autonomous. That's, let's say one third of the cases and the other two thirds are where the robot is actually autonomous. But to get there, 100 people had to teleoperate 100 robots to collect a lot of data. Hours and hours, in some cases, thousands of hours of robots doing very, very similar things in probably the same environment, then collect data, train a policy that can imitate that data, and then they're able to deploy robots autonomously, which is not exactly what you would see when you watch that video, because it feels like robot can just adapt to anything and come to your home and do everything you're doing. On that front, we're just not there yet, but we'll get there soon.

16:56

Speaker A

Remind me the name of the company that started taking pre orders for a humanoid robot that is ready for the home, quote unquote.

18:42

Speaker B

They're referring to it to 1x.

18:50

Speaker A

1X. I looked at that and I've had a lot of conversations along these lines about where robots are and, And looking at that, I, I question like, okay, are we like really a lot further than I think we are or you know, something else happened here, happening here. And this is, you know, you know, the early buyers are going to be beta testers and it may or may not work. Once it gets to, once it gets to the house. Do you have any takes on. Not that specific company necessarily, but like the readiness of humanoid robots for the.

18:52

Speaker B

Home also, 1x is not the only one. There are a few more who announced similar things. I mean, in a way it's good, it's very, very ambitious to sell robots next year into people's homes. I think they have a big challenge ahead of them. So let's see how far they can get next year. But usually these companies are also fairly honest that it is a beta, alpha beta, an alpha program. It will be just for early adopters. It will take a few more years before you can really buy these robots and send them into homes. That's partially why as a company we're focusing more on industrial use cases. There are a lot of other challenges with industrial use cases because now suddenly performance is really important. You need to be very fast. You cannot slow everything down, at least in most cases. Um, but it's, you have a little bit more control over what's happening. So for example, if we want to deploy 10 robots in a new, in a, in a new warehouse, it's easier for us to send an engineer for one or two days to check that everything is in order. They operate as they should. If they don't fine tune a few things and then let, let robots work, which is not something you can do in a everyone's home.

19:36

Speaker A

And are the, the tasks in the industrial setting, I'm imagining they're more repetitive, more consistent, less variation than, you know, run to the fridge and grab me a Coke.

20:52

Speaker B

Yeah, that's right. And we get to decide which tasks we tackle and which ones we leave for the future. So we can start with simpler things. A lot of it is moving objects around, bringing objects from point A to point B, moving boxes, opening boxes, taking items out or putting objects into boxes, putting the box in a truck, sending it further. And this seems really within reach, or next year, maybe the year after that.

21:09

Speaker A

But even though you might have seen a video of a robot doing that, that doesn't mean that it's, you know, ready now and people are doing it now without, you know, with it not being in a development phase. Is that fair?

21:37

Speaker B

Yeah, that's fair. My, my hot take on that and I'll be happy to be proven wrong. But I think there is not a single humanoid robot today that actually generates value. Meaning there might be a robot that does something fairly close to what it's supposed to do in a factory or in a warehouse, but it's not the exact task. So in the end it's not generating value because it's not doing the actual thing it's supposed to do.

21:51

Speaker A

Meaning it's doing some variant of the thing. Or it's like there's a handler that's fixing up, you know, cleaning up after the robot as it makes a mess across the.

22:14

Speaker B

Exactly, yeah. And typically you would have more handlers than you had people before. So you could argue the value is negative, but once again we'll fix that.

22:24

Speaker A

You know, we talked about like how you create these demos and the idea that there's either real time teleoperation or, you know, many, many people doing teleoperation to collect training data, you know, talk a little bit about, then you know, after that training data is collected via teleoperation. Like what the approach is for training, is that data then used as part of RL or is that more a supervised learning type of an approach?

22:33

Speaker B

Typically it is, it's supervised. So you record the data is, you know, images on the camera and how the, the commands that the teleoperator sent to the to, to, to the robot, which typically is like how should you move your hands in space and how should you move your fingers? So that is recorded and then a big transformer is trained to produce the same actions from the same pictures, the same images. Now what's interesting is the whole field shifted a little bit from just training these transformers from scratch to using vision and language encoders that were pre trained on Internet scale data.

23:10

Speaker A

Like VLMs off the shelf VLMs.

23:54

Speaker B

Typically you take a VLM, you remove the output and you train a new part of the network on top and then you call that a vla. So it's a vision language action model where the vision language part was pre trained before and the action part is trained from scratch.

23:57

Speaker A

Got it, Got it. So as opposed to predicting next language token, you're now predicting an action Token which is then translated into, you know, a separate motor motion or something like that.

24:15

Speaker B

Yeah, exactly.

24:28

Speaker A

Kind of compare and contrast that approach with folks, what folks were doing before. Are we doing that because it's, it's cool or are we doing that because it, you know, how much does it, how much does having a pre trained model to start with like save us from the generic transformers?

24:29

Speaker B

That's a very good question. The general thought is that it helps with generalization. Since the language and vision encoders were trained on Internet scale data, they're supposed to generalize. A typical case was that if you don't do that, you would train a robot during the day, then you turn. If the lights go down at night, it won't be able to perform anymore, even though, I mean there's still light. So everything should just work. A human wouldn't even see the difference. But because the image embedding changes a bit, the policy doesn't perform anymore. I believe that this gets better with the pre trained encoders. To be completely honest. This is overall, the generalization capabilities still need to be proven. I think Ross Dedrick has had an amazing talk at Stanford. We were talking about their efforts at Toyota Research Institute where they were comparing training policies on a very specific task with little data versus training more generalists with a lot, a lot of data. And they were seeing some signs of generalization. But I don't want to quote him directly, but it seemed like it's not fully understood yet how much of the generalization is coming from that.

24:50

Speaker A

Now map those two little data, lots of data to the transformer VLM. Discussion. The VLM would be the little data and the transformer was the, was a lot of data because like we're assuming the VLM was pre trained or is it re. Or was it reversed?

26:07

Speaker B

No, it's actually reversed. If you, if you include the pre training as data that you get for free, then the that you have a massive amount of data for pre training and then you can add less data for fine tuning this action head.

26:29

Speaker A

I guess it doesn't matter which one is which because the results were somewhat inconclusive is what I'm hearing.

26:46

Speaker B

We need to go in more detail and I'm quoting other people here, so it's a bit hard to hyper say. I think it makes a lot of sense to have pre trained visual encoders and language encoders because you don't want to relearn language every single time you want the robot to do something like language is language and by the Way, we have amazing VLMs now, so might as well use them. There's more of a question of this action head. Should you train it on a lot of random data or just on the thing that you want the robot to do in the end, and this is still an open question, one question that.

26:53

Speaker A

That raises for me is I just had a conversation where we were talking about how with VLMs generally they kind of ignore a lot of the visual information and really rely more heavily on the language information. And it seems like in a robotic scenario that, you know, that would be even more harmful to what you're trying to accomplish. Do you run into that as a challenge?

27:28

Speaker B

I've heard the same thing. I haven't seen it in the VLA case. I would guess it's because the robot cannot ignore the visual input. It's the main source of information. What tends to happen, however, is that they ignore the language inputs. If you train the robot to always do the same thing, I don't know, if you have a box and you have an object inside, it always has to take it out, it will completely ignore the language. It will just do the same thing. We'll try to guess from the image what it's supposed to do.

27:59

Speaker A

You mentioned the Sim 2 real gap. You know, we've been making, you know, this has been a known issue for, you know, many, many years. We've met. We have been making good progress on closing that gap. You know, talk a little bit about, you know, in your experience, what is required today to, you know, create a model in SIM and, and have it run in real. Do you have to, are you doing things kind of explicitly or specifically to address, you know, kind of real world or, you know, is it just. The models are better, the process is better, and you don't really think about that anymore and it just kind of works.

28:32

Speaker B

You need to do a lot of things very explicitly. The challenge is that to cross the sim2real gap, you need to have a very deep understanding of both worlds, of the simulation and how it works, and of the real world. Which means that if you want to have a robot that walks around as it should in sim, you need to go very deep. You need to know exactly what's happening between a command that the policy outputs and then all the way down to torque in the motors. And There are typically 10 different layers of transformations, even just on software, of how we go from high level command to actual current in the motors. And it's very tempting to ignore that. But by understanding every single layer and Knowing all the different transformations, then you can properly simulate it. And this really unlocks better performance.

29:17

Speaker A

So that suggests like the level of simulation that you're doing isn't like, you know, you pull up your sim environment and get generic, you know, humanoid robot, and you're going to train some model and deploy it to something else. It's like you have a digital twin of your humanoid robot in a simulated, like a high fidelity simulation environment. And you're training, you know, to a very fine level of detail, which sounds very computationally expensive.

30:09

Speaker B

No. So we're much closer to what you described first. So we have a very generic simulation environment. But there are some very specific things that are important. One clear example is what are the torque and velocity limits of a motor? You cannot expect it to do something that is not possible on the real robot. So you need to add those limits. And there are a few more things like that, like what kind of delay can you expect between a command. You need to identify a few of those parameters. And we are actually doing this usually in what we call a real to sim process. So we take the real robot, really hang it in the air, we let it shake a little bit, collect data of all the different motors, and then we know which are those important effects that we need to identify. We identify them and add them to the simulator. But simulation speed is the most important thing. So you cannot afford to simulate all those different effects, currents, magnetic fields, et cetera. You need to abstract all of it away.

30:38

Speaker A

That sounded very hard or inexpensive.

31:36

Speaker B

My personal take is that you still need to understand them even though you're not simulating them.

31:41

Speaker A

Got it. And so is the. It sounds then that. It sounds then like the result of that process is not a general model that you could deploy to any humanoid robot, but one that is specific to the humanoid robot for which you took the reel to sim. You know those key parameters, but that you know by, because you're able to abstract it out to these, you know, some handful of, or you know, several handfuls of kind of key parameters. Like it's relatively easy to, to do new robots.

31:47

Speaker B

And this was also the surprising part of. One of the key learnings of this year is switching robots is fairly easy as long as the hardware performs reasonably well. Now, as a company, we work with a few different suppliers of robots and a few different partners as well, with whom we're working closely. We've deployed controllers on, let's say between five and 10 different robots. And we see now that making a new robot walk is basically a few Days of work. And it should be less. It should be less than one day of work. If we optimize some of our processes now, bringing them to a new task, this a bit more challenging. This requires more engineering today. And this is what we're focusing on. One of our key metrics is how much human effort is involved in bringing a new robot to a new task. New robot is very easy today. New task is something we're working on.

32:25

Speaker A

And in this context, like how we've kind of talked a little bit about this, but how specific is a task? Meaning, like, is a task pick and place or is a task robot in this warehouse picking off of this line and placing into these bins?

33:20

Speaker B

That's a great question. More like pick and place. But there is an interesting concept there, which is we are trying to leverage the information contained in large VLMs to orchestrate and break down complex tasks into clear subtasks, even though that's not what we're focusing on today. Cooking is a great example, a great metaphor. If you wanted to. Cooking, yes. If you wanted to train the robot to cook every single meal on the planet and you would say each meal is its own task, you would never finish that. Right. The set of tasks is huge. But what you could do is you give the recipe to a vlm. You also give it images of what robots is. If the recipe says cut a cucumber, the VLM would say grab the knife, grab the cucumber and do this sort of motion to cut it. And then you can break it down into much simpler primitive, like cutting things, holding a pan, putting it down somewhere, filling a glass of water, pouring it, things like that. And suddenly the set of these primitives is not infinite anymore. The challenge is that now you need a higher level intelligence that will orchestrate all these primitives. But what's interesting is that that part is basically solved with a VLM. It's not 100% there, but it's moving much faster than the actual physical interaction of doing all these motions.

33:44

Speaker A

So can you elaborate on that? The orchestration is solved, and if so, how is it solved? And what's the relationship between that orchestration and what we talk about is like the reasoning capabilities of these large models, or is it the same thing or related?

35:19

Speaker B

It's similar. So maybe one way to describe this is on our website we have two videos. We have a video of a robot walking in a forest and picking up trash. This is mostly there to showcase what's possible and also play a little bit on our Swiss angle Using our nature, we have another video where Drobot is doing the same thing in our office. And in that second video, it's 100% autonomous. So you give it a text prompt. I think we're saying something like, pick up the toys in front of you and drop them in the basket at the end. And we're using an off the shelf VLM for that. The way this works is we're giving the images to the VLM and we're allowing it to do tool use to call specific skills of the robot. So the VLM would say, oh, I see a toy there on the ground. Let's walk to the toy. And this, let's walk to the toy is a skill that is actually triggered and executed by the robot. Once we're there, it will trigger, pick up the toy, then go to the basket and drop it off. And so by having a few of those primitives which are walk to things, I mean, the walking is locomotion, as we discussed, is itself fairly complicated. So you can walk on stairs, you can walk on a bunch of different complex terrain. But by having like walking, picking things up from the ground and then dropping them somewhere else, we can recombine it in many, many different ways without any retraining just by prompting the. An off the shelf vlm.

35:43

Speaker A

When I hear tool use, I hear like separate process or module or model. Is that the, is that the case? And like, how do you think about, you know, then I think of like, if, you know, if we got a bunch of tools, you've got a bunch of these, you know, separate models or modules. Like are they actually. Does this architecture imply that they are in fact separate and, you know, trained separately, or are they more universal somehow?

37:11

Speaker B

That's another great question. So the, in our case specifically today, they are separate, but we are actively working on merging them together into one single model, a more general model. And the hope with that again is that you see some generalization, so you see some interpolation between those different models. And the way we would do that is actually by still training those different modalities primitive separately and then using them as data generators to collect a massive amount of data in simulation to then train one of those larger VLAs across the whole data set such that it can perform everything. We're seeing early results of that in our, in our company. So things are, are going that direction. But we still need to prove that this actually leads to, to the generalization we're talking about before.

37:45

Speaker A

And hearing you describe that, you know, often when I hear like These kind of student teacher types of approaches, or, you know, I think of like distillation and trying to get to smaller models. You're not necessarily trying to do that, but talk a little bit about the hardware capabilities from a model inference perspective and like, where we are in terms of, you know, model size, that kind of thing.

38:43

Speaker B

In our plan, if we go through that whole process, we train, let's say, 50 of those primitives, we distill everything. We are still developing a hierarchical pipeline where you have three models interacting with each other. It would start with a relatively large, let's say, VLM that would be allowed to use presenting, and the output of that one would be that VLM would go from a very abstract task to clear subtasks such that if you have a robot here and I tell it, go across like, I don't know, go pick something up in the, in the fridge and would say, turn around, go through the door, open the fridge, grab the thing, close the fridge, et cetera, et cetera, clear instructions. Then those clear instructions would go to that VLA that we were describing before, where if it receives as instruction open the fridge, it will plan a kinematic motion for the arm to grab the fridge handle and, and open it just a few seconds into the future. And finally we have what we call a Whole Body Tracker, which will receive this plan of how the hand should move and how the whole body should move. And then we'll control the motors of that specific robot to execute that motion. Okay, now the size and frequencies of these models are very different, and that's why we think it makes sense to have three of them. The, the final one, the Whole Body Tracker, is a very simple, very small model. Typically it's a very small transformer, to be completely honest, doesn't even need to be a transformer. But today everything should be a transformer. And that can run very easily. We typically run them at 50 Hz, 50 times per second, even on the CPU of the onboard computer of the robot, not even the gpu, because it takes more time to send it to GPU and get it back. Then I'll skip the VLA for now going to the vlm. That one is typically fairly hard to run onboard on the robot. For now, it's running offboard, either in our office, in a server rack, or even in the cloud, which creates its own challenges. Once you want to deploy a hundred robots in a warehouse, either you have an amazingly good Internet connection or you have to install server racks. And in that warehouse. So we are hoping that Robot compute keeps progressing such that we can finally feed those things on the robots themselves, typically on a Jetson and then the vla. This is where compute is the most limiting today because we cannot really put it offboard. It still needs to run fairly fast, let's say 10 times per second with, with minimal delay. So it needs to be on board. And they also typically use diffusion, which means that you don't infer it just once. At least part of the network is inferred multiple steps. So this is where compute is the most critical. The onboard compute of the robot is the most critical.

39:06

Speaker A

Yeah, I hadn't thought about in the case of the one X that we were talking about and these other robots in the home that they're essentially like their brains are in the cloud. I assume that, you know, somehow we were able to get these models small enough to run locally, which seemed like a lot, but just from a latency perspective that I don't know, it's, it's hard to imagine that being particularly tenable and consistent as well.

42:04

Speaker B

So some of these models, just to clarify, some of these models can fit on the robot, but what we're seeing today is if you want to do some of this more abstract reasoning, so you give it a very abstract task and then it has to orchestrate something for multiple minutes there, you really benefit from larger models.

42:37

Speaker A

Yeah. And I would imagine that you would want even more abstract models in the home for consumer task than you would require in an industrial setting. Is that true?

42:58

Speaker B

Yes, I would say that's true. Because in an industrial setting, if the task is repetitive, you can more or less pre compute those very abstract instructions or go from very abstract instructions to clear instructions. In a home, if you have a human just telling you something for sure, you need the scale of large models.

43:15

Speaker A

Are you using off the shelf RL environments or is part of what you're creating, you know, the simulation environment for creating these models?

43:35

Speaker B

It is a big step of what we're creating. So we are not building simulators ourselves. We're using existing simulators, including ones from Nvidia. We also test and experiment with many others. But one of our key know how is how to properly build those simulation environments. And we have our own custom RL algorithms on top to, to benefit from that as much as possible.

43:44

Speaker A

Got it. So you, the, the simulator and the simulation environment are distinct. Is the simulation environment when you say that, is that the configuration of the environment in the simulator, like the simulator is like the platform and the simulation environment is like the thing that you create about your scenario, or is there exactly. Is there just those two levels of abstraction, or are there three levels of abstraction? I guess is.

44:09

Speaker B

I guess you would add the RL algorithm as a third component that interacts with both, but that's it. The simulator itself is basically a physics engine and a renderer. And then you have to put a robot in there. If you wanted to walk on stairs, you have to create stairs. But you cannot just ask the robot to randomly figure out how to walk on very complex stairs. So you have to create a whole, what we call a curriculum of difficulty. So you would start with very small stairs and progressively make them harder. And the same is true for all sorts of tasks. When we're training a robot to open a door, we have to create a simulated version of the door. Then we have to help the robots. We have to figure out exactly all these training processes on top of just the scenario itself.

44:37

Speaker A

And so, as we've talked about this, you've kind of positioned RL and imitation as these two alternatives. But is it also possible to use imitation in conjunction with RL to kind of bootstrap learning and like, you know, help the robot get over these, you know, you know, figure out stairs more quickly? Like, is that still a research problem or is that something that, you know, we're able to do in practice now?

45:25

Speaker B

That's a good question. It. It's still a research problem, but we are seeing good, good signs of life of it. I would say there. I, I would, I can talk about two different ways to combine them. One way is use a few demonstrations to help the RL process. And this is something we're doing very actively. So if you have a human showing, like, doing the task, you can extract just a little bit of information from that to help the exploration process of the reinforcement learning, such that the robot is not just randomly shaking and trying to figure out everything from scratch, but you're guiding it a little bit towards the right solution and a completely different way to approach imitation learning. Plus, RL is what we're seeing a bit more in other companies and in academia, which is doing more imitation learning for pre training and then adding some flavor of RL on top to try to improve the behavior after the fact. A big question there is, if the imitation learning pre training was done without any simulator, do you suddenly need to add a simulation again to do the fine tuning or not? And I think this is still a very, very open question.

45:59

Speaker A

You know, when I think back to, you know, some of the earliest conversations I had on Robotics, like, you know, these are folks like Peter Rabil. And I remember, I think, you know, this was like pre VLM stuff and maybe even pre transformer, I don't remember. But like, you know, some of that earliest work was like they'd have hundreds of robots, you know, real robots at Google, like, and they were starting an experiment with rl, I believe, you know, but like, it was, the robots were rlling and it was very expensive. It used a lot of, you know, you needed to have a hundred, you know, time on a hundred robots, you know, but they didn't have to deal with this RL to sim gap. You know, you're more focused on, you know, simulation and other ways to close that gap. But there are still proponents of, you know, RL in real life. You know, how do you think about, like, comparing and contrasting those approaches?

47:15

Speaker B

There are definitely people who are big proponents of RL in real life, or another way to put it, who are against simulation at all. In some cases they are. There are good reasons for that, because some things are fundamentally hard to simulate. I can talk about a specific task we're focusing on, which involves the robot manipulating cardboard boxes. The robot has to walk, pick up the box, bring it somewhere, put it on a table, open it, take what is inside out of the box and put it on a shelf, for example. In that case, most of the. If you break it down into subtasks, most of them are very well simulatable, except one very specific piece of it, which is opening the box. Because you can imagine there is maybe tape on the box. You have to take a knife, cut through the tape to be able to open it. And it's possible, but it's still a lot of effort to simulate the interaction of the tape with the cardboard and exactly how the knife cuts through it. So what we are trying to do there is identify those specific cases where simulation is still limited and use real data, but only for those very specific cases, and then mix it with simulated data of everything else. And we think this is how we basically get the best of both worlds or get as much simulation as possible. And as simulators develop, they'll take more and more of the whole set of, of tasks. But while there is a gap, we'll reuse real data for those specific cases.

48:18

Speaker A

You know, for the folks that say that, you know, RL in real is better, you know, in what ways would it be better for that particular scenario? Sounds like you would just go through a lot of boxes, but you still, like, unless you're giving your robot A utility knife. Like, it seems like it's the same problem in real. Right.

49:57

Speaker B

I agree with you. I think it is much harder to do it in real life compared to simulation, especially with reinforcement learning. What you could do is do a little bit of imitation learning for that specific case, and that's much easier than letting the robot learn everything from scratch. I would guess the, the only argument for real life, for pure real life, RL would be that you don't need to deal with simulation, which can be very, very hard, especially if you don't have expertise in house in terms of how to create those simulated environments and how to tune the simulators to behave nicely when everything is in real life. Well, you already have the perfect simulator in a way, but on the other hand, you have very expensive hardware and then, and you failure and then you reset is way more expensive than in simulation.

50:18

Speaker A

Yeah, I think it's clear why it's compelling and also why it is aspirational. Like the, you know, the idea that as humans, we don't simulate the world to learn things we like, explore in the world and we learn that way. And so I'd want my robot to be able to, to do that, but we're nowhere near the sample efficiency in robots as we are in humans. So you would end up breaking a lot of robots and, or boxes to get there, and then you've only solved one task.

51:10

Speaker B

I think another interesting point is that the human reward signal is extremely complicated. If you think like, if you're doing some tasks with your hands, the amount of information you're getting from all the nerves and all your skin, also from your muscles that are tired, et cetera, is extremely complicated. And we don't have that information at all with a robot. Typically, tactile sensing is very primitive, so if the robot is slowly damaging itself, you wouldn't know until a motor breaks. So if you're doing reinforcement learning in real life, getting rid of behaviors that would damage the motors simply won't work because you'll get, I don't know, maybe one event per week. The reward is not there. And this is where simulation helps once again, because in simulation, we have perfect information about everything. We can design reward functions that will avoid breaking motors, damaging the mechanics of the robot, for example.

51:49

Speaker A

And, you know, based on your earlier point about incorporating vision, reducing performance or making it more difficult to converge on a model, you know, you, you know, one approach to that is, oh, let's just add skin or let's add, you know, additional sensors. But, you know, all the, any Additional sensors or each additional sensor rather increases the burden, the computational burden on these models.

52:49

Speaker B

Absolutely. Plus, mechanically, it makes everything more brittle. So with cameras, we're, I would say we are there today. We can add cameras to our robots. They're very cheap, they're very reliable. But tactile sensing is just not there today.

53:15

Speaker A

You know, in thinking about, you know, crafting that reward function, like, you know, talk a little bit about that process that is, you know, key in any type of RL is figuring out what that objective function is. A reward function are, is it, you know, how, how standardized are they for a given task or do they vary, you know, very widely and require a lot of hand tuning? And maybe as a secondary question on the, the language side or in like coding agents, there's a lot of talk now about trying to incorporate value functions that, you know, allow, you know, that provide signal for of, you know, positive behavior before the end objective. And I'm curious if value functions are a practical thing in robotics today as well, or a conversation, you know, research. Are they? You know, is it research or is that something that we're using today?

53:31

Speaker B

Absolutely. Value functions are part of, of the RL algorithm itself. So we are training, for example, we mostly using some variant of ppo, which is an actor critic algorithm, which means that you're training both an actor and a critic. The critic is basically a value function. It's not used at deployment. It's only used to help the training of the actor during the training process. Now we're seeing some research into how it could actually be used even at deployment. I think this is more on the research side. It's not proven yet. And then about reward tuning, it is a big topic. We have 35 people in a company and quite a few of these people still spend hours tuning rewards. And typically this is obviously referred in a negative way. So you don't want people tuning rewards. But I think you have to make a distinction between two types of tuning. There are general rewards that just simply come from the task itself. So if we think again about how a robot is just locomotion, how robots are walking, you would say the task is just, you know, go from point A to point B, but in reality it's a bit more complicated. You want the robot to go from point A to point B. You don't want it to use too much energy to do that. You don't want it to hit the ground too hard. You probably don't want it to slip everywhere. You don't want the arms doing completely crazy motions around it. So by the time you describe, even in text, what you actually want, you already have, I don't know, maybe 15 lines. And so that translates to 15 different reward functions that you have to come up with in tune. And I think my personal take is that that part of tuning is fine. There is another kind of tuning that tends to happen a lot in rl, which is related to exploration. Once you've described the perfect task that you want, how do you guide the policy, the training process, towards that? For example, if you're a robot that opens the door, you might need to tell it, put your hand close to the handle, then close your, your fingers, then pull on the door. And these are really things that don't scale across tasks. And this is something we're trying to avoid as much as possible. And this is why we're working on other techniques where we can use one or two demonstrations from a human to help the learning process instead of all these manually tuned reward functions.

54:37

Speaker A

You know, we're at the end of the year and this is kind of the natural time for folks to make predictions. Do you have any predictions for the upcoming year or, you know, several years, whatever horizon you'd like to offer in terms of robotics, like, how do you think about the future?

57:12

Speaker B

You know, I said before that I think there, there isn't a single humanoid robot providing value in the world today. My prediction is that will change around the end of next year, maybe beginning of 2027. So, so we won't. Again, prediction, hard to predict, hard to say exactly what's going to happen. But I don't think we'll get this, you know, change chatgpt moment where suddenly you have a billion robots everywhere. Just because you need to build the hardware, it doesn't scale like getting access to ChatGPT. Right. But I would predict that around the end of next year, we'll start seeing robots doing actual work. It will be just a few here and there. And then in 2027, 2028, we'll scale both the numbers of robots per task, but also the set of different tasks that these robots can do, which means that in the coming years we'll go from, I don't know, hundreds of robots to thousands, and then very quickly to millions, tens of millions, et cetera.

57:29

Speaker A

And presumably you see that happening first in industrial settings and then consumer.

58:28

Speaker B

This is my prediction. Yes, that we'll see that first in industrial, then consumers at home. And I hope that after that we can go to more, to crazier applications, like we should really Send humanoid robots to Mars to build colonies before, before humans land there.

58:35

Speaker A

When you think about the currently available robots that you have, you know, seen and, or worked with, like, what, are they all kind of the same? Like, are all the, you know, the dog's the same, all the humanoids are roughly the same? Or do you see, you know, big differences between them from a hardware perspective? And if so, like, are there, you know, ones that are particularly exciting for you?

58:54

Speaker B

Now, that's a great question. There, there are probably three or four different strategies you can take in terms of how you're designing your, your humanoid robot, specifically, what kind of actuators you use, what kind of gearboxes. And since there are many companies exploring that space, everything is happening in parallel, no matter which of those strategies you take. There are a few companies in the US, maybe one in Europe, and probably 50 in China building that exact thing. So the competition is really fierce. One part where hardware is not there yet today is on the end effectors on the hands. It's still debatable if you actually need very dexterous hands. I think one of the big reasons why many companies develop hands with high dexterity so like more than 20 degrees of freedom in a hand, is because they're using imitation learnings. They're imitating humans, which means that you need to be able to imitate everything a human does. Once you go the RL route, you can learn other kinds of behaviors with much simpler grippers. But the debate is still open on that.

59:24

Speaker A

Again, we've been talking about, you know, dogs and humanoids, but there's this broader question, which is, is humanoid the best form factor for a robot? Like, you know, should we be making robots with two arms, two legs? Do you think that that's the way to go? Or is, are we kind of anchored on this because it's our form, but there are, you know, better forms that you've seen or, or, you know, think about.

1:00:35

Speaker B

It's another great question. So I, I worked on more than 25 robots, I think, by now. So any number of, of legs and arms can imagine from zero to probably four, five, six legs. And there is room for all sorts of robots in the world. Honestly, I mostly use, and as a company we use the word humanoid for the lack of a better word. What we mean by that is not the human form factor, but human capabilities. So very basically we want robots that can go where humans go and can manipulate their environments in a similar way to how humans do that. So probably you would need at least two arms with some sort of fend effector to interact with the environment. And then whether you have legs or wheels, both are fine, both have their own applications. So you can go a long way, especially in the industry with a wheeled platform. And we're working with robots like that as well. Having said that, it's surprising how quickly wheeled platforms get stuck. A very common thing that easy to imagine is if the floor is not perfectly flat. If you have something, you have cables or of course stairs, your wheel platform is stuck. But another very important part is also the footprint. With those wheel platforms you have two choices. Either you make them very large and then they're stable by default because they have a very large footprint, but they don't fit through tight spaces anymore. And very quickly, especially in slightly older industrial settings, you have tight spaces.

1:01:04

Speaker A

And the alternative is maybe some gyroscopic Segway like thing.

1:02:45

Speaker B

That could be one. Typically what we see is that you just have a small platform, which means that you have to be very, very careful how you move the torso on top because if you lean too far, it just falls over. I haven't seen the gyroscopic platform yet.

1:02:51

Speaker A

Okay.

1:03:03

Speaker B

I still think a three legged robot is probably the coolest robot I've seen so far, but it's maybe not the most applicable for industrial tasks on Earth.

1:03:04

Speaker A

And then, you know, maybe, maybe one more question. There are, you know, quite a few now like robotics kits or you know, robots are getting more accessible for folks that are interested in the space and want to play but you know, don't have access to a humanoid robot like, or don't have any robots, like, you know, what, what are some cool things that someone can, you know, order now, you know, maybe get by the holidays or soon thereafter, meaning not like Pre order for 2027 and start playing around. Like what? You know, if you were, you know, advising someone who is like excited about getting their hands dirty, like you know, what would you tell them to start doing?

1:03:15

Speaker B

There's an amazing community around Hugging face and their little robot project where they have very cheap robot arms and they can help you learn about the whole teleoperation, data collection, training and deployment pipeline with those arms. So that's a really good way to learn about that. For the more like locomotion and reinforcement learning aspect, it's a little bit harder because you probably want a robot with legs, which also means that the robot should be able to fall and stand up without completely breaking. I think the best bet there is the Chinese quadrupeds that are really getting fairly cheap. It's still multiple thousands of dollars, but it's, let's say it's affordable for a university or a school or if you really want to go much deeper into that at home as well.

1:03:54

Speaker A

And you get your quadruped and you unbox it, like what can you do with it? Or where do you start with, you know, trying to, to do, you know, trying to do some experiments with it.

1:04:50

Speaker B

So when you unbox it, it, it can or typically can already do quite a lot. So it will be able to walk in some cases. They even have things like slam pipelines. It can do some navigation, can avoid obstacles, things like that. But then the challenge is that you want to get rid of all that software and basically recreate it. Recreate it from scratch. And there again there are many communities online. There are many GitHub repos that help you get started. The unit we go to is probably the most standard platform. So I would start there. And there are people who open source already. Everything from training to deploying code, deploying these policies on those robots.

1:05:06

Speaker A

Okay, cool. Awesome. Well, Nikita, thanks so much for jumping on and sharing a bit about what you're up to. Very cool stuff.

1:05:49

Speaker B

Thank you so much. Really enjoyed this.

1:05:58

Speaker A

Thank you. Sa.

1:06:00