Latent Space: The AI Engineer Podcast

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

84 min
Mar 10, 2026about 1 month ago
Listen to Episode
Summary

NVIDIA engineers Nader Khalil and Kyle Kranen discuss their acquisition journey from Brev to NVIDIA, the development of Dynamo for data center-scale AI inference, and the future of AI agents in production. They explore technical challenges in scaling inference systems, agent security considerations, and NVIDIA's internal culture of innovation.

Insights
  • Agent security requires limiting capabilities to two of three functions: file access, internet access, or code execution to prevent vulnerabilities
  • Disaggregation of prefill and decode phases in inference allows for better resource optimization and scaling at data center level
  • NVIDIA's 'Speed of Light' (SOL) principle forces teams to understand theoretical limits before accepting timeline constraints
  • The shift from single model inference to system-of-models architecture is becoming the dominant pattern for AI applications
  • Context length scaling faces fundamental physics limitations that will require architectural breakthroughs rather than incremental improvements
Trends
Disaggregated inference architecture separating prefill and decode operationsMulti-agent systems with specialized sub-agents for different tasksHardware-model co-design for optimized inference performanceCLI-first interfaces for AI agent interactionsAlways-on autonomous agents running for extended periodsTest-time scaling becoming more important than model sizeAgent workflows breaking out of coding into broader business applicationsLocal-cloud hybrid inference deploymentsContext-aware caching for improved inference efficiencySecurity-first agent deployment strategies
Companies
NVIDIA
Primary focus - discussed acquisition strategy, Dynamo development, and internal AI deployment
Brev
GPU provisioning startup acquired by NVIDIA, now integrated as developer experience platform
OpenAI
Referenced for model capabilities and enterprise deployment considerations
Anthropic
Mentioned for Claude and coding agent autonomy metrics
Meta
Discussed for recommendation systems and Llama model training approaches
Google
Referenced for research papers and model architectures
Amazon
Mentioned for using Dynamo in generative recommendation systems
Mercedes
Partnership example for NVIDIA's autonomous driving technology
ServiceNow
Used NVIDIA's Nemotron dataset to train their own models
Cursor
AI coding tool widely adopted internally at NVIDIA
People
Nader Khalil
Director of Developer Experience at NVIDIA, former Brev co-founder
Kyle Kranen
Engineering leader and architect of Dynamo at NVIDIA
Jensen Huang
NVIDIA CEO, referenced for Speed of Light methodology and company culture
Leopold Aschenbrenner
Referenced for 'unhobbler' concept in AI scaling limitations
Brian Cannizzaro
NVIDIA executive who taught about choosing your own path within the company
Quotes
"You really only let an agent do two of those three things. If you can access your files and you can write custom code, you don't want Internet access because that's a vulnerability."
Nader KhalilOpening
"SOL is essentially like, what is the physics? The speed of light moves at a certain speed. So if light's moving some slower, then you know something's in the way."
Nader KhalilMid-episode
"This is the year system as model. Where instead of having a single model be a thing, you have a system of models and components working together."
Kyle KranenLate episode
"We're completely happy investing in $0 billion markets. We don't care if this creates revenue. It's important for us to know about this market."
Kyle KranenMid-episode
Full Transcript
4 Speakers
Speaker A

Agents can do three things. They can access your files, they can access the Internet, and then now they can write custom code and execute it. You really only let an agent do two of those three things. If you can access your files and you can write custom code, you don't want Internet access because that's one. It's a vulnerability. Right. If you have access to Internet and your file system, you should know the full scope of what that agent's capable of doing. Otherwise, malware can get injected or something that can happen. And so that's a lot of what we've been thinking about is like, you know, how do we both enable this because it's clearly the future. But then also, you know, what are these enforcement points that we can start to, like, protect?

0:00

Speaker B

All right, welcome to the L Space podcast in the Chroma Studio. Welcome to all the guests here. We're back with our guest host, Vivu. Welcome. Good to have you back. And our friends Nether and Kyle from Nvidia.

0:38

Speaker A

Welcome.

0:49

Speaker C

Yeah, thanks for having us.

0:50

Speaker A

Yeah, thank you.

0:51

Speaker B

Actually, I don't even know your titles. I know you're like architect, something of Dynamo.

0:51

Speaker C

Yeah, I'm one of the engineering leaders and architects of Dynamo.

0:57

Speaker B

And you're director of something. Developers. Yeah, you're the developers. Developers. Developers guy at Nvidia.

1:01

Speaker A

Open source agent marketing, Brev and dev tools and stuff in the focus.

1:07

Speaker B

And we're kind of recording this ahead of Nvidia gtc, which is coming to town again, taking over town, which we'll all be at, and we'll talk a little bit about your sessions and stuff.

1:11

Speaker A

Yeah, we're super excited for it.

1:23

Speaker B

One of my favorite memories for nadr, you always do, like, marketing stunts. And while you were brev, you had this surfboard that you went down to GTC with, and Nvidia apparently liked it so much that they bought you. What was that like?

1:24

Speaker A

Yeah, yeah. Our logo was a Shaka. We were always just kind of like, trying to keep true to who we were. I think so much of startups, you're trying to pretend that you're a bigger, more mature company than you are. And it was actually Evan Conrad, SF compute, who was just like, you guys are really amazing. Yeah, he was just like, guys, you're two dudes in a room. Why are you pretending that you're not? And so then we were like, okay, let's make the logo Ashaka. We brought surfboards to our booth to gtc and the energy was great. Some palm trees too.

1:40

Speaker C

They actually poked out over, like, the walls. So you could see the bread booth and no one else, just from very far away.

2:09

Speaker B

Oh, so you remember it back then?

2:16

Speaker C

I remember it pre acquisition.

2:18

Speaker A

I was like, oh, those guys are cool, dude. That makes sense because we signed up really last minute. And so we had the last booth. It was all the way in the corner. And so I was, I was worried that no one was going to come. So that's why we had like the palm trees. We really came in with the surfboards. We even had one of our investors bring her dog and then she was just like walking the dog around to try to like bring energy towards our booth. Yeah, Steph, Yeah, yeah, she's the best.

2:19

Speaker B

You know, as a conference organizer, I love that.

2:42

Speaker C

Right?

2:43

Speaker B

Like, it's like everyone who sponsors a conference comes, does their booth. They're like, we are changing the future of AI or something. Some generic bullshit. And like, no, like actually try to stand out, make it fun. Right. And people still remember after three years.

2:44

Speaker A

Yeah, yeah. You know what's so funny? I'll give you this clip if you want, if you want to add it in. But my wife was at the time, fiance, she was in medical school and she came to help us because it was like a big moment for us. And so we bought this cricket. It's like a Vine, like a vinyl printer. Because like, how else are we going to label the surfboard? So we got a surfboard, luckily was able to purchase that on the company card. We got a cricket. And it was just like fine tuning for enterprises or something like that that we put on the. On the surfboard. And it's 1am the day before we go to GTC. She's helping me put these like vinyl stickers on. And she goes, you son of. She's like, if you pull this off, you son of a bitch. And pretty much after the acquisition, I stitched that within the news of the acquisition. I sent it to our family group chat.

2:55

Speaker B

Well, she made a good choice there. Was that. Basically the origin story for Launchables is that maybe we should explain what Brev is.

3:38

Speaker A

Yeah, Brev is just a developer tool that makes it really easy to get a gpu. So we connect a bunch of different GPU sources. So the basics of it is how quickly can we ssh you into a gpu? And whenever we would talk to users, they wanted a GPU, they wanted an A100. And if you go to any cloud provisioning page, usually it's like three pages of forms, or in the form somewhere there's a dropdown. And in the Dropdown, there's some weird code that you know to translate to an A100. And I remember just thinking like every time someone says they want an A100, like the piece of text that they're telling me that they want is like stuffed away in the corner. And so we're like, what if the biggest piece of text was what the user's asking for? And so when you go to Brav, it's just big GPU chips with the

3:44

Speaker B

type of beautiful animations that you worked on Pre. Like pre you can do like now you can just prompt it. But back in the day, artisanal code,

4:21

Speaker A

I was actually really proud of that because it was. I made it in Figma.

4:31

Speaker B

Yeah.

4:34

Speaker A

And then I found I was like really struggling to figure out how to turn it from like figma to react. So what it actually is is just an svg and I have all the styles. And so when you change the chip, whether it's like active or not, it changes the SVG code. And that somehow like renders, like looks like it's animating, but we just had the transition slow. But it's just like a JavaScript function to change the like underlying SVG. That was how I ended up like figuring out how to move it from, from Figma. But yeah, that's artisan.

4:34

Speaker C

Speaking of marketing stunts though, he actually used those SVGs or kind of used those SVGs to make these cards. Oh yeah, like a GPU gift card that he handed out everywhere. That was actually my first impression of that.

5:00

Speaker B

Yeah, Yeah, I think I still have one of them.

5:14

Speaker C

They look great.

5:16

Speaker A

Yeah, I have a ton of them still actually in our garage, but just they don't have labels. We should honestly like bring, bring them back. But I found this old printing press here actually just around the corner on Venice. And it's a third generation San Francisco shop. And so I come in, an excited startup founder trying to like. And they just have this crazy old machinery and I'm in awe because the whole building is so physical. Like you're seeing these machines, they have like pedals to like move these saws and whatever. I don't know what this machinery is, but I saw all three generations. Like there's like the grandpa, the father and the son. The son was like around my age.

5:17

Speaker B

It's like a holy, holy trinity.

5:47

Speaker A

Yeah. It's funny because we. So I just took the same SVG and we just like printed it and it's foil printing. So they make a mold that's like an inverse of like the A100. And then they put the foil on it and then they press it into the paper. And I remember once we got them, he was like, hey, don't forget about us. You know, I guess like early Apple and Cisco's first business cards were all made there. And so he was like, yeah, we get like the startup businesses, but then as they mature they kind of go somewhere else. And so I actually, I think we were talking with marketing about like using them.

5:49

Speaker C

You should go back and make some cards.

6:15

Speaker B

Yeah, yeah, yeah, yeah. You know, I remember, you know, as a very, very small BREV investor, I was like, what? Why are we spending time like doing these like stunts for GPUs? Like, you know, I think as a typical cloud hardware person, you go into AWS, you pick like T5XXL, whatever, and from a list and you look at the specs, like, why animate this gp? And I do think it just shows the level of care that goes throughout Briv and also Dynamo and Nvidia.

6:17

Speaker A

I think that's the thing that struck me most when we first came in was the amount of passion that everyone has. I think you talked to Kyle, you talked to every VP that I've met at Nvidia goes so close to the metal. I remember it was almost a year ago and like my VP asked me, he's like, hey, what's cursor? And like are you using it? And if so, why? And I'm just like surprised at this. And he downloaded cursor and he was asking me to help him like use it and I thought that was. Or like just show him what, you know, why we were using it. And so the amount of care that I think everyone has and the appreciation, passion and appreciation for the moment. Right. This is a very unique time. So it's really cool to see everyone really like appreciate that.

6:44

Speaker B

Yeah. One thing I wanted to do before we move over to sort of like research topics and the stuff that Kyle's working on is just tell the story of the acquisition. Right. Like not many people have been been through an acquisition with Nvidia. What's it like? Yeah, just anything you'd like to say?

7:19

Speaker A

It's a crazy experience. I think the thing that was the most exciting for us was our goal was just to make it easier for developers. We wanted to find access to GPUs, make it easier to do that. And then actually your question about launchables. So launchables was just make one click deploys for any software on top of the gpu. And so what we really liked About Nvidia was that it felt like we just got a lot more resources to do all of that. I think, you know, Nvidia's goal is to make things as easy for developers as possible. So there was a really nice like synergy there. I think, you know, when it comes to like an acquisition, I think the amount that the soul of the products align I think is going to be, is going to speak to the success of the acquisition.

7:34

Speaker B

Yeah.

8:13

Speaker A

So in many ways feels like we're home. This is a really great outcome for us. Like we, you know, I love brev.Nvidia.com like you should, you should use it.

8:14

Speaker C

It's a front page for GPUs. Yeah, you want GPUs, you go there

8:21

Speaker B

and it's like internally is growing very quickly. I remember you said some stats.

8:24

Speaker A

Ye. Yeah, yeah, yeah, it's. I wish I had the exact numbers. But like internally, externally it's been growing really quickly. We've been working with a bunch of partners with a bunch of different customers and ISVs. If you have a solution that you want someone that runs on a GPU and you want people to use it quickly, we can bundle it up in a launchable and make it a one click run. If you're doing things and you want just like a sandbox or something to run on. Right. Like openclaw, huge moment, super exciting and we'll talk into it more. But you know, internally people want to run this and we know we have to be really careful from the security implications. Do we let this run on the corporate network? Security's guidance was hey, run this on breadth. It's in, you know, it's a vm, it's sitting in the cloud, it's off the corporate network, it's isolated. And so that's been our stance internally and externally about how to even run something like openclaw while we figure out how to run these things securely.

8:28

Speaker B

But yeah, I think there's also like you almost like we're the right team at the right time when Nvidia is starting to invest a lot more in developer experience or whatever you call UX or I don't know what you call it. Like software. Like obviously Nvidia is always invested in software but like there's like this is like a different audience, it's a wider developer base.

9:13

Speaker C

Yeah, right.

9:33

Speaker A

Yeah, yeah. You know it's funny, it's like it's

9:34

Speaker B

not so like what is it called internally? What is this that people should be aware that is going on there?

9:36

Speaker C

Like developer yeah, yeah.

9:41

Speaker B

It's called developer experience. Or is there like a broader strategy here?

9:42

Speaker A

Nvidia. Nvidia always wants to make a good developer experience. The thing is, a lot of the technology is just really complicated. Like, it's not. It's. You know, I think the thing that's been really growing, or AI is growing is having a huge moment. Not because, like, let's say Data scientists in 2018 were quiet then and are much louder now. The pie is there's a whole bunch of new audiences. My mom's wondering what she's doing. My sister's taught herself how to code. I actually think, just generally AI is a big equalizer and you're seeing a more technologically literate society. I guess everyone's learning how to code. There isn't really an excuse for that. And so building a good UX means that you really understand who your end user is. And when your end user becomes such a wide variety of people, then you have to almost reinvent the practice and

9:46

Speaker C

actually build more developer ux.

10:31

Speaker B

Right.

10:34

Speaker C

Because there are tiers of developer base that were added. You know, the hackers that are building on top of OpenClaw, right. For example, have never used GPU. They don't know what CUDA is. They just want to run something. You need new UX that is not just, hey, how do you program something in Cuda and run it? And then we built, like when Deep Learning was getting big, we built Torch. But recently the amount of layers that are added to that developer stack has just exploded because AI has become ubiquitous. Everyone's using it in different ways.

10:34

Speaker A

It's moving fast in every direction, vertical, horizontal.

11:05

Speaker D

You even take it down to hardware like the DGX Spark. You know, it's basically the same system as just throwing it up on big GPU clusters.

11:09

Speaker C

Yeah, yeah, it's a Blackwell.

11:15

Speaker B

Yeah. We saw the preview at last year's GTC and that was one of the better performing videos of our Nvidia coverage so far.

11:18

Speaker D

Awesome.

11:24

Speaker B

This will be the.

11:24

Speaker A

That was actually.

11:26

Speaker C

Fingers crossed. Yeah.

11:27

Speaker A

Even when Grace Blackwell or when DGX Spark was first coming out, getting to be involved in that from the beginning of the developer experience and it just

11:28

Speaker B

comes back, you were involved.

11:36

Speaker A

Yeah, yeah, yeah. I mean, from. It was just like I got an email. We just got thrown into the loop and suddenly. Yeah, it was actually really funny because I'm still pretty fresh from the acquisition and I'm getting an email from a bunch of the engineering VPs about like the new hardware GPU chip or not chip but just GPU system that we're putting out. And I'm like, okay, cool. Natter is now involved with this for the ux. I'm like, what am I going to do here? So I remember the first meeting. I was just like, kind of quiet as I was hearing engineering VPS talk about what this box could be, what it could do, how we should use it. And I remember one of the first ideas that people were ideating was like, oh, the first thing that it was like, I think a quote was like, the first thing someone's going to want to do with this is get two of them and run a Kubernetes cluster on top of them. And I was like, oh, I think I know why I'm here. I was like, the first thing we're doing is easy SSH into the machine and then just kind of like scoping it down. Of like, once you can do that. The person who wants to run a Kubernetes cluster on 2 Sparks has a higher propensity for pain than someone who buys it and wants to run OpenClaw right now. Right. If you can make sure that that's as effortless as possible, then the rest becomes easy. So there's a tool called Nvidia Sync. It just makes the SSH connection really simple. So if you think about it, if you have a Mac or a PC or whatever, if you have a laptop and you buy this GPU and you want to use it, you should be able to use it like it's a GPU in the cloud.

11:37

Speaker B

Right.

12:56

Speaker A

But there's all this friction of, like, how do you actually get into that? That's part of Brev's value proposition, is just there's a CLI that wraps SSH and makes it simple. And so our goal is just get you into that machine really easily. And one thing we just launched at ces, it's still in, like, early access. We're ironing out some kinks, but it should be ready by gtc. You can register your Spark on Brev. And so now if you like remote managed local thing.

12:56

Speaker B

Yeah, because Brev can already manage other clouds anyway.

13:20

Speaker A

Right.

13:23

Speaker D

And you use the Spark on Brev as well, right?

13:23

Speaker A

Yeah, yeah, exactly. So you set it up at home, you can run a command on it and then it gets. It's. Essentially, it'll appear in your Brev account and then you can take your laptop to a Starbucks or to a cafe and you will continue to use your. You can continue to spark just like any other cloud node on Brev.

13:26

Speaker B

Yeah, yeah.

13:39

Speaker A

It's just like a pre provisioned. So your little data center in your home. Yeah, exactly.

13:40

Speaker B

Yeah, yeah.

13:44

Speaker D

Tiny little data center, tiny little size of your phone.

13:45

Speaker B

One more thing before we move on to Kyle. Just have so many Jensen stories and I just love mining Jensen stories. My favorite so far is sol. What is sol?

13:48

Speaker A

SOL is actually I think of all the lessons I've learned, that one's definitely my favorite.

13:58

Speaker C

It'll always stick with you.

14:02

Speaker A

Yeah, yeah. You know, when you're a startup, everything's existential, right. Like we've run out of money. We were like on the risk of losing payroll. We've had to contract our team because we ran out of. And so like because of that you're really always forcing yourself to like understand the root cause of everything. If you get a date, if you get a timeline, you know exactly why that date or timeline is there. You're. You're pushing every boundary. And like you're not just say you're not just accepting like a. No just because. And so as you start to introduce more layers, as you start to become a much larger organization, SOL is essentially like, what is the physics? Right. The speed of light moves at a certain speed. So if light's moving some slower, then you know something's in the way. So before trying to layer reality back in of like, why can't this be delivered at some date? Let's just understand the physics. What is the theoretical limit to how fast this can go? And then start to tell me why. Because otherwise people will start telling you why something can't be done. But actually I think any great leader's goal is just to create urgency.

14:03

Speaker C

There's an integrity, create compelling events.

14:59

Speaker A

Right.

15:01

Speaker C

Sol is a term in Nvidia is used to instigate a compelling event. You say this is done. How do we get there? What is the minimum, as much as necessary, as little as possible thing that it takes for us to get exactly here. And it helps you just break through a bunch of noise.

15:02

Speaker B

Yeah.

15:19

Speaker C

Instantly.

15:19

Speaker B

One thing I'm unclear about is can only Jensen use the SOL card? Like get the bullshit out. Because obviously it's Jensen. But can someone else be like.

15:19

Speaker C

No, frontline engineers use it?

15:28

Speaker A

I think it's not so much about get the bullshit out. It's like, give me the root understanding, right? If you tell me something takes three weeks. Yeah. The first principles, it's like, what's the. Is it three weeks? What is the actual. Yeah. What's the actual limit of why this is going to take three weeks? If you're going to. If you. If let's say you wanted to buy a new computer and someone told you it's going to be here in five days, what's the sol? Well, like, the SOL is like, I could walk into a Best Buy and pick it up for you. Right? So then anything that's like, beyond that is. And is that practical? Is that how we're going to, you know, let's say, give everyone in the company a laptop? Like, obviously not. So then, like, that's the Sol. And then it's like, okay, well, if we have to get more than 10, suddenly there might be some. Right. And so now we can kind of piece the reality back.

15:31

Speaker B

So this is the program do things that don't scale. And this is also what people would now call Beehive Agency.

16:08

Speaker C

It's actually really interesting because there's a second hardware angle to SOL that doesn't come up for all the Org. So SOL is used culturally at Nvidia for everything.

16:16

Speaker B

I'm also mining for. I think that can be annoying sometimes when someone keeps going SOL and you're like, guys, we have to be stable. We have to fucking plan.

16:25

Speaker A

It's an interesting balance. Yeah, I encountered that actually just with Alec, because we have a new conference, so we need to launch. We have, we have goals of what we want to launch by the conference. And like, yeah, at the end of

16:34

Speaker B

the day, is this gtc?

16:44

Speaker A

Well, this is like. So we, I mean, we did it for ces, we did it for GTC DC before that, we're doing it for GTC San Jose. So I mean, like every, you know, we have a new moment and we want to launch something and we want to do so at sol. And that does mean that some, there's some level of prioritization that needs to happen. And so it is difficult. Right. I think you have to be careful with what you're pushing. You know, stability is important and that should be factored into sol. SOL isn't just like, like, build everything and let it break. You know that that's part of the conversation. So as you're laying layering in all the details, one of them might be, hey, we could build this, but then it's not going to be stable for XYZ reasons. And so that was like one of our conversations for CES was, you know, hey, like, we, we can get this into early access, registering your spark with brev. But there are a lot of things that we need to do in order to feel really comfortable from a security perspective. Right there's a lot of networking involved before we deliver that to users. So it's like, okay, let's get this to a point where we can at least let people experiment with it. We had it in a booth, we had it in Jensen's keynote, and then let's go iron out all the networking kinks. And that's not easy. And so that can come later. And so that was the way that we layered that back in.

16:45

Speaker C

But it's not really about saying you don't have to do the maintenance or operational work. It's more about saying it's kind of like highlights how progress is incremental.

17:49

Speaker A

Right.

18:01

Speaker C

What is the minimum thing that we can get to? And then there's SOL for every component after that. But there's the SOL to get you to the starting line. And that's usually how it's asked on the other side. SOL came out of hardware at Nvidia. So SOL is literally, if we ran the accelerator or the GPU at basically full speed with no other constraints, how fast would be able to make a program go.

18:01

Speaker B

Yeah.

18:25

Speaker C

Right.

18:26

Speaker B

So in training that, like, you know, then you work back to like some percentage of like mfu, for example.

18:27

Speaker C

Yeah, that's. That's a great example. So like there's an. There's an SOL MFU and then there's like, you know, what's practically achievable.

18:32

Speaker B

Cool. Shall we move on to sort of Kyle's side? Kyle, you're coming more from the data science world. And I mean, I always. When. Whenever I meet someone who's done work in tabular stuff, graph neural networks, time series. These are basically when I go to Neurips, I go to icml, I walk the back halls. There's always like a small group of graph people, small group of tabular people, and there's no one there. And it's very. You know what I mean? It's important, interesting work if you care about solving the problems that they solve. But everyone else is just LLMs all the time.

18:38

Speaker C

Yeah, it's like the black hole, right? Has the event horizon reached this yet in Neurops?

19:13

Speaker B

But those are transformers too, and those are also interesting things. Anyway, I just wanted to spend a little bit of time on that background before we go into Dynamo proper.

19:18

Speaker C

Yeah, sure. I took a different path to Nvidia than Natter. I joined six years ago, seven if you count when I was an intern. So I joined Nvidia right out of college and the first thing I jumped into was not what I'd done during internship. Which was some stuff for autonomous vehicles, like heavyweight object detection. I jumped into something, I'm like recommenders. This is popular.

19:29

Speaker B

Yeah. You did Rexis.

19:50

Speaker C

Yeah, Rexis. That was the taboo data at the time, right? You have tables of audience qualities and item qualities and you're trying to figure out which member of the audience matches which item or more practically which item matches which member of the audience. And at the time, really it was like we were trying to enable recommenders which had historically been a little bit of a CPU based workflow, into something that ran really well in GPUs and, and it's since been done. There are a bunch of libraries for EXIs that run on GPUs. The common models like Deep Learning Recommendation model, which came out of Meta and the wide and Deep model which was used or was released by Google, were very accelerated by GPUs using the fast HBM on the chips especially to do vector lookups. But it was very interesting at the time and super, super relevant because we were starting to get this explosion of feeds and things that required recommenders to just actively be on all the time. And sort of transitioned that a little bit towards graph neural networks when I discovered them, because I was like, okay, you can actually use graph neural networks to represent relationships between people, items, concepts. And that interested me. So I jumped into that at Nvidia and got really involved for like 2ish years.

19:51

Speaker B

And something I learned from Brian Cannizzaro is that you can just kind of choose your own path in Nvidia.

21:03

Speaker C

Oh my God.

21:08

Speaker A

Yeah.

21:09

Speaker B

Which is not a normal big corp thing. Like you have a lane, you stay in your lane.

21:09

Speaker A

I think probably the reason why I enjoy being in a big company, coming from a startup guy.

21:14

Speaker B

Yeah, the mission is the boss.

21:19

Speaker A

Yeah, it feels like a big game of pickup basketball. Like, you know, if you play one, if you want to play basketball, you just go up to the court and you're like, hey, we're going to play this game and we need three. And you just like find your three. That's honestly for every new initiative. That's what it feels like.

21:20

Speaker D

Yeah, it also like shows, right? Like Nvidia is just releasing state of the art stuff in every domain. Like okay, you expect foundation models with Nemotron voice just randomly pop tier parakeet just comes out. Another one the Nvidia voice team has always been producing. There's always just every other domain of paper that comes out, data set that comes out. And it's like, I mean it also stems back to what Nvidia has to do. Right. You have to make chips years before they're actually produced. Right. So you need to know. You need to really forget.

21:32

Speaker C

Design process starts like three to five years before the chip gets to the market.

22:00

Speaker D

Yeah. I'm curious more about what that's like. Right? So like, you have specialist teams. Is it just like, you know, people find an interest, you go in, you go deep on whatever and that kind of feeds back into, you know. Okay. We expect predictions like the internals at Nvidia must be crazy. Right. You know, you must not. Even without selling the people, you have your own predictions of where things are going and they're very based, very grounded, right?

22:05

Speaker C

Yeah, it's really interesting. So there's like two things I think that Nvidia does which are quite interesting. One is we really index into passion. There's a big sort of organizational topsound push to ensure that people are working on the things that they're passionate about. So if someone proposes something that's interesting, many times they can just email someone way up the chain that they would find this relevant and say, hey, can I go work on this?

22:29

Speaker A

That's actually. I worked at a big company for a couple of years before starting on my startup journey and it felt very weird if you were to email out of chain, if that makes sense. The emails at Nvidia are like mosh pits. It's just like 60 people just whatever.

22:51

Speaker B

And like I messy, like reply all.

23:06

Speaker A

Oh, it gets. It's insane. It's insane.

23:09

Speaker C

It does help, you know, manage the context.

23:11

Speaker A

But that's actually like I've actually. So this is a weird thing where I used to be like, why would we send emails? We have Slack. I am the entire. I'm the exact opposite. I feel so bad for anyone who's like messaging me on Slack because I'm so unresponsive.

23:14

Speaker B

Your email, Max.

23:24

Speaker A

I'm email maxing.

23:25

Speaker C

Email is a different.

23:26

Speaker B

Email is perfect because we can't work together. I'm stuck.

23:27

Speaker A

You know, it's great because important threads get bumped back up, right? Yeah. And so Slack doesn't do that. So I just have like this casino going off on the right or on the left and like, I don't know which thread was from where or what. But like the threads get. And then also just like the subject. So you can have like working threads. I think what's difficult is like when you're small, if it's not 40,000 people, I think Slack will work fine. But there's. I don't know what the inflection point is there is going to be a point where that becomes really messy and you'll actually prefer having email because you can have working threads. You can CC more than nine people in a thread.

23:31

Speaker C

You can fork stuff.

23:58

Speaker A

You can fork stuff, which is super nice. And just like. Yeah. And so. But that is part of where you can propose a plan. You can also just like start. Honestly, momentum is the only authority, right? So like, if you can just start to make a little bit of progress and show someone something and then they can try it, that's I think what's been, you know, I think the most effective way to push anything forward. And that's both at Nvidia and I think just generally.

23:59

Speaker B

Yeah.

24:19

Speaker C

There's the other concept that like, is explored a lot at Nvidia, which is this idea of a $0 billion business. Like market creation is a big thing

24:20

Speaker B

at Nvidia, or you want to go and start a $0 billion business?

24:27

Speaker C

Jensen says we're completely happy investing in $0 billion markets. We don't care if this creates revenue. It's important for us to know about this market. We think it will be important in the future. It can be $0 billion for a while. I'm probably mangling his words here, but I'll give an example. Nvidia has been working on autonomous driving for a long time.

24:31

Speaker B

Like an Nvidia car.

24:51

Speaker D

No, They've used the Mercedes. Right. They're around the HQ and I think it finally just got licensed out. Now they're starting to be used. Qu.

24:53

Speaker C

Yeah, yeah.

25:00

Speaker D

For 10 years you've been seeing Mercedes with Nvidia logos.

25:00

Speaker C

If you're in like the South Santa Clara, it's from South. Yeah. So zero billion dollar markets are. Are a thing like, you know, Jensen,

25:05

Speaker B

I mean, okay, look, cars are not a zero billion dollar market, but yeah,

25:16

Speaker A

that's a bad example. I think, I think he's messaging zero today, but. Or even like internally. Right? Like, like it's like an org doesn't have to ruthlessly find revenue very quickly to justify their existence. Right. The important research, a lot of the important technology being developed, that's kind of

25:19

Speaker C

where research is very ideologically free at Nvidia, like they can pursue things that they were you research officially, I was never in research. Officially I was always in engineering. I'm in an org called Deep Learning Algorithms, which is basically just how do we make things that are relevant to deep learning go fast.

25:35

Speaker B

That sounds freaking cool.

25:50

Speaker D

And I think a lot of that is underappreciated. Right? Like time series. This week Google put out Timeframe, a new time series paper. Rexis Semantic ID started applying Transformers LLMs to Rexys. And when you think the scale of companies deploying these. Right. Amazon recommendations, Google Web certs, it's huge skill and you want fast.

25:51

Speaker C

Yeah. Actually there's a fun moment that brought me full circle. Amazon ads recently gave a talk where they talked about using Dynamo for generative recommendation, which was super weirdly cathartic for me. I'm like, oh my God, I've supplanted what I was working on. You're using LLMs now to do what I was doing five years ago.

26:10

Speaker B

Yeah. Amazing. Let's go right into Dynamo, maybe introduce to the top down and yeah, I

26:32

Speaker C

think at this point a lot of people are familiar with the term of inference. Funnily enough, I went from inference being a really niche topic to being something that's discussed on normal people's Twitter feeds.

26:38

Speaker A

It's on billboards here.

26:48

Speaker C

Very strange driving seeing just an inference AD on 101. Inference at scale is becoming a lot more important. We have these moments like openclaw where you have these agents that take lots and lots of tokens but produce incredible results. There are many different aspects of test time scaling so that you can use more inference to generate a better result than if you were to use a short amount of inference. There's reasoning, there's requerying, there's adding agency to the model, allowing it to call tools and use skills. Dynoassort came about at Nvidia because myself and a couple others were sort of talking about these concepts that you have inference engines like VLM, SGLang, TensortLM and they have one single copy, they sort of think about things as one single copy, like one replica, one version of the model. But when you're actually serving things at scale, you can't just scale up that replica because you end up with performance problems. There's a scaling limit to scaling up replicas. So you actually have to scale out to use maybe some Kubernetes stuff terminology. We kind of realized that there was a lot of potential optimization that we could do in scaling out and building systems for data center scale inference. So Dynamo is this data center scale inference engine that sits on top of the frameworks like vlm, Echeling and Tensor GLM and just makes things go faster because you can leverage the economy of scale. The fact that you have KV cache, which we can define a little bit later in all of these machines, that is unique and you want to Figure out, like, the ways to maximize your cache hits. Or you want to employ new techniques in inference, like disaggregation, which Dynamo introduced to the world in March. Not introduced. It was an academic talk beforehand. But we're one of the first frameworks to start supporting it. And we want to combine all these techniques into sort of a modular framework that allows you to accelerate your inference at scale.

26:50

Speaker A

By the way, Kyle and I became friends on my first date, Nvidia and I always loved. Because he always teaches me new things.

28:47

Speaker B

By the way, this is why I wanted to put two of you together. I was like, yeah, this is going to be good.

28:53

Speaker C

It's very different. We've talked to each other a bunch.

28:56

Speaker B

Actually.

28:59

Speaker C

You asked, why can't we scale up?

29:00

Speaker A

Yeah. You said model replicas.

29:01

Speaker C

Yeah. So scale up means assigning more.

29:03

Speaker B

Heavier.

29:06

Speaker C

Yeah, heavier. Like making things Heavier. Adding more GPUs, adding more CPUs. Scale out is just like having a barrier, saying, I'm going to duplicate my representation of the model or representation of this microservice or something, and I'm going to replicate it many times to handle the load. And the reason that you can't scale up past some points is there are sort of hardware bounds and algorithmic bounds on that type of scaling. So I'll give you a good example that's very trivial. Let's say you're on an H100. The maximum MV link domain for H100 for most DGX H1 hundreds is 8 gpus. Right. So if you scaled up past that, you're going to have to figure out ways to handle the fact that now for the GPUs to communicate, you have to do it over Infiniband, which is still very fast, but is not as fast as NVLink.

29:07

Speaker B

Is it like one order of magnitude? Like hundreds?

29:54

Speaker C

It's about an order of magnitude.

29:56

Speaker B

Not terrible.

29:59

Speaker C

Yeah. I need to remember the data sheet here. I think it's about 500 gigabytes a second unidirectional for NVLink and about 50 gigabytes a second unidirectionral for Infiniband. It depends on the generation.

29:59

Speaker B

I just want to set this up for people who are not familiar with these kinds of layers and the transfer

30:17

Speaker D

speeds also, maybe even just going a few steps back before that most people are very familiar with. You see, you can use on your laptop, whatever these SD Lang Vllm. You can just run inference.

30:23

Speaker C

You can run it on that laptop.

30:36

Speaker D

You can run on laptop up. Then you get to. Okay, Models got pretty big, right? GLM5, they doubled the size. So what do you do when you have to go from, okay, I can get 128 gigs of memory, I can run it on a spark. Then you have to go multi gpu. Okay, multi gpu. There's some support there. Now, if I'm a company and I don't have like, I'm not hiring the best researchers for this. Right. But I need to go multi node. Right. I have a lot of servers. Okay. Now there's efficiency problems, Right. You can have multiple 8H100 nodes. But, you know, is that as a. Like, how do you do that? Efficient. Yeah.

30:37

Speaker C

How do you like represent them? How do you choose how to represent the model? Right. That's like a hard question everyone asks, how do you size? Oh, I want to run GLM5, which just came out.

31:09

Speaker A

New model.

31:18

Speaker C

There have been like four of them in the past week, by the way. Like a bunch of new models.

31:19

Speaker B

You know why, right? Deep sea.

31:22

Speaker C

No comment. Yeah, but glm5, right. We, we have this new model.

31:24

Speaker A

It's.

31:29

Speaker C

It's of like a large size. And you have to figure out how to both scale up and scale out.

31:30

Speaker A

Right.

31:34

Speaker C

Because you have to find the right representation that you care about. Everyone does this differently. Let's be very clear. Everyone figures this out in their own path.

31:34

Speaker A

I feel like a lot of AI or ML even is like, is like this. I think people think, you know, there was some tweet a few months ago that was like, why hasn't fine tuning as a service taken off?

31:40

Speaker B

And you know, that might be me,

31:48

Speaker A

it might have been you. Yeah. But people want it to be such an easy recipe to follow. But even like if you look at an ML model specific to you. Yeah, yeah.

31:51

Speaker C

And the model.

31:59

Speaker A

And there's so much, there's so much tinkering. Like when you see a model that has however many experts in the MOE model, it's like, why that many experts person, they tried a bunch of things and that one seemed to do better. And I think when it comes to how you're serving inference, you have a bunch of decisions to make. And you can always argue that you can take something and make it more optimal. But I think there's this internal calibration and appetite for continued calibration.

32:00

Speaker C

Yeah.

32:20

Speaker D

And that doesn't mean people aren't taking a shot at this, like tinker from thinking machines. RL is a service. It also gets even harder when you try to do big model training.

32:21

Speaker B

Right.

32:30

Speaker D

We're not the best at training MOEs when they're pre trained, like we saw this with Llama 3, right? They're trained in such a sparse way that meta knows there's going to be a bunch of inference done on these, right? They'll open source it, but it's very trained for what meta infrastructure wants, right? They want to inference it a lot. Now the question to basically think about is, okay, say you want to serve a chat application, a coding copilot, right? You're doing a layer of rl, you're serving a model for X amount of people. Is it a chat model? A coding model? Dynamo? You know, back to that, Sorry.

32:30

Speaker C

So we sort of like jumped off of, you know, on that topic. Everyone has like their own journey and I like to think of it as defined by like what is the model you need? What is the accuracy you need? Actually I talked to Natter about this earlier. There's three axes you care about. What is the quality that you're able to produce? So like are you accurate enough or can you complete the task with enough performance? High enough performance? Yeah, there's cost. Can you serve the model or serve your workflow? Because it's not just the model anymore, it's the workflow, it's the multi turn with an agent cheaply enough and then can you serve it fast enough? And we're seeing all three of these play out. We saw new models from OpenAI that are faster. You have these new fast versions of models. You can change the amount of thinking to change the amount of quality, produce more tokens, but at a higher cost and a higher latency. And really when you start this journey of trying to figure out how you want to host a model, you think about three things. What is the model I need to serve? How many times do I need to call it? What is the input sequence link? What does the workflow look like on top of it? What is the sla? What is the latency SLA that I need to achieve? Because this is usually a constant, you know, the SLA that you need to hit and then you try and find the lowest cost version that hits all of these constraints. Usually you start with those things and you say you kind of do a bit of experimentation across some common configurations. You change the tensor parallel size, which is a form of parallelism.

33:00

Speaker D

I'd say it goes even deeper. First you got to think, well model,

34:26

Speaker C

it's like a multi step design process because as you said, you can choose a smaller model and then do more test time scaling and it'll equate Quality of a larger model because you're doing the test time scaling or you're adding a harness or something. So yes, it goes way deeper than that. But from the performance perspective, once you get to the model you need to host, you look at that and you say, hey, I have this model, I need to serve it at this speed. What is the right configuration for that?

34:30

Speaker A

You guys see the recent. There's a paper I just saw like a few days ago that if you run the same prompt twice, you're getting like double.

34:55

Speaker D

Try it again.

35:01

Speaker A

Yeah, exactly.

35:02

Speaker D

And you get a lot.

35:02

Speaker A

Yeah.

35:03

Speaker D

But the key thing there is you give the context of the failed try. Right. So it takes a shot. And this has been like, you know, basic guidance for quite a while. Just try again. Because you know, try.

35:03

Speaker A

Just try again.

35:14

Speaker B

Did you try again?

35:15

Speaker A

All advice in life.

35:15

Speaker D

It's a paper from Google if I'm not mistaken. Right. I think it's like a seven little short paper. The title is very cute. And it's just like, yeah, just try again.

35:17

Speaker C

Give it.

35:24

Speaker D

It has context.

35:25

Speaker B

Multi shot.

35:26

Speaker C

You just like say like, hey, like, you know, like take, take a little bit more. Take a little bit more information. Try and fail.

35:26

Speaker D

And that basic concept has gone pretty deep. There's like self distillation RL where you, you do self distillation, you do RL and you have past failure. And you know, that gives some signal. So people take. Try it again.

35:32

Speaker B

Not strong enough for, for listeners who listen to here, Vibu actually and I and we run a second YouTube channel for our paper club where.

35:45

Speaker A

Oh, that's.

35:55

Speaker B

We would just cover this self desolation and all that. That's why he's so up to speed on it.

35:55

Speaker C

I'll have to check it out.

36:00

Speaker B

Yeah, it's just a good practice. Everyone needs a paper club where you just read papers together and the social pressure just kind of forces you.

36:00

Speaker C

There's like a big inference reading group.

36:08

Speaker A

I feel so bad every time he put it on. Like on our. He shared it.

36:10

Speaker B

One of your guys is big in that. I forget.

36:14

Speaker A

Yeah. Ishan.

36:16

Speaker B

Ishan.

36:17

Speaker C

Ishan's on my team actually. Funny, funny. There's a, there's a, there's a employee transfer between us. Ishan worked for Natter at Brev and now he's.

36:18

Speaker A

He was our head of AI and then yeah, once we got in.

36:25

Speaker B

Because I'm always looking for like, okay, can, can I start another podcast that only does that thing? And Ishan was. I was trying to like nudge Isan into like, is there something Here, I mean I don't think there's, there's new inference techniques every day. So it's like it's.

36:27

Speaker C

You would, you would actually be surprised the amount of blog posts you see

36:39

Speaker B

and if there was a period where it was like Medusa, Hydra, what Eagle.

36:44

Speaker C

Now we have new forms of decode, we have new forms of specular decoding.

36:49

Speaker B

What are you expecting?

36:53

Speaker D

It's exciting when you guys put out something like Nemotron. Because I remember the paper on this Nemotron 3, the amount of post training, the amount of tokens that the GPU rich can just train on. And it was a hybrid state space model, right?

36:54

Speaker C

Yeah, it's co designed for the hardware.

37:07

Speaker D

Yeah, co designed for the hardware. And one of the things was always the state space models don't scale as well when you do a conversion or whatever the performance and you guys are like no, just keep training. And Nemotron chose a lot of that.

37:08

Speaker A

Also something cool about Nevotron, it was released in layers, if you will. Very similar to Dynamo. It was released as aggregated. You can the pre training, post training data sets are released, the recipes on how to do it are released, the model itself is released. So you can benefit from us turning on the GPUs. But there are companies like ServiceNow took the data set and they trained their own model. And we were super excited and like you know, celebrated that work.

37:20

Speaker B

Zoom, the frontier model, Labs. Zoom is, Zoom is AGI.

37:43

Speaker D

I think you know, also just to add like a lot of models don't put out base models. And if there's that, why is fine tuning not taken off? You know, you can do your own training but you guys put out base model. I think you put out everything.

37:46

Speaker B

I believe, I don't know about base. Base can, base can be cancelable.

37:59

Speaker D

Bass can be cancelable.

38:03

Speaker B

Yeah.

38:05

Speaker D

Safety training.

38:05

Speaker B

Do we get a full picture of Dynamo? I don't know if we.

38:07

Speaker A

What I'd love is you mentioned the three axes like break it down of like, you know, what's pre filled, decode and like what are the optimizations that we can get with Dynamo.

38:10

Speaker C

Yeah, that's a great point. So to summarize on that three axis problem, there are three things that determine whether or not something can be done with inference. Cost, quality, latency. Dynamo is supposed to be there to provide you the runtime that allows you to pull levers to mix it up and move around the Pareto frontier or the Pareto surface that determines is this actually possible with inference? And AI today gives you the knobs yeah, exactly. It gives you the knobs. And one thing that we use a lot in contemporary inference and is starting to pick up in general knowledge is this concept of disaggregation. So historically models would be hosted with a single inference engine and that inference engine would ping pong between two phases. There's pre fill where you're reading the sequence generating kvcache, which is basically just a set of vectors that represent the sequence, and then using that KVCACHE to generate new tokens, which is called decode. And some brilliant researchers across multiple different papers essentially made the realization that if you separate these two phases, you actually gain some benefits. Those benefits are basically a you don't have to worry about step synchronous scheduling. So the way that an inference engine works is you do one step and then you finish it and then you start scheduling the next step. It's not like fully asynchronous. And the problem with that is you would have essentially prefill and decode are actually very different in terms of both their resource requirements and sometimes their runtime. So you would have prefill that would block decode steps because you'd still be prefilling and you couldn't schedule because the step has to end. So you remove that scheduling issue and then you also allow yourself to split the work into two different types of pools. Prefill typically. And this changes as model architecture changes. Prefill is right now compute bound. Most of the time if the sequence is sufficiently long, it's compute bound on the decode side because you're doing a full passover, all the weights and the entire sequence. Every time you do a decode step and you don't have the quadratic computation of kbcache, it's usually memory bound because you're retrieving a linear amount of memory and you're doing a linear amount of compute as opposed to pre fill where you retrieve a linear amount of memory and then use a quadratic memory.

38:17

Speaker A

You know what's funny? Someone exolabs did a really cool demo demo where for the DGX Spark, which has a lot more compute, you can do the compute hungry pre fill on a DGX Spark and then do the decode on a Mac. And so that's faster.

40:35

Speaker C

Yeah, so you can do machine stratification. And like with our future generation generations of hardware, we actually announced like with Ruben, this new accelerator that is pre fill specific, it's called Rubin cpx.

40:49

Speaker A

I have a question. When you do the scale out, is scaling out easier with Dynamo. Because when you need a new node, you can dedicate it to either the pre fill or decode.

41:05

Speaker C

Yeah, so Dynamo actually has a Kubernetes component in it called Grove that allows you to do this crazy scaling specialization. It's a representation that. I don't want to go too deep into Kubernetes here, but there was a previous way that you would launch multinode work. It's called leader worker set. It's in the Kubernetes standard. Standard and leader worker set is great. It served a lot of people super well for a long period of time. But one of the things that it struggles with is representing a set of cases where you have a multi node replica that has a pair pre fill and decode, or it's not paired, but it has a second stage that has a ratio that changes over time. Pre fill and decode are two different things. As your workload changes, the amount of preflow you'll need to do may change. The amount of decode that you'll need to do might change. Let's say you start getting insanely long queries. That probably means that your pre fill scales harder because you're hitting this quadratic scaling growth for listeners.

41:14

Speaker B

Prefill will be long input, decode will be long output, for example.

42:11

Speaker A

Yeah.

42:15

Speaker C

So decode scale. I mean, decode is funny because the amount of tokens that you produce scales with the output length, but the amount of work that you do per step scales with the amount of tokens in the context.

42:15

Speaker B

Yes.

42:26

Speaker C

So both scales with input and the output output.

42:27

Speaker B

That's true.

42:29

Speaker C

But on the preflow decode side, if suddenly the amount of work you're doing on the decode side stays about the same or scales a little bit, and then the prefill side jumps up a lot, you actually don't want that ratio to be the same. You want it to change over time. So Dynamo is a set of components that a tell you how to scale. It tells you how many prefill workers and decoded workers it thinks you should have. And also provides a scheduling API for Kubernetes that allows you to actually represent and affect this scheduling on your actual hardware on your computer infrastructure.

42:29

Speaker A

Not gonna lie, I feel a little embarrassed for being proud of my SVG function earlier.

43:01

Speaker C

It was really cute.

43:07

Speaker B

It's all engineering. It's all engineering, sort of technical. One thing I'm kind of just curious about, you see at a systems level, everything going on here, and we're scaling it up in multi distributed systems. I think one thing that's kind of of the moment right now is people are asking, is there any SOL sort of upper bounds in terms of. Let's just call it context length for want for a better word. But you can break it down however you like. Yeah, I just think like. Well, yeah, I mean like clearly you can engage in hybrid architectures and throw in some state space models in there all you want, but it still looks very attention heavy.

43:08

Speaker C

Yes, yeah. Long context is attention heavy. I mean we have these hybrid models

43:44

Speaker B

and most models cap out at a million contexts and that's it for the last two years. It's been it.

43:49

Speaker A

Yeah.

43:54

Speaker C

The model hardware context co design thing that we're seeing these days is actually super interesting. It's like my passion, like my secret side passion. We see models like KIMI or GPT oss. I'm going to use these because I know specific things about these models. So Kimi 2 comes out right. And it's an interesting model. It's like a deep SEQ style architecture. It's mla. It's basically deep SEQ scaled a little bit differently and obviously trained differently as well. Well, but they talked about why they made the design choices for context. Kimmy has more experts but fewer attention heads and I believe a slightly smaller attention dimension. But I need to check that doesn't matter. But they discussed this actually at length in a blog post on jihoo which is like. Or Jipu which is like. Yeah, Chinese Reddit. Yeah, it is, yeah. So it's actually an incredible blog post. Like all the mlsys people that I've seen on GPU are very brilliant but they talk about. The creators of QEMI K2 actually talked about it on there in a blog post and they say we actually did an experiment. Right. Attention scales with the number of heads. Obviously if you have 64 heads versus 32 heads, you do half the work of attention. You still scale quadratically, but you do have to work. And they made a very specific sort of barter in their system, in their architecture. They basically said, hey, what if we gave it more experts? So we're going to use more memory capacity but we keep the amount of activated experts the same. We increase the expert sparsity so we have fewer experts. The ratio of experts activated to number of experts is smaller and we decrease the number of attention heads.

43:55

Speaker D

And kind of for context, what we had been seeing was you make models sparser instead no one was really touching heads, you're just having.

45:38

Speaker C

Well, they implicitly made it sparser.

45:46

Speaker D

Yeah. For Kimmy they did. They Also made it sparser. But basically what we were seeing was people were at the level of okay. There's a sparsity ratio. You want more total parameters, less active and that's sparsity. But what you see from papers like the labs like Moonshot, Deep Seq, they go to the level of okay, outside of just number of experts. You can also change how many attention heads and less attention layers. More attention layers.

45:48

Speaker C

Yes, yes.

46:11

Speaker D

So and that's all basically coming back to just tied together is like hardware

46:12

Speaker C

model, co design, which is harder model context co design. Yeah. Right. Like if you were training a model that was like really, really short context or like really is good at super short context tasks, you may like design it in a way such that like you don't care about attention scaling because it hasn't hit that like the turning point where like the quadratic curve takes over.

46:16

Speaker A

How do you consider attention or context as a separate part of the co design? Like I would imagine hard work or just how I would have thought of it is like hardware model co design would be hardware model context co design.

46:35

Speaker C

Because the harness and the context that is produced by the harness is a part of the model once it's trained

46:45

Speaker D

in like, even though towards the end you'll do long context, you're not changing architecture through training.

46:52

Speaker C

I mean you can try.

46:57

Speaker B

You're saying everyone's training the harness into the model.

46:59

Speaker C

I would say to some degree.

47:02

Speaker A

Or there's co design.

47:03

Speaker B

I know there's a small amount, but I feel like not everyone has gone run full send on this.

47:04

Speaker C

I think it's important to internalize the harness that you think the model will be running into the model.

47:09

Speaker D

Yeah.

47:15

Speaker A

Interesting. Okay.

47:15

Speaker B

And like Bash is like the universal harness.

47:16

Speaker C

I'll give an example here. Right. I mean, or just like a, like a. It's easy proof, right. If you can train against a harness and you're using that harness for everything, wouldn't you just train with a harness to ensure that you get the best possible quality out of.

47:20

Speaker B

Well, I can provide the counter argument, which is what you want to provide a generally useful model for other people to plug into their harnesses.

47:35

Speaker C

Harnesses can be open source, right?

47:42

Speaker B

Yes. I mean that's effectively what's happening with codecs. But you may want a different search tool and then you may have to name it differently.

47:44

Speaker A

I don't know how much people have pushed on this, but can you train a model? Have people compared training a model for the harness versus post training training for.

47:51

Speaker B

I think it's the same thing. It's just extra post training.

48:01

Speaker A

I see.

48:04

Speaker B

And so I mean cognition does this, cursor does this where you just have to like, if your tool is slightly different, either force your tool to be like the tool that they train for or undo their training for their tool and then retrain. It's really annoying.

48:04

Speaker C

And like, I would hope that eventually we hit like a certain level of generality with respect to these new tools.

48:17

Speaker B

It's not AGI. Really stupid. Like, like, learn my tool, bitch. Like, I don't know if, I don't know if I can say that. But like, you know, I think what my point kind of is is that there's like, I look at slopes of the scaling laws and like this slope is not working, man. We're at a million token context. Okay, maybe next year 2 million. We're not going to 100 trillion. You know, like this, this just.

48:23

Speaker C

Oh, there's so many interesting ways.

48:44

Speaker B

This doesn't work. This doesn't work.

48:45

Speaker A

What's kind of funny is whenever there I. I feel like we always want to see a trend that we can predict, but every time something's come, it's been like a leapfrog. So I imagine, I don't know how we go from one to two, but I imagine what's likely to happen is we break through that from some new.

48:47

Speaker C

Yeah, there's an interesting formalization of this. There's an essay, it's a pretty interesting essay by Leopold Aschenbrenner called situational awareness.

49:01

Speaker B

No kidding. Yes.

49:10

Speaker C

He introduces a concept awareness called an unhobbler. Right. So Leopold in this essay, details, hey, I want to get, I want to get to this point in intelligence and I think that it is four orders of magnitude worth of compute and data and training away. And he says, oh yeah, I think data centers can scale up by about this much. I think that you can scale up the data and some other things by this much. But one of the things that makes the rest of that order of magnitude growth possible is these unhobblers, these scientific discoveries that are discovered during the model architecture search or training that really, really, really impact how you are able to scale. A good example of this might be that we see a lot of models that are, and this is probably a very tiny unhobbler, but is important for the performance perspective. We see a lot of models that are trained with multi token prediction natively during pre training and per deepseq. In their paper they say, hey, this actually helped us ensure more stable convergence. But there are unhobblers that are like that. And then there are rather large unhobblers. Architecturally, a lot of our models we had different types of attention. And one of the problems with attention is you have a lot of kv. But people have found different forms of attention like group query attention and MLA in deep seq multi head latent attention that decrease the burden that KV has on the model which allows you to grow longer in context.

49:11

Speaker B

Yeah. And that was very drastic for deepseak.

50:38

Speaker C

Yeah, for context like the total, I think the total context length of Deepseek is 128,000 tokens or it might be 256,000 with rope extension. That entire context, I think it's 128,000 fits into 8 gigabytes. And previously context, I think the llama 405B context of a similar size was like 40 or 80 gigabytes in the same precision. So those unhobblers really decrease the stuff of that size. And I wouldn't be surprised if we do see the ability to break through to 10 million, 20 million 100 million context through an unhobbler showing up.

50:40

Speaker B

I see.

51:20

Speaker C

And it's just science.

51:20

Speaker B

More deep learning algorithms is what it is.

51:21

Speaker A

A frame pickup and he has room for two.

51:26

Speaker C

I could actually give you an example of a theory. Not a theory here, but something theoretically

51:28

Speaker A

an unhabbler that you're excited about.

51:34

Speaker C

An unhobbler that I mean, I haven't seen. So it could be a tar pit and it could just not work. But I would be really excited to see a model that does pre fill and decode differently. So a model that does pre fill locally document wise pre fill it doesn't in chunks. And then you do decode globally across the entire sequence. Logically, to me it doesn't seem like you would necessarily need to have KV be associative between documents that have no mutual association. But that places a lot of burden on pre fill on decode and pure attention within the decode phase to make those connections. Since the KV is static at that point, you see other techniques that are interesting like this too. But if you're able to do that, if pre fill becomes local and decode is still global, you solve that pre fill quadratic scaling problem because you have a bunch of small chunks that you pre fill independently.

51:35

Speaker B

Okay, all right. Well, let's wait and see. But I think it'll be pretty exciting.

52:29

Speaker C

Fingers crossed.

52:33

Speaker B

Yeah, fingers crossed.

52:33

Speaker C

Yeah.

52:34

Speaker D

I'm excited for prefront decode on separate hardware. So like Grok acquisition. Right. Can we decode on the Grock, can we get super fast?

52:35

Speaker C

I don't think I'm allowed to comment on this.

52:43

Speaker B

Mark is going to shoot arrows at us.

52:45

Speaker A

He's got a blow Dark. Yeah, he's in the side of the room.

52:48

Speaker C

Just like go to sleep.

52:50

Speaker A

I'm super excited to see the team come in and like, you know, I've gotten the pleasure of working with some of the Grok people coming in. So, you know. Yeah, I know Sunny.

52:52

Speaker B

We've had him at the same conference that you were at.

52:59

Speaker A

Yeah.

53:02

Speaker B

And I think you guys are going to be doing some sessions at gtc. I don't know if you. This is a good place to plug them.

53:03

Speaker C

Yeah, yeah. So I can't speak to any LPU related sessions at gdc. I have no idea about that. On the Grox side. Yeah, I use the associative Nvidia U.

53:08

Speaker A

On the.

53:19

Speaker C

On the Nvidia Dynamo side we're giving it. There are a large number of sessions for those that aren't aware. You can actually search all of these sessions for GTC online. Just go to the GTC website. I don't know what the URL is but. But go there and you can just look up Dynamo and you'll get all the sessions. There are about 20. There are a couple that are hosted by the Dynamo team. There are a couple that are hosted by people that use Dynamo that want to show off the results they've been able to get. But there are two that I'm really excited about. One is just the general Dynamo tutorial and this is the. I'm going out with Harry who's our lead product manager for Dynamo and we're sort of talking about how to use Dynamo to get better performance and also where we see Dynamo going in the future. And then there's another session that I'm doing with one of our agents teams at Nvidia to talk about the future of agents in production inference. So we're talking about there's this new horizon with respect to agents because we have these harnesses that actually impart structure upon calls. If you compare the past and the present with respect to how LM calls work in the early days when they were chatbots, every call was like very different. There was basically no structure. You could assume that if it was conversational there might be some implicit structure because you have a multi turn conversation. But agents, you have this harness that abides by rules so it imparts direct structure onto the context. And you see this. There was an interesting Twitter post about how Claude code structures its Context so that you get as many cache hits as possible. And I think it was by One of the PMs for clog code and he wrote about it. And that type of structure that the harness can impart actually goes hand in hand with the inference code design. So I'm doing a talk. I don't know the session name or the session number, but I'm doing a talk. You can look me up by name on the GTC website on how we accelerate agents and where we see specific optimizations for agents going in Dynamo and in inference in general.

53:19

Speaker B

Yeah, I think there's only 1pm for cloud code and it's got the rest. There's Devrel, there's Boris.

55:18

Speaker C

Maybe it's Devrel.

55:23

Speaker B

Exactly. I mean, let's go into agents. I think this was like the last part of this discussion we planned.

55:24

Speaker C

How have we not talked about agents

55:29

Speaker D

also with you guys?

55:31

Speaker B

We scheduled it. I was like, okay, let's have cohesive sections.

55:32

Speaker D

I mean, there's a big news, right? Nvidia is a huge deployment of codecs.

55:36

Speaker B

Nvidia uses everything. We use this cursor and we use this.

55:42

Speaker D

But that's a pretty big deployment, right? That's tens of thousands of pieces, people.

55:44

Speaker A

Totally. Yeah. We were super curious. Yeah. It goes back to the mosh pit of emails we kind of mentioned earlier or just the, like, how fluid the. Org feels. So when there's new technology, people will just email it out and everyone will try it. And if it. If it's making people's lives easier, it'll spread like wildfire.

55:48

Speaker C

A lot of times Jensen will get it and he'll be like, let's make this work across the company. Let's make this work right now.

56:03

Speaker A

Honestly, if I was a startup, I feel like a cool hack. If you have something that's going to save an nvidian's time, they'll spread it to a couple and the same thing.

56:07

Speaker C

Right.

56:15

Speaker A

It'll just spread like wildfire.

56:15

Speaker D

Careful before your email blows up from startups, by the way.

56:16

Speaker A

Well, y. To have to know the person. Right. But no, I. Yeah, so, I mean, we. I love using Codex. It's been a ton of fun. I've been using it personally. Been using it at work. It's been. Yeah, I don't know, it's been great to see the rollout. Something really funny. On the day that we got Codex and Claude code access, I found this person, his name's Carlos, at the company, he wrote an Outlook cli.

56:19

Speaker C

Oh yeah.

56:39

Speaker A

And just the CLI for email and this was. I've been using that. Yeah. Maybe like four or five weeks ago. And the site. So once I got like Codex access, I installed the cli. It had a skill and I just asked it to go through all of my emails, which it's very messy, so I don't respond to your email. I'm really sorry. But I asked it to give me a summary, highlight any escalations that I should look at, put any thread that it thinks I should respond to in a folder and then archive everything. And it did. So if I missed your email, it's because it didn't get.

56:40

Speaker B

So I should put a prompt injection in my emails to. Yeah, what you should do is just paste the OSH's.

57:09

Speaker A

Yeah, yeah, yeah. My SLA is highest on FaceTime. But it was magic. And so sent it in a big email thread to like 500 people. A bunch of folks tried it out. I started like FaceTiming whoever I could at the company to get them set up with this. Yeah.

57:15

Speaker B

That specific example, you guys deal with, like some pretty sensitive emails.

57:27

Speaker A

Yeah.

57:32

Speaker B

Is there a security review with this? Because, like, one guy made it for himself, but, like, it's not meant for all.

57:32

Speaker C

The security team at Nvidia is incredible. Like, shout out to them.

57:37

Speaker A

They're.

57:40

Speaker C

They're, they're trying to.

57:41

Speaker A

We have an amazing security team because they're progressive and they know that this is really important technology. Even have to bring it in. If you think about, like, if you work at a big company, your laptop's usually very locked down if you can only access certain things. Nvidia engineers have those restrictions, aren't there? So you're expected to understand the risks when you try things out. And so very quickly, you know, made sure to chime in security on what we were doing. There's actually a lot that we've been thinking about, especially with OpenCloud.

57:42

Speaker B

Right.

58:04

Speaker A

Like there's, you know, agents can do three things. Yeah, agents can do three things. They can access your files, they can access the Internet, and then now they can write custom code and execute. And you should really only let an agent do two of those three things. If you can access your files and you can write custom code, you don't want Internet access because that's one. It's a vulnerability. Right. If you have access to Internet and your file system, you should know the full scope of what that agent's capable of doing. Otherwise malware can get injected or something that can happen. And so that's a lot of what we've been thinking about is like, you know, how do we both enable this because it's clearly the future. But then also, you know, what are these enforcement points that we can start

58:04

Speaker B

to like protect and is there any directive of like hey, we have a company account or company agreement with OpenAI, we use OpenAI models here or like choose whatever.

58:40

Speaker A

No, no. So I would never put any company data in a model that's not either that we don't even.

58:49

Speaker C

It has the most security. Yes. Yeah, like how that goes.

58:54

Speaker B

You know, obviously you could run your own models. You have Nemotron and we do.

58:58

Speaker A

We have an internal cluster. So you know, of course an English. Yeah, yeah, yeah. I think we're Dynamo's first customers.

59:02

Speaker C

Actually there's a funny story about like how I got the experience that informed what we needed for Dynamo at one point. There's a website called build.Nvidia.com and also for US inference Nvidia.com that is allows people to try models. It gives an API service. You can call the model with like a REST API and you know, you get a response. I ran the model side for that and it was at one point the largest inference deployment and still may actually be the largest inference deployment. Nvidia. I've since like handed it off to some people and they're doing a wonderful.

59:10

Speaker A

This is an extremely under known or less known resource build dynv.com you can get any of these open source models and it's rate limited but it's free so it's perfect for hackers and, and,

59:38

Speaker C

and the SLA on getting models Day zero models up is like a day. Yeah, like they're, they're incredibly good at like figuring out the right way to host the model to get it up there. As soon as it comes up you render is. Yeah, I ran, I ran it a long time ago. It was originally called Nvidia AI Playground. Then it was called AI foundation and then it was called Build Out Nvidia Cog and I ran the model side of it. So there was a large multi organizational team I ran which models should we host, how should we host them and like what's the proportion of them? And then of course there was like an SRE team that like made sure that things ran well and scaled the models as well. But I ran like you know, model how do we get the model to silicon and then which also worked with our product team determine like which models were important a very long time ago.

59:48

Speaker D

Yeah, there's also like a middle ground in between. Right. This is like for the Hacker try anything. There's the Brev console, then there's Dynamo. There was also nims. Right? Yeah, I remember it had its little moment like a year or two ago. Is it still.

1:00:39

Speaker A

Yeah, no, NIM is, you know, inference mode, I think it looks like for something.

1:00:52

Speaker C

It's no longer an acronym, it's just a nim.

1:00:58

Speaker A

But yeah, NIM is how enterprises can take any of this technology and run it with support and all of that. And so that includes Dynamo. That includes, I don't know, all of our other optimizations that are packers over enterprise.

1:01:00

Speaker C

Yep.

1:01:13

Speaker B

Anyway, so you got a bunch of experience running the sort of internal inference gateway playground.

1:01:14

Speaker C

Yeah. Bill also built Health Build, Nvidia's first internal like VS code thing we call the MV code.

1:01:19

Speaker B

It's like the extension, right?

1:01:26

Speaker C

Yeah, it was a VS first.

1:01:28

Speaker D

Like the fork VS code.

1:01:29

Speaker A

Agree.

1:01:31

Speaker B

We jokes absolutely not. It just a while back be like, we should have a 4th VS code

1:01:31

Speaker C

hackathon where you the best for VS code.

1:01:35

Speaker D

Earlier we were doing a How do

1:01:39

Speaker B

you make a billion dollars?

1:01:40

Speaker D

Someone from VS code was there and he was like somewhat down to get involved.

1:01:41

Speaker A

And I was like, oh, you should do that.

1:01:45

Speaker B

That's all I said. Then the cool thing became for a Chrome hackathon from. No, no, no. IDs are not cooling.

1:01:47

Speaker D

I.

1:01:52

Speaker A

What's it called? I was talking to Joseph from roboflow and your partner in crime we were talking about with the new Alpa Mayo model. So Nvidia just released an open source the. The Mercedes cars that you saw drive.

1:01:52

Speaker C

Shit, sounds crazy.

1:02:03

Speaker A

Yeah.

1:02:03

Speaker D

Released.

1:02:04

Speaker C

Will you open source a autonomous driving model?

1:02:04

Speaker A

Yeah. So we were thinking like, could we hackathon a driverless car? Like I have my old car. Let's just try it.

1:02:08

Speaker B

We'll take it.

1:02:15

Speaker D

Take it to like Click Trail with

1:02:16

Speaker A

Treasure island in the middle of the bay. Just like just see it, let it roam. Yeah. Like how cameras do we need? Right? Like 1, 2, 3, 4, I don't know, maybe 5, 6. I don't know. Yeah, but I think we're going to try. You just do it with us. We can see we could even have a race. It's like the first person to automate their driving. I mean, over a weekend.

1:02:17

Speaker B

We do have an autonomy track at World's Fair. Waymo was there like. Yeah, Nvidia did send people those for Groot because he didn't have the driving thing yet. Yeah, that's cool.

1:02:34

Speaker D

I think comma also has a version of this comma. They have open source driving. They've done a Fun hackathon on he

1:02:44

Speaker B

and I as host. Because what I really want is a Tesla. It was Tesla level self driving.

1:02:50

Speaker A

Yeah.

1:02:54

Speaker B

But as a smart car like a two seater that's basically a wheelchair with a roof.

1:02:55

Speaker D

And I think they make them in the demand has DNA.

1:03:00

Speaker B

They're like this for like five years.

1:03:06

Speaker A

Yeah.

1:03:08

Speaker B

Really? Yeah.

1:03:08

Speaker D

They were different manufacturer.

1:03:09

Speaker C

I feel like it's one of those things where we'll see someone buy the brand and it'll be revived. I would buy it like a private go. Someone hears this. Go buy your car.

1:03:11

Speaker B

Yeah. That's crazy Mercedes because they're like I think Mercedes.

1:03:22

Speaker D

Mercedes used to make them.

1:03:27

Speaker C

Yeah.

1:03:29

Speaker D

I don't know.

1:03:29

Speaker A

I feel like they own the brand

1:03:29

Speaker B

and you that's your dream might come true, you know. Okay.

1:03:31

Speaker A

We're

1:03:37

Speaker B

like every time I try to park in San Francisco I have to buy a smart car because like 20% of the parking lots in San Francisco only fit smart cars.

1:03:39

Speaker A

Yeah. Really? That's what I mean. Even though it was late here trying to.

1:03:48

Speaker C

This comes from someone that like basically does not drive.

1:03:53

Speaker A

That's where the Vespa was a life hack.

1:03:55

Speaker B

Yeah, exactly.

1:03:57

Speaker A

Yeah.

1:03:57

Speaker C

You know what happened to the Vespa?

1:03:58

Speaker A

I used to have this yellow Vespa. I left it outside the hacker house when we moved out. It's just it was always there and then like a month ago it's not there anymore. I've been meeting to. I don't know. You could light so it's actually been like a db.

1:03:59

Speaker C

You forgot about it.

1:04:12

Speaker A

Yeah.

1:04:14

Speaker D

Unless.

1:04:14

Speaker B

Yeah, yeah, yeah. No, this is probably has it. And speaking of hackathons, I also wanted to give a big shout out to the world shortest hackathon. Let's go. You did twice.

1:04:15

Speaker A

A handful of times. Yeah. There's going to be one at gtc.

1:04:23

Speaker C

Oh, we're doing another one.

1:04:25

Speaker A

Pretty much. We have a bunch of challenges that no, we haven't released and you get to bring your agent to come and attempt to go through those challenges.

1:04:26

Speaker C

It's like the zero minute hackathon idea. You just bring your I promised eight night a long long time ago. You just bring your agent and then you press the go button. You're not allowed to code.

1:04:34

Speaker A

It's just the Asian doing.

1:04:44

Speaker D

It's a good hidden evo, right?

1:04:46

Speaker C

Yeah.

1:04:47

Speaker D

You make a J rope and you

1:04:48

Speaker C

make this something I would love to see from cognition or someone else be like come bring your agent.

1:04:49

Speaker B

Like drop it in.

1:04:56

Speaker D

Because you don't know you like supervisor. Will it be a. You know, operate a browser, order a Pizza. Will it just seem like that Snake

1:04:57

Speaker B

game, you know, and you don't know what the task is?

1:05:03

Speaker C

Yeah, I don't know what the task is. Like, we're just like, you don't even know what the judging categories are. And then you give it the judging categories, like try and win as much as possible.

1:05:05

Speaker D

It's great though. It turns into like, like. Yeah. So let's build something on Dynapod.

1:05:11

Speaker B

It's a great business.

1:05:15

Speaker C

Funny story, actually, we have a couple of people at Nvidia we've been working with security to like bring agents really close to compute. So we now have like stuff where we can like tell Dynamo, like, go run some experience with Dynamo, like on X cluster and just like try it right now, like queue up. Once you get queued, like send this request, load, load. And we've actually been able to just one shot problems. We used to have this problem where with Dynamode you have to find the right configurations and we do it automatically for some parts of it. But you have to have a good initial configuration that you want to use. And we've just had an agent just completely one shot that it goes. It gets the compute. It runs a couple experiments. It's like, this is the best. These are part of the Pareto frontier. Go run this. And then we just give that to people and it's like faster than anything that they have.

1:05:18

Speaker A

Agent UX and Agent Marketing are super important. They're stuff that we've been thinking a lot about. Alec is like redoing the entire Rev CLI so that you can fetch all the different compute types that are available. I don't know, it's going to be really soon, but then you can just browse what GPUs are available and then provision 1 SSH to it right there and you can pipe all the commands. But I think it goes back to like the Alex CLI coding agents are. It's kind of funny, I feel like coding agents have been so much more effective than general purpose agents. And I think a large part of that that is it just has access to the terminal like you said. And that means it has access to everything that you've installed into your terminal. It can run so you know, it would write code and it can compile the code and if there are errors, it can fix it. It can run your suite of tests because that's all just in your terminal. And so that, you know, for the idea or what got me really excited about the outlook CLI we're now just churning through building CLIs for the entire, like for the entire business suite. Slack building, Slack also workday cli, SAP Go.

1:06:07

Speaker B

I've also done that for myself. Really?

1:06:56

Speaker A

Yeah, yeah, we're going to, we're going to open source all of this and like, yeah, all the, I mean they're just, they're CLIs for the business applications. We would love for someone to run with this and like build like, I don't know, like open CLI foundation in or something. Yeah, Nvidia would love to support anyone that's doing this.

1:06:58

Speaker D

Like every dev tool should really have good CLI support at this point. Like at one point it was you want your docs to be accessible by an LLM, right? You want LLM good docs. No, everything needs some CLI tool.

1:07:13

Speaker A

Yeah. It's kind of funny, right? Like computing began with a terminal with a shell, but we said that it's not empathetic to, to humans. So we built these nice user interfaces and then now we have LLMs navigating our user interfaces and ironically we're not empathetic to the machine anymore. Just give the LLM access to the shell.

1:07:25

Speaker B

One thing that slightly makes me uncomfortable is like why do we have to build CLIS? Why can't we just expose APIs?

1:07:41

Speaker C

I have an interesting answer to this. So there are a couple reasons. There's Portability is one issue. Sometimes APIs are not discoverable or reachable by some, some, you know, types of things. There's some element of locality, right. Like, like the CLI is like literally you interfacing with your like local system, which is a little bit different. You could still do it by API but like there's this highlighting of like what is the difference between like a CLI and an mcp, right? Like they kind of occupy the same purposes and you call them. It does something on the system and that's done. I think that in pre training there's just an enormous amount of command line data. Yeah, yeah. Like even let's ignore, let's, let's ignore rl. Like you're doing no harm Harness. You're doing no harness posturing. Just the amount of CLI versus API documentation for just navigating this world of the CLI in your file system through that is just enormous. Yeah, right.

1:07:47

Speaker A

I think there's a couple of things too. Like if, let's say we want to. So one, I think your intuition is right. The CLI is just wrapping the API. Right.

1:08:41

Speaker B

Functionally.

1:08:47

Speaker A

Functionally, right. And I think it's nice because one, you're being Very specific and pedantic, you know, even of what. And that's really good because you're describing the problem space. So you know what the. I don't know, I don't want to call it like the space for vulnerability. You know, what network calls you're making. It's not arbitrary. And that's not decided on the fly. That's like pre decided, which is important from a security perspective. But then if you were to write a bunch of API requests, you would probably do that. I don't know. Would the model, like, use Python to do so I kind of like that. Everything like a CLI is just dash because it's ubiquitous. Like it's just there. And you don't have to make sure that there's certain environment variables that. That are set up. Like, if your Python version is different than my Python version, we're using the same model to go do the same thing. Is it going to write different code? It probably would.

1:08:48

Speaker C

And so it's kind of nice to go work. Right.

1:09:31

Speaker A

With human as well, I think. Just like making those decisions happen ahead of time versus yeah.

1:09:33

Speaker B

One last thing on this sort of agent, I guess maybe colocation or whatever you call it. One pattern I'm tracking for this year. I always try to think about what's the theme of this year going to be? Last year, definitely coding agents. This year is definitely coding agents breaking out of containment into broader. I go, definitely have seen her rent a human. Yeah, I'm on.

1:09:38

Speaker A

Are you really?

1:09:58

Speaker C

When I say I'm like $5,000, I'll do anything, really. I think so.

1:09:59

Speaker B

I need my bowels from Costco.

1:10:04

Speaker D

But I think the best part is only the agent can book me. You know, it's very.

1:10:07

Speaker C

Usually it's just like another labor marketplace

1:10:11

Speaker B

mechanical Turk was this. So this.

1:10:14

Speaker D

I have a weird story with why I did it. So back to your example of just giving agent access to compute. Right?

1:10:15

Speaker C

Yeah.

1:10:21

Speaker D

You guys are GPU rich at Nvidia. I hooked up.

1:10:22

Speaker A

He's not shy about it.

1:10:25

Speaker D

I have a 247 agent running. I hooked up to RunPod. It doesn't shut down instances. And I'm like, I've tried prompting it. I've given it instructions. Shut down when you're done. It's like, I need to keep it warm. I'll need it soon. It's horrible on time estimates too, because like, they realize it's like, yeah, I'll need it in 45 minutes. 45 minutes, I'll shut it down. 45 minutes of human time is actually three minute of agent time. So it's like I'm booting it up, I'm waiting. I'll just leave it on all night. And Moto is good at shutting down after some inactivity. I had it on my local server, like a little dual GPU thing. It just stays on. I have a little space heater at home now, but careful. So basically, you know, they don't care about the concept of money. Just burn it. I need it.

1:10:26

Speaker A

It's useful and DGX Spark will be really nice. Like I think I'm looking at it as a super useful for agents because yeah, you buy it once you plug it in and then it comes rip.

1:11:03

Speaker C

I'm gonna make a. I'm gonna make an Nvidia ad here. Okay. The Blackwell like RTX 6000 cards Pro Pro are only like. I think it's $8,000 slightly cheaper. Yeah, well it's much, it's much cheaper than the data center cards.

1:11:13

Speaker A

Yeah.

1:11:29

Speaker C

And it's got 96 gigabytes of VRAM. So if you and your, your crew want to go like run a local agent for, you know, you, you in the home. I feel like, like it's got a significant amount of vram. I've thought about purchasing this and running in my basement, except my neighbors would hate me.

1:11:29

Speaker D

It's just a single like two, three slot GPU.

1:11:46

Speaker A

It's mostly.

1:11:48

Speaker C

Yeah, it's a PCIe. Yeah, PCIe GPU. You can go by that. I mean the big difference against like the RTX, like gaming GPUs. Is it. I mean obviously it's like black belt. Like it's a pro GPU and has a lot of vram, which means you can run pretty large models on it.

1:11:49

Speaker D

You can stack four of them for the Max Q in a system.

1:12:02

Speaker C

But as that's a beast, it's beefy.

1:12:05

Speaker A

You can run.

1:12:08

Speaker C

What is that? 96 Zega and anything 96, you're on a lowseek.

1:12:09

Speaker D

But also they are slow. They're not. I mean performance of speed will be somewhat slower to API. Like.

1:12:13

Speaker C

Oh yeah, that's true. So again, big learning economy of scale allows you to do things that allow you to get both speed and throughput. Like you can run. I'll give you an example. There's an optimization called Wide ep. I'm not going to go into it fully, but like it featured heavily in Inference Max for Deep seq and there's a. There's a great set of stories from Nvidia and from Semianalysis about like why. Yep. Is Important. But for like MOE models it's like basically essential. And you run it like the level of parallelism, the level of scale up parallelism used for it is like 32. So it goes beyond that 8 barrier. And it like really, really, really is important to have that MVL 72GB 200 MV link to serve at scale. And like, it's like, I don't remember, like the, you know, cost improvement, I think. Against Hopper. Right, against hopper. With this MVL72 system, you're getting like 35 times cheaper per token for like a lot of the curve. Yeah. Which is crazy. Yeah. And normalized per gpu, obviously, because part of the GPU is cost or the code. The GPU is part of the cost.

1:12:22

Speaker B

And one thing I'm exploring is the sort of this year is also the year the sub agent where you have the main agent, but then that also kicks off tools which are in themselves agents that have limited agents and such. Low context locally, whatever.

1:13:24

Speaker C

Right.

1:13:39

Speaker B

Different prompts. So for example, one thing that cognition does is before you kick off a search, they do like a fast context model where you kick off April, you just search across the code base. That is better than indexing a lot of the times, not all the times, and you should still index for something. But the idea that agents should be able to command subagents and probably run them maybe close to inference as well. I don't know if that's architecturally possible or even.

1:13:39

Speaker C

Yeah, we're thinking about that for Dymo, that's our big theme for the year.

1:14:05

Speaker B

Because if you can design that into your stuff, then a lot more people will use it. Right now it's just kind of theoretical because you do pay a lot of back and forth coordination costs.

1:14:08

Speaker D

I think you'll net speed up though. Right. Even at a basic level, speculative decoding, you're running a small model, you're running two instances. But it's Netflix that is one example.

1:14:18

Speaker B

Yes.

1:14:27

Speaker C

Yeah, but this is like a little bit like different with like agents. Agents.

1:14:27

Speaker B

Yeah. This is not spectacular.

1:14:31

Speaker C

I think there's like a summarization of that trend that I like to do or I like to say to my team. It's like this is the year. So there are two things. This is the year system as model.

1:14:32

Speaker A

Right.

1:14:41

Speaker C

Where like instead of having like a single model be a thing, you have a system of models and components that are working together to like emulate the black box model. So when you make an API call call to something that's like, like a multi agent in the background it still looks like an API called to a model you're still getting back to.

1:14:41

Speaker A

But under the hood.

1:14:57

Speaker C

Yeah, under the hood it's like a billion different models and that's a lot of complexity. With Dynamo and with other libraries and media, we're looking to help manage that complexity.

1:14:58

Speaker A

It's funny, we actually for ces we just released the model router for DGX Spark where you can have a local model that's running on the Spark and then also a foundation model and then the model router decides when to send queries to which ones. So it's no longer this like either or it's use the best of everything that's available to you. You have a good post training model

1:15:06

Speaker B

that's running on is it leads to also the bread functionality of being able to manage the Spark.

1:15:22

Speaker C

Oh, that'd be cool.

1:15:26

Speaker A

Oh yeah, I did be able to request. Yeah, there we go.

1:15:27

Speaker C

I actually have a question, like I'd like to like extend and flip over how much longer do you guys think like agents are going to be running? Because that's one thing I've been throwing around. Like what happens when.

1:15:31

Speaker B

I mean always on it even affects

1:15:39

Speaker D

the like back to the prefill decode, right? Like Codex is, I'd say compared to cloud code it's much longer at tasks. Like that thing will like to run six, seven, eight hours, I'll run it overnight and I'll go back and I have like a little crappy logging software I use and there's just times where it wants to like I'm going to go deep on research and it'll eat up 80,000 tokens. Go on another, go on another, just eat through tokens and you know that's part of it. Like at the end it does, it does hit a long time task and I think you only see that there's

1:15:41

Speaker A

insatiable demand for tokens and every improvement that comes kind of just makes our demand even higher. It's kind of funny, right? If you have a teammate and you ask them to do a task and they're like, should I save some effort and not think too hard about this task? Fuck no.

1:16:11

Speaker D

I'm in my favor level.

1:16:23

Speaker A

Too bad.

1:16:24

Speaker D

You can have four shots, right? Like the original Codex before the app. Why do one call, Give it four attempts. Just use all the tokens. Try more, try again, try more.

1:16:25

Speaker C

It's like the meta index, right? Is the thing that tracks how long models are able to run. I expect that we'll just see log linear, if not log super linear. Growth we will see before the end of the year an agent that is capable of running for longer than 24 hours with self consistency the entire time.

1:16:37

Speaker D

I would also poke at different domains having different desires. At a consumer level I'm getting slightly frustrated at 20 minutes per basic query. Sure you can optimize, you know six, eight hour. I don't see myself shooting off many one week agents. Right. Someone doing like okay GPU kernel research or medical or biological like you know in those domains. Sure shoot off a lot that take a lot of. So like I think it will be somewhat domain specific because you also really need to train that in.

1:16:56

Speaker C

Right. You know what's funny one thing is doing your taxes right. Like that's tax. Get it right.

1:17:24

Speaker A

I wonder if like a major school case that's sort of like speculative decoding is like your agent figuring out what you might be prompting it the next day at night and like pre fetching.

1:17:31

Speaker B

Yeah, you can already do that.

1:17:40

Speaker A

Yeah. Really?

1:17:41

Speaker C

Branch. Branch prediction.

1:17:41

Speaker B

Oh well no, well that's, that's too. That's too low level but yes. Sorry. Yeah, yeah, yeah. One question I got to get is. So like we actually did record a part with the meter folks who Sarah right here their chart is the human equivalent work hours of work rather than than how long the ads themselves are being autonomous and there's a huge difference. Right. Like human work five hours, agent work 30 minutes. It's actually 30 minutes, not five hours. Right. So that chart that you see is them estimating what the human equivalent replacement is. I think actually Anthropic released a more recent chart that showed cloud code autonomy from their production traffic numbers and that was 20 to 45 minutes. That's roughly where we are. So that's the sort of realistic. I mean I do think like there's experimental setups where you can just like Ralph William might just prompt it to keep going when it stops. And obviously you can. That can go arbitrarily long.

1:17:42

Speaker A

I feel like from my experience around Yeah I guess 20 to 40 minutes seems right for when I'm using like codecs or cloud code. But then like I always try to just like if I want to spin up like a new. There's a net new project. I'll often start with replit and like it only happened for bleeding. Yeah like spin up like they're new like from the V3 agent. Like it'll spin up a web browser and like click around and discover new bugs and just keep churning. So I think like my longest was like over an hour. That hey, I've been churning.

1:18:33

Speaker D

I think before we see super long running, I think there's going to be a bit of an efficiency hit. So sure, you can take an hour and go down paths, but you also want, you want to be more efficient, you want to be smarter in your reasoning.

1:18:59

Speaker B

Right.

1:19:11

Speaker D

So I think that'll actually go down before we go back up. Like you don't want to scale non optimized systems just for the heck of it. As much as I love saying use all the tokens. Tokens, they are expensive. Like going from dense to reasoning models, that's an added cost.

1:19:11

Speaker B

Right.

1:19:25

Speaker D

You're paying for a lot of tokens and it doesn't make sense to just scale stuff that's not optimized. So there's always that little balance. But I think you'll see both sides of it.

1:19:26

Speaker A

Yeah. So 2023 was super exciting. I think if you were in SF you were like, okay, I know this is going to be a huge world changing moment, but it seemed like no one had known yet and maybe even before. Was it 2022? Maybe? Yeah.

1:19:36

Speaker B

I would say Rune had this tweet where everyone was in SF from 2021 to 2023 understood what it was like to be already totally unhappy.

1:19:47

Speaker A

Yeah, 2021, that's when I made my first OpenAI account. Yeah, it was crazy. And I remember it was so funny because at the time SF had not been doing well. So pretty much what it felt like was the concentration of founders in the city had risen because where my neighbors were used to doing a bunch of stuff, those people had all left. So the only people that were still in the city were people that really wanted to be build. It was cheap tech.

1:19:56

Speaker B

It was. Yeah.

1:20:15

Speaker A

It was also way cheaper. I feel really bad anyone who is trying to get rent now, but there was Celo was. They had a huge office.

1:20:15

Speaker B

So Blockchain. It like took over the. The old Casper building.

1:20:23

Speaker A

Yeah, they had the showroom and they had the. Like the. What would I think was like the back warehouse. It was, it was a huge office

1:20:27

Speaker B

and it's right across on OpenAI in Neuralink.

1:20:33

Speaker D

Yeah, it was in the original arena.

1:20:35

Speaker B

I named the arena because of it.

1:20:37

Speaker A

Yeah, yeah. And so it was really exciting because like rovoflow. I think I forgot mintlify. Yeah, mintlify Brev was there. You guys were there. I remember that was actually. It was there that you bought the AI engineer domain.

1:20:39

Speaker B

Yeah. I didn't know what I was going to do in AI. I want to do something.

1:20:51

Speaker A

But it was Kind of this. It was a really fun moment where we were kind of all in this cello space. And I don't know, it was a really cool community, especially being so early.

1:20:54

Speaker B

Yeah. And so, Dan, you got me early cruise access.

1:21:02

Speaker A

Oh, yeah.

1:21:05

Speaker B

So there was a going period of time that both crews didn't wait. Models. Which is free.

1:21:06

Speaker C

Yeah, always.

1:21:09

Speaker A

If you had it.

1:21:11

Speaker D

I mean, they're so back. Cellos opened again.

1:21:11

Speaker B

So nature.

1:21:15

Speaker A

Zooks.

1:21:16

Speaker B

Zooks is doing Zooks and Robotaxi. Yeah. So totally.

1:21:16

Speaker A

But yeah. And so it's actually really cool that you guys have this studio so close to Celo, this rock climbing gin right around the corner. So. Yeah, it's an awesome block.

1:21:21

Speaker B

Yeah.

1:21:33

Speaker A

Just. And

1:21:33

Speaker B

I do think one thing I try to do with the podcast is like, bring, like, what it's like to be in San Francisco to the rest of the world. And also just like, maybe give El Tepo taqueria.

1:21:36

Speaker A

Yeah. My favorite tacos in the city. And.

1:21:46

Speaker B

Yeah. Steak and shrimp. I know, it's very good.

1:21:48

Speaker A

Yeah. And I guess what it's like to be in San Francisco, I think, is just everyone seems to be super supportive sometimes. I feel like the city believes in you more than you do. And even. I don't know if you remember, but I remember posting my first blog post and. And I had met you on Twitter and you gave me like an hour of your time super randomly. And you kind of coached me through writing content for developers. And I was trying really hard not to come off salesy or plug myself. And so I kind of stripped all personality out of the blog post. And you, you brought that out. You're like, people don't. It's okay to talk about what you're doing. Like, you don't have to be weird about it. And I remember just that. I think that really helped me kind of figure out what our voice is and not shy away from it. And so always really grateful for you.

1:21:50

Speaker C

Hey, you inject your voice into, like, everything.

1:22:28

Speaker B

Huge advance.

1:22:31

Speaker C

Manage to be very genuine about what you care about. Yeah.

1:22:32

Speaker B

Imagine some Rand person. DMs you. And can you give me feedback on this blog post? And it's pretty boring. And you're like, fine, he looks interesting. I'll just do a zoom call. And then you meet this guy. He's so energetic.

1:22:36

Speaker A

Just be right there.

1:22:48

Speaker B

But I think people are trained to write a certain way in school, and they never see there's a broader world

1:22:50

Speaker A

out there not to unlearn.

1:22:55

Speaker C

Writing is thinking. And everyone thinks different, differently. So you might as well just fight your way.

1:22:57

Speaker D

Cool.

1:23:03

Speaker B

Well, thank you for indulging with us. Really broad breaking discussion. But I love you guys are the young faces of Nvidia with so much energy but also a lot of typical death. And I think people learn a lot for this session, so thank you.

1:23:03

Speaker A

This is awesome. Thank you, guys. Thank you for everything that you've done.

1:23:16

Speaker B

Yeah.

1:23:19

Speaker A

Nga, the podcast, all the above and

1:23:19

Speaker B

see you at gtc.

1:23:22

Speaker C

Forward to it.

1:23:24

Speaker A

Yeah.

1:23:24

Speaker B

Cool.

1:23:25

Speaker A

It's awesome. Thank you, thank you.

1:23:25