NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

84 min

•Mar 10, 20263 months ago

Summary

NVIDIA engineers Nader Khalil and Kyle Kranen discuss their acquisition journey from Brev to NVIDIA, the development of Dynamo for data center-scale AI inference, and the future of AI agents in production. They explore technical challenges in scaling inference systems, agent security considerations, and NVIDIA's internal culture of innovation.

Insights

Agent security requires limiting capabilities to two of three functions: file access, internet access, or code execution to prevent vulnerabilities
Disaggregation of prefill and decode phases in inference allows for better resource optimization and scaling at data center level
NVIDIA's 'Speed of Light' (SOL) principle forces teams to understand theoretical limits before accepting timeline constraints
The shift from single model inference to system-of-models architecture is becoming the dominant pattern for AI applications
Context length scaling faces fundamental physics limitations that will require architectural breakthroughs rather than incremental improvements

Trends

Disaggregated inference architecture separating prefill and decode operationsMulti-agent systems with specialized sub-agents for different tasksHardware-model co-design for optimized inference performanceCLI-first interfaces for AI agent interactionsAlways-on autonomous agents running for extended periodsTest-time scaling becoming more important than model sizeAgent workflows breaking out of coding into broader business applicationsLocal-cloud hybrid inference deploymentsContext-aware caching for improved inference efficiencySecurity-first agent deployment strategies

Topics

AI inference scaling Agent security architecture Dynamo inference engine Prefill and decode optimization NVIDIA acquisition strategy Speed of Light (SOL) methodology Context length limitations Multi-agent systems Hardware-software co-design Enterprise AI deployment GPU resource management Autonomous coding agents CLI development for AI KV cache optimization Disaggregated computing

Companies

NVIDIA

Primary focus - discussed acquisition strategy, Dynamo development, and internal AI deployment

Brev

GPU provisioning startup acquired by NVIDIA, now integrated as developer experience platform

OpenAI

Referenced for model capabilities and enterprise deployment considerations

Anthropic

Mentioned for Claude and coding agent autonomy metrics

Meta

Discussed for recommendation systems and Llama model training approaches

Google

Referenced for research papers and model architectures

Amazon

Mentioned for using Dynamo in generative recommendation systems

Mercedes

Partnership example for NVIDIA's autonomous driving technology

ServiceNow

Used NVIDIA's Nemotron dataset to train their own models

Cursor

AI coding tool widely adopted internally at NVIDIA

People

Nader Khalil

Director of Developer Experience at NVIDIA, former Brev co-founder

Kyle Kranen

Engineering leader and architect of Dynamo at NVIDIA

Jensen Huang

NVIDIA CEO, referenced for Speed of Light methodology and company culture

Leopold Aschenbrenner

Referenced for 'unhobbler' concept in AI scaling limitations

Brian Cannizzaro

NVIDIA executive who taught about choosing your own path within the company

Quotes

"You really only let an agent do two of those three things. If you can access your files and you can write custom code, you don't want Internet access because that's a vulnerability."

Nader Khalil•Opening

"SOL is essentially like, what is the physics? The speed of light moves at a certain speed. So if light's moving some slower, then you know something's in the way."

Nader Khalil•Mid-episode

"This is the year system as model. Where instead of having a single model be a thing, you have a system of models and components working together."

Kyle Kranen•Late episode

"We're completely happy investing in $0 billion markets. We don't care if this creates revenue. It's important for us to know about this market."

Kyle Kranen•Mid-episode

Full Transcript

4 Speakers

Speaker A

Agents can do three things. They can access your files, they can access the Internet, and then now they can write custom code and execute it. You really only let an agent do two of those three things. If you can access your files and you can write custom code, you don't want Internet access because that's one. It's a vulnerability. Right. If you have access to Internet and your file system, you should know the full scope of what that agent's capable of doing. Otherwise, malware can get injected or something that can happen. And so that's a lot of what we've been thinking about is like, you know, how do we both enable this because it's clearly the future. But then also, you know, what are these enforcement points that we can start to, like, protect?