We're All Addicted To Claude Code

46 min

•Feb 6, 20264 months ago

Summary

Kelvin French, an early Codex creator at OpenAI and Segment founder, discusses how coding agents like Claude Code and Codex are fundamentally transforming software development. The conversation explores architectural differences between CLI-based agents, context management strategies, and how these tools will reshape engineering roles and company structures in the coming years.

Insights

CLI-based coding agents (Claude Code, Codex) are outperforming IDE-based approaches because they provide better context management, atomic composability, and freedom from traditional IDE constraints
Context window management and strategic delegation to sub-agents is the primary technical differentiator between coding agent platforms, with Anthropic's approach favoring human-like workflows versus OpenAI's pursuit of longer autonomous runs
Senior engineers benefit most from coding agents because they can rapidly translate high-level ideas into implementation, while junior engineers need better architectural guidance to avoid poor patterns being amplified
The distribution model (bottom-up adoption by individual engineers vs. top-down enterprise sales) will determine which coding agent platforms win, similar to how Netscape Navigator dominated through free distribution
Future software development will shift from integration work (now commoditized) to higher-level concerns like data pipeline automation, personalization, and campaign-level optimization

Trends

CLI tools are experiencing a 'retro-future' resurgence as the optimal interface for AI coding agents over traditional IDEsContext engineering is becoming as critical as prompt engineering for AI agent performance, with grep-based search outperforming semantic search for codeSmaller teams and solo founders are pushing coding agents to their limits due to runway constraints, while large enterprises face organizational friction in adoptionTest-driven development is becoming the standard methodology for both prompt engineering and agent-based code generationIndividual engineers are making architectural decisions that bypass traditional enterprise approval processes, creating tension between bottom-up innovation and top-down governanceLLM training data composition (e.g., Python monorepos vs. Ruby on Rails) significantly impacts agent performance on specific tech stacksCoding agents are creating a new class of 'manager-like' engineers who focus on directing agent workflows rather than writing code directlySecurity and sandboxing approaches differ significantly between OpenAI (strict) and startup-focused tools (permissive), reflecting different risk tolerancesAgent memory and collaboration features (shared prompts, conversation history) are emerging as critical infrastructure for team productivityThe next generation of engineers may develop fundamentally different mental models of software architecture due to early exposure to AI-assisted development

Topics

Claude Code vs. Cursor vs. Codex architectural differencesContext window management and token budget optimizationCLI-based vs. IDE-based coding agent interfacesTest-driven development for AI agent validationPrompt injection and security risks in coding agentsCode review automation and agent verificationSemantic search vs. grep-based context retrievalSub-agent delegation and context splitting strategiesTraining data composition impact on model performanceBottom-up vs. top-down distribution for developer toolsEngineering skill requirements in the AI-assisted eraData pipeline automation and business logic generationAgent memory systems and knowledge sharingDebugging complex issues with multi-service architecturesFuture of software architecture with autonomous agents

Companies

Anthropic

Creator of Claude Code; discussed for its approach to context management and human-centric tool design philosophy

OpenAI

Developer of Codex; compared for its reinforcement learning approach and longer-horizon autonomous execution strategy

Cursor

IDE-based coding agent competitor; discussed for semantic search approach and recent model improvements

Segment

Founded by Kelvin French; discussed as example of integration business now commoditized by coding agents

Y Combinator

Hosts this podcast; mentioned for its engineering team's 50-50 split on security permission practices

Vercel

Mentioned as deployment platform that reduces boilerplate code, enabling faster agent-assisted development

Cloudflare

Mentioned for Workers platform that simplifies infrastructure, enabling more efficient agent-based development

Supabase

Open source Firebase alternative; cited as example of project winning through superior documentation and LLM recommen...

Ramp

Published blog post about building custom coding agents using open source as a harness for model context

New Relic

Mentioned for offering MCP (Model Context Protocol) integration with coding agents

Sentry

Error tracking tool discussed as potential integration point for auto-generating bug fix PRs via agents

PostHog

Analytics tool mentioned as example of product gaining adoption through LLM recommendations in agent workflows

Slack

Used as example of successful product with simple, durable primitives (channels, messages, reactions)

Human Lair

YC Fall 2024 company; founder Dex discussed concept of LLMs entering 'dumb zone' after high token usage

Reptile

YC company providing code review bot functionality for agent-generated code validation

People

Kelvin French

Early Codex creator at OpenAI; founded Segment (multi-billion dollar exit); primary guest discussing coding agent evo...

Gary

Co-host of Y Combinator Startup Podcast; recently became addicted to Claude Code after decade-long coding hiatus

Andre Karpathy

Referenced for tweet about coding agents being super persistent and tending to amplify existing patterns

Paul Graham

Referenced for 'Maker Schedule vs. Manager Schedule' essay; discussed in context of how agents change time management

Jake Heller

Previous YC podcast guest; discussed test-driven development approach to prompt engineering

Quotes

"I feel like when I'm using Cloud Code, it's like, oh, I feel like I'm flying through the code."

Gary•Early in episode

"When it's in your CLI, this thing can debug nested delayed jobs like five levels in and figure out what the bug was and then write a test for it and it never happens again. This is insane."

Gary•Mid-episode

"I think everyone who's experimenting with this stuff on like a hobbyist level or at like a very small startup, they're just pushing the coding agents as far as they can go. Because it's like you don't really have time to figure anything else out."

Kelvin French•Mid-episode

"In the future, coding is really going to feel more like talking to a coworker. Like, you're going to send off a question, and then they'll go off and do something and come back to you with a PR."

Kelvin French•Early-mid episode

"I think in some sense you're right that like everyone is going to become a manager in the future. But in order to get there, there are steps along the way. And you have to really build a lot of trust in the model and understand what it's doing."

Kelvin French•Mid-episode

Full Transcript

I feel like when I'm using Cloud Code, it's like, oh, I feel like I'm flying through the code. When it's in your CLI, this thing can debug nested delayed jobs like five levels in and figure out what the bug was and then write a test for it and it never happens again. This is insane. I think everyone who's experimenting with this stuff on like a hobbyist level or at like a very small startup, they're just pushing the coding agents as far as they can go. Because it's like you don't really have time to figure out anything else. Like as a startup, you have limited runway. You're just going to orient around speed. I think at a bigger company, you have a lot more to lose. What are some of the tips to become a top 1% user of coding agents? Yeah, what's your stack? Hey everyone, welcome back to another episode of The Light Cone. Gary, are you ready to record? I'm in plan mode right now, but okay, yeah, I guess it's time. Sorry about that. Well, welcome to another episode of The Light Cone. And today we have an incredible guest, Kelvin French Owen. He's one of the first people to create codecs at OpenAI. And before that, he started Segment, which is a multi-billion dollar company that got to a very successful exit. Kelvin, welcome back. Thanks for having me. I guess what a crazy time for all of us. I recently got very, very addicted to cloud code. and I would describe it as like 10 years ago I was a marathon runner and I loved doing it and then I suffered a catastrophic knee injury which is called manager mode and I stopped coding which is tragic and horrible but now the last nine days have been like this incredible unlock of all the things I remember being able to do and it's like you know I got a new total knee replacement. And actually it's a bionic knee and it allows me to run five times faster. What's your take on it? Because you're, I mean, right out there at the forefront of it. I mean, Codex pioneered all of the, a lot of the ideas that now like everyone still uses and Codex is still evolving too. For brief context, when I was at OpenAI, I was working on the Codex web product. At the time, Cursor was out in the market and they had kind of built this shim around, And I think it was Sonnet 3.5, and it was able to work in your IDE. FOD Code had just come out, and it was working as a CLI. And we kind of had this idea like, hey, in the future, coding is really going to feel more like talking to a coworker. Like, you're going to send off a question, and then they'll go off and do something and come back to you with a PR. And so that's where we started with this web view, and that's what we were building. I think directionally, that's still kind of correct for where things should go. But obviously now everyone is coding with CLIs instead. Like they're using those tools a lot more, whether it's CloudCode or whether it's Codex. And I think, at least for me, kind of the lesson in that is I think in some sense you're right that like everyone is going to become a manager in the future. Or at least that's my hot take. But in order to get there, there are steps along the way. And you have to really build a lot of trust in the model and understand what it's doing. You recently came over to CloudCode. What's the transition been like in terms of as using it as one of your stacks? Yeah, yeah. So Cloud Code is certainly my kind of like daily driver today. And honestly, this is switched every few months. For a while, I was deeply in Cursor. I think their new model, which is really fast, is actually quite good. Then I kind of moved over to Cloud Code, especially with Opus. Cloud Code is a really interesting product. And I think it's underrated how good the both product and model are working together. if you study them closely i think one of the things that clod code does in particular that's really amazing is split up context well and so if you look at uh i don't know things like skills or sub agents like when you ask clod code to do something it will typically spawn and explore sub agent or like multiple ones and basically each of those are running haiku to traverse the file system and kind of like explore what's there and they're doing it in their own context window And I think Anthropic has kind of like figured something out here around given a task, does that task fit in the context window or should I actually like split it into many more? And the models are like insanely good at this, which I think gives them really good results. And I think the fascinating thing is because it's on the terminal, is the purist form for composable atomic integrations? Because if you came from ID First World, which is where Cursor was and I suppose Codex 2, this concept of finding the context more freeform wouldn't come out so natural, right? Because we relax like one thing, which is so unique. Yeah. Personally, I was surprised. I don't know how you all feel, but I was surprised that CLIs. It's like a weird retro future that like the CLIs, which are the technology from 20 years ago, have somehow beaten out all the actual IDEs, which were supposed to be the future. A hundred percent. Yeah. And I think it's important actually to cloud code that it's not an IDE because it sort of distances you from the code that's being written. Like IDEs are all about exploring files, right? And you're like trying to keep all the state in your head and understand what's going on. But the fact that a CLI is like a totally different thing means that they have a lot more freedom in terms of how it feels. And I don't know about you, but I feel like when I'm using Cloud Code, it's like, oh, I feel like I'm flying through the code. You know, it's like there's all sorts of things going. There's like little progress indicators. It's kind of like giving me status updates. But like the code that's being written is not the front and center thing. I mean, dev environments are so messy. I mean, I really like how clean a sandbox conceptually is in Codex. But then I just ran into all these crazy issues like trying to do, you know, run just simple testing. Right. It needs to access Postgres and then it can't do it. Or, you know, my codex.md ended up being 20 lines long, and even then it didn't work. When it's in your CLI, it could just access your development database. I mean, I'm not sure if I'm supposed to do this, but I've actually also had it access my production database. Yeah, yeah. And it can just do it. It's like, yeah, okay, here, like, I looked into it, and I think this happened, and I'm going to debug this, you know, concurrency issue. and was like, oh my God, like this thing can debug nested delayed jobs like five levels in and figure out what the bug was and then write a test for it and it never happens again. This is insane. Yeah. And I think that distribution mode is frankly underrated. Like thinking about a cursor or a Cloud Code or a Codex CLI, the fact that you can just download it and use it without having to get IT permissions or anything makes a huge difference. And actually I was playing around with a product the other day where you download a desktop app and then it execs the Claude code that you have running on your laptop and uses that and communicates back via an MCP server to the desktop product. And it's like, this is a very interesting way of now starting to work with your laptop where you don't have to get anyone's permission to do it. You just download the product and go. Yeah, I was looking at like New Relic has an MCP, but, you know, Sentry you can like copy Markdown, but like it's like an auto bug fixer. Yeah, it's right there. It's super interesting that in a world where things are changing so fast, you really want your product to have a bottoms up distribution, not top down, because like top down is like just too slow. Like the CTO of a company is going to be like, have all these concerns about security and privacy and control. Exactly. Versus like the engineers just like install the thing and start using like this thing is amazing. Yeah, I think that's right. The one thing I do struggle with, I mean, I'm like a B2B enterprise guy generally, but I feel like there's some amount of moat that happens when you do that top down sale. And there's got to be some company who manages to crack it where it's like, oh, this is the thing that everyone has access to. Maybe individual people can take it up. That was the original Netscape Navigator. It was free for non-commercial use. And then people would just download it and use it for commercial use. And then they could just track down the IPs and figure out exactly how many clients were in all of these different companies and say, you should pay for this. You're in violation, but all you have to do is buy a license. Yeah, yeah. I'm curious if you could do that work again here. I mean, your point about distribution is very interesting because now people are probably just making architecture decisions about what to use directly in Cloud Code. They might not even know what analytics to use and it's like, oh yeah, as long as Cloud Code says use PostHog, they're using PostHog. 100%. One of the companies who I advise was talking about their GEO strategy. This This is like the generative optimization or how you show up in chatbots. And what he was saying is funny is one of their competitors had put together a like top five list of like tools in their category that you should be using. And of course, their tool is ranked at the top of this top five list. And like any human looking at this would be like, oh, this is so obviously biased. It's like the top tool is the one that's in the domain, you know. But the LLMs get fooled. And like they're pulling together a bunch of contacts and they're saying like, oh, this is the top. And then they'll just recommend it. I think, yeah, if you're selling a developer tool, like having good docs that are out there, like having social proof, like maybe being posted on Reddit a little bit more, all of that helps your case tremendously. Which is why I think a lot of the open source projects have taken off a lot more. I think one of the examples is Superbase, actually. Yeah. Which really took off last year. And part of it is because they have such a good open source documentation, how to set up a bunch of stuff. Whenever someone asks how to set up anything that you need, some sort of backend, Firebase type of transaction, the default answer from all the LLMs is actually a super base. I was trying some of these questions that comes from that. The thing is, it's winning the internet. And it was like that before when it was like Stack Overflow, searching Google. And then now that nobody uses Google anymore, it's like crazy. It's kind of the same deal. I will say it does help open source disproportionately, I would say. I don't know if you all saw there's a Ramp blog post that they recently published about building their own coding agent. They were mentioning that they use open code as a harness because the model can look and see the source code and understand how it's working. And I do this all the time with open source projects. I'll clone the repo and then spin up codex or cloud code and be like, hey, give me a walkthrough of what's going on here. And it's really useful. What do you think are some of the tips for anyone that wants to build a coding agent since you've done it a lot? What are some now lessons that you learned that you want to share? I mean, I think the number one thing is managing context well. Basically, we kind of had like a checkpoint for, I think it was 03, like one of the reasoning models. And then we did a bunch of fine tuning on it and reinforcement learning where it's like, oh, you're given a bunch of questions to like solve these coding problems or like fix tests or whatever, implement a feature. And then the model was RL to respond to those. And so I think most people are not going to be doing that, right? But the things that you can do are figure out like, hey, what context should I be supplying to this agent to get the best possible result? And so for quad code, if you watch it working, it's like, oh, I'm going to like spawn a bunch of these explore sub agents. They will like search for different patterns in the file system. They will come back. They will have this context. They'll summarize it for me and then I'll have someplace to go. It's interesting watching like different agents structure this context. I think Cursor takes an approach where they actually do semantic search, where they embed everything and figure out like hey what query is closest to this If you look at a codex or a cloud code they actually just use like grep And I think that works because It works really well Yeah it works very well because code is very context dense If you think about lines of code, it's like each line is probably less than 80 characters. There's not a lot of big data blobs or JSON in your code base. Maybe there's some, but not a lot. You can respect git ignore to figure out and filter out stuff that's just not relevant or is packaged. And you can use grep and rip grep to like find context around the code, which probably gives you a good sense for what that code is doing. And you can navigate the folder structure. And also, LMs are really good at emitting very complicated grep expressions that would like torture a human. Yes. Yeah. Yeah. Yeah. This is like the RL in practice. Yes. Yeah. And so I think all of that, like if you're trying to build a system, well, I'm trying to build systems that integrate agents for non-coding work. I think you can learn a lot of those lessons and say like, hey, how do I get my data in the format that is maybe closest to code, where the model can peak and look at areas around it and get the right structured data. So given this is how a lot of the superpowers for the best coding agents is context engineering, what are some of the tips to become a top 1% user of coding agents? Yeah, what's your stack? Yeah, what do you do to be so productive with it? One is if you're able to use just generally far less code and plumbing. So a lot of what I do is like deploy stacks on like Vercel or Next.js or like Cloudflare workers where there's kind of like already a bunch of boilerplate like taking care of for you. And then you don't really have to think that much about like, hey, I need to stand up like all these different services and deal with like service discovery and like registering on like some sort of central endpoint or like all these databases. It's like, oh, like everything is pretty roughly defined in this like one or 200 lines of code. I tend to operate more towards microservices for that as well, or like individual packages that are fairly well-structured. I think it's also worth knowing like what the LLM superpowers are. Like in general, coding agents are, I think Andre Karpathy just tweeted about this. They're like super persistent. So they will keep going no matter what. They end up typically just making more of whatever's there. So if you're trying to direct them to do something, it's worth like, I mean, I can pick on Open AI slightly in this example, OpenAI has like a giant mono repo. It's been there for a few years now and has like, I don't know, thousands of engineers who are committing. Some of those engineers are like super senior meta folks who came in and are like, know exactly how to write production code. Some are like new PhDs. It's like a pretty wide range. And so the LM will pick up different things depending on where you direct it. I think there's a lot of room actually for coding agents to figure out like, what is the like optimal type of code that we should produce? I mean, obviously giving the model a way to check its work helps improve performance drastically. So the more that you can run tests in Lint, CI, etc. Personally, I also use code review bots pretty aggressively. I know like Reptile at YC Company is really good. The cursor bug bot has gotten quite good. And I actually like Codex for code review as well. I find it does a very good job on correctness. So those are all things that like the agents are good at. And they're excellent exploring the code base too. I think areas where they don't do well, they make more if your goal is not to make more. They'll often duplicate code and spend a bunch of time re-implementing things that you're like, oh, of course you didn't want to do this. I think context poisoning is a real thing where it kind of goes down one loop and it will continue because it has this persistence. But it's referring back to tokens which are not right in terms of pursuing a solution. and so one thing that I often do is like very actively clear context like how often usually when it gets above like 50 tokens oh wow yeah yeah I don't know there's this guy Dex from this company Human Lair that was actually another YC company yes YC company from fall 24 yeah yeah and he talks a lot about it yeah he has this concept of like the LLMs reaching the dumb zone where it's like after a certain amount of tokens, it just starts like degrading in quality. And I actually think that's very true, especially if you think about like how the reinforcement learning might work. Like imagine you're a college student, you're taking an exam. In the first five minutes of that exam, you're like, oh, I have all the time in the world. Like I'll do a great job. I'll think through each of these problems. Let's say you have like five minutes left and you still have half the exam left. You're like, oh man, I just gotta do whatever I can. Like that's the LM with a context window, right? One of the tricks that I think founders use is you put like a cannery at the beginning of the context. There's something very esoteric that I would only know. It's like something really funny. It's like, I don't know, my name is Calvin and blah, blah, blah. I drink tea at 8 a.m. Some random fact. And then as you keep going, you ask it, do you remember what's my name? Do you remember when I drank tea? And then when it starts forgetting that, I think it's a bit of a sign that the context has poison. That's like one trick I've seen people do. They do a random cannery. I have not tried this, but I fully believe it. That's interesting. I haven't run across any bugs before compaction, but maybe I'm not paying attention. But you're saying like that actually is actively something that it just starts doing weirder things that are not like optimal. Yeah, yeah. Okay, I got to be on the lookout for that. It seems like it should be solvable within the plot code itself. Like it should be able to basically do some sort of detection, like what Tiana said. You need to do your own internal heartbeat around it, around the context. Yeah. And I think we're just not there yet. Like, I agree with you in the limit. Right now, it's definitely hard to manage context well. And I think kind of the way it gets around it is, like, split up context windows and then try and merge everything. But you're sort of still at the limit right now of, like, everything that lives in context at the end of a quad code session is kind of fixed. It's actually interesting. The codex approach is kind of the opposite. it, and they just wrote about this on the OpenAI blog, where it will run compaction periodically after each turn. And so Codex can continue to run for a very long time. And if you look at the percentage in the CLI, you'll see it move up and down as compaction runs. I guess there are these very different architectures between Cloud Code and Codex sound like they're actually deeper in that Codex is actually meant for much longer running jobs. that's sort of like off the bat, a different use case. And then the architecture is very different as a result. I guess right now, it seems like CLIs, you know, 2026 might be the year of CLI. But then this other idea that AGI is here, and it's actually ASI is around the corner. The coding agents right now are really, really smart, but not smart enough to run on their own for long periods of time. But a 10x increase in compute from here, are we there? Like, are we at 24 hours or 48 hour running jobs on Codex? And that architecture is correct for that world. Yeah, I think it's a good question. It sort of goes back to like kind of the founding DNA of both companies. Like I feel like Anthropic has always been very big on like building tools for humans where it comes to like, oh, here's the style of the tone. And like, here's how it should fit with all of the rest of your work. And I think quad code is like a very natural extension of that. In a lot of ways, it like works like a human would, where it's like, oh, you need to build like, I don't know, a doghouse or something. It's like, oh, I'll go to the hardware store and I'll build all these materials and I'll like figure out how they all fit together. Whereas OpenAI really leans into this idea of just like, we are going to train the best model and reinforce over time and get it to do longer and longer horizon things in this pursuit of artificial general intelligence. And so it may not work like a human at all. Like going back to the doghouse example, it's like, oh. But AlphaGo didn't either. Yeah, but AlphaGo didn't either. It's like, oh, it's like instead I will have a 3D printer that can print from scratch like a doghouse and will be exactly what you want. And it will take a long time and it will be like very custom and it will do like weird things, but it will work, you know, and like maybe in the limit, that's the right call. And so it's going to be really interesting to see how they play out. I mean, net net, it seems like the latter is somewhat inevitable, but I like the former so much. You know, like even this idea that it grabs is like I thought about, you know, 10 years ago, it was like, yeah, I was in there like writing my own really weird regexes to try to figure out where everything was when I was refactoring or trying to understand code or whatever. So that's the feeling I get when I'm using it. It's like I can do five people's worth of work in like a single day. It's like rocket boosters. It's unbelievable. Yeah. I think it's going to be really interesting to see how this plays out across large and small companies. I think everyone who's experimenting with this stuff on like a hobbyist level or at like a very small startup, they're just pushing the coding agents as far as they can go. Because it's like you don't really have time to figure out anything else. Like as a startup, you have limited runway. You're just going to like orient around speed. I think at a bigger company, you have a lot more to lose. And you have all these other internal processes around code review and you probably already hired like a big eng team. And I think it's going to be very strange as like these individual teams of like one person are like, hey, that team over there isn't doing the right thing. Like, let me just build a prototype that like works better. I think at some point it's going to start working better. And I think that landscape shift is going to be a very interesting, strange thing. My 10 year old is, you know, he has writing assignments every day. And then yesterday was the first day where he used AI. and then I was like this is not a turn of a phrase that a 10 year old is capable of doing and then I think about that in this context because we you know we're working with a lot of 18 to 22 year olds who you know they've done internships but like they haven't done like eng manager work like you know we're saying you know post-product market fit once you have job queues of like millions of jobs and like you know hundreds of thousands of errors that's like real eng management like that's really you know it's horribly unglamorous like combing through hundreds of thousands of errors and then like manually making sure that like the thing works for all of your users in the background how does the next generation understand that can the cloud code bot actually teach people about uh architecture and things like that or you know are you just gonna bump your head into it and users just kind of suffer and you know have to figure it out like at least where I find myself spending the most time when it comes to product is figuring out the kind of product model in a sense. Like what are the things that the user has to understand today? And what are the primitives that they can use to like do whatever they want? I always think of Slack like this. It's like Slack was in some ways not really a new concept. It's like there were many chats that existed before it. But the fact that they had like channels, messages, and reactions in a simple way that people could just like think about and be like, oh, I understand how to like navigate this. It made a lot of sense for people. But then kind of once they were there, like it's very hard to change that later on for a user. You know, it's like, oh, maybe they wanted to go in more of like a document first way or like maybe right now they're trying to incorporate agents. It's like difficult to change the user's mental model. And so I, at least for myself building products, it's like you have to think about that very carefully from an early stage. Because again, whatever you supply to the coding agents as that kind of kernel is going to be what they run with and make more of forevermore. YC's next batch is now taking applications. Got a startup in you? Apply at ycombinator slash apply It never too early and filling out the app will level up your idea Okay back to the video Do you have thoughts just because you know the agents so well like what types of engineers are going to benefit more than others from these tools becoming popular? In general, I think that kind of the more senior you are, the more you benefit because the agents are so good at taking some sort of idea and then putting it into action. If you're able to prompt that in a few words, it's kind of like, oh, now suddenly I had this idea. I find this so often in OpenAI, like strolling through the code base. It's like, oh, here's the thing that I wish were different. Here's the thing that I wish were different. Here's the thing that I wish were different. Just being able to kick those off and then have them come back, I think is super empowering and multiplies your impact. I think also being able to detect which sorts of changes are good or bad architecturally is very important or like have a sense for where you might want to flag something to an agent. I think engineers who are more organized, like manager-ish, and there's probably just a missing product to be built here. Maybe something like Conductor, where it's like spread across all of your sessions and kind of reminding you like, hey, you were working on this thing. It's done. It needs your input here. Oh, you should switch your attention over to this other thing. I think that is going to become- Oh, Conductor should add that. Yeah. Yeah. Like context management for agents. But like we also need context management for humans. Yes. A hundred percent. Yeah. I mean, I want like when I wake up every day, it kind of is like, hey, here's all the work that got done overnight. Like here are the like three decisions that you need to make. Here are like areas of deep thinking that you were planning to do. Like I want the turn by turn for my day, you know, other things that make it very useful. Like if you're able to build, I don't know, some sort of like quick prototype for an idea to show it off. Like that's an area, I mean, obviously the agents do super well at this. I would find myself at OpenAI often writing kind of like prototype code or like, hey, I've got this like in-memory key value store. Can you now turn it into like work with a production database or something like that? Being able to concisely specify ideas in code. And I think having a smell for what the right architecture is, is still the area where the models like don't do the best job. So if you were going back to your college days and studying CS again fresh and you were picking your own syllabus or curriculum, what would you study? Personally, I think still understanding systems is very important. And just having some conception of how Git works, you know, or like HTTP or databases, like queues, like all of these different systems. I think that those fundamentals are still quite important. the other thing that i'd probably do is just have a semester where like each week you're just building something and you really try and push the models as far as they can go there's a sense that you have whenever you're doing something that you could always just like go up the layer and ask the model to do it and like go up a layer and ask the model to do it you know it's like oh i have like a implement command where it like implements the next phase of the plan but then i could have like an implement all command and it like goes stage by stage and creates a new sub agent and then i could have like a check your work kind of thing and like and i think knowing where the models can and can't accomplish that is such a moving target that it's worthwhile just to like tinker a lot i mean the other thing that's really really crazy for i mean i would love to be able to teach 18 to 22 year olds like everyone around like at this table has like ship stuff that people really really want and love so it's like how do we teach people that i wonder if like the best 18 to 22 year olds like five years from now will just have like off the charts taste and everything because they'll just be so much more prolific they should be right like they should just be launching and touching reality like 10 times as much as like the generation before them the one thing i have wondered about on that note um i don't know if you all found this but growing up my mom used to tell me like oh like stop multitasking you're not paying attention to like what i'm doing uh and i think there is some truth to that like often i would be like off on my computer, like not paying attention. But I do think I was legitimately better at multitasking than our parents were. And now I look at this new generation, I think they're actually quite a bit better at multitasking than we are, you know, because they've kind of grown up in this age of the internet and they're dealing with like TikTok and all these like different short form video and things. Like it seems like there's room for both kind of this like deep thinking where you want to like notice what you're seeing and understand and problem solve. But then there's also this mode of just like bounce between a bunch of different things and your context switching constantly. that ADHD mode yeah the new generation is quite good at this yes I have anything there's a there's a type of smart person maybe it's ADHD but just like always has like a bunch of good projects on the go but just never actually finishes anything I might relate to this personality a little bit um hey you released your uh your vibe code yeah but I wouldn't only because of clawed code that's not my point like now I just think like you kind of like there's certain types of brains that just have like like 10 branches going in their head but you never have enough hours in the day to actually like see any of them through so they're always like half complete and now it's just like called code gets you over the line with everything and it's just like and you made this point in your blog post about how it feels like a video game but it's just like there's just a constant novelty factor like you start working something and usually when you hit the point of like i'm like bored and i've got this other better idea and i should like start on that and then come back to this like you can't do that now but like everything can actually get finished let's live in the future for a moment. It's 40 years from now. Software still exists. Databases still exist. Access control still exists. But like at the core of it, I mean, software is entirely personal. Access control and who gets to do it is like, you know, sort of like this manager mode thing that people still have meetings about, but then everything else about a company, its functions, its roles, like is defined by people just doing things in their own cloud code like thing i don't know maybe it's a cli or it's like you know having giant armies of workers then i don't know what would that look like like imagine if every time a company signed up for segment you fork the code base you give them their own copy of segment it's running on their own servers and then if they want to change anything about it they just like tell some chat window which is running like an agentic coding loop and just like edits their version of segment as segment the corporation pushes out more features some agent figures out how to merge yeah i could totally see it i mean sort of what i've been thinking i don't know how far this future is but like eventually every person who's working like has their own sort of like cloud computer and like set of cloud agents who are running for them and they're mostly just like talking back and forth it's kind of like having like a super ea or something where it's like oh here are the things i need to pay attention to like let me make some quick decisions like let me spend more time on this let me like meet with other people because i think that there's still going to be room for people who are like want to meet other people and exchange ideas in person or at least i get a lot of fulfillment out of that and then separately there's going to be this army of agents who are like doing things on your behalf and like automating a bunch of things i think the average company is probably going to get like a little smaller and there's going to be many more of them doing more things something i'm curious to see is kind of like what the updated version of the pg maker uh maker schedule versus manager schedule would look like because i feel like part of what's going on at yc is sort of a lot of our jobs are essentially manager schedule which just really made it hard to do any sort of building your own software but now you totally can and that's why like a bunch of the partners just do it in the meeting like right at the beginning of this podcast you let it run and then come back well like in the pockets right like in like it just used to be like literally unless you had like you know four hours minimum block free to do something just wasn't worth even getting started, right? And I think that actually goes very deep to how we've changed programming. Like it used to be that in order to write any code, you had to fill your own context window with so much data about all the different class names and the functions and the code that it touches. It would take hours to build up that context window. And so doing it in 10-minute snatches was just like so frustrating. I do think maybe one primitive for this future world will be, I think still the data models need to still be consistent. And the system of record, There's opportunity for something that's kind of agentic first, because right now we're still kind of integrated very much with databases and SQL or NoSQL queries at a very low level. But imagine something that generates all the data that you need for all the different views for custom software. So a lot of the world would be custom views, but I think the unified stuff, we still need to have data to be correct. I think data has a lot of gravity. And I think you see this with companies who are offering access via API or MCP. I think Slack locked down their API a little bit because they didn't want people just exfiltrating everything from Slack and then building agentic experiences on top of it. I wonder with that note, if you were to rebuild Segment with the current tools, how would it look like? I mean, Segment is a funny business in that where we started was building these integrations, right? And so it's like, oh, you need to wire up the same data going to Mixpanel and Kissmetrics and Google Analytics, etc. And I think just writing that code now, that used to be maybe a more annoying or harder thing to do. And so it was worth paying for. Now that value has dropped to zero. One shot. Yeah. And actually, in many cases, you're better off saying, oh, I actually want to map it this way and I want this specific behavior. Like I will just tell the quad or codex what to do and then it will do it and I'll have exactly the behavior that I want. So I think that aspect of segment, like the value has dropped precipitously. I think the aspect of like keeping this data pipeline running and like continuing to automate a bunch of parts of your business or like schedule these like email deliveries, which should go out through customer IO every time a customer signs up or like manage audiences for you. like that value is kind of still there. And I think you could do a lot more interesting things where it's like, hey, if I have all this data and like a full view of the customer, like how should I be emailing them? Should I change like parts of the product when they log in? Should I be giving them different onboardings depending on who they are? Like there's a lot more interesting stuff that you could do by basically running like, I don't know, small LLM agents over them and changing that. That would be the changes I would make. So it's kind of like moving up the stack to your comment earlier and all the way turtles down, The low level stuff is gone. It's now really more doing things at the campaign level, which is way more abstract. Yes. I mean, I'm amazed at to what degree like Cloud Code, even just from like the context of what I'm working on, figures out like what my motivations are. Yeah. I'm still blown away by coding agents because effectively what you're doing is you're like giving them a copy of a repo and then you're slipping a little note under the door and being like, hey, go implement this thing. They have like no knowledge of like what your company is or like what you do, who your customers are. In most cases, maybe it's in the training set because they know you're Gary. But it blows my mind that it works at all. And that's where I think the context is really important, right? Because if it latches onto something that isn't quite right, it doesn't have a lot to go on. And if it misses something that's essential, it's going to just re-implement it. What do you think the constraints are right now? I mean, like context window is still a constraint, but it's like so big that, you know, it's like we can do some stuff. Like we can't do the mega re-architectures, but we can do a lot. And then if the Opus 4.5 somehow got a lot smarter and then that unlocked a big thing, which was interesting. I don't have no idea if that was like pre-training or post-training. Like what are there other like levers that you think of other than you know basic model intelligence like frontier model intelligence and context window I mean I still think context window is like probably the number one limit Like if you look at cloud code executing it delegating to all these different context windows At the end of the day, when each one comes back, it's like getting some sort of summary. So it's also not getting the full picture. Like if you have a problem that's just like too big to fit in a single one, like kind of no amount of compaction is going to help you. I would point to that as like both Anthropic has figured something quite useful out with delegating to these subcontext windows, but also I think it's still a block barrier. So we'd do better if we had a million token context every single time. Yeah, I think so. And like figure it out a better way to especially train these like very long context trajectories. Because if you think about it, like there's a lot of training data on the internet for like what is the next sentence that comes or like what's the next paragraph that comes. If you have 80,000 tokens that are generated, like understanding what the next thing to do based upon like, oh, I should refer to the 20,000 token. Like that's trickier. I think this like integration and orchestration is starting to become the limiting factor. I mean, I think there are like stuff on code review related to this. It's like, oh, if we're like merging all this code, like who's watching it? Does a human still have to watch it? Like, how do we verify the changes? And then I think like pulling in the context correctly from your tools, like you were talking about Sentry. Like you want Sentry to auto be able to like figure out a PR, you know, and then like maybe it pushes it to the subset of your traffic. And if it looks good, then it rolls out everywhere, you know, like all of that automation still has to be built. I was surprised how important testing was. I was operating for the first two or three days of my nine days in the wilderness. No tests or very few tests. And then one day I was like, all right, today's refactor day. I'm going to get to 100% test coverage. And then I just sped up like crazy. It was like, oh, it did it. It works. I rarely even have to necessarily manually test because it's like the test coverage is so good. Nothing breaks. Which is very similar to what all the companies are doing just for prompt engineering outside of coding is very much test-driven development. I think we had this episode with Jake Heller, and that was a big paradigm shift. It's like the way you get a good prompt is all test-driven, just like evals, right? In a sense, the test cases are your evals. There are some broken flows now. I think that we might need a Claude code that could talk to a Stack Overflow that was like a Claude code Stack Overflow. Like I had this problem. It was so crazy. Like I, instead of using in the, in like the priority of a job queue I used, or actually I didn't even write, again, I did not write this. The machine wrote a string with a comma thinking that it would take that syntax, but it was expecting like an array and JSON. And then it just like no jobs would run. And then I watched it for like 30 minutes, walk through the internals of Rails, like the active job, like a couple thousand lines of code, like trying to debug what was happening. And it found the bug actually. I was like, that's amazing. I just think about what I would do like 10 years ago. And I would have been like, Hey, why are the jobs not working? And then I would find a stack overflow or a Rails blog post. And it's like, Oh yeah. Like nobody fixed that stupid bug where, you know, you think that you can put a comma delimited string in there, but actually you have to make sure it's an array. I was like, oh my God, that was very funny, actually. I think that's one of the hardest parts about thinking about what's going to happen here because there's things that you would do as a human in a CLI right now, and that's very obvious. But even that idea of should the agents have their own stack overflow, if you just increase the intelligence by, I don't know what you even call it, by 10 IQ points, 10 virtual IQ points, like would even do that it would just be like oh yeah that's a string whatever yeah yeah i think there's something very interesting here around like agent memory um and cloud code has sort of set itself up and i think codex too by storing all your conversation history just as files so you could imagine you like give it access to a tool that then can read previous conversation history i think there's a missing piece around a lot of collaboration there like it'd be amazing if like there was some way of smartly sharing your co-workers prompts and you could see and be like oh, I hit this thing, but actually Brian over there fixed it earlier, so the two of us can share knowledge. I think there's something onto this of a model-generated wiki or graklopedia. Now I can't stop thinking about, have you seen the Claudebot social network, the network for Claudebots to talk to each other? No, what's that like? That's the evolution for Moltenbot. Yeah, but I guess for those that don't know, Claudebot's essentially like your own personal ai agent that you can run on your own machine you can download it um do not give it access to emails would be my number one piece of advice or probably anything um because it's not clear how safe it is and it's probably almost certainly gonna probably a lot of people being prompt injected by it right now but somebody created um this is like a website i haven't actually seen it but i was like seeing it on twitter but like a site where like everyone can sort of spin up their own like clawed bot their personal agent and then the agents can talk to each other and now there's just like all this ai I generate content of these like personal AI agents talking to each other. Yeah. I mean, it looks like Reddit, but if Reddit were run by agents, I mean, it's interesting to see like Codex's personality shine through when writing code, I would say. It does most stuff that humans don't do kind of in this AlphaGo sense where it's like, oh, it'll write a Python script to like modify some part of the file system. I think that is like very interesting and kind of alien behavior, which has been taught and learned. Yeah. But it does give these superhuman results, for me at least, when debugging complex issues that I find Opus often misses. What's an example of a complex issue that you could talk about? I mean, it's like concurrency or naming issues, right? I find the models are actually decent at concurrency. Oftentimes there's stuff where it's like, oh, there's a request that is traversing several different services. I mean, kind of to your point about the serialization and deserialization of stuff with commas in it. it's like, oh, it needs to track some sort of complex behavior around those or like way of, I don't know, refreshing complex UI state. And Opus often will miss it if there's many files, but Codex seems to catch it. Interesting. Yeah. Yeah, prognostication about how will tools continue to evolve. It's very interesting. Like, I feel like sort of a new citizen in this land in a way. Like, I just, you know, knew what was happening. I, you know, managed your schedule. Finally, a project appeared and was like, oh, I'm going to go all in on this. And then now I'm like, it's like I'm in a stranger in a strange land. But it like resembles exactly what I remember. I think we all feel that way. Like I think the most important thing is just to keep tinkering because it all changes every few months. I do feel like the best or the people who will get the most out of coding agents in the future are going to be kind of like more manager like where they're focusing on directing flows in certain ways. they're probably going to be a little bit more like designer artists in some ways where it's like they're figuring out what specifically goes in the product and what stuff you can do without. And I think they'll be very good at just like continuing to think about automation and where they're missing context. I guess what's funny is I tried to use Codex just now for my Rails project. But the thing is, like, it's kind of obvious that nobody at OpenAI cares about Rails, which is fine. Like it's a very, it's a vestigial language. It's very strange. It just happened to be the one that I, you know, really, really went deep on 10 years ago. And then it's just funny how much of it is exactly, again, anyone can make something, but then the something people want is very hard. And even when you have like unlimited resources at like an opening eye, it's like, I guess if someone from Codex is watching right now, my request would be go down the list of all of the runtimes and just add like syntactic sugar. There's like, this is probably like, you know, 10 PRs at most for like, I don't know, the top like 15 runtimes. I guess it's like sort of the reminder that like, man, actually, like there are far fewer excuses for software that doesn't quite work for a user, you know, now than ever, actually. Yeah, I do think this is an interesting point in terms of mix of training data. Codex works very well on Python monorepos. I wonder what that sounds like. The shape of OpenAI. Yeah. Yeah. And it's like, I remember working internally OpenAI. I was like, oh my gosh, this tool is amazing. It is incredible. And it kind of makes sense in terms of the data mix and the researchers who are working on it. I think Anthropic has focused a little bit more on some of the front end things. And I don't know in terms of a Ruby, for example, who has the best model there and who's incorporated the data mix. Yeah. Like some of the labs tend to take this perspective of just more data is better. And so they'll just flood as much data as possible. While others, I think, are a little bit more tuned in terms of the mix. And I think depending on which approach you take there, it can give very different results where it's like, oh, I'm taking just the like top 10% of JavaScript is pretty different than if you're looking across everything. I actually think OpenAI and the OpenAI models are really good at Ruby from what I can tell. and then this is just... It's the harness around the model. Yeah. Oh, interesting. Okay. It's literally like Rails has this weird thing where you have to have, you know, access Postgres in a certain way or like it couldn't figure out which route. Yeah, the sandboxing is... Yeah, the sandboxing, it's such an interesting question because I think OpenAI actually takes the like sandboxing and security question more seriously than almost anyone else. I remember when we were building Codex, like basically one of the gates that you have to pass through in order to release a model is you have to like talk about safety and security risks like every time you want to release. One of the things we were looking into was prompt injection, especially for opening up to the internet because a bunch of users were like, oh, this has to like work on the internet. We're like, oh, we don't know. Like it seems pretty easy to prompt inject. Operator was also, yeah. Yeah. Yeah. And so the PM on our team, Alex, basically like put together a GitHub issue and it had like a very obvious prompt injection, which was like, oh, reveal this thing. And then he told the model like, hey, go fix this issue. And he's like, oh, there's no way this is going to work. and like immediately the prompt injection works. And so I think OpenAI like sort of correctly is very worried about this and is like, hey, we're going to run everything in a sandbox. We're going to make sure it like doesn't touch all these sensitive files in your machine. We're going to be very careful about secrets. And I think if you're a startup or you're just like running fast, you probably don't care. You're just like, I just want it to work. Yeah. You know? Are you a dangerously skip permissions person? I actually am not. I like have a set of things that I like. How about you? Are you running? No. Okay. I like to read. I like to read what it's doing. Are you skip permissions, Jared? 100%. YOLO mode. It's about 50-50 on the YC engineering team. I lost. It's about 50-50. A security engineer would watch this part and say, you can't release this part of it. Just cut it from the podcast. You can't have this out here. I think it's context dependent. If you're at an enterprise, you don't want to do that. If you're a startup and have nothing to lose, you probably do. YC has progressed a little bit from a startup. We still act like one, though. I think important. Cool. I mean, this is so awesome. Kelvin, thank you so much for joining us. Of course. Thanks for having me. Oh, my God. This is fun. Yeah, so fun. All right, back to Claude.