The Secrets of Claude's Platform From the Team Who Built It

43 min

•May 8, 20262 months ago

Summary

Angela and Caitlin from Anthropic's platform team discuss the evolution of Claude's platform from simple API endpoints to managed agents with infrastructure, memory, and autonomous capabilities. They explore how platform design choices create path dependencies in model behavior, the infrastructure challenges developers face when productionizing agents, and their vision for a future where Claude self-optimizes based on outcome and budget parameters.

Insights

Platform abstractions must evolve with model capabilities—from completion endpoints to stateful agents—to help users achieve better outcomes with less manual engineering
Infrastructure and productionization are the primary pain points for agent builders, not harness engineering as commonly assumed
Model-specific harness optimization creates significant performance gains but locks models into particular primitives, making generic hot-swapping increasingly impractical
Multi-agent orchestration enables experimentation with different reasoning strategies (advisor, adversarial, swarming) optimized for specific use cases like research or debugging
Successful internal agents require clear ownership, human-in-the-loop oversight, and organizational structures that allow non-technical teams to iterate via Claude rather than direct code changes

Trends

Shift from generic model-agnostic harnesses to model-specific architectures that optimize for each model's strengthsPlatform consolidation toward higher-order abstractions that hide infrastructure complexity from end usersEmergence of multi-agent systems as a standard pattern for complex workflows requiring different reasoning strategiesGrowing importance of verifiable outcomes and budget constraints as primary success metrics for autonomous agentsInternal AI platforms becoming competitive advantages, with companies like Stripe and Ramp building end-to-end development platforms on agentsManaged agents enabling non-technical teams to participate in AI automation through conversational interfaces rather than codePath dependency in model development—early design choices around primitives (file systems, skills, tool use) shape model capabilities and specializationInfrastructure-as-a-service becoming critical differentiator for AI platforms as developers move from prototyping to productionAgent lifecycle management and deprecation becoming operational challenges as organizations accumulate autonomous systemsLayered abstraction patterns emerging where multiple Claude instances orchestrate to prevent non-technical users from foot-gunning core systems

Topics

Claude Managed Agents architecture and infrastructure Platform evolution from APIs to stateful agent systems Model-specific harness engineering and optimization Multi-agent orchestration and reasoning strategies Infrastructure scaling for long-running autonomous agents Agent success metrics and verifiable outcomes Internal AI platform design and governance Human-in-the-loop agent workflows Model lock-in versus flexibility trade-offs Prompt caching and context window optimization Skills and tool design for agents Agent lifecycle management and deprecation Slack integration and deployment patterns Evals and testing for autonomous systems Team-based agent ownership and maintenance

Companies

Anthropic

Host company; Angela and Caitlin lead product and engineering for Claude's platform and managed agents

Stripe

Built internal 'Minions' platform—an end-to-end software development agent system used by engineers

Ramp

Built similar internal agent-based platform for automation and development workflows

Vercel

Described as 'AI software factory' internally; uses agents extensively for company-wide automation

OpenAI

Mentioned as competitor in model development; GPT models compared to Claude on harness engineering

Google

Mentioned as competitor lab experimenting with different model advancement techniques

Cursor

AI code editor referenced in discussion of model-specific harness engineering approaches

People

Angela

Leads product strategy for Claude platform; discusses platform evolution and design philosophy

Caitlin

Leads engineering for Claude platform; discusses infrastructure scaling and technical implementation

Dan Shipper

Podcast host; discusses internal agent implementations and asks strategic questions about platform direction

Quotes

"We find ourselves like basically needing to kind of like evolve the platform to be sort of like higher and higher order abstraction. But it's in the pursuit of like helping you get the best outcomes out of something."

Angela•Early discussion

"Infrastructure sucks. It does. It sucks so much to spin up servers. I can't believe you do that all the time."

Caitlin•Mid-episode

"Everyone hits the same problem of, oh, wow, I either need to keep a server constantly running, or I need to use infrastructure that will spin up and spin down, and I need to store the transcript data, and I need secure sandboxing and all these sorts of things."

Caitlin•Infrastructure discussion

"The platform has to seriously scale. If Claude is on the fly or agents on the fly are becoming what they need to become in order for you to do what you're trying to do, the platform has to seriously scale."

Caitlin•Future vision

"Claude is actually able to understand itself enough that it can write itself on the fly. In that world, if Claude is on the fly or agents on the fly are becoming what they need to become in order for you to do what you're trying to do, the platform has to seriously scale."

Angela•One-year vision

Full Transcript

A year from now, where do you think the platform will be? We want to experiment with directions where Claude actually gets so good at understanding itself. It figures out what model you should be using. It figures out how to spin up all the sub-agents. You don't have to think so much about what kind of architectures are there, because Claude is actually able to understand itself enough that it can write itself on the fly. In that world, if Claude is on the fly or agents on the fly are becoming what they need to become in order for you to do what you're trying to do, the platform has to seriously scale. How close are we to Claude Make Me a Billion Dollars? That's really what I'm asking. Angela, Caitlin, welcome to the show. Thanks for having us. Yeah, thank you. So for people who don't know, you both work on the platform at Anthropic. So, Angela, you're the head of product for the cloud platform. And Caitlin, you are the head of engineering for the cloud platform. I'm really psyched to talk to you because, A, you've been launching a bunch of stuff. You have cloud managed agents that came out recently. You've been launching new features for it. And I think that it comes at this really interesting time where it makes me think about what actually is a platform in AI for a model company. because in the GPT-3 days, the platform was a completion endpoint. You just send a prompt to get a response. After that, it was a completion endpoint with tool calling and chat sessions, that kind of stuff. And now with cloud-managed agents, you're essentially getting a cloud on a computer with memory and all this other stuff. So I'd love for you to help me unpack that trajectory and what it means to build a platform in AI. Yeah. I think your characterization is very accurate. I think as a lot of these technologies have evolved with the LLM first starting, and then I think putting that behind an API was very fun. A lot of people were like, wow, I could do some. At the time, I think it was very cool. Now, we'll probably look back and be like, oh, that was really basic. and then you know I think like we've moved more and more towards like a slightly more like stateful world as you kind of like want to persist the kind of like sessions state to be able to make sure that the kind of performance of the model is like better and better I think that that's probably like actually the the through line like as a lot of these kind of like as we make improvements to Claude and as it continues to get better and like more autonomous we find ourselves like basically needing to kind of like evolve the platform to be sort of like higher and higher order abstraction. But it's in the pursuit of like helping you get the best outcomes out of something. Like I think in the very beginning, you know, we were very like everyone was very exploratory. It's like you have no idea what people are going to build with these LLMs and you wanted to kind of have as much possibility out there as available. And then as those use cases started to kind of narrow down, like people started building products with it. People started now like building agents with it. and more and more of that is about, you know, like customers coming to us and being like, how do I get the best out of Claude? How do I like set up my tools? How do I run the loop? And so on and so forth. And you have some people who are like really, really experimenting and they're on the edges and that's great. And then you have like just a whole host of other folks that are coming in who are like, I kind of want a lot of this stuff like out of the box. And in our pursuit for making sure that like Claude is basically producing the best outcomes, we find ourselves like enriching the platform to be richer and richer and richer. And that's, you know, contained in that is like both the state, it's like the tools that you start to see us adding. It contains a lot of kind of like almost like sort of like cloud components of a lot of these types of things. But it's in pursuit of the same mission of like just making things literally as easy as possible. And I think in probably, you know, the forward state of a lot of these things in terms of maybe the philosophy of what a platform ultimately ends up doing, It probably ends up just being the set of primitives in infrastructure that enables you to basically get the outcome as fast as possible with actually as little of work as possible. And I think that that tends to follow a certain form factor, at least in this current state. How would you characterize what the primitives are today? So maybe that's just asking, what are the primitives in Cloud Managed Agents? Yeah, so Cloudmanage Agents is built on all of our same primitives that you could otherwise build on directly. So the Messages API. And within the Messages API, we've built a whole bunch of, I guess, maybe innovations around the API. Like you could just get tokens in and out if you really wanted to. But, you know, you can use some of our built-in tools. You can use stuff like code execution, spawn a sandbox, and execute work. You can use, I guess, like, you know, web search and all these sorts of different things. And so I think we've taken what we see as all the most powerful of those things and put them together into a harness and a set of infrastructure that is, you know, just the way to get what we think is the best outcomes out of Claude. So I'm sitting here feeling this sense of I've been thinking of it as like time deflation. Like my time gets more valuable in the future as opposed to the opposite. Whatever the opposite would be, my time gets less valuable in the future. And the reason is because we're, so for example, internally for us, we're building an agent. We're building some agent products where it's like agents that do specific things for us internally and then hopefully for customers. And in order to do that, we have a couple Mac minis with Claude running in a loop on the Mac mini, right? And a lot of that, and it's like a thousand line Python file or whatever, and a lot of that mirrors what you guys are building in cloud managed agents. And so for me, and I think for a lot of people building on cloud or on the cloud platform or ecosystem, there's at least I feel this, maybe we should just wait for you guys to build it. But then I don't know what the lines are. And I'm sort of wondering, if I want to build an agent, what is the best path to do that in a way that aligns with what you guys are doing? Yeah. I think this part of the kind of platform business is actually somewhat similar to any other form of the platform business, where you do have customers like yourself who are building. And you're kind of thinking, should I go ahead and do it? Because maybe I have this immediate need. But at the same time, it'll kind of want to repeat the work per se when you could have just gotten it for free out of the platform. And also infrastructure sucks. It does. It sucks so much to spin up servers. I can't believe you do that all the time. That's like a huge part of the job. That part, everyone's like, that's the worst. But I will actually say part of why we ended up building Cloud Managed Agents was because Anthropic ourselves had gone through enough of these iterations where we built products that were agents that you could run autonomously in the cloud. And we did that, stand up the infrastructure so that it works well, sort of work enough times that we ourselves were like, OK, we're done building this for ourselves. We're doing it once in a way that's going to really work from everything that we've learned, also for all the people who are doing it. You can run whatever you're running on a couple of Mac minis maybe, right? And for a lot of people that could work. But I think if you're building agents into your product and you're running something really at scale, right? Like that's where it really starts to become more and more challenging to get that infrastructure right. That's really interesting. Yeah. And then maybe to answer the other part of your question, I think we have like two pieces of the philosophy here. One is a bit in the way that we kind of design managed agents, which is that we try to have it be modular enough. We want to be opinionated about some pieces that we feel like should be very well married to the Claude model. But then we, like, oftentimes, like, the way we want, for example, we want Claude to, like, very specifically use, like, file systems. That's, like, a very particular, like, Claude kind of style. In a specific way or just file systems in general? Just file systems in general. We also really want to lean into skills. I know, like, a lot of folks like skills, but, like, that's something that we, like, we want to have our hardest be really opinionated about that. And so we're kind of particular about, like, those kind of primitives being the case. So, like, use the file systems, use the skills. They're really basic, but at the same time, we still find people who are still trying other methodologies to go do that. And we want to kind of help you when you build to start on the best foot. So that's one piece on some of the kind of more opinionated ones. But as each one of these kind of endpoints or APIs that we have as part of the suite, we try to open them up a little bit in certain areas. So there's things that we're looking forward to and being like, maybe it's not available today, but in our design, we are trying to make it flexible enough for people to add in different pieces. Because we recognize that this API or suite of APIs is not necessarily going to solve maybe everything in its original construct. And there are going to be pieces that need to open up. And then the second bit is, we're kind of public about this, is when we do design a lot of these things, we do put out blog posts and reference implementation. So if you did want to kind of at least be inspired by that construct, but still maybe make your own on the messages API, you can definitely do that. I think that's to the point you just made. That's something that's coming up for us. Again, we have, you know, Claude's running on a Mac mini with a Python file and a couple other, like, you know, bigger, more serious implementations on, like, you know, cloud infrastructure that we're trying to figure out what to do with. And I think I told the team that we were talking today. And I think one of the questions that they have or one of the feelings of consternation that they have considering using Cloud Managed Agents for this kind of thing for spinning up agents for our customers is just right now it's like we have a playground. We just have a server or Mac Mini. We can just pipe stuff to Cloud. It can do anything that Cloud Code can do. It has a file system. It has a browser. It has all this stuff. If we want to switch it out to GPT 5.5 or Gemini or whatever, it's pretty easy to do that. So is that kind of... And I feel like they feel like if we use a Cloud Managed Agent, we're going to get locked in and we're not going to have the flexibility to do all the stuff that we want. And there's also a worry that features are going to come to Cloud Code itself that won't be in Cloud Managed Agent for a little while and that it'll prevent us from being at the edge, which is sort of what we promise to our customers and really to ourselves. Like we just love being, like just doing whatever the new thing is. How do you think about that? Yeah, so I think the, what's nice about the way that we work internally, I guess, is like, so we run the platform and the platform for what most people think of it as is our externally facing APIs and our suite of APIs The other rest of what our team actually does is internal platform in the sense that all of our first party products are built directly on the same platform as everybody else. And so what's cool about that is we spend all of our time, not all of our time, but a lot of our time working with the teams internally who are building on top of the platform and kind of enabling the features that they will build sharing ideas and these sorts of things. And so I think over time, you'll maybe see less and less divergence of, you know, like what might be available in cloud managed agents, what might be available in co-worker cloud code that might sit on top of the same infrastructure, right? Like that's, I think, one way to think about that. Yeah. And then I think, you know, on your point around or your team's point around, like, you know, having some kind of like model lock-in fear, I think that that's like valid. like many folks kind of have that consternation. And I think we're kind of at this place where there's a bit of like an evolution here where, you know, if you look back, maybe even just a couple months ago, it was very standard to kind of build a very, very, very generic harness. It's super generic. And then you can kind of hot swap models across all of those things. And I think for kind of an older generation of models across labs, that kind of worked like, okay. A lot of things were moving at a pace where I think that that was like mildly reasonable. I think now for the next kind of generation of models, And as we kind of see it forward, I think you kind of see this a little bit from every lab. Like everyone's taking like slightly different techniques and perspectives on how they want to kind of advance their particular form of the model. And so in theory, I guess you could do kind of the superset of all those things. But more often than not, I think, you know, like when you build agents for your company or for your customers, you do want to deliver like an outcome ultimately for them. And so I think that that level of abstraction of like what you're actually hot swapping stops becoming this like really generic harness and hot swapping the model. and it gets more to like the harness and the model get very paired, you still need redundancy and you still might want to use other models for things, but you probably do it at the layer of like the agent, meaning like the harness plus the model, rather than necessarily the other architecture of like, you know, really, really generic harness and hot swapping everything underneath. That's really interesting. Is that how, I don't know, the cursors of the world are doing things? Like do they have a separate harness for each model or is it a generic harness that they're kind of hot swapping the models in and out of? Do you know? I'm not entirely sure. my intuition would be that, like, I don't know about Cursor in particular, but there have been, like, teams that we have talked to who have kind of fallen on similar kind of perspectives. And it's mostly because they're just trying to squeeze the most out of each model to kind of, like, almost, like, harness engineer, like, every single, like, nuance. And, you know, one example that we have, it's not an external customer per se, but something that we've done a lot internally, like, we recently launched, like, Memory, for example, with Manage Agents. And we tried a bunch of different harnesses ourselves. We tried one that was the one that we ended up launching. We tried a bunch of others using a bunch of different other techniques. And at least personally for myself, when I saw the eval suite from the team, but each one of these harnesses performed drastically differently. And so I think just even looking at something like that shows you that you can actually hill climb a tremendous amount by just harness engineering the right pieces together. And I think if you were to just take that forward across all model combinations, across all different labs, all different kinds of providers. There is a lot of alpha in that kind of construct. And so I wouldn't be surprised if more than just ourselves have experimented with that level of unit tying. MARK MANDELAVYSCHIHLAMI- It's really interesting that there's this path dependence where you make some choice for how you do requests and responses, or how you do tool calls, or whether you have the model want to use file systems or not. And then that sort of changes the trajectory of all of these different models. MELANIE WARRICK- Yeah, and it feels like maybe at the time, almost like, you know, kind of like footnote. But it ends up becoming very big. Do you think that that will end up affecting the model's generalizability in the sense that at some point they'll just have these sort of maybe locked in lanes of stuff that they're good at because they're, you know, Cloud is really good at file systems and OpenAI is, you know, GPT is good at some other things. Like, yeah, how is that going to flow through the model's like personality and behavior if it's like locked into a specific way of doing things? I do think it does actually kind of tend to lock the model. So like what we end up like kind of treating as like the right path and the right primitives need to be like very carefully thought through. And so like I think in some eras, you know, like of other models, they become really, really, really good at like reasoning. And then they almost like over optimize on that level of reasoning. And there's other perspectives around like, okay, like, yes, we want it to be really good at like a computer. Like maybe the computer part is the interesting part. And so if you think through maybe some of the primitives, which we could get right, we could get wrong, but at least we'll like go through the thought process of like that will probably at least lead us, you know, one path or the other. I think it's hard to say like, you know, in which direction per se will ultimately be true. But I do think there's a lot of like path dependency it ends up taking. So being really like thoughtful about what you choose to actually include or give kind of the model more natively is really important. Are there any of those path dependencies that you've had to undo? Probably. I can't speak enough about that at the anthropic level. I've only been here a couple of months, but I have to imagine that that has been the case. I mean, we've experimented, even at other labs, that the kind of primitives that we have to take a look at are constantly changing. And you do kind of hit a little local maxima and rethink, like, okay, maybe there's like a more generic approach to doing it. Yeah, yeah, yeah. Interesting. I want to take a step back and ask you something that maybe I should have asked at the beginning, which is, like, who is Cloud Manage Agents for, right? Like, I set one up earlier today. We've got some people already using it in production inside of Every, and I just did one today. I really loved the sort of, like, getting started chat experience that you had and the sort of some of the examples that you had. And it felt to me like even if I was not technical, I might want to use this to set up an agent. It might be a little bit complicated. But what I actually did is I just, and I'm sorry to say this, but I did it in the Codex in-app browser. So I had Codex driving the managed agent setup. And I had a Slack bot working pretty quickly. It was really cool. So how do you think about when you're designing stuff, when you're designing cloud managed agents, who it's for? Yeah, so it's interesting because I think you're right that especially with that quick start experience, which we actually felt pretty strongly about launching, not specifically for the sake of making it so that non-technical people could go and build agents, but actually just for anybody, technical or not, to be able to wrap their head around the primitives like the APIs. Here's what I can do and here's what fits together. Yeah, exactly. Like, you know, the kind of education portion of it. But I think when we think about who is for, we think about a couple different things. One is we're seeing people internally within companies build automation or build really powerful platforms or systems. Like we've seen people say, I want, you know, a full end to end software development platform. Right. And like managed agent is a perfect solution for something like that. Or, you know, I want to automate a little process over here where like legal has to review my marketing copy. Right. And things like that. And so you shouldn't have to reimplement memory and like all that stuff. Every time you're doing that. Right. You can get started really quickly and you can get something running quickly. The other user that's top of mind for us is people building into their products that they expose to their customers. And so that's the other one where actually, yes, like you do still want a lot of customization. You do still want to make something that's going to be really powerful for your product. But we still like definitely, definitely believe that not spending your engineering resources on the infrastructure and on all the little harness engineering tweaking sort of stuff is worthwhile. Why couldn't we have talked like a month ago? You would have saved us so much time. We'll just need to talk more. But I am sort of curious. OK, so maybe infrastructure is one of these things. But when you see people setting up agents, what do you see them think the hard thing is? And what ends up actually being the hard thing? And are they the same? Good question. Maybe this is, I don't know, spicy. I'm not sure. But I think people think the harness engineering part is the hard part. And so actually like, you know in the past we launched the agent SDK which is what you guys I think are using On your Mac minis and for a lot of people they were like, okay great I don't have to do the harness engineering part where I have to do prompt caching and I have to maximize my context window and all these sorts of things I think we're just actually using just Claude in bat like the Claude dash P command. Yeah, okay. It's it's pretty good Yes, yeah OK, cool. But regardless, you guys did that because it takes off your hands building the harness, right? But I do think what we saw with a lot of customers was, OK, now I want to go and take that thing and get it into production and scale it. And everybody hits an infrastructure wall. Everyone hits the same problem of, oh, wow, I either need to keep a server constantly running, or I need to use infrastructure that will spin up and spin down, and I need to store the transcript data, and I need secure sandboxing and all these sorts of things. And so, you know, and like if you boot a cloud code session or you boot the agent SDK in a sandbox and like that's the thing that you have running, but your sandbox loses connection and dies or whatever, your whole agent dies, right? And so I think the infrastructure part especially is the wall that most people end up hitting, but they're more expecting that the actual harness engineering and like getting the most out of the model is the part that's going to be harder. Yeah, I totally agree with that. I was just going to say, like, you know, we talked to so many people who are now at a place where they're, like, prototyping really quickly. And they're super excited. And it's, like, it's doing the thing. And then, yeah, there's, like, a class of people who are, you know, really pushing and being like, okay, I do want to hill climb. I really want to edit the hardest. But then once you have that thing, like, productionizing is just a freaking nightmare, especially for the more interesting kind of long-running async ones that you want to do a bit more remotely that are a bit more autonomous. And everyone kind of runs into that wall. It was a big inspiration for why we built what we built. I feel like one of the like er examples of the shape of an agent is open claw And in particular the the thing that it has brought to us Internally is you have an always-on agent in slack that has its own personality and it has its own like Part of the world that it like ends up working on Are you guys like is is that a possible future for like okay? A one-click agent that lives in my slack that yes I can go set up all the internals, but I don't have to really think about all of the technical infrastructure stuff. Because I think you all have the beginnings of that but it still a lot of steps from the current managed agent to something that always on in my Slack that I have to set up and customize So does that fall in the realm of Platform job or is it too far in the product direction? NICOLE MIRRELLA- No, it definitely is something that we really want to do. I think we focused a lot on the infrastructure piece to start, because that's where we just see a lot of these pain points. But yes, I think it's like, I don't want to exactly say final shape, but in its advanced shape, we actually want to make it so that you can deploy these agents really, really easily. We've made some light steps in this direction. For example, we included Vaults as one of the primitives as just kind of. MARK MANDELAVYSCHIHLAMI- And Vaults store your keys and stuff, like your OAuth keys? MELANIE WARRICK- Credentials. MELANIE WARRICK- Credentials, yeah. MELANIE WARRICK- As kind of solving some of the lower level pieces as a starting point. But once you kind of wrap some of these more sort of agent identity type of primitives in a more secure way, and you can handle it really easily, and it works with the whole system, then I think it's very natural for us to get to a place where maybe you are either one-clicking Slack integration or alternatively even maybe just telling Claude, add Slack, and it just handles absolutely everything. And then before you know it, your little bot is just picking you on Slack. I love it. I can't wait for that world. What are the best internal use cases of agents? Because I think there's this big question happening right now where, okay, yeah, everyone's in codex or cloud code, but then now we have these agents that are out in the cloud. Now everyone inside of a company can have their own agent. There are team agents. There are company-wide agents. So what are the patterns that you see for when people make really useful internal agents, what they do and what they look like? Yeah, I would say we, similar to, and we've actually seen a few examples of these in some of the more AI-pilled, AGI-pilled companies like Stripe built Minions, and they talked about that a lot as their kind of like end-to-end development platform that their engineers could use. I think Ramp did something similar and we've done similar things as well, right? That's interesting. Yeah, we've built kind of platforms internally that are, you know, I have agents running that I can talk to from Slack or from wherever, right? And at a certain point that becomes actually like a pretty thin layer on top of managed agents. Like you don't have to do very much to accomplish. That's what I was thinking. Like I looked at Minions or whatever Ramp does And I was like, why? Why? You know? So is it actually useful to have a sort of like thin coding agent that anyone in the company can use? Or why not just install the Cloud app in Slack? Yeah, I would say the difference in a platform like that and some of the things that we've done internally is there's a lot of customization that you might want to do on the development environment where an agent is actually running and able to verify its changes, right, and things like that. It's like, here's how our CI CD works. Yeah, exactly. And so, you know, I think for lots and lots and lots of people, like Cloud Code is an excellent tool, right? And you can run cloud agents with Cloud Code, and that is really great. But I think if you're trying to do a bit more end-to-end development, right, and you maybe want to bake in more custom things, then you could start with something like managed agents and build a layer on top of that and end up with something that's maybe closer to that end-to-end experience. It also seems to me like there's something in particular about having a team that you need to work with that makes the managed agent shape important as opposed to it just all works in cloud code. Like, I guess technically you could, like, sync the skills between everyone's cloud code. But, like, there's something about just we all have one agent that does this thing that seems to work. Yeah, I'm really glad you brought that one up because I think, like, that's actually, like, one of the more common areas where we see a lot of the opportunity is that, to your point, you know, there's a lot of like individual productivity that's happening, whether you're a developer or a non-developer, there's like so many tools that you're using to just like make yourself like more automated, more, you know, high leverage. But then when you get to the team layer, suddenly everything gets like massively more complex. Like number one, obviously you can't like sit on your laptop. And yes, you could maybe like, you know, put it in the cloud, but it's again more for yourself to kind of like handle with your laptop closed. But then you go to like, okay, well now like the three of us want like, you know, a couple of agents that interface with each other and work with each other. And then maybe we're automating a process kind of end to end. and especially for some of the more complex processes that you kind of envision being like really transformed with AI, you do need like, you do need that kind of like team orientation. And that needs to happen at like a layer that's a slightly higher bit of abstraction than just a single agent. And I think some of the teams exploring, you know, kind of multi-agent architectures and things like that are really exciting, but it needs to be built on top of a little bit of like a platform that everyone kind of spin up and down and control. And I think G from Vercel like had a really good perspective on this in a way where I think his company, Vercel, is obviously incredibly AI-pilled. And he kind of describes it as sort of like an AI software factory internally. And I think that's exactly the right mindset. And that produces an extremely high-leverage organization that's really just creating a tremendous amount of productivity, but not just for themselves, just for every single process that they have in the company. And I really want to go back to this, like, okay, agent use cases. We've got coding agents that anyone can use in the company. What are the other ones that you see people standing up that are really useful? We've seen a few. So one of the fun things that we get to do is just kind of work with our internal teams of different functions and help them agentify. Because we actually just get to learn a lot as a result of doing that. And so the silly example I brought up earlier of legal team needs to review marketing copy was one of the ones that. Very real. Yeah, extremely real. like really like blew people's minds with like very basic agents that just give people the right setup to be able to do that so you've seen that well what does that actually do so it's like there there's marketing copy and there's a legal agent that is just like watching what everything marketing does and is like stop like no it is more like okay i'm a marketer and i've written some copy right and in the past maybe you would have opened a ticket or something and be like can you please review this copy um but instead you submit it to this like you know little app that we built on top of agents that is like, okay, cool. Now I'm going to go as an agent review first and then put it in legal's inbox as a already first pass review was done. And maybe actually like the agent is, it's clear enough that it can say, okay, marketing, you're good. Right. Or maybe it's still like, no, this needs like an extra human review. And so, yeah, just, and that's the sort of thing where, again, just thin layer on top. Um, but you can build the, you know, you have access, I have access, we can both see the outputs and we can work together. Okay. But then, so for example, why is that not a skill? So it's, it can, it very much can be a skill. And that actually is like, if you, you would probably build that agent as a, you know, legal reviewer agents, right? And so you would have MCP servers or whatever it is that help you access external context. You would have skills that help you understand like, here's what rules we have to follow and not follow, right and all those things and you put all those things together but then you can just fire off a session with that agent and then I think the last piece you need and this is where I'm saying it's a really thin layer it's just like the form factor on top where like different people can collaborate together and like work with that agent and multiple agents can be involved in the system and so I think it goes a little bit broader than a skill because you kind of still need like the right form factor for the agent to be able to go run and then for people to be able to interact with it. ANNA BAKERAWALA- Another core bit of why it's not a skill is because, or not exclusively a skill, is because you actually do need human in the loop. And so if you were to automate the whole thing and you were just taking the skill and looking at yourself from legal skill, for example, in that world, of course, you could have just done a pure skill. But if you need a human in the loop to be like, OK, I want to review and I do want to check and we're looking at legal things, and so there's a bit of authentication that's sort of necessary. In order to automate that entire process, you kind of need agents to go do the thing. And so because you need to spin up sort of separate sessions for that to happen, some sort of stitching is necessary that can't be instantiated in a single skill. That's really interesting. Yeah. OK, so just to push on that a little bit. So what is the best practice for you? You create an agent that its job is to make sure that when marketing is writing something, they can get it approved really quickly by legal. And sometimes it'll approve things immediately. Sometimes it sends stuff to legal and ideally it's like getting better all the time. So it can do more and more right What is the best practice for? Who owns that agent once it's built because one of the things that we found is if you don't have a human who's responsible for the agent It gets stale very quickly and then it ends up being kind of this like dead thing That's all just like out there doing stuff, but it's not actually good and also Even if it kind of works, there's all there are gonna be all these times where legal is like you asked me to approve this, but I don't really need to approve this thing. Like, let's update your prompt. So like, how does that all work when it works well? So it's actually really interesting because, so the form factor thing, right? Like the app that sits on top of that, that we originally built, one of our teams worked on that, right? And like kind of sitting with these teams and understanding what they needed. And they were kind of like, okay, here you go. And we're going to go do other stuff now and like, let us know how this goes for you. And then a really cool thing actually ended up happening where people on those teams who were using the tool were like, oh, I wish like this little thing could get tweaked or this thing could get better. And they like popped open Cloud Code and made some of the changes themselves. And so it's funny. And then is your team responsible for approving the PR? Does it just like go in? Usually my team is responsible for reviewing the PR if it's a system that we actually own. But yeah, like people can kind of self-serve making changes to those things, which I think is really cool. So it is, I do think we're still in a stage for a lot of teams, a lot of companies, like even going back to, you know, like Stripe has minions, right? Like Stripe has a large developer productivity team. We used to work at Stripe, so we spend a lot of time with them. But they have a large developer productivity team. They're awesome. And they're obviously putting a lot of work and energy into building platforms and tools like this. And so I think we're definitely still in a place where something like managed agents or being able to build on top of our platform is really powerful. but you still kind of need the like AI pilled people and technical people within a business to then go like create something really excellent on top of that that works well for whatever you're trying to do. That's interesting. Yeah, I love the anyone can open a PR to do this because everyone's using cloud code. One of the things that I find talking to people who are in infrastructure roles at companies where this is starting to happen is like, you know that, you know, the meme where it's like there's there's a person and he's like going like this and he has daggers in his back and he's covering it. It like infrastructure people are that Now anyone can submit PRs How do you deal with that And how do you do that well Because obviously like in an ideal world you would love for legal to be able to submit PRs to improve this agent. And also sometimes they're probably gonna submit stupid stuff that wastes time. And so what are the right ways to either organizationally, like culturally or technically like make that possible without ruining your lives? KATHLEEN MURPHYSKI- For this particular one that we've constructed that Caitlin's given as an example, we actually have a couple layers of abstraction away from that PR layer. So at the very beginning, it started that way. And to basically prevent users from foot-gunning themselves a little bit, they get to a place where, oftentimes, their way of interacting with the agent that they own, whether it's the marketing team who owns the marketing agent requesting, or if it's the legal team owning the agent that does the review, they actually engage with those agents through Claude itself. So they actually spend more of their time like kind of talking directly to Claude. And then Claude will oftentimes figure out what should be the right way for them to go and handle it so that they're not kind of like, you know, hopping straight down to the absolute core bit and doing something that may result in, you know, some complications. And they're talking to Claude or Claude Code? Like Claude Chat or Claude Code or co-work? It's a different instantiation of Claude that we made that actually is a managed agent in and of itself. So it's just kind of like managed agents all the way down in that construct. But we found that each layer, if we kind of tune and prompt each variant of the managed agent, it helps to solve different parts of the problem for users. So at the end state for that marketing person or that legal person, it is like a really simple interface where the way that we tell them is like, you're just talking to Claude. But under the hood, it's many, many Claude's engaging with each other to get to the part where then the Claude's themselves are doing the more complex work that the human doesn't really necessarily need to interpret. Interesting. You guys just launched multi-agent orchestration. What are the coolest things that people are doing with that? One of the more interesting ones is, like, I think people are using it to, like, construct sort of different harness techniques. And that one I'm personally very excited by because, like, there's different techniques that people have experimented with where, you know, like, for example, we recently did, like, the advisor strategy one. But really, if you were to genericize it, You just separate like execution from advice. And there's also one where you can have like two, you know, modes where one is generating someone, something, and the other one's adversarial to it. And then there could also be sort of like, you know, you split it into a bunch of different like little tiny pieces and then they kind of recombine. And then there's ones where maybe it's kind of something closer to like best of end kind of like style of thing. And then there's so many more. And like in each one of these different types of like architectures or strategies, they are good for very specific use cases. So some of them are much better for deep research or wide research type of style use cases, right? And there are others that are like, these are the kind of ones where they all sort of swarm together are better for bug hunting, for example. And so that's really cool to see that if we can make the primitives very LEGO-like, then people can put them together to solve things at a slightly higher form factor, which is more like an architecture or a strategy. And they get much more interesting results out of that. And that's really exciting to see, because it also suggests that you can actually hill climb multiple layers of abstraction. How do you know if an agent is successful? How do you measure success for an agent? Yeah, I mean, there's like evals and stuff like that, which everyone has talked about like ad nauseum. One direction that we really like is like this kind of verifiable outcome. We've been somewhat opinionated on that one. And it's almost like in the absolute end state of, you know, we talked a little bit about what's a platform at the end of things. Going from that philosophy, it's like our kind of principle of like maybe the end state of some of these things is that everything should kind of compress down to an outcome and like a budget. And that's probably like about it. And everything else should be figured out for you to kind of resolve exactly across those parameters. And so for us, we're kind of, yes, we still have evals. We have a lot of these other things that we measure that are domain specific, like, you know, some coding evals would be like, you might want to measure like just the actual PR getting merged. Those are more verifiable. But as we get to the place where, you know, like an outcome is actually a spec that you are just as a human able to define and our ability to interpret that and regrade itself over and over is closer to what we care about. Claude, make me a billion dollars. Your budget is $10. Exactly. I meant to say no mistakes. Go. Go. Exactly. Maybe Mythos could do that. And then one of the things that we were running into that I'm curious if you have a solution for is agents like get outdated pretty quickly. sometimes because there's no human attached to them sometimes like they're just running an old model or there's an old or in an old architecture or whatever and it feels like there needs to be a end of life cycle for agents like we've talked about having like a little like funeral for them and like having like a little page on our website that's like here's all the decommissioned agents and stuff like how do you manage especially in a company in a really big company how do you manage to all of the agents that are sort of out there, and maybe they're in Slack pinging stuff once a week, but you're like, this is super stale. How do you make sure that you retire them as quickly as you are making them? So one of the things we have actually done is we have made skills that help you do things like upgrade to a new model when a new model comes out, right? We've actually put a good amount of work into making it easier to do exactly what you're talking about. And I think maybe some of the most like AGI pill people are like running agents that are monitoring their agents to see if their agents are, you know, like outdated and in need of that sort of stuff. But I think for the way that we like to talk to customers who ask us this question, I do think the most interesting instantiation of this is there's a new model and now I need to go upgrade my agents or maybe be done with those agents because the new model enables me to build agents that are way more powerful and do more interesting things than the old agents did. But I think that upgrade process and that migration process is something people have had to wrap their heads around as like, it's like a breaking change. And I have to put actual energy into making that work. And obviously, sorry to talk about evals, but if you have evals, this process is easier and things like this. But I do think that's one of the things we've tried to do is how do we give you skills and how do we give you the right tools to make that process easier. and then you could go be AGI-pilled and choose to actually automate more of that with more agents. So a year from now, we're back at Code with Claude. Where do you think the platform will be? What will I be able to do and how it will be different from what I can do today? Do you want to go first? You can go first. A year is a long time. In this industry especially. How close are we to Claude making me a billion dollars? is that's really what I'm asking. If that works, you probably won't be sitting here. Yes, yes. We'll be asking Cluck for this. I mean, yeah, like, we want to get closer and closer to that state where I think we kind of, okay, so a couple things. I think in a year from now, I mean, one thing that we'd love to get really, really close to is actually that kind of, like, simplicity, and this might be a significantly higher order of abstraction. I don't know what the form factor will look like or whatever, but the kind of parameters we will care for from users will be that outcome. And of course, it has to be verifiable. There are some parameters that have to be restrictive. And the budget. And I think we'd want to experiment with directions where Cloud actually gets so good at understanding itself. It figures out what model you should be using. It figures out how to spin up all the subagents. I actually don't think you need to think so much about harness engineering in that world. Today, you don't have to think so much more aggressively about tool construction, for example. We've kind of made that a little easier, and you get to delete a little bit of that scaffolding. Less prompt engineering, too. Yeah, exactly. Exactly. And I think if you just keep going up that stack, like today a lot of the innovation is happening at this kind of like like like really high level almost like harness architecture like level which is really fun but i think a lot of that honestly also kind of goes away where you almost like don't have to think so much about like model selection you don't have to think so much about what kind of architectures are there because we probably put a would have like gone through enough iterations with claude where claude is actually able to understand itself enough um that it can almost like write itself on the fly to figure out what is necessary in that kind of like two parameter world of like outcome and budget. I don't know that we'll get there like in a year, but I feel like we might be able to do like the outcome part of that with like maybe, you know, some bars, some error bars on the budget side. Really cool. Yeah. Okay. That was really cool. I'm gonna give you a slightly more boring answer, which is in that world, if Claude is like on the fly or agents on the fly are like becoming what they need to become in order for you to do what you're trying to do, the platform has to like seriously scale. That is. And so I do think some of this will be what are the right abstractions that actually enable that, right? Like somewhere on the primitive to higher order realm, right? But I do think so much of what our team is going to be doing is making sure that the tokens that people want to come in and out of Claude are going to be able to come in and out of Claude. because our system is scaled to meet not just the demand, but like in that world where it's just like you have agents that are like literally constantly running and recreating themselves and doing this sort of work. You just need a system that, you know, can handle long running requests, can handle a bunch of differently shaped things. And so I think for us, it's going to be, I never want the ability of the platform itself to be able to scale, to get in the way of what people would otherwise be able to accomplish with these things. And so I think that's something that's going to probably be very friend of mind when we're talking in a year. Awesome. I'm excited. Thank you so much for joining. I really learned a lot. Thanks for having us. But instead of gold, it's filled with pure, unadulterated knowledge bombs about chat GPT. Every episode is a roller coaster of emotions, insights, and laughter that will leave you on the edge of your seat, craving for more. It's not just a show. It's a journey into the future with Dan Shipper as the captain of the spaceship. So do yourself a favor. Hit like, smash subscribe, and strap in for the ride of your life. And now, without any further ado, let me just say, Dan, I'm absolutely hopelessly in love with you.