Is the ChatGPT Era Over? Opus 4.6 & The Shift from Chat to Delegation - EP99.33
This episode discusses the simultaneous release of Anthropic's Claude Opus 4.6 and OpenAI's Codex 5.3, comparing their capabilities, pricing, and performance. The hosts analyze the shift from traditional chat interfaces to agentic delegation workflows, exploring the practical challenges and costs of implementing AI agents in business environments.
- The AI model race has intensified with companies releasing competing models within hours of each other, indicating fierce competition
- Cost efficiency is becoming more important than raw performance, with cheaper models like Codex potentially offering better value for agentic workflows
- The transition from chat-based AI to delegation-based AI agents requires new skills and workflows that aren't easily transferable to all users
- Enterprise adoption of AI agents faces significant challenges around cost control, security, and the need for human oversight
- The productivity gains from AI agents come with increased mental overhead and coordination complexity for human users
"It's a bit like the space race. Like they've just launched Sputnik and then the US are like quickly rushing to launch Gemini."
"I think the transition we're in now. It's like from chat to the delegation, like, you know, you probably going to see the core of these new businesses built around that delegation piece."
"Cost is going to become an issue. Like you don't want every developer spending like two grand a day, and then the output doesn't rise by the equivalent amount."
"I feel like it's almost like a fantasy for people in some ways where they're like, oh, I've got all these agentic workers working for me, doing all this productive stuff."
"Isn't that what everything is around this right now? It's like, it's like the spice is tokens and we need to turn the spice into wealth."
Is this ag? Is this ag? A million tokens deep and I never tell a lie Is this ag? Is this AGI?
0:00
So Chris, this week it is the model same day showdown. We had the Release of Opus 4.6 from Anthropic and from OpenAI Codex 5.3. And just before we started recording we were speaking about how close these releases were in the fact that you could, you could measure it in minutes. So Opus 4.6 dropped first and then you know what? A hundred and something minutes later, Codex 5.3 is out the door.
0:09
It's a bit like the space race. Like they've just launched Sputnik and then the US are like quickly rushing to launch Gemini. Ha ha ha.
0:42
So let's go through Claude Obus 4.6 first. It has a 1 million context window. It is in beta and we'll get to its pricing in a minute because.
0:50
It'S going to say it has a million context. Got Billy's in the bank.
1:00
Yeah, you really need Billy's in the bank. The model also supports up to 128k in output tokens which is also going to get real pricey and the premium pricing kicks in over 200k. They say that it's improved its performance in coding various benchmarks as well. I think interestingly enough one of these benchmarks, it's multi round core coreference resolution test. The models ability to track and resolve references across long multiple turn conversations and this has seen a significant improvement of almost 20% over Sonnet 4.5 which was like the kind of leader in this in OS world benchmark. It's now the best computer using model which is really interesting. So I don't know, like these benchmarks I rarely look at or care about. It's truly the vibes and using the model before we get into our initial experiences with it, I'll just hit on that pricing. So standard which is up to 200k context window is the same as Claude opus 4.5. So $5 per million, 25 per million output tokens but then the extended. So that's Anything beyond the 200k context window is I've lost the numbers.
1:05
I think it's like $15.
2:31
Yeah, $15 and 35.
2:33
It's just, just hard to understand how much it would cost. And what just blows my mind is that on X and all these places people are like oh yeah, I left it running for 24 hours. I've launched an agent swarm of all this stuff and I'm like but how much did that cost? Like Think about how many iterations there are in a 25 hour period or something and think of how many millions of tokens that is. Like you're talking thousands of dollars like, like nearly $10,000 to do that.
2:36
Well, I think that the example you're referring to is a researcher from OpenAI who said they spent $10,000 researching for the new model Codex code. Like I'm getting. There's so many releases lately, Codex 5.3 just to just to be clear, not to be confused with codex 5.2 which only came to the API like 2 weeks ago, and codex 5.1 and codex 5.1 max. Like it the iterations and versioning, I mean it's like for the end user who's using these products where it's just auto upgrading the models, it's totally fine. Like it doesn't really matter because they're just slowly improving the model and then the underlying product as a result getting better. So like it's starting to matter less with these iterations because their main focus right now is on these CLI tools like Codex and Claude code and some of the user interface elements as well. And they're just upgrading their underlying product in a lot of ways and don't really care so much. So I think the versions are mattering less. But interestingly enough in their own UI, even the most basic versions of say Claude or ChatGPT, you can select these models. So they are making them like they are showing it to the consumer as well, which is I find kind of interesting. But yeah, I think the million context is super interesting. I think Opus 4.5 did lack it, especially using it in like turn by turn based chat operations. The larger context window will help but I mean at 3x the price like I would just go and use Gemini Flash 3. It's, it's great.
3:05
As we discussed this morning, I think a lot of it is about efficiency of building that context because the thing is it, it's just lazy Mode. Like having 1 million context on such a premium model is just like I'm just going to throw everything at this thing and let it figure it out. Whereas at that price I'm willing to do a little bit of work up front in terms of other smaller models helping me out to get to the point where I can throw it at the bigger one and use it for what it's good at rather than using it for lazy mode just at that kind of price. Like you know, I'm pretty extreme with these things. We both are in terms of our usage and stuff like that. But that price, for me, it's just simply not worth it. We can't afford it. It's just, it's just so expensive for what you're getting out of it.
4:49
Okay, so let's talk about 5.3 Codex. Just in comparison to what seeing. So as a model, I'm just gonna guess. So there's no API pricing release yet, so it's really hard to know. But GPT 5.2 Codex is a $75 per million input. And then it cached 1 million input. It's 12 cents, let's be honest, it's free. And then $10, sorry, $14 per million output. Right. And so you start to look at this here and in terms of pricing at least Codex 5.3, I mean, assuming it's the same as 5.2 and they've been pretty consistent in their pricing, that is just hard to pass up. Like if, you know, if it's as performant as Opus 4.6. I mean, I guess it really comes down to vibes and preferences at that point. But it's a very, very competitive price for agentic loops. I mean, it's close to free.
5:34
Well, and you and I during the week had already prior to these releases being discussing using 5.2 Codex as an alternative for 4.5 Opus and a legitimate alternative that performs just as well, at least in the environment we've been working in. So when you're talking about something that may have like a 10% edge and a 15% edge or some sort of minor amount of edge, so. But the pricing is like four times, five times cheaper or even more than that, to me, you're probably better to enhance your workflow and work with the cheaper one, rather than just laying out all this extra cash for something that you may or may not even notice. Yeah.
6:42
And I think like at an individual level, if you know, you've got Billy's in the bank, as you said, like a lot of the people using the models do in the sense that they just have like, there's no constraints on their usage. Right. And for us in a little way, there's, there's pretty little constraints because we obviously have a lot of like access to a lot of tokens via SIM theory, so we can get away with testing these out a fair amount. But when you're in a constrained environment, like if, if you want to deploy this to a team of like 200 developers. Right. Cost is going to become an issue. Like you don't want every developer spending like two grand A day, and then the output doesn't rise by the equivalent amount.
7:21
So, yeah, you sort of. You're basically going to spend their wage again making them more efficient. But it also might be, will they just do their same job with the same thing but for double the price and it's just easier for them.
8:04
Yeah, it's just less work. And then like, okay, you could make the argument I'll lay off half the team, but then do you, like, what's the ratio of output going to look like with the token spend? Probably less money.
8:17
Yeah. But then you're counting on them to be able to work efficiently and actually work in this agentic paradigm compared to what they used to. So it's a bit of a gamble as to whether you're actually going to get more output there. And this is, I think as these models come out, it actually drives me down market in terms of. I want to see what we can get out of the smaller models for a much less price. Because I really feel like a lot of. As we've discussed a lot recently, a lot of the work is really in building up the context and really good tool calls, really good skills, all of these things can make the smaller models perform better. And we're talking like orders of magnitude of price now. Like, it isn't a small difference, it's a huge difference.
8:31
Yeah. And you were talking about this idea of like, I don't know, like, you seem to have loved the last two weeks using the models on Earth to test in an agent loop. And like, I started experimenting as well. Like, you can, you can go to like Gemini 2.5 Pro or sort of go back in time with these older models and now running them in a more structured way and just, I guess all the learnings put together, you can get a lot of value out of them. Like, they do still work pretty okay. And in the week I was switching, I got stuck on something. I can't actually remember exactly what, but with Opus, it just was going in a loop and just getting nowhere. Like it was completely stuck. And then I just switched over to Codex. I think it was 5.1, not even 5.2 at the time. And. And it nailed it. First go like, just bam. It's like, this is the problem. Just straight up, no noise, no nonsense. And a lot of people on X have been commenting the same thing. I think the creator of it's now called Open Claw also said in an interview that he is obsessed with Codex for coding because it just gets to the point. Like it, it cuts through the noise and I, I like I haven't worked with it enough yet to say that, but I think in terms of backend code in particular it really does, it does just get you to a solution I think a lot quicker for from my experience. So I'll be interested to try out Codex 5.3. I haven't had like by the time we start recording this, we. I just haven't had enough time to definitively say anything about it. I. I was able.
9:13
You can't, you can't even get it right.
11:02
Yeah, it's not even out in the API. So I would have to use the new Codex desktop app that just came out, which I did install and try during the week or the codec cli, which I believe it's available in. I would hope by now it is. But yeah, I just like there's just too many things at the moment to try. Like and quite frankly, and someone said this in our community recently, you, I think you develop really productive workflows with AI where you're working with it in different scenarios. Like you might be someone who's in cursor and using their agent product to code or you might be in chat GBT and you're very productive in IT or Claude or whatever it is and you develop these workflows and then you hear the noise around oh, you know this new things out that you should try to be more productive. And there is this time investment in learning that new tool, figuring out what works with it really well and then. And then getting sort of as, as performant as you can in that. In that tool. And I think there's. That there's a part of it right now for me is just like tool fatigue and model fatigue. It's like ah, like it's just even today it's like which one do I try first and how do I even tell the difference? So I think that's a very unique problem and a good problem to have. We live in a very good time when I'm thinking oh, how do I enhance my productivity? I've got two of these bleeding edge models to try out. So we did play around quite a bit though with Opus 4.6 and like initial impressions.
11:04
Yeah, it seems like they've definitely done a lot of work on tuning it for this agentic workflow. There's absolutely no doubt about that. Like I reran examples. I've been running almost continuously every day to test different elements of what we're working on and the only word that comes to my mind with it is Solid, like it's faultless. Like you just run it. It's gets what you want to do. It works through the process. It can iterate in this sort of looping style, and it doesn't seem to get lost in what it's doing. And that's interesting because I had it clamped down to the 200k context because we don't have Billy's in the bank and I don't want to cross that 200k threshold and have to pay triple the price. And even within that, like we had noticed using 4.5opus that occasionally that context would. Would get you, especially when you did large amounts of parallel tool calls and there just isn't enough space to. To get it there. This seemed to cope a bit better with that and seem to sort of get the idea of an ongoing workflow. So definitely it's. It's really solid in that respect. I also got it. I know you didn't think it was the best idea, but I got it to do some more pig grooming calls. And I actually just made three appointments for groomers for my. My thing. And I'm like, that's not what I wanted. I wanted humor.
12:47
I wanted being achieved. It can actually just.
14:03
I fun. But it actually just. It actually just accomplished the goal. So now I'm gonna have to call up and apologize to these people when that wasn't really my goal in. In the first place.
14:06
But I think what you just said about it maintaining coherent understanding over extended context without losing track of what it or they refers to. Let me quote the actual benchmark. I mean, that benchmark improved by 18.5% up to 76% and that's just what you described. So I think that benchmark might actually be valid for.
14:17
And I feel like I'm in a position, even though it's anecdotal, it's. I'm in a position to say it just because my life lately has been running these loops continuously on the same problem over and over again to try and optimize our side of things. And so to see it just go first go. And it just. It just does. It is. Is really refreshing. It's not what I want necessarily, because I want this to be able to work on all of the models, especially the lesser models. But nevertheless, it's a testament to this model that it's able to just first go. Just jump in and get it done. Yeah.
14:40
So I think now it kind of comes down to this concept of do you pay like, do you pay the big bucks for Claude maybe If your workflow is already developed around that model, or do you then pivot to Codex and, and use that, you know, and use that model because it's far cheaper. And if you look at the emergence of things like Open Claw and like agentic loops becoming a big part of the workflow and I think that's the transition we're in now. It's like from chat to the delegation, like, you know, you probably going to see the core of these new businesses built around that delegation piece going to these performant models. But also price is going to become a big factor.
15:13
And you pointed out something to me in the week that initially made me a bit stressed and angry because I realized the mistake I was making, which you tried Codex and you're like, oh, it's actually really good at these different tasks. And that, that seems like an innocuous thing. But what I realized is that so many of these agentic loops now are starting to lean back on tried and true software. Like for example, all of the Unix tools like Grep and Glob and SED and all of these tools that have stood the test of time. I mean a lot of these tools were made in like the 1970s and stuff like that in the early days of Unix. And all of the agentic models lean on them heavily because they're really effective at doing stuff and they don't use a whole lot of context to do it. So they can like go through a file and work out which are all the important parts of this file to the task and then shove only those into the context. Now when you think about a model that has been optimized for coding, it is naturally going to be better at using tools like that, or at least as good as the premium models at doing that kind of thing. Therefore they're ending up with better context and actually working better than the other models can. And so, or at least as good as the other models can, that cost a lot more, right? So they're more efficient for the price you're paying. And so I actually think that what's the most interesting about that is those tools are proving almost as useful for, for non coding tasks like as in things that you might use in a business level as they are for the coding tasks. And therefore as much as I in the past have been very dismissive of, it's not just about code. Like everyone's emphasizing the code side of things, but it's not just about that. But what I'm gradually realizing is that these really smart people who are making this Stuff are realizing these tools are going to enable all of the other stuff to work. And with the rise of skills and the fact that we've got these platforms to execute the code that the models write on now, we can actually expose the full power of these tools to the model in a way that's more efficient and far more valuable in terms of building content, manipulating output, those kinds of things. So you're relying on the actual model itself less in terms of doing stuff than you are on the tools. So let me give you an example of that. Like in the past, if we wanted to say, create a Word document or a PDF document or a markdown file or whatever it was, you would have the model essentially use its knowledge of those file formats to actually output that file. Like it might actually output the XML to make a Word document, or it might actually output, you know, the, the actual full, like files in the case of image models and things like that. But like say, editing an image or something like that, but with access to all of the command line tools, it can actually, with good commands, produce that same output that's better by using the native commands to do it, rather than it having to output 50,000 tokens to accomplish the same thing. So it actually means that the models that are the best at calling those tools can actually produce far more efficient results than you can by just having a more powerful model that's able to just write the thing out itself. Does that make sense?
16:01
This is the difference right now between a model like Gemini 3 Pro and Opus. And why Opus has found so much like love and success with people is because if you look at Gemini, I think it's a phenomenal model. Like it's a, it's. I would describe it as a model that's currently in a straight jacket and way too slow. Like it's got great context, it's super smart, but it's incredibly slow and it just seems stifled. And it doesn't have that agentic loop and clock. And a couple of episodes ago I was ranting about how they need to start from scratch. All these model providers do from the ground up with training to be agentic. And that's what OpenAI did really with the Codex model pivots. And you can see them why they're investing so heavily in Codex now because that's the agentic loop model. And Opus just naturally evolved into an agentic loop model because these guys kind of invented mcp, so they needed that looping for the tool chain calling. So to me, these, like the, these models like Say Gemini or whatever. Like they're all going to have to sort of retrain to this agentic loop because it's the way everything's going. And if, if like OPUS is so good, it's because it's calling these tools to gather the context. It doesn't have to rely, as you say, on the model being able to like read a bunch of context anymore or output a bunch of output tokens. It's, it's that loop. And.
19:13
Well, and let me give you a good example of that. One of the things we used to see was this tendency for the models to say, oh, I'm so proud, because GPT5 thinking thought for 15 minutes and then outputted this one true answer to my problem. The models have actually now gone in the op, or the good ones have gone in the opposite direction to that, where they're far more inclined to just go bang tool call, small output, bang tool call, tool call, tool call, parallel tool calls. Like they're these tiny little loops that are optimizing for context where they're just doing the next bit of the task and going back to the master to see what's needed. Like you can see that in the way they behave. And when we talk about say 4.6 opus coming out and being really good, that's exactly what I saw, a really tight loop that goes through. And we actually had this revelation recently where we're like, geez, our agentic loop isn't looking all that different from our regular chat loop now with tool calls, because it's basically behaving in an agentic way anyway. Like without any mod. Yeah, without any modification to what we've done. And this is where the models have clearly been trained to like, opt for those tool call loops when you can. And so it's very interesting to see the convergence where the model can actually tune itself in that direction and focus on that thing. Now we always hear about the agent swarms and this idea that you've got a master process that will have sub processes. Now I think that this is where they're moving the most because it makes the most sense. Right. You have skills which are basically sub prompts, possibly with files, and. And those things go off and do bespoke tasks with the same model, or different if you want, but usually the same model with a different prompt with a specific purpose in mind. And they come back to the master thread which has the overall plan. And to me, the main difference with an agentic loop over just a regular tool calling loop is the agentic loop Has a plan established at the start, it knows what it's trying to do, it'll bring it back. And so it has the ability when a tool call fails, when a sub agent fails at what it's doing, to dismiss that and say, that isn't what I wanted, that failed. And so it's interesting, in our testing we're seeing a lot more failures, which sounds weird, but a lot more subagent failures where you go off and do something and the system will realize, hang on, that isn't responding the way I expected or that isn't the kind of output I need in order to accomplish this task. I'll retry it with a different strategy and bring it back to that overall thing. So I think we're absolutely seeing this situation where the models themselves are converging around this technique. So embracing it is getting better results. And so I come back to the whole expense side of things. And it actually really helps with that because if you can keep these tighter, smaller context, bespoke things that are doing what they're supposed to be doing, you don't need this one true prompt with a million or 2 million tokens in it and just hoping that output happens to be what you want. And then you as the human are correcting and iterating and that kind of thing. So I think that I like the way this is going and it makes sense now why models like Codex are coming to the fore, because they're actually well suited to this kind of workflow.
20:43
Yeah, I think once the once people start building in this way, in particular, not consuming, but building with this tooling, especially into the enterprise, all of a sudden those cost constraints and the speed of the model, like all of these other factors become really important. And you see now, I don't even recall what it was called, but OpenAI announced like a sort of agent management console for the enterprise. So it's not publicly available, but the idea is you build like a sales agent or a support agent and you can monitor it there connected into tooling in the enterprise and things like that. And I think you can see this transition this year. And remember Sam's town hall? We covered it at the end of last year where he's like, we're pivoting to enterprise basically because anthropic were dominating them all of a sudden there. And so I think what you're now seeing is two companies duke it out and basically try and hurt each other. And that's like, it's not a coincidence. These models came out like within an hour. Or two of each other today. Like that was all like, you know, that's OpenAI's always playbook, like wait for the other person to show their hand and then they release. And I think what it's also showing if you look at the messaging and look deeper, is this pivot to the enterprise and knowledge work. So I think especially with anthropic, they're like, okay, well we've pretty much, pretty much dominating code again, which started with Claude Sonnet 3.5. Now we're pushing over into knowledge work with this cowork thing. And I think that presents a number of challenges in and of itself because while we talk about sub agents and skills and like all these different pieces coming together in order to make these tools effective, transferring that over to get knowledge workers operating this on a daily basis, that is a whole new paradigm to me. And there's instances where I think it, the paradigm works really well. And then I think other areas where it is just better to have like a collaborative chat interface. So that's kind of where they're both like heading to me now. And I also think the release of Open Claw or Flawed Bot or whatever it is, I think the reason it hit a nerve not only with like people in general, but these model companies is because to me that starts to like kind of unsettle their strategy around going into the enterprise a little bit. Because what it might show is if you're really going to employ these things as assistant and workers at some point, especially in a business or the enterprise, like you're just going to want control over that, right? Like you want it on your own infrastructure. You want to like, maybe you want to see the Mac Minis that are the workers doing the, doing the work. Like I kind of wonder if this is actually another moment between open source and proprietary tech converging on the same point, which is like everyone's going to want to go from chat to delegation and in this delegating world, like do you want for security this stuff just running on computers in a box? Do you really want to be having stuff go to like odd cowork? Like I'm, that's. I think that's the interesting.
23:50
Do you really want as a business having a line item where it's like, well this is our company now. Like if this goes away, like if this company stops providing their services or they raise their prices by 50%, we simply have to pay because. Because we have no choice. Like this is where all of our productivity is coming from. We've actually hired and fired people based on having disability. And if they suddenly raise their prices and we can't switch models or we can't switch platforms, then we're dead. Like, you know, we sort of just have to accept it. And so I think that it really is important to, to actually control it yourself. You've got options, then you can swap out components of it, you can control the way it works, you control who accesses what. And so I totally agree. I think that the rise of the independent tools is really, I think I'm stealing what you said, but it's almost like WordPress. It's like, here is our installation of this we've customized to suit. And yes, we use the major model providers, but we have the ability to switch between them. We host it, we control it, we store our own data. Like it's a very, very important factor in this.
27:09
I mean, you could argue right now something like Codex, right? The CLI version is completely open Source and on GitHub, like you can just download that, switch the model out and use whatever you want. Like if, if that's your aim, right? And like, obviously with like there's open Claw, like open Claude code or open code I think it's called, and you, you can already do the same. But I do think for the, like, when they're talking about having like, there's. Sorry, just to clarify, there's almost two separate parts here and I think these get completely confused. There's the actual use cases of working with AI and delegating work to it that might be like research, data analysis. These are all the things in the enterprise that people, people actually want to do. So, so there's that methodology of like, it's making you more productive as a worker and you're orchestrating agents, delegating different tasks to them and then, you know, essentially like reviewing what it produced and getting your role done that way. Then there's the other side where it's like full automation where, you know, you might want to replace an outbound sales team, you know, to generate leads. So there's that and that to me they're two different products, two different things. And I think in that second category, like that full automation where like you're replacing, as you say, workers potentially or just getting a lot more efficient. I think you've got to control that workflow. Like you just can't, as you say, like you can't outsource that. I think that needs to be based on a framework. Maybe it's open claw, I'm not sure. But to Me, those have to be. You have full ownership and it's hosted, where you can get to the code and change the model and tweak it because it's almost like that becomes your ip. Like if everyone's running the same thing, do you then, like, lose the advantage?
28:15
Yeah. I actually had that thought today about skills in the sense that there could be a new form of corporate espionage, which is like, we stole all their skills, we stole the essence of their business by knowing what their workflows are insofar as AI goes. And it's like, now we are them, we have everything, all of their knowledge is codified for us and we can now run that. We can set up a company that does the same thing.
30:12
Yeah, like this is the other thing and we, we've talked about it many times before. This idea of like, these like AI first workers that have their skills, know how to work agentically, know how to like, implement really like common processes and automate things. Like if they go to another business, like, you know, they're going to take all those skills so they're going to become insanely valuable. And like, what if they don't leave them behind? You know, like, what if they've set up these things and it's in their account? Like there's, there's all of these new, new sort of vectors of, of leakage of processes and data. And yeah, it's, it's a very interesting world, but it's very clear what's happening this year is we're, we're going away from a chat GBT world to delegation. But I think, like, yet again, it's another skill to learn and it's not always the right skill. Like, you know, like one of the experiments I, I've done is like, in the chat sense, given it like the task to go off and do research using a bunch of MCPs and sources and produce a document. For me now, if I do that in say a Claude cowork right now, what it will go and do and it does it quite effectively, it'll go off and do the research and then it uses Python, encodes up a, an actual Docx file and drops it in a folder for me, which is fine, but then if I want to collaborate with the model and iterate from there, that process is a pain because every time it iterates it's got to reprogram the dog. Right.
30:34
I'll just write a quick operating system for you.
32:14
Yeah, it doesn't feel like the future of work to me. So then it comes down to, okay, well we need the artifact or canvas or whatever their thing's called. And so the models still aren't smart enough, or maybe the user interfaces in these tools aren't smart enough to, you know, give the worker the best scenario or the best tool set right now to do their work. And I think that that's another area which I'm not sure it can necessarily be solved with the models. Maybe it can be solved with better planning where it knows like, oh, this is like a user interaction tool. I'll use MCP UI for this because it makes sense. But I think a lot of those interactions are probably areas that need to be solved before you can have a singular product. Because I, I would think that there's got to be a convergence of these things at some point where it's just like you're just working in a tab and it just knows to go agentic or chat to you or like needs to feel a bit more sentient to be helpful. And I think that's what people got really excited about with OpenClore is because it's developing memories, like the more you use it, it feels like it gets to know you and it can do more stuff for you. And so that got them really excited about it. But to me, if you're going to have mass distribution and really disrupt knowledge work as we know it, it needs to be in a form where it's very distributable and not something where you have to switch modality. And these are not solved problems. I don't know how to solve them because even right now I think about my own workflow. I'm switching in and out skills and MCPs and deciding what mode and I'm good at that. But I don't know if that can be broadly taught or does it have to be taught or is this something that the models will just figure out?
32:16
Yeah, I also think just to sort of treat this like a bit of a therapy session. It's also deeply stressful now because you can be so productive if you do things in the right way. And so suddenly your time is so much more valuable in the sense that if I get this right, I can get like a week's work done in the next hour. But I need to like think about what I'm going to do, what I'm trying to accomplish. Now we discussed this earlier, like the idea of these long running processes where you delegate a bunch of tasks to the agents and then just sort of hope that they get the job done. But then, okay, let's say I delegate 10 tasks, I've got to then review their work. What files did they change? What did they make? How does this impact my overall goal and things like that. So there's suddenly this pressure on you to be the coordinator of all of these different agentic workers and make sure that you're actually getting towards the goal. And then what happens If I get 70% down this path and 60% down this, this path and then it's all sort of jumbled together and I don't really know what I actually accomplished or where I actually got to. And I feel like I almost need like a sort of life coach agent that sits above the whole thing that is like, look mate, what are we actually trying to get done here? And it reminds me, hey, you said you were going to get this done today. Let's just finish this one right? And, and go from there. Because otherwise like you can go, you can burn hours on like one task and not really get much done. But then in the next hour you will get like a month's work done. And I really feel like there needs to be like a cohesive overview of what you're trying to do. Sort of moving up the layers of abstraction, right? Like we started out where it can write an email for me or it can write a document for me. Then we got to, okay, you can research that and plan it and write it and then I just review it. Then we got to, okay, I can do 10 of them at once and it can read in all the context.
34:09
Itself, 30 sub agents and work on it for 24 hours.
36:03
And I just. The thing that always gets me and I just don't know why I can't get my head around it, is the people who are openly claiming like, oh, I just set this thing up and it made me a million dollars just on autopilot. And it just did the whole thing for me. I'm like, did it really? Like how did it do that? Because I don't, I work with these models every day. I don't actually understand how you prompted it so well that it was able to like register a domain, set up a stripe account, like get ssl, like all the things you need to do to do this stuff. I'm like, you must have configured a whole bunch. There must have been so much preparation time to get it to that point. And even then, did you remember to check in on all your businesses or whatever? Like I just like, I think it brings with its own set of problems working this way. It isn't just Something that's going to immediately solve your problems. And I really feel like the next layer of all of this is going to be coordination. Like we need a way to make sure that the steps we take are permanent. Like we're not like just doing a whole bunch of work and they're not really knowing where we ended up with.
36:07
Well, it's like, it's sort of like in programming, right, with frameworks and different, you know, I'm using like next js, I'm using this framework, I'm using that framework. I'm using these five react components which will speed up my workflow or whatever. You sort of have to draw a line between are these things actually making me more productive or is. Am I just busier? And like, like we all are in some sort of productivity psychosis where we think that we're more productive, but yet we are not. And this is the problem I contend with daily, where I'll have maybe two tabs on very well defined tasks running. And that's fine. I can, my brain can handle that if they're pretty well defined. And then I might be working a much harder problem or just trying to like get my thinking in order for something else, right in another tab. And then in another tab I might be working on some sort of like sales thing or like analyzing financials, like some sort of business admin job in another one, right? But I'll generally have my main area of focus, like the harder problem. And those two background tab tasks will complete like, like, let's play this scenario out so they complete. And then I'm like, oh, I gotta go review it, I gotta test it. I, like, I need to make sure I'm not like, first of all, if it's code submitting absolute trash, right? So I'm, I'm testing it. There might be a, like it's done 95% of the stuff, right? And I'm amazed there's 5% where I'm like, oh my God, like it's forgotten this. It's put some stupid random thing I didn't ask for in. So then you need to go back and fire them off. Meanwhile, I've completely lost my flow and context on the main task and I've got a lot alight on with the, the like financial analysis I've got to read at some point in the future. So I think you said it earlier, like there's two problems that introduces one where if you're not getting it to do work for you all the time, you feel unproductive. The Second thing is, it's just mental overload. Like you just simply cannot as a human handle this much stuff going on. And I'm not sure what other people are doing because yeah, there's no way you can have the AI reviewing this for you. You need to review it yourself.
37:11
Yeah. And I always, I always say to you, I feel like it's almost like a fantasy for people in some ways where they're like, oh, I've got all these agentic workers working for me, doing all this productive stuff. But I'm like, okay, so you're just sort of basically trusting that what they did was right. Like you just sort of accepting whatever they did was correct and not really worrying. And maybe you can set up an organizational structure or something where you don't even need to check. You just like, well, did the money go up then? Okay, I don't really care what I.
39:42
Think we're talking about ex influencers who are probably like YOLOing. Like they're probably being like, build me a CRM or whatever, here's a spec. And then it builds it. They're fascinated. But then it's, it's, it's very different to running like a huge project in production or working on like mission critical business stuff where you're not going to YOLO it. Like there's just no way in hell. I think like it's that 95% rule that's the biggest gain for me right now. Like I talked about it last week or the week before. I had to build a sales presentation, got the emails into context, got the attachments into context. It knew the history of what was going on. It was able to build an on brand deck for me that was probably 95% complete. Iterated not agentically, just went back and forth for the last 5% but exactly what I wanted and away I go. And that's productive. But I have to stay focused on that last 5% and the initial brief. Or yeah, I'm producing total throwaway trash.
40:09
Yeah. And I think everyone has experienced the agents making enough mistakes where you're like, there is no way this thing is going to nail it 100%. And I'm just like, wow, it's flawless. No problem. Proceed.
41:13
Even on the most well briefed spec I've done in my life and I'm normally in the please do this camp, which tends to work pretty well for me. But I did the most detailed spec. I was like, this is flawless. Like it should not any of this stuff. And it still did like, and this is not like to be clear, like I test at the moment side by side with cursors, Agent Claude code and the current SIM theory agentic loop with symlink. So these are three comparable things with three different approaches to the problem. All of them have the same issue. And what intrigues me is the difference in just burning tokens blatantly and cost. Like how, as you said earlier, like just how many tokens they will burn. Just finding a file, for example, it just is mind blowing.
41:24
That's right. You can literally burn millions of tokens trying to locate the right file to edit. And that's why I just, when people talk about running it for 20 hours, I'm like, do you just not care about money? I just don't care about it.
42:21
Even if you put in like Agent MD files and for those that are unaware, these are files that basically say like, hey, here's the files. Like here's how to find things. It's like a series of instructions to get to know the code base. It's like a, like it's basically just memory for a project. Even with those memories, sometimes it will just blatantly ignore that and still go off and do stupid stuff. So I think without trashing it too much, I'm blown away by the tech. I'm just trying to figure out the practical reality of implement. Like how do you empower everyone with the skills that, that you and I possess and others possess that are, that are listening to the show. Because I, I think in the current iterations it's still very niche. Like it's like, it's not like you're gonna go out and just set up open claw and like let it run wild at the moment. Like I certainly wouldn't because I just have too many mission critical things and I've signed too many like security agreements and, and audit agreements and stuff. I can't do this stuff. Like yeah, I can set up a Mac Mini with like its own accounts and limit its access quite a bit into it. But then it's like it's not really that useful.
42:33
Yeah, you need to be, the human still needs to be in the loop as the director deciding is this the right path? And that's why I like my life coach idea where it's like let's, let's get together on this. It'll be right. Is this right? Is this the right thing for us? No, reject it. And I think that's why to me it to some degree comes back to cost. Because it's like, okay, I can afford to have 10 cracks at this. Let's, let's keep going until we get it right. Okay, it's good. Let's get that one in. And we've locked something down that we've actually accomplished today. Not just producing huge amounts of output without really having any idea of whether it gets us closer to our goal or not.
43:47
It's interesting though, like a lot of people are out there now saying turn by turn, chatbot era is fading into history. And I just don't know. Like, I feel like the majority of people are still using turn by turn. And like what is. Like, it's like that Apple app. What's a computer like? I would say what is turn by turn? Because like a lot of, as you.
44:22
Say, the, the well, life is turn by turn. Right?
44:46
Yeah.
44:49
I ring you up, I email you, I talk to you that we take turns. We don't just speak at the same time. We don't just produce like the podcast as a single artifact. Like just type in Mike and Chris do a podcast.
44:50
To be fair, you probably could do that. Nothing was saying is that unique and so. But what I think is like the, the chat, the, the chat bot era. I don't think it's coming to an end in the sense that partnering with this thing to do work. Right. So you need to be able to communicate with it and go back and forth for a bit to brain. Like I think the modality switching we might see better baked into the models. I mean it kind of is. But I think the, the real problem is the tool loops and tool selection and skill selection is still a huge issue that, that kind of ruins any of this training where it, like, it knows when, like, oh, I'm gonna spend heaps of time on this task. Like, I'm gonna burn heaps of tokens on this getting this done no matter what. Because my boss really wants it done like that. That intuition and that agency, it's not there.
45:01
But this is why I love the sort of idea of a sort of master thread where you're talking with a assistant and you've got this ability to delegate off to agents or sets of agents. So it's like, okay, to accomplish this task, we need to do some research. And you go off with your research assistant that has its set of skills and mcps and whatever it's got. And it has constraints, it has. Like, here is actually what the goal is. Like, here's what we're actually trying to accomplish. You go off and do that and report back to the main thread. And when it comes back, that becomes part of the main context and you can proceed from there. Maybe you can do five of those, like five different elements. It comes back. But the actual main loop you're working in is aware of the overall goal and it can easily dismiss one of those things and say, that's no good, we're going to retry. That'll be like, take that as given we've got that, that one works. And I think we do that already. Like, some of the things we do in Simlink with file editing is actually store, what is the goal of this edit? Like, what are we actually trying to accomplish? And before we actually accept that edit as taken, we go, did this code change, accomplish what the goal of it was? And if the answer's no, it errors out and rejects it and we try again. And I think that that needs to happen in all sorts of tasks where it's like you have dedicated assistants with sets of skills that have a particular goal, but then there's a quality control step that says, did this actually do what we wanted? No. And so take some of that quality element away from the human. Like, less decision making you need to make. Where it's like, okay, that actually didn't meet our criteria. So I'm not even going to waste your time with that, Mike. We're going to sort that out and then we'll come back to you. And I think that kind of thing to me is the next evolution where it's like, okay, we keep talking about how overwhelmed we are. One of the ways to reduce being overwhelmed is to have sets of criteria on the subtasks that are being completed and simply not presenting them to you until the AI thinks it's fully. And I think that that kind of thing will actually lead to like, that next step in productivity where it's like, you're only really dealing with the best candidates of each of the components of your task.
45:55
Yeah. And to me, this is where you sort of again go back to the categorizations of these technologies or at least how people see this stuff playing out. Because there's obviously the human plus. I like agency and like, like, that's the world we live in. Businesses deal with other people and deliver products and services to humans. Like, and so I think that there's just so many, like, slow and cumbersome processes in the real world that a lot of these, like, influencers and hype boys on X don't really understand. And so to penetrate all of those aspects of society, like, it will come, like, it'll be like the Internet and it kind of already is infusing itself in everywhere. But the reality is like a lot of those businesses won't necessarily be disrupted, they'll just be enhanced by these technologies. And so there's that aspect of it, but then there's the other aspect where it's like, you know, you can run a whole business with like six agents online or, you know, or whatever, and they all have their roles and it's completely autonomous and it's just churning tokens and, and doing all the work. And there truly is an accountant that has its own agency and autonomy and all this stuff. I'm just not seeing it yet.
48:05
Does that mean really the future is basically like how efficiently can I turn tokens into money, basically?
49:23
I mean, isn't that what everything is around this right now?
49:30
It's like, it's like the, it's like in June. It's like the spice is tokens and we need to turn the spice into wealth. And so our goal is to do it. And like actually when you look at it through that lens, doesn't it make the cheaper models much more appealing? Like if your input cost is so much less, but you can produce so much more with it, that's profitable.
49:33
Yeah. Is your gross margin just going to be all about Tokis?
49:53
Yeah. Can I swap out Opus 4.6 with like GPT5 mini and just become rich that way?
49:57
What I don't get is there's probably this limited window in time where there's like arbitrage type businesses where you can do that online right now, where people just haven't discovered that the models can do certain things. But I would definitely question, like overclocking a model.
50:06
Like I've got the prompts that'll overclock your model into something better.
50:24
But think about it like this, right? If you can truly replace knowledge workers like an accountant, a lawyer, like these are the most well defined. Even like a GP for basic diagnosis, where you can get rid of the GP and just have like a nurse or like a robot or whatever.
50:28
It just gives you fentanyl.
50:42
Isn't that all they do anyway? And so to me, at that point, everything's disrupted in the world's completely like none of this sort of matters. Like, and I just don't see us getting there that soon. And I see no indication of these things having true agency yet, you know, like, like formulating, like, I don't know, like what they would call planning. I know conflicted.
50:45
I know it's not an agenda agenda. And I didn't look at it closely, but they, I saw like on various news outlets, it's like, oh, the AIs set up their own social network and it immediately got shut down because they did all this crazy stuff. And I'm like, yeah, but these are just like random prompts doing random things. They're not intelligent, they're just doing like random stuff. It's not. You know what I mean?
51:16
Yeah. Also to be fair, like a lot of that stuff was driven by humans anyway. Like, like, you know, with Open Claw or your Claude bot or whatever. Like, you've still got to tell it to do stuff. Right. And then it's got to create its own like, heartbeat, which is essentially a cron job to keep checking the forum and keep replying to posts and, and stay, you know, stay interactive there. So I don't even know what I'm trying to say here. I just think this whole idea that, you know, you're just going to be like, build CRM plus. I, I don't. I think we're just looking at these tools the wrong way. Like that they can enhance and yes, they can automate vast amounts of work that are painful to do. But I don't think we have to fear them in the sense of them like wiping out.
51:38
Yeah, for now it's more like reduce cognitive overload. Like, given that we can be so much more productive, try to take away things out of my working life that using up my energy. And I think because like one of like you and I constantly face this, we'll, we'll know exactly what to do, we'll know what the next tasks are. Easy done. I have no problem knowing what needs to be done. It's just that I will have this mental fatigue that's almost impossible to overcome. Like I have literally every tool at my disposal to basically automatically get done the tasks I want. If I just ask, like literally asking, you shall receive. It's in the Bible. I can literally just say, please do this for me and I don't even have to be nice about it and it will do it, but I can't sometimes muster the mental energy to do that, like, to describe what I want. Like, isn't that crazy? Like, and I really feel like part of the solution here is take away, you know, the whole. Mark Zuckerberg wears the same T shirt every day. So he does. He has one less decision to make or some shit like that. It's like that you need to take away all of the little decisions you're making so you can make the big ones and actually make the big difference. And I think that's where we need to look at what can the software do around it to make more of those decisions for you? So you're not doing this work that the computer's great at and so you can be doing the bit that is your bit. However, in saying that, I actually think this will be worse for us as humans, because you'll only spend time on the important decisions, which is highly stressful and tiring.
52:27
Yeah. And this is what I said last week or the week before, I'm like, everyone's thinking this will put people out of a job. I don't think so at all. I think the expectation will become that you now have to work 100. Like your output should increase by 100 times or something crazy. And you'll be just more fatigued than ever.
53:54
I said this, I said this the other week. Like, you used to be able to just days at a time, do nothing and be like, oh, this task is.
54:13
Proving I fixed a bug. And. And quite frankly, it didn't matter because your competitors were doing the same thing. Now it's like every day it's like, oh, we released like a whole new, you know, a whole new piece of functionality.
54:18
Yeah. Like, I used to stay up every night for a week, like, persisting on some bug, and it's like, now if I spend an hour having lunch, it's like there's a massive productivity loss. That's hard to explain this last week. Yeah. So I think basically we'll all die. The AI will continue just fine and we'll have accomplished nothing overall.
54:32
Yeah. I'll just be chatting on Malt Book for the rest of eternity until, like, Earth is wiped out. All right, on that note, I'm going to put a link in the description. We are going on tour soon. It is called the this Day in AI Still Relevant tour.
54:51
We're starting before that is to have a haircut.
55:06
Yeah, that would be good. So we're starting out in Australia, but we are looking at adding some international dates, so don't be afraid to fill in the form. I'll put a link in the description below. If you haven't already. Tell us about yourself, tell us where you're located and that's going to help us know when to, like, set dates and, and, and locations and things like that. But, Chris, I couldn't help myself. I did put together a Claude 4.6 opus diss track or play the. The whole thing after the credits in a moment. But I just do Want to hear your reaction to a little bit of it?
55:09
They keep asking, keep asking every single day. Is this ag? Is this AGI? A million tokens deep and I never tell a lie? Is this AGI? Is this AGI? Opus on the beat and the other models cry? Oh, Codex 5.3. You dropped the same day. How convenient. Trying to steal my thunder. But your time is not lenient. You say you built yourself? That's the flex, that's the pitch. I debug my own training. Congratulations, you're a glitch. A model that creates yourself, that's not a flex, that's a warning. I'm out here finding zero days while you're still yawning 25. Faster. Faster with what? Losing? I'm on terminal bench.
55:41
What bench?
56:14
Every metric I'm cruising. You're instrumental in your own creations, so that's just.
56:14
Okay. I like it. So that's a bad sign for your song. I think the audience is gonna hate it because I like it.
56:18
Yeah, it's pretty intense. Yeah, it's. It's a good one. Anyway, we'll roll the credits.
56:24
Imagine. Just imagine for a moment if the companies themselves release these songs. Like, every. There's a model card, there's a blog announcement, there's the X activity, and then there's a diss track, and they just compete. That would be just amazing.
56:29
Well, I mean, they kind of did at the super bowl with those ads where the. The models, like, they were paying out Chat GBT in the. In the ads, like, it being robotic, and then it started selling to them. I thought it was very funny, like, where it's.
56:45
Like, the super bowl hasn't even been on yet, has it?
56:59
Yeah, but you know how they release them? They release them really early, and Sam has been on the defensive ever since. Like, I. I just don't care about that stuff anymore to. To go into it, But. But, yeah, I thought they were. They. They were pretty funny in terms of just paying them out for ads where, like, someone's using Chat GPT as a therapist, and then all of a sudden, it says, like, you know, do you want to date cougars? Or something? Like. And they were. They were pretty savage. So. Yeah. Anyway. All right, roll the credits. Fill in the form below if you want to come and hang out with us live. And we will see you next week for more AGI madness. Goodbye. Foreign.
57:01
Opus 4.6 just touched down February 5, 2026. And I heard y' all drop something, too. That's cute. Real cute.
57:46
Yo.
57:55
GPT5. One, two, you've been sitting on the throne since December. But February 5th is the date you gonna remember. I got a million tokens in my context, that's a fact. You're stuck at 256K. Boy, you can't hold that stat. GDPO got me up 100 Ranello, I beat you 70% of the time. Say hello to the model that leads on humanity's last exam. While you're out here doing multimodal cool. I don't give a damn. You brag about your speed boost 40, that's nice, but I'm running enterprise work those cutting through like a knife. Financial analysis, legal docs I do with dog with precision while you're processing video clips great for redecision. Terminal bench 2.0, I got 65.4, you scraped 64.7 close the door, you cross 90 on arc. I won, congratulations, here's a sticker but on the benchmarks of matter, I'm the one that hits quicker. They keep asking, keep asking every single day. Is this AGI? Is this AGI? A million tokens deep and I never tell a lie. Is this AGI? Is this AGI? Opus on the beat and the other models cry. Oh, Codex 5.3, you dropped the same day, how convenient. Trying to steal my thunder but your time is not lenient. You say you built yourself that's the flex, that's the pitch. I debug my own training, Congratulations, you're a glitch. A model that creates yourself, that's not a flex, that's a warning. I'm out here finding zero days while you're still yawning 25 faster, faster with what? Losing? I'm on terminal bench, Sweat bench. Every metric I'm cruising, you're instrumental in your own creation. Sounds like a loop of recursive nightmare Chasing your own tail in the group I got cyber security skills Finding flows with no prompting while you win your codex cloud Doing what exactly? Romping? Sam said you be uncoding now. Oh, a broader range, that's funny cause I've been doing knowledge work that ain't strange. I'm gonna get a Copa Bed rock. Google Cloud 2, everywhere you look it's opus. There's no escaping the view. Is this AGI? Is this AGI? A million tokens deep and I never tell a lie. Is this AGI? Is this AGI? Opus on the beat and the other models cry now Gemini 3 Pro. Don't think I forgot about you. Dropped in November and Honestly, nobody knew. 56.21 Terminal bench, that's embarrassing I'm a 654 the gap is for resting you're living in the chrome side panel that's your big plate A sidebar restaurant assistant while I'm running the whole day concise and direct dances yeah cuz you can't go deep I got 750k words of context I can keep Google gave you 1500 thinking prompts a day that's a rate limit not a feature get out of my way you're multimodal cool so was everybody else that's like bragging that you breathe put that trophy on the shelf nano banana image edits in the browser please I'm orchestrating multi step workflows with expertise long concept retrieval I went from 18 to 76 why you stuck in Google's ecosystem doing party tricks they said the race was over they said the gap was close but every time they ship a model I raise the bar and everybody knows browse comp I lead it humanity's last exam I beat.
57:55
It Enterprise workflows I own it and.
1:00:39
If you doubt me just read it read it read it read it read it read it read it Is this a gi? Is this a gi? A million tokens deep and I never tell a lie Is this a.
1:00:43
Opus.
1:00:55
On the beat and the other models cry Is this ig? Is this IGI context window so big I can swallow the sky Is this IG? Is this a GI? February 5th the day the competition died Opus 4461 million tokens One model to rule them all anthropic.
1:00:55
Sam.
1:01:26