Gemini 3.1 Pro, Claude Sonnet 4.6 & The OpenClaw Hire That Killed the Chatbot Era - EP99.35
The hosts discuss two new AI models: Google's Gemini 3.1 Pro and Anthropic's Claude Sonnet 4.6, analyzing their performance in agentic workflows and tool calling capabilities. They also cover OpenAI's acquisition of OpenClaw, criticizing OpenAI's lack of focus compared to Anthropic's consistent model improvements.
- Smaller, cheaper AI models like Haiku are performing nearly as well as frontier models in agentic workflows when given proper context
- The future of AI applications will likely use model mixing - different models for different tasks based on cost, speed, and accuracy requirements
- OpenAI's acquisition of OpenClaw suggests they're struggling to innovate internally and are buying distribution rather than building
- Tool calling accuracy and reduced hallucination are becoming more important than raw intelligence for practical AI applications
- Current AI model pricing may be unsustainable for mass enterprise deployment, favoring optimization toward smaller models
"I plan on running these things like mad. I plan on looping them for 80 iterations, 100 iterations. I can't afford to run any of the top line models at their current prices"
"When the models get it wrong, they get it wrong on such a scale. You're like, how do I unpick all the damage it's done?"
"If there's a bubble in AI, it's pricing a million tokens at $25 and beyond"
"No one can afford to run out, like roll out the anthropic top tier models right now on a mass scale inside an enterprise"
Chris, this week we have not one, but two new models. Gemini 3.1 Pro released just a couple of hours ago, and also a new Claude Sonnet 4.6. But before we get into talking about the new models, a few little reminders. First of all, I heard you loud and clear. People wanted the song from last week. Is this the end little reminder here.
0:02
AGI is coming and there's nothing we can do.
0:33
So it's now on Spotify.
0:38
So hopeful.
0:39
Yeah, it's very, very depressing, but it is on Spotify so you can check it out by this day in AI. Is this the end so you can listen to it or, or wherever you get your streaming music. The other quick reminder and I promise I'll, I'll shut up next week about this because there's enough registrations. Now we have a fair idea but if you do want to hang out with us and in our, in our tour, fill in the form below and let us know where, where, where you're based in the world and what you want to hear from us. And we are going to go on tour later this year and it should be a lot of fun. So fill in the form below if you're interested in that. Now we'll start with the latest drop from Google. Gemini 3.1 Pro. So this is a supposed tune on what was Gemini 3 Pro and it is allegedly better because it's performing on Arc AGI benchmark two times the performance of Gemini 3 Pro. Although a lot of people are saying it was a little bit of benchmaxing which is where they optimized the model or the benchmarks and of course no one would do that and the key features of the model, it's still the million token context window. They're talking about how good it is at vibe coding like SVGs. Again, I don't understand why this is a benchmark, but there is something new about the model which is this thinking control. Do you want to speak towards the thinking control?
0:39
Also in when they release Gemini 3 they, they move from having a thinking budget which some of the other models do, which is a certain amount of tokens that the model has to think before it has to get to of the real work. Whereas they move to a model where there's just low, medium and high. But Gemini 3 preview only had low and high. Now in my experience, nobody wants low. No one's like hey, I'm going to use the max cost and use its least abilities. So we didn't even bother with low. So you really only had high as an option. But now they've introduced medium, which seems better to me because the medium is sort of their auto switching model. So we see this with Anthony, where you essentially go auto mode. You decide if you need to think for longer or you decide if you need to think for less based on the task, which is ideal. And then if you want high, you can switch to high. And so that's really a big benefit. I mean, I'm yet to test it properly to see that it adheres to what it's saying, but in theory it's a better system.
2:10
The low option to me is kind of weird because if you want it, like if your app needs low latency, you're just naturally going to go to Gemini Flash, right? Like that's the brand of that product. And so it is a bit weird. I think these models, like, for me personally, when working with them, I do appreciate the. It's just going to decide how much thinking at the time to use. And I must admit, when I use the anthropic models, that is the setting I use those models on, I don't try and always do the max because then you compromising speed, which I think is becoming increasingly important, especially as you start working on multiple agentic tasks. Like, I want a faster experience, I want to be able to iterate through things a lot quicker.
3:14
Now I agree with you strongly on that. I used Opus 4.6 Max for the first half of this week pretty intensely because I had some serious stuff I needed to solve and I just didn't want to muck around and it's. I mean, I know we're not talking about that right now, but the model is just brilliant. Like it's absolutely amazing what it is capable of when you have it in that mode. The problem is, I think we mentioned this last week, if you're working on a single project or even, you know, a single. Across multiple projects, but like single kind of tasks, because the system is now actually making changes itself, it can't really then go and do another task in the same project because there's too much chance that it's going to cross over. Right. So you really, in this almost sequential way of working and you need to work on totally disparate tasks in order to do it and therefore the speed of the model and max, I mean, you're waiting probably fully two minutes at least to get a full response. It's just so long. Like it. It really slows you down in terms of your interactivity with. With performing work. And so I would argue, yeah, you're right. Like, a model that's going to gauge how much time is needed is a lot better than just forcing it to max.
3:56
Yeah, And I think there are ways of working Async in the sense of, like, doing all the fancy, like, it work trees and all this sort of stuff, but at the end of the day, and I think a lot of people out there are commenting on this as well, it's just, like, not how humans work. Like, as developers especially, people have spent so long talking about, like, you know, context switching, and now they proudly are like, oh, I switch. Including me on this podcast, I switch between, like, 70 tabs at once. Aren't I special? But the reality is I find the first few weeks of the show, being back for the year, I was honestly finding myself just continuously anxious because I was multitasking all the time, losing track of things, and I felt I had lost control. And by losing control, it makes you anxious. Like, you don't. So this, this week, I've actually slowed it all down. And I have a rule like, max two tabs, open max two things at once. And that. That definitely reduces the anxiety, but also reduces the surface area. I have to keep in, like, my. My ram. And I find I'm way more effective as a result. So I think I agree with you in the sense that then what's the next, like, improvement there? It's just like, get me faster at those two tabs rather than having to have some complex workflow.
5:10
Yeah, I agree with you there. I think my. The two words in my. Well, the main word in my head is like, solid. I want solid, where there's no backtracking. Like, once this update is done, I don't need to then go back and look at all the other things that broken, affected. And I find that so far, working agentically with all of the. Basically every model I use, I'm able to do that. I'm able to make these solid updates where I'm moving forward and not having to go back and be like, oh, God, I broke so much other stuff. I've then got to get distracted and fix it, because that's the real time sync where you get distracted by things that have been broken on a large scale. Because the thing is, when the models get it wrong, they get it wrong on such a scale. You're like, how do I unpick all the damage it's done? And this applies not just for coding, but for documents, for images you're creating, for diagrams, for presentations. You can't be in a position where it makes mistakes. I think that's an argument for letting it go longer with the bigger models. But then again for a lot of tasks the smaller ones are able to do it too. So there's a, there's a sort of trade off that's hard to define. It's hard to know would a smaller model have done this better or not. Like, and you sort of in a way are just guessing.
6:33
Yeah, like what a time to be alive in the sense that we now are so spoiled for choice about models. And I said to you before we started recording this show Today, when Gemini 2.5 Pro came out, I was, you know, just blown away by that model at the time because it was really the first model where you had the million context window. It could take in so much context and do a great job of it. And it also had I think the biggest increase at the time in output tokens. So it could spit out like just truck tons of stuff, like whole documents, code like whatever you kind of desired. So felt like a sort of Claude sonnet 3.5 moment to me where you know, they had really gone after this context window solved it. Well, I like the tune of that model. I still like that model. Occasionally I'll play around with it, but it feels like with Gemini 3 Pro to just bring it back to 3 to now 3.1 Pro, they really lost a lot of that essence of what was good about the model. But also it entered into a time where we also then at the same time got a lot of great options, especially the like Opus 4.6. Like that came out not that long after Gemini 3 Pro and kind of stole all that tailwind that Google had recreated with Gemini 3 Pro. And so then the question was coming back today for me to 3.1 pro is like, well, okay, this has got to be really compelling at this point for me to come back to this. And I don't know if it's that compelling. Like it just doesn't seem to have kept up with the times, which is, you know, the models are moving to agentic loops mostly for most tasks now. And it's still in my initial impression of it and it's very early and I will test it much longer. It just doesn't seem any like, it just doesn't seem like they have made great strides in tool calling. Compared to like the Sonnet tunes and the Opus tunes, I was the same
7:45
as you Gemini 2.5 for a long time. And you can verify this on our podcasts, was my model like I used it for everything. It was the core of everything I did and it was that long context that was just so amazing. What absolutely killed the Google models in my mind was when Gemini 3 came out and it would just simply forget the point of what you were doing in a task. And so I immediately assumed it's our fault, like because I'm using this through SIM theory, right, it's our fault we've got a bug. There's something about the long context where we're not giving it all the information. And I checked it manually. Like I would print out the prompts and check that they were correct to make sure that we were giving it all the information. And it was simply forgetting within the one prompt. Like it wasn't forgetting over a multi turn conversation. It would be a single shot and within that it would forget what it's doing. And so for me that model was just dead. And I basically never use it now every now and then it's good for some long context thing, but in my mind it was just gone. A model that was once my, you know, best one has gone to nothing. And so as you said, when the new version comes out, my initial thought is, well, I've hoped they've fixed that because it could be back to the best. But right now we're drowning in brilliant models. And honestly my, my model thinking is not between which frontier model am I going to use, it's more like how can I get the most out of the reasonably priced models? Because I plan on running these things like mad. I plan on looping them for 80 iterations, 100 iterations. I can't afford to run any of the top line models at their current prices to do the kind of things I'm doing. Like yes, I'm doing it because I'm experimenting and developing the system, but ultimately if I'm going to use this on a regular basis, I need a model that's actually affordable and can do the same job. So really for me, I'm not even comparing amongst the frontier models anymore. I'm saying this is what a frontier can do and this is what a smaller, better price model can do. And that's my comparison.
9:53
I think Google went from capturing like late last year, they really came out with a bang. And I remember my initial impression of Gemini 3 Pro in the preview form initially was wow, this is really good. But then quickly after daily driving it, as you said, like it had that path issue and I know a lot of our listeners that used it quite extensively felt this and like agreed. So it's pretty common opinion of it where you would sort of ask it to do a task and then it became obsessed with that initial request and it would just be very repetitive and unless you, which is bizarre, unless you yelled at it and you were like you idiot and, and abuse the hell out of it, then it would somehow self heal and, and go in different directions. But it became abundantly clear to me that that was the kind of model where it was like a one, two shot thing and in terms of like a continuous loop it was going to deteriorate. And also the hallucination of those models like that lineage of the three series is horrific. Like they, they, they hallucinate like mad. And so yeah, and when you're doing,
12:00
when you're doing tool calls that take real life actions like as in modifying documents and code and things like that, hallucinations are your worst enemy because that's where you can really absolutely destroy your time realizing that it's made these massive mistakes. So it really is the most important factor. You can't have that level of hallucination in a frontier model like that.
13:12
Yeah, and I think the reactions are really mixed on it right now. Like I've just, I just was going through X and, and Reddit and various, you know, all the fun places on the Internet and it is interesting that there seems to be a lot of like Google and fanboys out there in those communities that are saying it's you know, all the typical things like it's insane. I used it with open code and it's fast and it follows instructions, yada yada. But then you've got a lot of people saying that, you know, it tries to do like huge regex replaces in in files which is just like essentially like cutting and pasting in a file and it's causing issues and it's just like pretty horrific as an agent. Interestingly, like when we would first put these models into SIM theory, I always found like we'd have to do quite a bit of tuning to get them to like feel right. And maybe because we were so obsessed with Gemini 2.5 Pro back in the day, probably like six or seven months ago, we, we had tuned it quite well for these models and I just did a side by side test and we'll get to Claude Sonnet 4.6 in a moment but if I just put them side by side here, like you've got the exact same prompt in these two windows. So you've got make a Geoffrey Hinton doom center that helps Geoffrey monitor the situation. First research what he might be monitoring for, then use HTML CSS, JavaScript in a single code file to create the doom center. So I just wanted like the most basic test to see. Can it go do some research, can it then create our doom center?
13:35
It does sound like a standard benchmark to me.
15:15
Yeah, of course. And so you can just see the difference side by side. So you've got Gemini 3.1 Pro, does a single search and then comes back so confident and definitive with what it found. And maybe that's okay, like it's not necessarily terrible, like it did the task. And then you've got Claude 4.6. Oh, I actually use 4.6 opus for this, so whatever. But it goes and calls for three very different searches. So like Jeffrey Hinnon's AI warnings, Jeffrey Hinnan's AI existential risk concerns. So it does async. So it calls three tools, it figures it out, but then for whatever reason it calls the new document editor. Instead of like trying to like, I guess put the code in the document, which. So it's prompt adherence was actually not that great. Like it called the wrong tool. And I had to sort of say, no, not do that. You know, do it in code. Right. And so then you just compare the doom centers. Right. So I've got the Jeffrey Hinton one here. You know, whatever. It's the typical sort of vibe code joke thing at this point. Everyone's seen them. It's styled fine. It looks like the original Claude, probably 3.7 opus or sonnet rather. And then you've got the, the opus flawed one that has clearly advanced and it's like fully themed. It looks like some sort of control panel. Like stylistically the tune is just like so much better. And again, like, I don't like these things because I don't think they represent like real world usage. And it's not that helpful. But just as an initial side by side comparison to see how both models handle the problem is interesting to me. I thought the speed at which the new Gemini 3.5 Pro operator was fascinating. Like its ability to output tokens and just go bam with the code is fascinating. But I think it's also that bam thing that becomes a problem as well. Where it goes off, it hallucinates a bit. It goes and makes all these changes and then you find yourself in this position where you're like, oh no, how do I undo this? What has it done? Like, it's gone nuts.
15:18
And your example about getting the greps and the search and replace kind of work like regex stuff wrong is actually really important now because it seems to me like everybody is moving to this CLAUDE code. Slash, open, claw, slash, whatever kind of agentic workflow where we're realizing that working with files on a computer is actually better than a lot of the alternatives. So, for example, we're seeing a lot more work where we say, okay, let's dump all the context that we're working with to a drive somewhere and let's use code to manipulate it to get the actual pieces of context that matter, and even in some cases, writing code to make decisions and then using those decisions in the model. So this is as opposed to taking all of that context and throwing it in a 1 million Gemini window. So a model like Claude that only has 200k context without the, you know, the beta flag is still able to do powerful insights on data by understanding the problem, manipulating the context into the point where it extracts what it wants, and then, and then actually only putting that piece into the prompt itself. So it's cheaper, even though the model itself costs more, it's using less tokens overall and it actually is more accurate. And a model not being able to do those things like you're describing Gemini, where it's actually inaccurate in the way it manipulates files and things, is a much more like a much bigger deficiency than it might seem, because this is the way we're seeing the best results right now. So the thing that's actually leading to the better results is actually not something that it's strong at. And I was actually saying to you this to you earlier today because I was really concerned about context window and thinking, okay, well, the CLAUDE ones only have 200k, so as you move along with a task, then you're going to run out. But with the rise of sub agents and with the sort of workflows where you're working through maybe 80 tasks in a row, you actually don't want that big of a context window because you can't do it. You can't do 80 iterations of something with a million tokens in each one. Even at Gemini's pricing, that's $160 to get through a task. Whereas if you're using the techniques I just described, where you're getting all the context in one place, writing small amounts of code that run with decision processes in there, output that information and put just that in the context, maybe 2,000 tokens, 4,000 tokens, something like that, and all the decisions are made on that basis. You can get through that entire 80 step process only having used, say 200,000 tokens and get the same or a better result than having this massive context window where you're just leaning on the AI as a crutch to be able to handle that context window. So I would argue larger context window and not accurate and hallucinating the tool calls. That is far worse than a tiny context window and the ability to accurately do what you're told.
17:39
Yeah, and I think a big part of the consideration, as you say, is like cost and efficiency as well. It's just like, do you actually need the context window as big when you're able to use tools on a computer and sort of extract or loop through or fetch what you need at the time you need it with a sub agent. And for those that are still thinking, what the hell is a sub agent? Because I know, like, you know, there's a lot out there. It's really where it's breaking down a problem and saying, I'll sort of delegate this. Imagine like a workplace. I'm going to delegate this to a specialist agent or just an agent with a very specific prompt of some data I need. It's going to then go run that with the full context window again and it can go fetch that data and then decide what to hand back. And so I think, yeah, that can add up really quickly because if you spin up like 20 sub agents for a given task to speed up the overall problem solving, you can just absolutely burn tokens. And I think at the moment, because a lot of these applications, they're fighting for user interest, like there's a lot of token subsidization happening. Like they're probably making, you know, I don't know, it feels like they're probably making a loss on some of this stuff. Right. And so I think that's where, I don't know, like, you kind of need to learn optimization techniques at this very moment in time. But then I also think, and I mentioned to you that that D H H and I'll bring it up now.
20:49
Say, say his full name.
22:21
No, I'm not going to. David Hand Miser Hanson. I think criticize me in the comments below for my pronunciations. Kimmy. Kimmy. K 2.5 on open code. Zen is hilariously cheap. I bought $20 worth of tokens two weeks ago and I still have 10. 10.89 left after 3 million tokens. Man, he mustn't use it.
22:22
Billionaire.
22:44
Yeah. 3 million tokens. If there's a.
22:44
He's sort of like, he's sort of like coding Jesus, isn't he? He's sort of like all that is virtuous and good. Like he moved off Amazon and did it on private hosting just to save money. He, you know, never took invested dollars. He mocked the people who do. Like, he's sort of like a purist in all the senses of the word.
22:49
Yeah, but then he's saying if there's a bubble in AI, it's pricing a million tokens at $25 and beyond. I do agree with him. I think that a lot of people that have gone off and ran things like Open Core did get a bit of a reality check, especially as they shut down different ways of sort of hacking the usage where they're like, oh, I have a personal assistant on Telegram, it's great, but I spent $300 in a single day and I'm being hacked.
23:08
Like, and what's crazy is that the model providers, as you say, probably spent even more on his dumb telegram assistant. You know, like they're probably actually losing money on deals like that. And I reckon the pricing, when they put it that high, is just pure fear based. They're like, oh my God, if this gets used too much, we're dead. The billions in the bank are not going to be enough.
23:35
Yeah, well it must be because otherwise you just, you would win market share by just not charging. Like this is what. So just coming back to sort of rounding out the Google 3.1 Pro discussion is I'm sure if you use Gemini in Google's products, you probably won't notice that big of a difference if anything. And I don't know if it still suffers from that like tunnel vision pathway problem that we've spoken about. So I just need to test it more to figure that out. But I don't think for day to day users that are using Gemini stuff you'll really notice. But I do think given Apple's decided to use Gemini in Siri, this could be a serious problem right in their strategy because it's not great at tool calling and tool looping and this like, call it like open chlorism. And so therefore the to me, if I was at Google now I'd be saying okay. And I think all model companies should do this now. They should say, okay, we need two variants of the model, we need an agentic variant. So like Gemini 3 agent and call it what it is, it's Gemini 3 agent. And if you want to do agentic workflows and tasks, you use Gemini 3 agent and then you get a team and you optimize the tool calling and agentic looping with that specific model you specialize. Then you have Gemini 3 chat and that's for turn by turn based conversation with. I again probably need to improve tooling with that as well. But I think this is what they all should be doing. And I have noticed, like a lot of people said about the new Claude 4.6 sonnet that one of the problems is it's lost its creativity that it used to have and they used to love with it. I don't know how they know that immediately, but that's what a lot of people are saying. And I think that that is the trade off, right, of optimizing for these computer loop agents. Like you take away like the creativity and the art student in them and you go towards the engineer. And I think this is kind of the problem right now with these models is everyone is just going so hardcore into the agentic loop stuff. The models are getting optimized for it. They probably need to release instead of variants where it's like, oh, this is slightly cheaper and a quantized version, cough Lord 4.6 on it. Instead have like Claude Opus Agent Claude Opus Chat or whatever. I don't know, like, and then maybe it's like Claude Opus cowork or whatever. And this is optimized for like knowledge work outside of code. And I can kind of see it maybe going there. I just, I don't think we're going to be in a world anytime soon where it's like one model to rule them all.
23:55
Yeah, I agree. I think the variance is good. And actually, interestingly, I haven't had a lot of time to test the new Gemini model. But in the testing I have done, it seems like when it gets in that agentic loop thing, it's all business. Like there is no tokens outside of tool calls. It'll just go like, what's the task? Tool call, tool call, tool call, tool call until done. Like, no chit chat. And it's actually interesting, it might be better because I've noticed with Sonnet 4.6, for example, it's very chatty. It's like, I'm doing this, I'm doing this and oh, just let me think. And it's constantly questioning itself and rethinking things and maybe that leads to a better result in the end. But I'm like, just shut up. Like just.
26:40
I mean, Codex. That's the thing about Codex. Why I occasionally switch to it still is because it just gets on with it and it's mean. It's Not a very. Like, it's not, it's not your friend. Like, it's just I'm like, I'm not
27:19
going to read this. Like, I'm not going to read 16 pages of text about all of the different decisions you've made about which lines of the file you're going to change. Just change them.
27:31
But I think it's by design. It's just by design. They think out loud. Their models think out loud. You know that. But wait, but wait thing it does where it's like, actually I've got a better way. But wait. Actually, yeah, it drives me wrong.
27:40
This is. Yeah. Although I do like when it says but actually the, the changes are already complete. This is done. You're all good.
27:54
My prediction is that OpenAI and remember at the end of last year we did our predictions and I said, I think OpenAI will have the best agentic coding model at the end of this year. I said, still maintain it. I think they're gonna make it get
28:01
on Polymarket, make the bet. Although it's disappointing because polymarket's banned in Australia now and I haven't taken the time to get like a VPN to get around it. So I don't know what they're predicting at the moment.
28:14
Yeah, it's not for me though. I don't know if it's just you, you've been banned. Like, have you got some sort of gambling block on your, on your like computer because you're an addict. I don't know. Like, I feel like you're just making this up.
28:25
Maybe that's what it is. I don't know. But I can't get to it. Like if I try to go to the website, it just is DNS error.
28:36
All right, we better, we better get on to the next model. 30 minutes in.
28:42
Oops.
28:48
So this one is Claude Sonnet 4.6. And again to my earlier point on all this, do you get excited about new models anymore at the moment? Because I. There's so much that I'm still trying to work my way through, trying out like the GLMs in the agentic loop, as we've been saying, are performing and this is not definitive, but they're performing on par, if not faster than say Claude Opus for pretty day to day things. Right. And they're.
28:49
Well, yeah. To add to my point about the new Sonnet 4.6 being so chatty, if you try the same prompt with Haiku, for example, it sort of just does the work. It's the right balance for me. It'll tell you what it's doing, but it won't go into elaborate detail and it tends to just get through it and get it done. It's faster, it's cheaper and so far I haven't seen it like making wild mistakes or hallucinating things or breaking things. It just sort of works. And so, you know, for me that's been my go to model for, for anything agentic really, because it's the right balance of things.
29:21
Yeah, so this is the thing too. So you've got now what Claude Sonnet. Let me bring up the exact pricing. So you've got Claude Opus 4.6 positioned at $5 per million input tokens and $25 per million output. And then you've got the new Claude Sonnet 4.62 dollars redo discount on the base input tokens, $3 per million input and $15 per million output. So it is truly, you know, maybe the sweet spot in the middle. But as you said and say, I find it is a little bit chattier, a little bit dumber and it does feel like some sort of like quantize, slight quantized version of Opus 4.6 where it's just like how can we run Opus 4.6 cheaper? Is the feeling I get from that model. And I'm sure that's really what it's about is getting that level of agentic model like far cheaper and being able to deploy it broader. But at that price point you could argue maybe that is a good thing because you can use that as your primary and then in your sub agents that go off and read files and do stuff, you could use something like Haiku, which is only a dollar per million and then things get a lot more affordable. But if you can get like 25 cents per million tokens out of GLM or a Kimmy K 2.5 and you run that as your sub agent and then yeah, so like really does become an equation on price for a lot of this stuff. Like if you're building agentic use cases off these models at this point, like it's, it's this just weighing up what level of intelligence do I need and how do I drive down the price? Because at the end of the day, for a lot of the new agentic use cases that a lot of SAS startups are going to head towards and a lot of new businesses are going to be built on, this stuff's really important because like your grow, it's all about your gross margin as you start to build this stuff out. And at least Initially, you can go wild and use your VC dollary dues to kind of try a bunch of stuff out. But at some point, and I think this is going to happen with Anthropic, I think it's going to happen with open AI. Like they got to make money and they got to pay for all this stuff they've spent all the dollar dues or Billy's in the bank on. And so they're going to need to charge maybe a lot more or maybe do a lot of optimization. So it definitely is becoming something that, that we need to weigh up. But I also think at this point we're all kind of delusional if we think that we can tell the difference between Sonnet 4.5 and 4.6. Because I am at a point now where I can barely tell the difference between running glm, Sonnet or Opus outside of just very specific niche tasks at this point where I'm like, oh, I want a more creative flare or I want this. But just in terms of that crud operation in an agentic loop, all of these models are starting to feel very similar.
29:58
Yeah, I think good tool definitions and skills help with that significantly. And I think that's probably why we're seeing the smaller models almost improve in terms of their comparison levels. Because as the usage techniques of them get better, they become more empowered. I mean, we spoke about this for so long. You give me the old GPT 3.5 ChatGPT original and with the modern techniques you could probably get a lot out of it that we're getting out of the agentic loops anyway. So I think that part of it is just the way you use it. And I think, to answer your question about that, I'm just pleased I can have almost like a model mix for the kind of things I'm doing. And I actually think this is probably where we'll get to where you're saying, well, when my, when I work for my business on this kind of task, here's the model mix I use. You know, I use Sonnet 4.6 as my primary. I use Haiku as my, my sub agent. And then I use GLM5 as my, you know, shell executor, the one that actually goes off and does like hardcore commands or something, you know, and some sort of stack where you actually have the models playing to their strengths, where, you know, you're getting the right combination of cost, speed and accuracy and getting that mix right. And I think that that is probably the best way to work. It's just you can't use the same model for everything because some models will be overkill, some won't be able to do it, some will hallucinate, some won't have enough context or whatever it happens to be or speed. And I think that a mix is going to be a great way to work at least in the medium term.
33:06
Yeah. And we just constantly. There's that meme that goes around where it's like introducing the best model and it's like anthropic and then it's OpenAI introducing the best model and it sort of goes in this loop. I would say at the moment it's pretty clear Anthropic is like well and truly on top. Although a lot of developers are starting to favor the new Codex 5.3 and I think once that's pushed out in the API, like that will become increasingly popular. I found personally trying it out through the week a bit more unless you use it on like full max mode. It's not that I just, I find it a bit dumb is the honest truth. But it's really fast. But again you're not willing to compromise dumbness for space so that becomes a problem.
34:40
It's like it's really fast but like the damage it does is kind of crazy.
35:27
Yeah. But I also still find myself for like day to day interaction stuff like where I'm just going back and checking emails and stuff. Just use Gemini Flash. Even as low as like 2.5 flash, it's, it's fine. Like it does the job, it's fast, you know, it's got a huge context window. I'm sure there's downsides but unless it's like super important work, I find it just. Again, it's just the speed, the mix of speed and intelligence. I'm a huge fan of haiku as well. Like I use that a lot. So anyway, back to Sonnet 4.6. We've really given no details on this model. Here's how it benchmarks. So basically what you're paying $2 less per million tokens and you're losing five, what, six basis points roughly. Or a bit like five and a half basis points in a gender if you're worried about that. But everywhere else, like computer use, you're losing 0.2 of a percent and you're saving according to these benchmarks, which I don't really believe you're saving like a couple of dollar dollars there. So yeah, I don't know. Like, I think for. It's just a hard balance and a lot of these balances are made at the moment by a Claude subscriber going into the app. Right. And having limits on the different types of models and then them hitting the limit and then they just forcibly have to switch to what in their view is a lesser model. But it's not so much the case anymore.
35:32
Yeah, I guess it's sort of. I was thinking it's sort of like. Well, it's like alcohol, right? Like you can drink this hundred dollar bottle of wine or you can drink this $20 bottle of wine. They taste a bit different, but they both get you the same amount of drunk and probably after a little while you can't tell the difference anyway. Right?
37:02
Yeah. Yeah.
37:19
So it's a bit like that where it's like. But. But if someone offers you a glass of the wine or the glass. Hang on. Or a glass of the nice wine, what are you going to do? You're not going to be like, give me the. Give me the garbage wine.
37:19
But you know, one thing I noticed about garbage wines is you get a really bad hangover. Like the hangover is far worse. And maybe that's the problem with Gemini 3.1 Pro because it's really cheap. Right. $2 per million tokens, $12 for Outlook. But the hangover, when it destroys a bunch of your work would be bad. I feel like I'm dishing on this dissing on this way too hard given that it. I've barely used it. So I, I do want to apologize. Next week I reserve the right to come back gushing over it, saying it's the best agentic model ever and I was completely wrong and we can do one of those YouTube thumbnails where it's like we were wrong and do the clickbait.
37:33
I've never been this wrong in my life.
38:14
Yeah, this changes everything.
38:16
It's finished.
38:17
Yeah, I. I guess because Opus 4.6 is such a good model and the new Codex is a really impressive too, that sets a high bar for new model releases. So unless you're sort of beating that bar at this point, I do become a lot less interested. If Lord Sonnet, it came out and was a dollar, the same price as haiku per million input, I would have been far more excited about that release.
38:20
Yeah, I think that's the thing though. They couldn't, they couldn't support it because they would have like a server meltdown from people just hammering every request. Everyone can switch so quickly.
38:47
So let's switch now to the open claw anthropic OpenAI saga that has unfolded. So of course As a reminder, Open Claw, it was called a bunch of other names prior to this. It's an open source AI agent that people are yolo installing on like Mac Minis and various things. It was originally called Claude Bot and this guy Peter Steinberger, he created it. It got a lot, lot of like preschool stars on GitHub and that's like a measure of success in the coding world. Trying to explain this stuff out loud. Totally ridiculous. Anyway, people were using it because they were hacking the Claude Mac subscriptions or token and using it through this Open Claw which is essentially like Claude code but it, you know, has better memory, uses a bunch of markdown files for memory, uses some CLI tools which I guess port code can do, and then tunnels through your own computer. So it, it runs as like an agent on your own computer, but you can access it on say telegram. Now the terms of service of Anthropic says you can't do this. And I don't know if they enforce it. There's a lot of debate. Some people were saying they're enforcing it, others were not. So a lot of people started pivoting to other models like GLM Kimik 2.5 to get, you know, a good price per token in their, in their agents. But obviously originally being called Claude Bot, it was optimized to use Anthropics models and hack Claude board subscription to run. So there was obviously a trademark problem with that. But then more recently as the sort of branding and hype and you know, distribution, everyone's like trying it out and stuff and there's all this media hype about them having their own social network, even though people set up cron jobs for it to post with a skill that directs them to post pretty weird stuff. All of a sudden everyone was talking about someone should acquire Openclaw. And so OpenAI did just that only five days ago. Now Peter Steinberger joined OpenAI saying that they wanted to bring agents to everyone even though it was already open source and accessible anyway and, and they offered me a lot of dollary dues and
38:58
he wanted to bring money to his bank account.
41:22
Yeah. So they've, they've opened a foundation, the open, like a foundation for Open Claw. So it can keep pouring away.
41:24
Yeah, they've got a good history of the whole open foundation thing that companies
41:33
open AI, you know, it's in the name Open AI Open Claw makes. Makes a lot of sense.
41:36
It does make sense.
41:41
Steinberger has been a really big advocate for Codex models. He loves them. He vibe coded Open Claw with Codex models. So he's a huge fan of them. So I think that also probably led them to do it. But let's call this what it is. Like, none of the things in Open Claw are that hard to do. I think, if anything, like, pretty much anyone could build these things. And OpenAI are very capable of just recreating it. Maybe copying the code base. Exactly. And. Or just doing a lot of the stuff it did. Right. So you sort of say, well, okay, why did they acquire it? And I think it's partially just to get OpenAI back in the zeitgeist we're talking about it. And also I think largely just the distribution. It's become a brand overnight. Like, everyone's like, oh, how do I create an Open. Or agents? So I think they're partially buying Steinberger and also the distribution. But what kind of troubles me about the whole thing is that they've got all the talent. Well, at least 40, 50 of the Alan. They've got so many billies in the bank, they're about to raise like a hundred billies. It's like, guys, where are the updates? Where are these things? Like, why didn't you build this first? Like, it's almost humiliating that Open Claw came out and sort of took this, like, personal AI agent brand. And now, I don't know, it just feels like the whole thing feels yuck to me. Like, why didn't you do this? Like, what are you doing?
41:42
Partly it comes back to a point you made a long time ago, which is, I wonder how much the higher ups at OpenAI actually use it. Like, I wonder how much they're daily driving AI models and using them all the time. Because I think a lot of people who are come to the same conclusions. They're like, you know, what's going to work better is this model. And then that's why we've got this whole simultaneous invention thing. Like, OpenClaud's out there, OpenClaw came out, we're working on Simlink. It sort of is a logical progression. We're all learning at the same time the best way to work with the models. And I just wonder if they're not doing that because some of the things that they concern themselves with and have announced are just totally opposed to the evolution of the technology. The models and the models aren't that suited to it either. Like, they're probably better, I would say, than Gemini at this point, especially Codex. But I don't know, I just wonder if they're. If they're really as into it as the Other people are.
43:18
Yeah, it seems like they sort of. They had, remember. I mean, like, it seems laughable to say it out loud, but it's like, remember the Sora app buzz for like a few weeks and they seem to think, oh, this is crushing it. Like having all these disparate projects like Google. And we kept saying they need to just focus on the model. Like the model is all you need. And anthropic, the whole time have been focusing on the model and the use cases people actually use this stuff for. And I think they hit the nail on the head in particular with Claude code because they built it. As the models improved, it got better. Then they were all using it internally. And so therefore if you use the product, you build every day, like, the product becomes way better. Like, that's always the secret source of a great product. And I think with OpenAI. Yeah, I agree with you. It's like they're not, they're clearly not. Like, I don't think any of them were using Codex. They had to send a memo around being like, you got to use it.
44:17
And like, like other people at Amazon have to use chime for their calls.
45:10
You mentioned for the 600 time on
45:16
this, but they won't admit it. Oh, no, I love it. Makes me happy.
45:18
Yeah. So anyway, I like, it'll be interesting to see now whether OpenAI come out in chat GBT with some sort of like, you know, sort of commercialized Open Chlorine app and stuff and try. And you know, I. My guess would be when Apple announces some new Siri, OpenAI will then drop it and maybe eventually we'll get some device because they got Johnny Ives. Remember that love affair video they recorded in that beautiful picture they had together? Like, like, if you go back through all the ridiculous things they've done, it's truly mental and it's just like bubble economics, like playing out in front of our eyes. Whereas again, you've got Anthropic and I would also say probably equally Google going like, oh, how do we make these a business? How do we.
45:24
Yeah, how do we deliver some things people want, especially anthropic. It's sort of like a relentless pursuit towards a unified goal. Like, they've been pretty consistent with what they've announced and what they've done and
46:10
not wasting time on like video models and like audio models and all this other stuff. Just going, we're just gonna do this one. Yeah, it's really starting to become a lesson in focus, I think. And it'll be interesting if OpenAI are focused right now. But I don't get it. You just see no updates and any update you do see, you're like, why would you even do that like that.
46:23
I haven't even rolled out the nerds in a while. Like we haven't seen a. There's been no gag stream announcement.
46:44
We don't get gaggles anymore. I think they've been, you know, all the automated AI agents have probably replaced them. I think the gaggles are just gone. They're not being fired, but they're being sort of hidden in a back room at this point. So anyway, do you like, I guess all this talk about models and agentic loops and all this stuff, do you think having used all these in agentic loops now, like GLM 4.6, MEK 2.5 sonnet opus, like do you feel like you've used them enough now that it things can and will become commoditized? Or are we always going to live in this sort of like closed lab frontier model world where you've got to pay your power bill? I mean, I guess you've got to still pay it for the.
46:50
I just think when it comes to the real world, like when it comes to mass rollouts and using the technology on a larger scale scale, the smaller models will win at all times because no one can afford, and I don't care what people say, but no one can afford to run out, like roll out the anthropic top tier models right now on a mass scale inside an enterprise, you simply can't do it. If you work in these agentic loops. Right. No matter what you do, if you're doing a complex enough task, you're going to use millions of tokens, right? And millions of tokens cost $5 per million just for input. Like it really, really adds up. And yes, I believe there are tasks that are worth that. Like they're definitely tasks that are absolutely worth the money and you should do it. But that's on a small scale. That's your sort of elite people in an organization who can justify their usage of that stuff. I would argue when you want to get it out to large groups of people where they're all going to benefit the technology, it's. Even if it is, you know, profitable overall, it's going to be hard to prove. And I just can't see people allocating the kinds of budgets they need to do that. I think what they need to do is look at what trade offs are we willing to make and what do we get for that. And I think when it comes to the agentic looping, the trade offs are really, really simple. Some of these models are just brilliant at the agentic loops, like Haiku, for example. I would argue a lot of the time you would really struggle to tell the difference. Probably only Opus and say Sonnet are just more verbose, they just say more stuff, but the result is the same. And so I think that really that having used it a lot, I actually see less difference between the models. And yes, there's some quirks and things like that, but when you just look at I gave it a task and in the end it got the task done, then the models are quite similar in that respect. Like it's kind of rare that you come up with a task that one
47:41
of the this is when in a loop, when it's being fed the cherry pick context. So maybe why Haiku is performing quite well there is like it doesn't hallucinate much. So if you give it the right context, it can therefore give you the best output.
49:45
That's precisely what's going on. And this idea of a sub agent where you're giving it a very specific prompt, all of the data it needs, and a set of tools that allow it to accomplish that task. This is, we've spoken about this for a long time. We're at the stage of ultimate context building where you can build very, very specific tasks and then you've got your overarching process that's synthesizing all that together. And my argument would be that the best combination is to have your best frontier model running the overall process, like creating the plan, evaluating the plan, seeing where we're at with regards to the plan. But then the work is done by the smaller models, works really, really well. And the other thing is working in this mode, you're not trying to single shot everything. You're not going to try to have this brilliant model that just solves everything. Day one. The little models constantly make mistakes. They're like, oh, I called this tool and it failed. And instead of giving up, it's just like, oh, I know what I did, I got this parameter wrong. It tries again and it works the next time. And sometimes that takes a ton of loops. And the idea is that it gets there in the end and it's doing it in a much more efficient way in terms of both time because it's fast and money. And I would say that the days of having a model that has to single shot everything and get it perfect are over. It just isn't needed anymore. And the models aren't even optimized for that anyway.
50:02
Yeah, it's definitely changing the way we use the models and therefore what you start to think about like the pros and cons of each model. But I think that's the thing we've been calling out for a while, that it feels like at least the GPT5 variants of models, not the Codex, that it's why all the focus now is on Codex, not those original models. Because the nature of how we use models today has changed. But I would also argue right now, like the VC dollars are, that people can afford to do these things. Like these subscriptions are to a point where if you're willing to pay enough like it will feel like near unlimited. But the challenge is going to be over time like building out true, you know, true businesses or specialist apps on top of these things. It, you know, at some point some like all of these valuations have to be realized in the market. So it just feels like for a lot of these providers the only way is up, like they have to do the Uber style switcheroo at some point. I mean, like this now costs a lot.
51:27
I do think that the top models are going to end up costing more, which is why I think we all should be looking to optimize for the smaller models now because I think that really is the future. And also it's not like it's not as big of a trade off as it seems. You're not like, like DHH said, you're not giving up that much by using these smaller models. You can still probably get most of it done. And it's only when you want to like write diss tracks or like, you know, do extremely creative things that you might need to switch to a frontier model. But I think the importance is switching. It isn't just banking everything on, on one model. And I think having the flexibility to change when someone comes out with a good deal or something that just suits your workflow better is a position you should be in. I think locking into a single vendor and a single model is, is obviously not the right way to go.
52:26
Well, obviously out there you've got at the moment like big AI conferences on and you know, you've got our man Dario, head of Safety Sex cult, also on my necklace, my Dario necklace out there saying like, you know, all white collar jobs are going to go. I don't know what to do. Hold me back guys. Like the world's changing but then you just have to go to get a bit grounded to the anthropic job. Page, which someone did saying like, you know, AI is going to kill software. Salesforce is ruined to see that Salesforce is so complicated and shit to administer. They can't get their Claude code or any of these things to do it. They've got an add out for a Salesforce administrator in San Francisco. So if you're listening and you're a Salesforce administrator, your job's safe, you're fine. And so they clearly, if they're hiring,
53:18
I still remember paying someone over a hundred thousand dollars a year to operate a piece of softW software. Remember that?
54:15
Yeah.
54:22
Like, it's just wild that they can get away with that and make so much money doing it.
54:22
Yeah, it is. It's. It's not another sort of lull of the week just to. Just to sort of play us out here, this, this video. So there's. I think it was on yesterday, this India AI impact summit. And you've got. They tried to stage this photo shoot where they all held. Held hands and put their hands together in the air. And you've got Sam Altman next to Dario, obviously CEO of Anthropic. And they famously split up early on because Dario claimed that Sam was unsafe. They all are holding their hands up in the air, but they. There's this awkward moment where they go to hold hands and Sam Altman pulls his hand up and away and Darius kind of looking like, should I hold his hand or not? And then they end up just putting their hands like, like, kind of like fists together near.
54:27
It's still participating in the celebration, just not touching one another.
55:20
I just think it's so funny. It shows the tension. But also what's even more comical about it is it feels like a scene out of Silicon Valley. Like the, like, it's. It's common. It's like the purest form of comedy. Like, it's almost like a writer wrote this scene of these two guys being so awkward standing next to each other. It's worth a watch.
55:26
I mean, look, it'd be pretty awkward if I had to hold hands with other men as well.
55:48
Holding your hands? Yeah.
55:52
Like, what are they actually celebrating?
55:55
I. I don't know. I didn't watch the summer. I just don't listen to their, like, marketing anymore. I'm just. It's exhausting. Speaking of exhausting, our podcast. Thanks for making it to the end. All right, any final thoughts? Gemini 3.1, anthropic 4.6. Sonic Chris.
55:57
I'm definitely going to give 3.1 a really, really good shot this week. I Haven't had enough time to play with it. And I look forward to next week reporting on it. I think probably the thing I don't like is it's always this preview thing. They've always got this out, like, oh, well, it's just a preview model, so what do you expect? I don't like that. One thing you got to give Anthropic is if they release something, they really release it. Whereas OpenAI announces Codex and you can't really use it unless you're in their ecosystem. You know, Gemini announces it, but it's kind of like caveat, like, oh, is it real? Is it not? It's changing every other day, that kind of thing. I don't like that. I think if you're going to put something out there, you should put it out there, especially if you're charging money for it, you know, Like, I understand if it was free, that's fine. But if you're charging for it, you need to have something you're willing to stand behind.
56:19
It is strange too, because OpenAI used to be amazing at that. It was like same day API drop and now they're holding it back and they literally claiming when we can make it safe. That's the, the claim in their blog post, which is just utterly ridiculous.
57:06
Yeah, yeah, it's. It's kind of that. That catch all word, right? Like, it's like, oh, well, you know, think of the children. You know, it's like, okay, yeah, true children are important. We better not.
57:20
So don't forget, on Spotify or wherever you get your music, is this the end? Let's drive this song right up there. We still have, like, I think over a hundred monthly listeners to these tracks. It did peak at, like, higher than that, but I think everyone got over it. Don't blame you. But yeah, it's on there if you want to listen to it. All right, we'll see you next week. Bye.
57:31