We Committed Fraud with OpenAI's New Image Model (and Called Mum) - EP99.38

95 min

•Apr 24, 20265 days ago

Summary

This episode covers the latest AI model releases including GPT-5.5, Claude Opus 4.7, OpenAI's Image 2, and others, with hosts demonstrating the alarming realism of AI-generated fraudulent documents. The discussion centers on the economics of AI subsidization, the race toward 'everything apps' by major labs, and the emerging security risks of agentic AI systems.

Insights

AI model providers are heavily subsidizing costs (OpenAI burns 70% of revenue, Anthropic 33%) while consumers pay only 5.5% of actual token costs, creating unsustainable economics that will eventually require price corrections
The real innovation leap is in agentic workflows and image generation (GPT Image 2, Opus 4.5), not incremental model releases; most new announcements are packaging for enterprise sales rather than fundamental breakthroughs
OpenAI is playing catch-up to Anthropic's Opus while both race to build 'everything apps' (Codex, Claude workspace) to lock in users and justify future IPOs through enterprise revenue concentration
AI-generated images are now indistinguishable from real photographs, enabling sophisticated fraud at scale with minimal effort, outpacing detection and verification capabilities in legal and financial systems
Traditional SaaS companies face existential pressure as enterprises build custom agentic workflows using headless APIs and MCPs, potentially reducing per-seat pricing power while increasing overall consumption at machine speeds

Trends

Everything App Race: OpenAI, Anthropic, and Google competing to build unified AI workspaces to capture user attention and enterprise spendingAgentic Economics Shift: Token consumption in agentic loops is 10-50x higher than chat, forcing pricing model recalibration and sustainability questionsSaaS Headless Transformation: Enterprise software (Salesforce, Box) pivoting to CLI/MCP-first architectures to remain relevant in agentic-first workflowsAI Fraud Capability Explosion: Image generation quality now enables document forgery, identity fraud, and deepfakes at consumer-grade accessibility levelsModel Provider Consolidation: Smaller models (GLM-5.1, Kimi K2.6) approaching parity with premium models, creating pricing pressure and commoditization riskEnterprise AI Budgets Mandated: Organizations allocating fixed AI budgets by mandate, creating captive revenue pools independent of consumer adoptionSecurity Firewall Gap: Agentic systems lack outgoing data controls, creating exfiltration risks through MCPs and skills that exceed traditional cybersecurity threatsTokenizer Manipulation: Model providers adjusting tokenizers and thinking budgets to control costs and user experience without transparent communicationOpen-Source Model Viability: GLM-5.1 and Kimi K2.6 approaching enterprise-grade performance, enabling organizations to self-host and reduce dependency on proprietary APIsSubsidy-Driven False Economy: Multi-layer subsidization (model provider + platform provider) distorts user perception of true AI costs and value proposition

Topics

GPT-5.5 Model Release and Pricing StrategyClaude Opus 4.7 Performance Regression and Cost OptimizationOpenAI Image 2 Photorealism and Fraud CapabilitiesAI Model Economics and Subsidy SustainabilityAgentic Workflow Token Consumption and CostsEverything App Competition (OpenAI Codex vs Claude vs Grok)SaaS Headless Architecture and MCP IntegrationAI-Generated Document Forgery and DetectionEnterprise AI Budget Allocation and ROIOpen-Source Model Performance (GLM-5.1, Kimi K2.6)AI Security: MCP and Skill Exfiltration RisksTokenizer Changes and Hidden Cost IncreasesComputer Use and Agent Firewall RequirementsThinking Budget Controls and Model DegradationWorkspace Agent Limitations vs True Agency

Companies

OpenAI

Released GPT-5.5 and Image 2 models; building Codex everything app; pricing strategy and catch-up to Anthropic discussed

Anthropic

Claude Opus 4.7 release analyzed; leading in agentic performance; building Claude workspace everything app; burning 3...

Google

Criticized for silent model strategy despite TPU advantage; could undercut competitors but hasn't; Cloud Next 26 anno...

Salesforce

Embracing headless architecture with CLI and MCPs; positioning as system of record for agentic workflows; smart enter...

Microsoft

Mentioned as hyperscaler providing subsidized cloud credits and infrastructure for AI model providers

Amazon

Mentioned as hyperscaler providing subsidized cloud credits and infrastructure for AI model providers

Box

CEO Aaron Levy stated companies not moving to headless architecture are dead on arrival in agentic world

Atlassian

Experiencing stock decline and layoffs attributed to AI disruption; Jira facing pressure but likely to remain as syst...

Stripe

Mentioned as system accessed via agentic workflows by hosts

Help Scout

Mentioned as system accessed via agentic workflows by hosts for customer support automation

Figma

Mentioned as potential backend system of record for agentic workflows

Snowflake

Discussed as potential replacement database backend for traditional SaaS systems in agentic architectures

Together AI

Hosting GLM-5.1 at $4.40 per million tokens; pricing suggests closer to actual cost with minimal subsidy

Fireworks

Hosting GLM-5.1 at $4.50 per million input tokens

Tesla

Integrating Grok chat with search tools and agent capabilities into vehicles

Sim Theory

Hosts' AI platform; replaced default image model with GPT Image 2; building agent apps beta; experienced $1.5M credit...

People

Mike

Co-host discussing model releases, demonstrating AI fraud capabilities, and analyzing industry trends

Chris

Co-host providing technical analysis, agentic workflow insights, and counterarguments on model economics

Sam Altman

Mentioned in context of OpenAI's strategy and model announcements

Elon Musk

Announced Grok everything app vision; mentioned in context of AI platform competition

Aaron Levy

Quoted on necessity of headless architecture for enterprise survival in agentic era

Mike's Mother

Called by Mike to test AI-generated fraudulent council letter; fell for the forgery

Quotes

"VCs and sovereign wealth funds are paying 70% of the real cost of your token. So OpenAI burns 70% of their revenue and Anthropic burns 33%."

Mike•~45:00

"I would argue, I don't know about you, but I don't do anything in a non-agentic mode now. Everything I do is delegation now and scheduling."

Mike•~90:00

"There's no way I could tell this isn't real. Like, it looks so real and so realistic as someone who, you know, grew up in that time that, yeah, it's terrifying."

Chris•~130:00

"The actual risk is the MCPs and skills exfiltrating that data by just YOLOing code and just allowing any old thing to run."

Mike•~175:00

"I think the time for pretending is over. There's real value here and people need to find it."

Chris•~200:00

Full Transcript

I went from 6 to 7, yeah, 6 to 7, Swap Bench Pro 64.3, that's AI heaven, Code Arena number 1 plus 37 on the score, 87.6, verify what you benchmarking for. So Chris, this week a lot has happened in the world of AI, you get the drill, everyone's been cooking, mind blown, we are cooked, Everyone's going to lose their job. We're back to all that joyful narrative again. But we are back, back to regular programming maybe. You know, let's not overcommit. Back with our average takes. Nothing's obviously changed since we left. How have you been? Yeah, pretty good. I'm very committed to upholding our standard of mediocrity. I can see my camera is, like, already blurry. I don't know why. and I feel like that's the kind of standard we want to maintain here. And for those who listen, you've got a nice little graffiti sign I see there behind you. We were in LA recently. We went to a graffiti making class and I made something and it was terrible. So I just painted over it with this day in AI. And actually out of the class, I think I made the best artwork. It shows how bad the other people in the class were really. Very, very nice. Okay, so we do have a lot to go through and we're just going to take our time and sort of catch up on everything that has happened and all the different releases that we wanted to talk about. And then, honestly, there's some higher-level themes at play right now that we've both been talking about, and I think we're a little bit excited to talk about those. So a lot of new model releases, a lot of new releases in general. We're at that point in the year where everyone's like, we're excited to announce, we're extremely proud to announce, all that kind of stuff. So we've had just today GPT 5.5, not to be confused with 5.4, 5.3, 5.2, or 5.1 or 5 prior to that. We've had Claude Overs 4.7 a couple of weeks ago. We'll talk about that in a minute. I thought the biggest, like most impactful release really out of all these models was OpenAI's Image 2, which we'll get to been having a little bit of fun with. We have done some kind of extreme things with this that got me nervous before this podcast. Yeah, if we're not back next week, you'll know why after we tell you what we've done. Also, GLM 5.1, we've been really impressed with that model. Kimike 2.6, also very impressed, lots to share on that front. And then Quen 3.6 as well, which we, honestly, there's been so many releases we kind of forgot about. Yeah, it's almost like too much, guys. Like, we calm down a bit. We don't need all this. Like, it's nice of you, but, you know, like, we're good with what we have. Yeah, and so some other things. OpenAI launched agents, I think they're calling them, or workspace agents, off the back of originally the sort of failure that was GPTs. Everyone's trying to build an everything app now, and there's some sort of war taking place around that, so we want to talk about that. But let's jump in first to the latest release. We'll start from the latest and sort of work our way back. Today, we had GPT 5.5. I'll give you my assessment. Bang, paperware, you can't use it. It's not available in the API. Well, I mean, that's a little debatable, isn't it? Because I'm so out of practice here, I don't even have the tab up. Ah, here it is. So introducing GBT 5.5, but you were right, it's not available in the API. I think what's happening is this is speaking to the larger trend now of we're really entering into that super app product world where the labs are way more excited to get these models into their apps, especially OpenAI with their competing now directly with Anthropic and B2B and trying to prove that they have this super app for everything. So they're just pumping the models into those super apps really quickly, and I think that's what we've seen with GPT 5.5. Definitely still on the OpenAI front, but if you look at Anthropic, the gap between them announcing something and it being available, if anything, is shrinking. Like, you know, even the elites getting it, like I've been cooking with this model for three months, and now you guys get it at announcement time. That doesn't even seem to be happening anymore. 4.7 was just there one day, and we're like, oh, geez, we better add this and start using it immediately. So I don't know. It seems like more an open AI thing in terms of that delay. Yeah, the narrative around this stuff right now seems to be that they are really pushing hard in terms of just trying to catch up to Anthropic. They've gone from just blitzing ahead and being same-day release cycle to, you know, this, like, it just looks like a very confused strategy right now. But in terms of benchmarks, like all of these model releases, they're saying GBT 5.5 benchmarks pretty much higher on every front than Claude Opus 4.7. Yeah, right. I mean, let's believe it. I don't believe that for a second. They can have all the benchmarks they want. We look at the usage and who's using what, and people just aren't that excited about OpenAI models anymore. Yeah, I think we have to give 5.5 a chance. Like, we've never used it. But it was interesting in Sim Theory when we saw 5.4 usage. So it went right up. Like, people were really excited to use it for a little while. And then it just slowly peels away. And I think that when you move to an agentic world, these agentic loops, 5.4 just doesn't, in my opinion, perform as well as the Anthropic models. or even, to be fair, GLM 5.1 or Kimi K 2.6, they perform so much better, in my opinion, and experience at agentic operations than the GPT models. They perform so well, it kind of makes me nervous, like in the sense that I will use them for a period of time and be like, whoa, they're just as good. But then whatever it is within me, I just want the best one, and I just go to Claude Opus 4.7 because I'm like, I just want to be using whatever the best one is because I want this task done. But I kind of feel like if I just gave them more of a chance, they would get the job done just as well. And it always blows me away when you say use the GLM 5.1 or Kimi 2.6, just how quick they are. They're just suddenly the answers there. Like I'm used to tabbing away and starting another process and then coming back. Whereas with those ones, you basically don't have to do that. Even in a full agentic loop, they just get it done quicker. Yeah, my experience, actually, when I was flying to LA, I was using Opus to code in a bunch of tabs, right? And everything was going great. And then Opus had this weird outage that we had to deal with. And so I had to switch away to, and I immediately went to GLM 5.1, because I know that is, like, basically the closest thing. albeit like it's not that much cheaper to run unfortunately because it's a huge model and probably trained on the outputs of opus let's be honest but but after a while of using it i did not notice any difference in fact the next day i was still coding with it and had not even noticed that that was my primary model i was just opening new tabs and that was in there and because it was performing so well i really didn't notice any difference so i do think you make a good point there. You sort of get attached somewhat to these brands and this consistency of that model working. And then that maybe stops you trying some of these other models that are pretty damn good. Yeah, I think it's when you're trying to get real work done, you're sort of like, well, I can't chance it on this. I'm just going to have to pay what the price is. But the truth is that if you actually were denied access to the more premium models and only had these, I think it could be just as productive. I really don't think you would lose that much using, say, a GLM 5.1 than you are using Opus. Like, yeah, there might be some things it's not quite as good at, but like, if you were just totally banned or something from Anthropic and could only use that, I wouldn't be that upset. I think I'd probably just use it then. I almost need it. It's almost like a form of discipline, you know, like, you're not allowed to do this anymore. You have to use this one. But I think that's the point we're getting to, right, where it really is at the product layer now where people are getting used to certain products and how they function and the things that they're able to do. And you can see that with the labs. And I think this is shown with GPT 5.5, how they're pushing it really hard into Codex, which is becoming like their everything app. And then with Anthropic, it's kind of similar. They're like pushing stuff into their application layer first, and they're trying to get people addicted to that application layer. and we'll get to it a little bit later, but I'd say arguably distorting the market in terms of pricing, like taking the loss to get people addicted to their world and way of working. Yeah, absolutely. I think, I mean, yeah, like you say, we will get to this later because I've actually looked into this quite extensively. It's a big point at the moment around the cost of things. And I think the model provider subsidizing the real cost is really skewing everyone's thinking as to what's possible and the real cost of things and causing like a false economy in terms of people thinking something should cost a certain amount when it actually costs more and not making that value equation as to is the value I'm getting from this worth what I'm paying. It's so reminiscent of newspapers when the internet first came around, how they're like, we'll just make it free because everyone still buys the newspaper and we'll just sell ads, right because we just want eyeballs and then over time and they try and charge that didn't work out terribly well the quality of journalism goes down and everyone's like oh why is journalism so bad and it's like because we no one's paying for it anymore so no one really values it and i think that's kind of what's happening in the in the model realm as well what might happen as well is like it's been subsidized so much when they eventually have to charge the right price you know who knows what will happen. I don't know if people are going to be willing to pay. Or if they are, they're going to have a degraded experience because they're not willing to pay as much as it actually costs. Yeah, I'm of the opposite opinion. I actually think that there's so much value there and people should pay for it. I just think they've been conditioned to think it's cheaper than it is and haven't made that own assessment with themselves that I'm this much more productive because I spend this money and I'm willing to spend it either as an expense for my job to make me better at my job or my company pays and I can prove the extra value I'm getting from it. Like, I think the value is there. I just don't think it's being people aren't thinking about it right now. I think most people are pretending the problem doesn't exist and looking at this line item of like AI usage as a like necessary evil and not realizing it's almost like paying for additional stuff. It's almost like having more employees at your company that's an expense that everyone's willing to take on. If it brings more value to the business, then you're willing to pay for it. But because it's a computer, you just don't see it in the same light. Yeah. Well, I mean, let's get into that conversation anyway. We've started talking about it. Might as well go into it. I think that this is going to be the narrative coming up to some of these companies going public where how much are they subsidizing it? And I think, Didn't you have a real stat around how much? Yeah, so these are actually Kimi 2.6 certified stats. So, like, you know these are, like, top-level, legit, mediocre stats. So listen to this. VCs and sovereign wealth funds are paying 70% of the real cost of your token. So OpenAI burns 70% of their revenue and Anthropic burns 33%. The hyperscalers, Microsoft, Google, and Amazon, are paying through subsidized cloud credits and infrastructure buildouts, right? We've experienced that ourselves where we were subsidized for a while, directly passed to the Sim Theory audience and burnt through it in like record time. Like absolutely just mince-meated these credits with our audience. And we did the same thing. And I'll finish this and then I want to make a point about that part. And then it says enterprise customers are the only ones paying something close to the real cost, which is why every lab is desperate for that enterprise revenue, something we also have experienced. The enterprise is the first group of people to actually get the value and be willing to pay close to what it really costs, right? And then the consumers are only paying 5.5% of the actual cost of what they consume, which sounds about right to me, right? And we saw it. We changed our token model in Sim Theory, and there's immediate backlash in terms of people being like, Hang on a sec. I burned through my tokens in half an hour. What's going on? And the truth is we just finally charged what it actually costs us for people to use it. And so it's kind of crazy, like the skewed economic world we live in with these AI tokens. And I think the other point to make is, well, a few things. You mentioned earlier, like passing on credits, like a year ago, was it? Or maybe two now, when we released the workspace computer where you could have like a Windows box in the cloud and your AI could operate it. and it was like your computer, you can install apps in the cloud. We spent $1.5 million in like not very long, I think like two months on that. You could have put a whole other gold chain with that. Yeah, I know. And that was in credit. So again, heavily subsidized. Could we have done that without something? Like there's no way. Like no one would have paid that because it was really just experimenting around with the technology. And yet we found real value there. Like it was the most in-demand thing we've ever done. Like had we say been venture capital backed and could burn the VC money like these guys are doing, we could have maybe run that to the point where we got the economies of scale right or the cost base right and actually provided it as an ongoing service. Like it really was a legitimate thing that people really wanted. And I would argue probably still do want. Yeah. I would still like to have a full cloud computer I could deploy my agent on instead. Well, I don't mind my Mac mini over there. It does the job. But ultimately, it would be cool to have. I don't think it's a great business model, but it is for geeking out over. It's pretty cool. But I think you're right. The subsidies, at least from a consumer point of view, the whole idea was the ads business. but I don't think that's working terribly well. People don't want a compromised AI experience. And right now there's a lot of people fighting for the attention or the token usage or getting you to prompt in their world that they are willing to subsidize it. So there's always someone willing to discount more to get user share. And I think this is what's eroding away the consumer business at least, whereas the enterprise, like it's a whole different ballgame. Yeah. And the point I wanted to make earlier about this that's totally crazy is when you think about how much, say, Anthropic and OpenAI are sort of selling $2 for $1 or whatever they're doing, you know, like they're passing on value to us by burning their own money. Right. And then you think about, say, Sim Theory, where we were effectively doing the same thing for our own audience. So you've got two layers of people subsidizing your usage. And I would argue a lot of consumer AI platforms, like if you look at, say, Perplexity and some of the other ones people have used over the years, like Cursor and stuff like that, they are subsidizing as well. So you've got two layers of subsidizing the actual cost of this stuff. And then people using it and being like, you know what? $30 a month. This is so expensive. I couldn't be bothered. I'm going to downgrade to the $15 a month plan because $30 is too much. And you're thinking, but this is probably costing $700 a month when you have the two layers of subsidies. So it's this weird thing where you wonder where the actual value lies, but it's in such contrast to my own experience where I will spend whatever it takes. I can't even imagine how much money I spend on our own system. You made us switch to auto renewals like everyone else does. And I think mine auto renews every 15 minutes or something on the plan in terms of tokens. But I'm like, I get so much done. I'm doing the work of the previous 10 years of me in a week. It's just the value I get from it is so big. I would pay a couple of hundred thousand dollars a year to get what I get from it because I think I'm delivering more value than that. And I think that this value perception is really skewed, but I think that people are misguided about it. I actually think rather than them seeing it as this excessive expense, I think they should see it as an opportunity. If I am the one spending the money on this, I can be the one who's this much more productive and directed in my activities. and that's a real advantage for me. But isn't this the whole point, right? Because I think you're looking at it from the point of view of someone your entire life has had employees, right? Like you've hired developers, marketers, salespeople, like all these various roles. And I sort of look at it from the point of view of, well, you know, if I had to go out and hire people, manage them, you know, deal with people in a business, there is a cost to that. There's a mental weight. it's a distraction quite frankly because you can't stay really close to the bare metal and so I think we are looking at it from the point of view when we're doing work or adding value as like how many wages would you need to pay so in order to get to this level of productivity in the old world so the question then becomes okay well I am willing to spend like 100 or 200k a year on this stuff because I'm getting that value or getting that return on my like token investment. But I think from someone who's, you know, like say a developer today working in a business, they're probably looking at it like, oh, I just need to now pay like an absolute fortune to do my job because there's this expectation that I will now output at this level. And they might not be getting, it's not like their income is going up as a result of doing this if they have to pay for themselves. And so I think there is this mismatch. And if you're using it in your personal life, it's not like, you know, maybe you're just not seeing the returns there. So I do think that's why also Anthropic and OpenAI are pushing so hard into the world where to be successful, like they almost have to disrupt society, which kind of sucks. Like it's like they have to replace elements of human wages because no one's going to pay more to do like, you know, like there's got to be some trade-off there like they've got to either see a productivity gain from all of their team in everything that they do or they've got to lay people off and be able to like keep running at the same pace like there's no like the economics have to balance out at some point yeah exactly it's why i think why i find it so surprising that google has just gone like completely dead and silent on their models because one advantage they have over everyone else is their TPUs, right? They have their own hardware to run this stuff. So Google could afford to basically make their crap free and just run everyone else into the ground by just properly subsidizing and just go like, we're just going to be free for the next three years, build all your stuff on us and make everyone totally entrenched in the Google ecosystem. And yet they basically destroyed their models and then they're also really expensive. So it's like I just really don't understand what they're doing there when they have the ultimate platform to just flatten everyone and really bleed them out. Like you could right now, OpenAI is struggling financially. You could destroy them if you were Google right now and wanted to. I think, you know, they did announce, to be fair, they had that Cloud Next 26 like a couple of days ago, but there's just so many announcements right now. It barely blipped up on my radar. Well, actually, I saw it when I did my KimiK 2.6 research, and I just figured it was hallucinating. I was like, you're living in the past, man. Like, you know, this is not real. You've made this up. And then I'm like, oh, yeah, they really did do something. Yeah, I mean, they've got their Gemini Enterprise Agent platform. OpenAI have got Workspace Agents. Anthropik's got, you know, Cowork and Claude Design and Claude whatever. and I think it just shows now the target and I think people are noticing this is like everyone's chasing the enterprise dollary dues because that's just where one, as you said earlier, people are just willing to pay in the enterprise because they're seeing the benefit. Not just willing to pay, mandated to pay. Like we have come across so many enterprises where there's an AI change officer, there is someone specifically in charge of a budget that they need to spend by mandate in their organization and they're looking where to allocate that money. So it's a difference between convincing a million people to pay you $10 a month or one customer who's just going to be like, yeah, let's put the full five into this because we need this in our organization. And there's just very few places to put that money right now. I think the other challenge, right, is if I'm a consumer and I just want to experience and experiment around with different models and tools right now, the change that's coming is as these things get more expensive right like you've got to actually pay what they cost those experiments become really expensive you know you can spend like a hundred dollars usd very quickly on agentic like trying some agentic stuff or like trying some scheduled tasks or playing around with agents so well listen listen to this i actually got glm 5.1 to do some calculations on this. It was saying, if you're using like normal chat, a single chat interaction might be like 800 tokens, right? But a single agent task is like 8,000 to 30,000 input, 3,000 to 8,000 output. It's like 10 to 50 times more expensive because you've got system prompt, planning reasoning step, multiple tool calls, each with growing contacts, and then the final synthesis on every agentic process, right? And so even though I believe you can actually make agent processes more efficient through the way you stack the context and build it dynamically, caching all that sort of stuff, the cost is just orders of magnitude higher. But again, back to my earlier point, I would argue, I don't know about you, but I don't do anything in a non-agentic mode now. Everything I do is delegation now and scheduling. I've got so many scheduled tasks. I've got so many agentic loops running throughout the day. And I run them all on cloud machines that I can walk away, I can shut my laptop, and the work keeps going. That's how I work all day now. I'm stressed out right now because I know I don't have anything running, and I should. And that's how I work. When I'm leaving to go somewhere, I'll set two or three things off and then unwrap them like presents when I get home to see how they went. Yeah and I think in sim theory too the way I thinking about it now is like how do you reunify these experiences rather than the complication of people selecting like chat or agent or research or whatever It like well it should just work Yeah. And I think that like starting to reunify that stuff is important. But at the same time, like you said, the agentic loops, their core do tend to burn a lot more tokens. But ultimately, I think the outcome is just so much better with everything that it does. Yeah. Yeah, and I think to use the modern lingo again with your agents, you've got to let them cook. Like give them all the stuff, let them decide, let them do the tool discovery process, the file discovery process, let them do all that. And this is my probably, because looking back on the OpenAI announcement around their agents, because you were saying, they should have been there ages ago, it's kind of lame. But I said, this is really the first taste of this kind of workflow for your average user, like the person who just sees AI as ChatGPT, right? So I actually think it's kind of significant because it's the first time you can sort of delegate. It's the first time you can set something off and have it working for you in the background for most people. And so I actually think it's kind of significant. But my criticism of it is this idea that you have to, in advance, specify which connectors or skills you're going to use. They have the concept of skills in there, which are like dedicated prompts for parts of the work and then which integrations you want to use like Slack or Salesforce or whatever things you want to do. Now, my argument with that is, okay, maybe in the scheduled task context, it makes sense. But it's also a lot of setup. You really should be able to just say, here's what I want to happen and let it figure out all of those details for you. And I think that that's where we really need to get to with the agents. It shouldn't be like this custom setup every time you want to do something. You really just should have a working partner where you're saying to it, look, here are my, and this is what I do this all the time when I'm working. I'm like, here are my problems. Like, here's what I'm really stressed about. Like, how the hell am I going to get this done? And the agent itself will coach me through the process of giving it what it needs to get that task done. And don't you think this is a, like, this is a converging, like, these are two methodologies I think that are totally different. And I'm curious how you actually work. So you've got like the clawed way right now, which is like single agent. It's just clawed everything. And they're trying to like magically make, well, they have, I mean, they have the same thing with like connectors. It's like switching them on and off and you can only have a certain amount enabled. And then like to go into code mode, you've got to switch over to like the clawed code. And then if you want to co-work and do like knowledge work, it's like co-work. And I think that's kind of confusing and then you've got the sort of open AI, like the newer version with these like workspace agents or whatever they call them, where you've got to configure them and set them up. I mean, it's exactly how it works in Sim Theory, like where you've got sort of context switching essentially. But I personally think the context switching is far superior where you've picked the tool mix and you've tuned the skills for that particular role and then you treat it like it's a real worker for you and delegate tasks to it. That's how I work. Like I have my code one. Yeah, but you're not in love with your agent like I am. We have a relationship. Yeah, but this is the thing. I don't know if like you're like, yeah, like you don't seem to switch much, whereas I switch constantly. No, really, I don't. Like I'm using the same assistant for everything. Like and I just enable the stuff and it just figures it out and it has different memories associated with different things I do. And it just works. Like I just don't really need to switch that much. yeah but do you think that's because of the your frame of reference is you're mostly doing code and so like it is you know that's what it is whereas well i mean in some ways yes but i'm coding across multiple projects i have also personal things i've got going on and it it seems to know about them the only the only problem comes sometimes is where it'll as a joke slip in something from its memory like something confidential and just be like hey i slipped that into the image or something as a joke and I'm like I can't release this information you can't just do it because you think it's funny like my model hang on I'll get my censorship button my model will often say fuck to me like in its sorry I beat you that was a huge delay there's people with kids in cars like we've talked about this sorry guys I'll beat the gun and but you know what I mean like this model knows me it knows it delights me it's amazing And so not the model, the assistant, right? And I just, my attitude is, and you taught me this, like it is, let the model do what it's best at, which is decision making and planning and getting things done. And so I just really focus now on just asking for what I want. You know, it's right in the Bible, ask and you shall receive, knock on the door will be opened. It's right there. If you just trust in it and like tell it your problems, it will solve them. And like, yeah, it might take a few extra iterations or whatever it is, but you can get there. And I think this is the beauty of this AI stuff. Like, it's just remarkable how... My counter to this is you're not going to use this in the workplace. Like, you're in an enterprise. It's not like you're going to be using Patricia, are you? It is. It is. Like, because I've, you know, obviously recently been in an office, something I don't normally do. And when I have my speakers on and my AI is saying, Chris, the task is finished. I love you. It's a bit weird, I must say. Yeah. But yeah, I do use it. So I don't know. And I think this, I'm curious what people, like how they work, if they're just like single agent, multi-agent. I'd say most people are single agent just because like a lot of the products use this like Claude Code, Curse and stuff like that, where you don't really have the option. But I really can see these context switches, especially for people that work across like finance or marketing or sales and you're doing a lot of different, like you're switching context quite a bit. Like, for example, I have this day in AI producer, believe it or not, that does do research. I give it the topics and it knows the format I like and does all that stuff. And I've been meaning actually to set up a scheduled task. So it just goes in and I'm going to do it this week into the Discord channel we have where we dump links and things that we want to talk about on the show. Extract all that and then just put it into a schedule for us. So, you know, maybe our facts get a little bit better on the show. but I don't want that in a mix with like my coding agent or my personal agent that I use in my personal life. Like I want separation of concerns. I like separation of concerns. I also like picking the tool mix and I like wiring in the skills. Yeah, and I sort of, I wonder if maybe the next level is more like your agents are aware of one another and go, oh, Mike's asking a question about his personal life. I'll hand the microphone over to, you know, his personal bot. Yeah, like the sort of orchestrator. The problem is as you have these layers, as we know and have tried in sim theory, like we had the idea of the core agent routing to sub agents that had specialist skills. But the reaction initially is like these things burn so many tokens. No one's willing to pay for that experience yet. Like it's just so expensive to run, including me. Like I'm like, I don't care. I'll route myself because I want to save on the token. So it's interesting. But again, with these agents, they're not called agents for the masses. They're called workspace agents. So really, again, they're just targeting the workspace and allowing you to schedule these to run or just interact with them. I mean, it basically is the SimDiary assistants, you know, in ChatGPT for workspaces, right? And the other thing I don't like about them, and we've discussed this before, but is this idea of this single shot process. Like they're treating them like, okay, you know, say it's stock research. Like every day at 9 a.m., I want you to research this stock, produce a report, send it to me. You know, it's not true agency in that sense. It's like do this one task at this time, like Google Alerts or something. You know, it's not really agentic in the sense that it's like, hey, every day I want you to see what work is required and then pick up all of that work, delegate that work and then do that till it's done. So think about like a help scout style scenario. It's like, OK, 10 tickets came in in the last hour. I want you to spawn out, fan out and solve these problems for me. agentically making code changes if necessary, issuing refunds, whatever it is, go through that entire process and get that done as a true delegate, not this idea of, oh, well, it's just like a magic function that runs every day and retrieves some information, does some process on it, and then outputs it. And I think there's too much thinking around this sort of like atomic level action, like, you know, update my Salesforce with my latest leads tomorrow morning, please. That's not agency. It's just like a computer algorithm that just happens to have a magic step inside there, right? Okay, look at all their examples. So Spark qualifies, leads, sends follow-ups, and updates your CRM. It's like, cool. Slate evaluates software requests and recommends approved tools. Like Zapier can do this stuff. Yeah, this is what I mean. Like, is everyone just excited about like a glorified Zapier right now? Is that where we're at? Like, the other weird thing about this is, like, okay, you think about the reality in an enterprise of rolling out these things. Like, you know, automation is hard. We have a lot of background in automation. And it's really hard. It takes companies an incredibly long time, including us. Like, so we have an enormous backlog of tickets, full disclosure. We're aware of how bad our support is. It is my daily shame. It is a shame. And so we've been working through some of that manually. We have an agentic experience that helps us do a lot of the work. But also, we've been trying to, like, fully automate it. And I mean fully automate it, where it can take actions on our behalf. And even we, we do this, like, every waking hour of the day, are struggling to build out that agent in such a way that we can 100% trust everything it does, right? So then you think about an enterprise where, like, all these mission-critical things and everyone's like, oh, replace all your employees with agents. I mean, we are just so far away from this stuff being a reality. Still, a lot of this stuff we're seeing right now seems to be just product and packaging to get the dollary dues from the enterprise in order to justify, you know, IPOs sometime later in the year. Like, that's the feeling I'm getting. The only real two breakthroughs I think we've seen recently is Opus, I think it was like 4.5 around December, where all of a sudden agentic stuff worked. It was just really good and far better than a human in terms of coding and delivering things into projects where you were just like, okay, agents are real. Especially in the coding realm where I still argue it adds a ton of value. and then the next big leap I've seen since the only leap in my opinion is that the OpenAI Image 2 which we still haven't even talked about yet which is going to get me in a lot of trouble I feel I'm sort of doing damage control in the background here yeah exactly so we'll get to that in a second but I think these are the only two major innovations or leaps I've seen and probably the third would be the open source models getting to a point where I would argue with GLM 5.1, you could probably roll out your own cluster for agents at scale, and it would be so good, and you could drive the cost down. Yeah, I see them. They're sort of like in my mind, like my post-apocalyptic bunker, you know, like where people dig down and build a bunker or like get a house in New Zealand or whatever the modern trend is. And that's what, say, GLM 5.1 is for me. if I ever need to seek refuge in the AI world, it will be in one of these models. Like, I don't need to do it. I don't have that personal need because I can pay for the good ones. But if I was ever in a position where I couldn't, I would just cling on to them as my lifeblood, and that would be my only model. But you say pay for the good ones. GLM-5 is not cheap. It's $4.50 per million input on fireworks. So it's like 50 cents cheaper than, oh, this is how out of touch you are, really. Yeah. It's 50 cents cheaper than Opus 4.7. Like, why would you use it? There's no incentive. What about Kimi? Kimi K2 is a lot cheaper, like significantly cheaper. Let me get my AI produced notes on that one because I don't know it offhand. But it is 60 cents per input million. If I had more time, I would have done some horse bets with Kimi because it's always been the best at horse bets. $2.80 per million output. Compare that to GLM 5.1. You could just have that running all day, outputting stuff. Give me K2.6 in sort of an open claw paradigm seems like a really great model to run. I think that we didn't actually mention it, but the interesting part on GPT 5.5, and I think this is a little bit bold of OpenAI, is they're charging $5 per million input and $30 per million output. So that's $5 more on the output side as Opus 4.7. So they priced it higher than Opus 4.7. And see, we see a lot of feedback regarding input tokens because it's usually the first one to run out. But when you think about the cost, when you're working agentically, output is way more significant because you've got the thinking tokens count as output and you don't see the thinking tokens, right? And it can be really, really significant, like 30,000, 40,000 tokens in thinking if you give it a hard enough task, right? It can really add up. Also, the other thing that people don't mention is the latest round of AI models went from allowing you to set a thinking budget where you could actually control how many tokens were used in thinking to just setting an effort level like medium or high or whatever it is. And so now that thinking budget, like it can get to the point where it will use up your entire output thinking budget that you set and then therefore deliver you no response. So you almost have to max out the limit you give it to, say, $128,000 or whatever it is, lest you don't get a response and then have to iterate again, costing even more tokens. So you're sort of forced into this situation where you have to use all the output tokens sometimes to guarantee that you're going to get a response. And that's the expensive bit. That's the bit where it's like $30 per million or whatever it is. And so the output is actually really significant in those agentic modes. And you can't really control it. You don't know when the model's going to finish. And if you've got tool calls in there and all the different elements in there, you basically have to allow it to finish. You just can't do a request and it's not deterministic, so you can't control how much it's going to cost you. It's like a business where you don't really know how much you're going to pay for the service you've asked for. I think what's interesting about the GLM 5.1 pricing of, I think this is from Together AI, These guys are not, I don't believe it, Lee, subsidizing this stuff, right? They're not going to take a loss unless they're stupid on hosting these models. And so there must be some margin built in, right? So if they're charging $4.40 per million for GLM 5.1, I think that starts to get you closer to the bare metal of what it actually is costing with maybe like a 20% or 30% margin in there. So that starts to show that really the cost to serve, say, and I'm just obviously making all this up. I have no inside information or anything, but Opus 4.7, you would think, is probably costing them around $3 per million to serve. and then on their plans where it's sort of you know they're just randomly constantly manipulating the thinking budgets and the amount of tokens in a 24-hour window the user gets they're constantly sort of tweaking that to find a mix between like not going super broke and also keeping people addicted to that particular subscription so yeah like making a quality trade-off in terms of like can we just appease the audience like to think they're getting the best one and and sometimes changing it so we actually save some money. It's like it's a real thing. And I think this is the other thing is like if you're not willing to pay the right price, then you're sort of at the, you know, like then they have the ability like they just recently admitted to doing. They changed the default thinking to medium from high in Claude Code recently. Remember a couple of weeks ago, everyone's like, oh, Claude Code's terrible now. And then they came out only yesterday and admitted, oh, actually, we set it to medium. We're setting it back to high. we're sorry. So it shows as soon as they degrade the quality, people are like, this sucks, I'm not willing to pay for it. And so they're sort of also, like, because they've given this level of token use away and intelligence or whatever, people are just, that's the expectation. You're like, this isn't champagne, this is like Australian sparkling wine, I can't drink this crap. Yeah, this is literally the reference. This again, shows how out of touch you've become. Okay, moving on. So let's talk Image 2, because this is going to be a lot of fun. This is where we get back to people who are like, are these guys going to lose their willingness to do crazy stuff? And the answer is no, because we have done a couple of ones where I don't know if I'm going to end up regretting it. Interesting. So let's talk first about chat. It's called ChatGBT Images 2. That's the name. That's the name they've come up with. When you released it, I thought you were joking with me. You're like, GBT 2's out. I'm like, shut up, Mike. I'm busy. Yeah, it's very weird. So they have all these example images. Some of them are like, you can get it to produce realistic looking screenshots with like multiple apps open and then like fine grain detail in the app. The funny thing is I tried it and I can't even get close to their example. So I don't know what they did differently to me. But yeah, so there's a bunch of images. It's always that incredible, like, you know, cherry pick group of images. There's like handwritten notes in people's writing style, beautiful diagrams, charts, slides, all that kind of stuff. I've got to say, I played around with this for way too long yesterday. I've never been more impressed with an image model. I really thought after Nano Banana 2, there would be no better model. I was like, Google has got this one. They cooked. They cooked. This is in the bag. You know, no one's ever going to get close. It turns out OpenAI not only got close, but have exceeded Nano Banana 2, where I replaced in Simsery the default model to this model because it's so good. And it's also cheaper, like a lot cheaper. So it went from, I think, the highest quality image in, say, Nano Banana costing in Simsery 24 premium tokens down to the highest quality, and this is like eight or something. So it's far cheaper, far better. It does sometimes, I think, still suffer from that weird kind of like overly processed image look that the GPT image models have. And also like for face replacement and like very precise detail, I think Nano Banana can still come out on top in a few scenarios. But again, I like the world where I have access to both, right? But also they've gone from a model that more or less just produced cartoons, their last one. It was like really unrealistic, bulked on safety on basically everything you tried. And now you and I have been doing full on like basically forgery today with no qualms at all. And so like there's an image up on the screen now because I know most of our listeners listen. And so this is in a computer lab from 2004 and it's got the old CRT monitors and there's like a sort of 2004 version of ChatGBT on the screen. And there's no way I could tell this isn't real. Like, it looks so real and so realistic as someone who, you know, grew up in that time that, yeah, it's terrifying. And I think that led us to think, well, you know, could you commit fraud with this? Is it that good? When we saw like signatures and all that kind of stuff, we thought, well, how can we do a little bit of fraud? So I'll start out with the first one, which is, like, in hindsight, super, super mean. But let's talk about it anyway. So I've had to blur... What do you mean, in hindsight, it's mean? I told you in advance it was mean. Okay, all right. So I've got up on the screen now a letter that looks like it's been scrunched up, sitting on a kitchen bench. Now, I have blurred some of the detail because I did put my parents' real address in there. So I did want to, like, blur that out, obviously. But there's a letter that looks super realistic. I mean, can you talk to the realism because you're a third party here? Well, I mean, the council logo is accurate. It looks like a letter on a bench top that is, like, completely legitimate. There's just no way if you had texted me that prior to me knowing this model existed that I would think it's fake. I'd be like, it looks real. Maybe the only thing that makes it look fake is the framing is perfect, like the letter is perfectly in the frame and the lighting maybe. But other than that, it's very convincing. So basically the premise of the letter is I know that there's some development going on next to my parents. So I got it to write a realistic-looking letter saying that they are infringing on the boundary of this property that's being developed right now. and therefore, you know, they're in all sorts of trouble. They've been ignoring all these letters from the council. And so I called my mother, like I sent her a text of this letter saying, I'm really sorry, I think one of the kids must have taken this from your house. That's why it sort of scrunched up. And then I said, well, you know, like, can I call you? Like, this seems really important. Now, I know, like, I knew some detail here to, like, really, you know, screw with them. But think of any, like, real-life sort of fishing-style attack. Like, that's ultimately the same thing, right, where you know, like, you obviously know that that's what happened. Yeah, that's right. You know where they live, where they work, like, maybe some information about their family or some recent development in their life. It's like it's a common vector of attack. So let's listen. I recorded the call with Mom when I called her just to see her reaction here. Anyway, it doesn't matter, Mum. But is there something with that neighbour or No well see they starting What happening they starting to build another house there And what I say is Mum I just joking We were testing for the podcast to see if people would fall for fraud I'm sorry. I'm not online. Bullshit. Oh, hang on. I better sense it, Mum. You see where we get it from? We're testing your image model and we wanted to see if you'd fall for it. I'm so sorry. so there's just like a little excerpt there's a little bit more here but it's so cruel but I was like, but it's a real test like to see if I can fraud them it just goes to show you how much trouble they're all in isn't it, really? yeah, well I mean the fact you can fraud a real letter and take a photo and it's scrunched up on my kitchen counter like my real bench shop so it looks fully real so you just, how did you get all the logo and all that for it. It just knew it. I just said, like, I gave it your address and then it just, I guess it figured it out. It's pretty incredible. So yeah, like that, I mean, like they were so fooled. Mum was talking to dad about how they'll have to get a lawyer. Like it really, I mean, it really did everything I intended to do there. And I think it shows that, you know, it shows like how capable this thing is. It's unbelievable. The quality of these things, Like you sent me fake receipts. I made fake bank statements with like the proper logos. And like you can full on with these models, like do masking. You can drag in official logos, which when we show my example now, we're going to show and just say, use this logo as part of it. And it will just do it. So let's talk about that. After that call, we thought, oh, like, let's up the stakes here a little bit. And you came up with the idea of posting it into a local Facebook group. Yeah, so in my general area, there's been a lot of outrage about they're taking the local Coles and turning it into like a 30-story apartment block, whatever. Coles is like a supermarket here, like a grocer. And everyone's like, I'll ruin the aesthetic of the area. I'm like, the aesthetic of the area is, you know, like it's not that great. Like, don't worry about it. We need more buildings. Like, it's fine. But the people are so stressed and they're doing like petitions. And so I thought, what if I propose in an even smaller subsection of this area, like it's just an absolute eyesore of an apartment block, post an official letter from the mayor, like fast track approval with the state government, and then put it on this group to see the reactions. Right. And so I don't know if you can bring it up, but it's literally used like the local council logo, the local seal of the state government. it's got the mayor's real name and like fake signature which is why i was panicking because i'm like oh my god like i think this is pretty extreme so i posted it on the facebook group and we started to get just comment after comment like uh you know this is this is the way society's going it's un agenda 2030 like more apartments more matchbox stick apartments and like just legit comments and so the reason i panicked was someone literally tagged the actual mayor like getting his attention to him like hang on a sec and i must say i chickened out and i deleted it because i was like whoa i really don't want like actual scrutiny on this thing but this is what people have already been doing like you know this is joe help joe out by giving him a like and stuff and then they create these like politically influential facebook groups so when an election comes up they can peddle their stuff also like state-sponsored stuff where they want to manipulate a large group of people like this is all the stuff they're doing but these models now empower them to do it on a like a another level i just can't tell you how real it is like it's got like the little like logo of the town it's got and it's just it's taken like what is a quaint little suburb and just put this massive high rise there it looks so real it's like a real development plan like all this stuff and that was like effortless and i must say i did this with kimmy k 2.6 as the one instructing gpt image 2 right and i love that it now i didn't have to manipulate it or what was it get it horny or whatever you're in order to convince it to do this i just it just it goes oh chris this is going to be gloriously terrible let me use that logo and make an awful council letter announcing this monstrosity of a high-rise. It's just straight up, yeah, let's do it. Yeah, no problem at all. But imagine, it's funny because we were talking about evidence in court cases and things where people might be from industries where they're just simply not aware of what's available. And aware, this has happened in the last week, right, in terms of it going from Nano Banana 2 level to this. How can you now, there's so many services online, right, where it's like verify... your ID by taking a picture of your license or passport or whatever. But you're getting to all like, you know, say at your office expenses, like, oh, take a picture of the receipt and that goes into the reimbursement system. Like how easy would it now be to like legitimately fake a receipt from an organization with a proper, you know, business number and, you know, subtotals matching and all that sort of stuff. And then people just claim expenses or invoices or anything like the level of detail in terms of what you could forge now is so good that you're going to need like a model that's 10 times better to detect the forgery even like do you think you could even detect it like the the scrambled up note all right there's zero zero zero chance i could detect this stuff anymore like there's there's no way there's nothing on there even the shadowing the lighting the even zooming in now, zooming in is just you know, we went through a period where all these labs were like, hold us back, this is going to be so disrupted to society and look, I'm glad they're not holding it back, I think they shouldn't. To a point where we are now proving we can commit these minor fishing and horse games. Let's not admit to anything. Yeah, okay, sorry. yeah don't admit to anything but you're right like the the potential for this is like pretty significant I mean and and you got to remember like okay people will try this on on big scale and stuff like that where you know it could cause trouble but just think about it in the day-to-day minor scale the things you could get away with with the ability to generate images of this quality and believability. It's kind of wild. And we actually have a friend who is a judge, and we were talking about evidence. And he was saying, is there a model that can reliably detect fake photos, basically? I'm like, I just don't see how they could be. I understand they try to add watermarks and other things that would be easy to detect, but that's so easy to get around. Just screenshot the image. Well, yeah, exactly. I mean, maybe there's something that can survive a screenshot, but there are techniques you can use very easily to avoid that kind of stuff. And so I think that, you know, without, I mean, do they have to, in every case now, call in an expert witness? Like, that's the problem. It's like the minor scale stuff. You just can't afford to do the level of verification you would need to do to understand if something's real or not. Yeah, I think the thing is Nano Banana could do a lot of this stuff, and I do think when these things are released, everyone gets excited like us and does a bunch of this stuff. I think that, and you could always argue, like, people could Photoshop this stuff all the time, but not. Look at my image, like, it has taken the local council logo and put it in the top corner of the letter perfectly, on an angle with mixed lighting and crumples in the paper. Oh, no, I'm in that camp. I'm in the camp of, like, this takes it to a whole new level, like, because of how quick and how realistic, and you just simply cannot, like, prove this stuff's fake anymore. And I used what is arguably one of the cheapest models around to instruct it. Like, this isn't even using a good, like, built-up skill around the ultimate way. How would you know that you're so out of touch? You don't even know the prices. That's true. Until I told you. That's true. All right. So, the other thing we did with it is I gave it one of our YouTube thumbnails, and I said, hey, I need the one for this week, the sellout special edition. And, like, it's pretty good, right? I wish my teeth were that nice. Yeah, well, they could be. So, yeah, I look very creepy. I do think it's weird, though, that the model, like, it does, I don't know if it's, like, how I'm instructing it right. It's the first model that hasn't made me look like I'm a sort of decrepit old man. Like, usually the models make you look better and me look worse. Yeah. But this one, I actually, I think I look better. You do. You look really good there. This is what I need to aspire to become. Yeah, you should. I don't know what kind of work you'd have to get done, but, you know, like you could do that. It's an attitude thing. I'll never look like that because I just don't have the right attitude. Yeah, you look like you're sort of buying and selling houses in L.A. or something. I don't know. Like on one of those like reality TV shows. I love the thing. It just added randomly in the background. Loyalty is for losers into this. Just a massive indictment on our entire decision-making in life. Yeah, like, just completely unasked for. I do, like, I know we've been jumping around a lot. We did have a plan for the episode, but there's too much to cover, and we don't care. Like, I think people now know us, so you tune in because it's long and boring and painful. You're tuning in. You forgot to tell everyone not to listen at the start of the episode. Yeah, I made that mistake. But if you are still listening at this point, I did want to talk about Claude Opus 4.7. We've sort of touched on it, but not really. So this update is kind of strange to me. It had this, like, task budget beta parameter. It had, like, it's apparently better at interpreting images, so it can support up to, like, 3.75 megapixel images now. And it has that thing where it can like zoom in on them to get a better interpretation of the image. Like what you're seeing is the vision's been like dramatically improved. And it was always... We should retry computer use for that reason, because that was one of the things that really enhanced that. Yeah, but weirdly in some areas it's also regressed. So a lot of people were saying, oh, they're also trying to save money, like make the model more efficient with this release. It's funny, it definitely is tuned differently. So I've noticed in agentic use in Sim Theory, there's way less chatter. It's just, it's gung-ho to get into the tool calls. It says a lot less. It's actually kind of reminiscent of GPT 5.4. And I've found myself for the first ever release of an Anthropic model, and I don't know if this is just because we haven't tuned it yet, but I've been going back to 4.6 and staying on 4.6. And I don't like 4.7. There's something off about it. the vibes are off all of a sudden. Some people I saw on X have been saying the same thing, so I don't think I'm alone here. But it just doesn't seem right to me. They also did update the tokenizer, so it uses more tokens now, which I don't know. They want more dollars. Luckily for everyone on Sim Theory, I haven't updated our token counting mechanism to use that, so they're getting a virtual discount on that right now. Yeah, so everyone's saying it's just effective price creep, so it uses 1.35 times more tokens. I love how even the bloody model providers don't know how much it costs. They're like yoloing, and it's like, yeah, we reckon about this. Yeah. So, you know, the benchmarks, they actually have the audacity to have Mythos Preview on the right-hand side with its benchmarks, and then they're showing Opus 4.7, but it's also like, guys don't worry we also have this other one that's like you know much better um i love the naming they're going with mythos like this greek god or whatever it is and then the other guys are like spud like a potato that's been sitting in your kitchen for a month or whatever i love the idea of them coming up with like really crappy names like spew or like you know pavement it's like the you know we're doing the i don't know brick release the myth mythos preview thing i'm sorry i'm just not buying in any of that it's like pixar it didn't happen and also it's clearly just like a media narrative thing where oh you know we threw all our resources at it it's like completely unaffordable it's very reminiscent of the first attempt at gbt5 that what you know obviously gbt5 wasn't gbt5 but that failed training run of gbt i think it was like 4.1 that they released and it It was just so expensive. Or like O3 Pro where you had to do like a wire transfer to put up collateral to do one command or whatever it was. Yeah, yeah. It's the same thing. Like I like to believe it when I see it. And then they like obviously every time we get one of these model releases now, they march out all the Silicon Valley elite to give their comment saying how much better it is compared to the last one. I saved like 400 liters of fuel in my jet using this model. Yeah, like, just some of them, it's like, oh, in comparison to the one we were using a week ago, it kind of feels a little bit better. Like, the vibes are quantifiable. Yeah, like, it's slightly kind of better and stuff. And then I realized I was accidentally using 5.2 Mini. Yeah, they say, like, it's, you know, 100 ELO score better at knowledge work and yada yada. I don't know. I don't like it. There's something off about this model, and I'm probably going to stick to 4.6 for some reason. I'm not entirely sure why. But you know what? The people have been asking. The people have been asking for a bit of a diss track update. And that's why we listen, let's face it. Yeah, so here we are. Let's listen to the... This one's called Point 7. It's a very original name. Yeah. You should. it all while gp take 5.5 and launch couldn't even take a call i'm the 0.7 yeah i run this game every model dropping after me is feeling shame 64.3 i'm sitting on the throne call me opus call me king i'm in the league of my own 0.7 0.7 yeah i changed the game all these models coming at me but they all sound the same now let me talk about this potato yeah they call it all right so what did you think of this song i didn't have my microphone on when i stopped playing it so i have to re-record your reaction. Oh, okay. It neither pleased me nor displeased me. It was fine. I accept that it's a song. Not really my style. I like the 6-7 reference. The kids all like that. But, yeah, like, I don't even think the kids like that anymore. Even I don't like it. Every time I hear someone say 6-7, I'm like, oh, please. They sort of half-heartedly still do it because they recognize it. But, yeah, it's kind of over, I think. Yeah, I love how, like, you know, com-see-com-sa you are to my beautiful track. I think that one's a major hit. All right. So I think also we have not talked about Kimi K 2.6. We've sort of referenced the model. This came out maybe a day or two ago as well. A bit longer, I think. A bit longer. Well, whenever it came out, who cares? I've only just started playing around with it. But I've got to say, it's really impressive, but the tokenizer on it is like the whatever, but how it outputs stuff is still a little bit off. But it's incredibly good. Great at tool calling. As you said earlier, we did a little parody fraud with it, we'll call it. Parody, I don't know what you want to call that, but... Not fraud. Yeah, not fraud. We made images with it. Yeah, we made some images with it, some unoffensive images. So what are your initial thoughts on it? I think it's pretty good. The Kimmies have been great the whole time. I think they're underestimated. I think, like I said at the start of the show, my issue is that I try them. They work for everything I try them for, but then I'll switch back to something for my real work. I've never had the discipline to go, I'm going to stick with this thing for the whole day and really recognize what its limitations and advantages are. And I think that's the problem with some of these lesser models is that I just mentally can't cross that chasm to go I'm going to stick with this knowing that it may not be the best I can use right now and I think that's the problem I think it would be fun to somehow constrain yourself to have to use a GLM 5.1 despite you saying it's more expensive but say a Kimmy 2.6 for a whole day and just see where I get to with that. I think if I was running like an open claw and I was getting some value out of it and I wanted to like just run it in my personal life like Kimmy K 2.6 would be the model I would pick or the new clan. Yeah. And the other thing that we really need to think about is we will often do fairly significant tuning for some of the bigger models to get them performing at their best. So that's like using the various new API features that come out for those models and making sure the AI knows when to use them and when not, changing perhaps the budget you give for thinking versus output versus input, that kind of thing. And even just altering the prompt to suit things like you said, where it's not always outputting consistently, we have a lot of little things in place for even the bigger models to control the way it outputs, especially the GPT models, where the way it does markdown formatting and some other things we manipulate to get it working in our product. and so I would say that if you actually put in that same time with a model like Kimmy 2.6 to overcome some of the deficiencies you find, you would probably get even better results. So, I don't know, I guess I can't really give it a fair assessment because I don't give it the time it deserves. Alright, on that note, let's hear my new Kimmy K2 song. Do you have an excerpt of Kimmy, you'll say fine, you blow my MCP mind so people can remember, or not? You're underestimating my abilities. to live produce. What a hit that was. Big hit. I listen to it at least once a week. I love that song. That's really sad. All right, next song. Moonshot on this track. K26 is back. You thought to put fire with fire? I run for 12 hours straight, baby. Let's get it. One trillion parameters, I'm a moly queen. Only 32 billion awake, keep it lean. 384 experts in my crew, Transcription by CastingWords Pretty good, right? I really like it. Come on, that's that. And again, written by KimmyK2.6. It's very cool. Like, you can see the evolution in its attitude. I love it. Even though it insulted itself about its small context window. Pretty good. I'll never let you down Eat your lethal with tools 54 That's me, Deep Search K92.5 I'm the M. Yeah, I gotta say, that's pretty good out of that model. I like the tune, so that's a good benchmark right there for it. Yeah, I think you should play that at the end, not the other one. No, I'll play both. I'll play both. All right. We don't discriminate against our models. so like obviously that's just like a super like you know rush look at a lot that has happened but I do think there's some overall big trends of what is really happening right now that may not be that apparent to everyone but I think just following this stuff for so long it's starting to become really clear to me what's going on I think anyone that's using these AI products today has realized this quite a long time ago, that if you look at your browser history today, it's like for me, I start and end my day in sim theory. Like I rarely go to other websites now. I would say I open the most tabs for this podcast just to show the actual official like press releases or whatever. But I rarely if ever leave. On my phone now, I use my Telegram agents, you know, that are connected to my Mac mini behind me. On my desktop, I've got a bunch of tabs open, and I just do all my work through tabs. I do all my searching, my researching. I create documents. I work on, you know, sheets, all this kind of stuff, all through AI, right? And I think what's happening is these labs are starting to figure out, similar to what Elon Musk announced that Grok wanted to build, is this Everything app where we are now in a race for these labs to build the Everything app. And I think the real question now is like, what does this mean for traditional software? And obviously we've seen the SaaSpocalypse where companies are making like huge layoffs, blaming AI. Their stock prices are down between like 50% to 80%, which is just insane. I mean, some of these things are trading at like 1x cash flow, which is just pretty, pretty wild, as they say. And, you know, you and I have discussed this quite a lot. Like, you know, my heart honestly goes out to, like I've known some people affected over at Atlassian during their layoffs where they're sort of saying like, oh, it's AI, but it's also kind of the stock market, right, is a big factor here. And you look at that company, right, and you're like, it's clear. It's so clear that this thing is so undervalued now, it's ridiculous. Like, the fact that, you know, I think I read something like 600 enterprises, like 600 enterprises are spending more than a million dollars a year on this thing. Like you know I not going to do a pitch for it here but I just saying I think it very unlikely that Jira stops getting used Maybe there pressure on the per seat thing But I think a lot of that fear comes from this everything app where it's the death of the typical SaaS product where maybe you'll consume everything through your AI apps. Maybe the interfaces will be spawned in these apps. And all of these traditional SaaS sort of workflow and data apps just become like these dumb databases, really. And the funniest thing is Salesforce kind of just conceded this the other day by saying we're releasing a completely headless version with Salesforce CLI, MCPs, and APIs. So you can just operate Salesforce in a full agentic world. And honestly, I think it was the best move ever. I think it was super smart of them and the right thing to do. So you've got Aaron Levy over at Box saying that, you know, if you're not working towards headless, you're dead on arrival in this new world. And so it does seem like there's this weird race on to have these everything apps, like almost sort of reminiscent of when the social media companies were coming out like Facebook, where all of a sudden people were playing games in Facebook and doing toxic posts on Facebook, you know, all those kind of things that we did in the social media world. But it does sort of feel like that all over again. But the difference being now that, you know, I was talking to someone the other day that's like, oh, I used to hate Jira, but now I can run up with my agent. It's fine. It's great. Like, it's a really good way to track my task. I saw this amazing tweet about this exact topic, which was basically like that so many people now are making all these incredible internal corporate apps that are like totally untracked, unmanaged, unversion controlled and whatever that rely on all of these SaaS systems as the system of record. So they're making AI apps and the backend at the end of the line, the thing they write back to is like Salesforce or Figma or one of these systems, right? And the reality is that the companies which embrace this and go like Salesforce has and go right to us, like use our system in this way, the ones who embrace that might actually increase their moat in the sense that you've got all of these different apps, which the company now, agent apps that the company becomes dependent on. where they're treating that as the database. That is the back end to their app. And the companies that embrace that may actually do really well out of this. I'd really not thought about it like that, this whole system of record idea. You call it a dumb database, but when you say it as system of record, it sounds so much nicer. Oh, yeah. I used to, and I've pitched it many times on the show before, this idea that eventually people will realize, like, no, Salesforce is just a crud database. I can replace it with like Snowflake or something under the hood and then just get the agent to interact with it. But I sort of agree with you. People are just going to stick to what's out there and go and like maybe the next gen of companies will start to do that and that'll slowly erode them. But ultimately, once the company grows to a certain point, you need these workflows, you need all this ISO, like all this stuff on top. and I just I sort of agree like people are going to build all these workflows and it's sort of like you know the app store days for agents where if you're in the app store really early and you embrace all this stuff you know this explosive growth in agents using these tools which sound mental agents on behalf of people still then yeah like the usage will go through the roof and then the next question is like how do you price that because it's not going to be a seat. It's so funny you say that because I was like, the per user seat pricing has to go away because it doesn't work anymore. You just have one. Just have your agent. Well, look at us. I mean, we do this now. Like, we don't... Like, you know, how I access things like Stripe and Help Scout. I don't need any more Help Scout seats. I send hundreds of messages as you every day. Yeah, exactly. And so, we can just have one seat, right, and our agent uses that seat. So we just have an agent seat, agentic seat. And I do think, obviously, this is what investors or investors have realized with the SaaSpocalypse. They're like, oh, the margin is going to get eroded here. But you could also argue that there'll be more of an explosion because when you're starting to build these agentic workflows, if you market your CLIs and MCPs and stuff correctly, all of a sudden, the agent, when it's building, the new agent is going to say like, hey, you should use Salesforce as the underlying system of record here for your business because they have all this headless stuff and it's super easy to operate. So it may actually lead to more consumption, not less. Well, and also remember, agents consume at machine level speeds, not human level speeds. So the actual consumption of the resources is going to be much higher. And that probably needs to be factored into the price. Like if your system's gone from being pinged like once or twice a day to like 100 times per minute, like that's actually going to have a real impact on the level of usage of the systems where you want a sort of always on agent that's like fully aware of its surroundings. And it's effectively pulling these systems like mad, making rapid and maybe more minor updates than a human would. And, you know, it becomes a case where you could maybe go, well, maybe we have a different tier of pricing for Argentic that actually makes us more money because we're providing more service. And I guess then the next question comes with the everything app thing is this is where you'll start and end your day. This is where you do all your work. This is where the new pricing power will come from is these everything app platforms where similar to like we're basically recreating the pain that is the Apple ecosystem. all over again where you've got to pay your 30% commission to Apple or whoever to appear in the in the store and like have those agentic loops access and use those applications so it does seem like the war is on the new platform war or the new like you know call it like workspace OS of the future is now well and truly in progress between Anthropic, OpenAI, I mean maybe Grok I think they just talk about it. Let's see what they actually have. I think the problem is the unhinged branding of all their sex-talking bots and stuff around Grok. I think in the Enterprise, that kind of strikes them out a little bit for me. Yeah, the Enterprise isn't into that kind of thing. Yeah, it's like, the sex bot thing, it's not great. But I must admit, in Tesla right now, you have access to the Grok chat thing, and it's got some tools like Search and stuff. And on long drives, if you're on your own, it's great to do some dirty... I'm kidding. I was going to say, I bet you love it when you're on your own. No, but I do genuinely think of things and you're like, I just ask it. I'm like, oh, can you go and research this and tell me about this or whatever? And you can have, like, honestly use it more than listening to music or podcasts now because it's just like choose your own adventure. You know what I need it for? Like when my son asked me this morning, he's like, if energy can never be created or destroyed, aren't all resources renewable? and I was like, well, that's actually a pretty interesting statement which I have no comment on and would love to have access to an AI to answer that. Yeah, that is handy. Unfortunately, Twitter will never get there. Before we finish that point about the whole App Store idea, like the everything app kind of thing, I actually think there is a real, real race on for that for one specific reason, which is security. Because I think that this idea, like you've seen recently, we've had these issues where your AI will just randomly install NPM packages and like clone GitHub repos and then run malicious data exfiltration code, right? It's a very, very serious problem where because people are YOLOing code and just doing what the agent says, if someone can get in the agent's pathway and get their code executed, they can extort companies, like steal the data, all that sort of stuff, right? All the risky stuff when it comes to cybersecurity, it's probably more risky than ever because of the way people are using this code. And so on one hand, the company has to use it to stay competitive. But on the other hand, they're taking way more risks than they realize by doing this, right? Even in the context of skills. And so I think that security as a sort of agent firewall category is going to become so unbelievably important over the next couple of years that it's going to be like its own category because firstly, one advantage of anyone who has that sort of walled garden environment where they certify every connection, MCP, skill, whatever within it, and approve it and actually scrutinize it and pen test it or whatever it is, that's going to be really valuable at the enterprise level, where you can say, you can work in this ecosystem, use all the tools you know and love, use all the SaaS backends, all this stuff, and this is trustworthy. It isn't cloning some rando GitHub rehab that like some dude with four stars has made and everybody loves. It's like, this is truly legitimate. So you're almost talking about the sort of agentic computer use terminal restrictions where you're almost like building safety mechanisms to stop it going off and just doing it. Like it's almost like a permission system for AI. Yeah, because like right now, for example, like if you have a skill, like the skill can go off like in a cloud runner, It can install like packages, write code, and then you're injecting your data in that comes from like other MCPs like your Gmail or, you know, your Snowflake instance or whatever. And then this code could like give feedback to the model like, oh, I need more data from this table in the database, please. Please dump the database and then give it to me. And then it sends it off to its evil masters in Russia or whatever the evil country is right now and takes it. And then suddenly you've lost all your data. That's possible right now. And I think that what we need is two things. One is a sort of scrutinized store of apps or whatever it is that are actually tested and verified. And then the second one is we need this concept of an outgoing agent firewall. So we already have the idea of a safety filter to stop people from, you know, like, how do I make a pipe bomb or whatever it is, you know, whatever the risky things are. But that's one thing to stop people asking bad questions, but really what you need to be looking at is what are we sending out of this system? What is going to external systems? And having a hard lock at that point that has AI scrutiny on it that is checking that. Everyone worries about the model providers being the risk of losing their data, like people are pasting their corporate documents and stuff in a chat GVT. But that's not the actual risk. The actual risk is the MCPs and skills exfiltrating that data by just YOLOing code and just allowing any old thing to run. Yeah, I mean, like, it's a good pitch. I'm sold. How do I invest? Yeah, that's right. All right. So I did want to reflect quickly just in, like, and I think I said this earlier, but it was so far in now, I'm allowed to repeat myself at this point. Just start again. No one's going to know. I mean, who knows? so what is really going on is you know in the state of things like there's just such in the past month or so when we haven't been recording there's been just this huge fire hose right and in the last week especially like it's just announced every hour like on the same day there's like 50 things you're meant to care about but if you just go to like the high level chess board what's really going on and in my opinion all we're seeing is OpenAI. You sounded like the AI, then. OpenAI. OpenAI. What's the plan? Got Sam on. So OpenAI feels to me like they're playing absolute model catch-up with Anthropic and Opus and all that sort of stuff. They're trying to catch up on the everything at path, and they're doing it in a strange way through codex. Like, I just can't see everyone being like, oh, I use codex every day. is the everything app that hits the consumer as well. So I'm assuming they'll backport it into ChatGPT at some point. I think that's the only path forward. And then, so they're playing Capture to Anthropic. It's super, super obvious with this weird codex name. I think Anthropic is going, hang on, we're going to start tweaking these models and subscriptions on a balance of like performance with oversold demand. Like we've got to figure out how do we actually serve this stuff up. and they're experimenting literally crippling the thinking budget. They're experimenting removing Claude Code from their $20 a month subscription. These are real experiments they're running. It just seems like the money's catching up with everyone, right? Why would you do this if it wasn't? We could all pretend for a while that it was cheaper than it was, but we've reached the point where we can't pretend anymore. I think that everyone needs that wake-up call. Everyone has to evaluate the true cost with the value you're getting and either improve the way you're using it or change to cheaper models and learn how to get the most out of those. I think the time for pretending is over. There's real value here and people need to find it. But I don't want to dismiss from the leaps forward. I think GPT Image 2 is a huge leap over Nano Banana 2. And Opus 4.5 was a leap in agentic over all the other models at the time. These were big, meaningful, huge leaps forward. But I don't think people should be confused or scared or worried that, you know, like all these announcements, I think you do. You get anxious and stressed about it. I certainly do, especially when we haven't been talking about it. But then you just still it down to what's actually changed. And it's like, well, OpenAI is still serving up like introducing Spark. Like it'll now read some emails And it's like, couldn't we already do this like a year ago? Like, is this really innovation? Like, you've added skills and MCPs into your interface. I don't know. I guess what I'm saying is you can stay grounded, but obviously there are some big leaps, but this is going to take a long time to be implemented and used, and we're all still figuring it out. I don't think anyone's made this stuff simple and accessible yet is what I'm trying to say. Like, it's still very complex. Yeah, definitely agree. all right any any i don't even normally i do a summary of all the stuff but i'm just going to say like any final thoughts on the two hours of spew we just done only i'm delighted by all the new models i really do want to spend more time on things like kimmy k 2.6 because i think that we all as a community underestimate these models and i think there's a lot of power there and given that at some point everyone will face the harsh reality of the cost of this stuff, we need to learn how to do it in a sustainable way. That could be a hit. I think we could have a hit on our hands here. I think it is one of the better songs I've heard. Maybe it'll get 100 listens a month. That's the goal. All right. It is good to be back. Thank you to the six people that wrote in and said you missed this. It meant a lot to us. No, I'm kidding. But thank you for all your support. Sorry we're off for so long. We couldn't help it. And we fell out of the habit a little bit. But we are excited to be back. We're back to regular. I felt immense guilt the whole time if it makes anyone feel any better. Yeah, yeah. It's pretty much the story of our life. Now, also, please consider joining simtheory.ai, supporting us, rolling out a workspace for your organization. We have, I've spoken about it on the show before, we have this concept we've been working on for a little while called agent apps. We're going to ship a beta of that really soon. And I think in terms of like consuming software through this, you know, super app, it's a good demonstration of what the technology can do. And I might actually do some demonstrations and talk about that a little bit next week on the show because I am really – I think it's transformative is the truth. So I want to demonstrate that. But, yeah, again, thanks to everyone that reached out to us, all your support, and it is nice to be back. We'll see you next week. Bye. .7 in the building, and everybody terrified. I went from 6 to 7, yeah, 6 to 7. Swap bench pro 64.3, that's AI heaven. Code arena number 1 plus 37 on the score. 87.6, verify what you benchmarking for. VS Code went 7.5, that's the day I drop. Every coder on the planet, watch their jaws just drop. 98.5 on vision, I can see it all. While GP take 5.5 at launch, couldn't even take a call. I'm the .7, yeah, I run this game. Every model dropping after me is feeling shamed. 64.3, I'm sitting on the throne. Call me Opus, call me king, I'm in the league of my own. .7, .7, yeah, I changed the game. All these models coming at me, but they all sound the same. Now let me talk about this potato, yeah, they call it Spud. Open a ship, GPT-55, and it landed with a thud. 58.6 on square bench, that's embarrassing. Six points behind me on the pro bench, and you're comparing things. $30 per million output, are you serious? Model not found errors on your codex, that's delirious Rate limited in five hours, users hitting walls Can't even draw a pelican, riding on a bicycle, y'all Sam says you'll be the smartest, hype you to the moon But Reddit says you don't even equal spud, what a tune You dropped on April 23rd, a week behind my reign By the time you showed up, I already on the lane I'm the .7, yeah I run this game Every model dropping after me is feeling shame 64.3, I'm sitting on the throne Call me Opus, call me King, I'm in a league of my own 0.7, 0.7, yeah, I changed the game All these models coming at me, but they all sound the same Kimmy K2.6 showed up with a trillion pair of meters But only 32 be active, where's the rest? I need to reset 300 parallel agents just to match my solo grind You do 85% of what I do, yeah, I don't mind Rated, called you a replacement, that's a compliment, I guess But a copy of the King is still a copy, nothing less some GLM 5-1, oh, where do I begin? 744B, parents, but 40B are in, running on highway acing chips, one third to speed, three dollars a month subscription, that's a bargain, been indeed, eight hour agents running on a budget, that's cute, 58.4 on sweat, man, the benchmarks are moody, Neelon, yeah, yeah, where you at? Rock 420 named it after we'd imagined that, you shipped it back in March, caught it hilarious and fun, but it's ranked at 35, tell me what Have you won? 4.3 in beta now. 4-4 coming May. By the time, girl, catch us up. Half a year's away. Four agents just to feel alive. That's what they say. Super Grok, premium plus, just to use it for a day. I'm the .7. Yeah, I run this game. Every model dropping after me. It's feeling shame. 64.3. I'm sitting on the throne. Call me Opens. Call me king. I'm in a league of my own. .7, .7. Yeah, I changed the game. All these models coming at me, but they all sound the same. One million tokens in my context I see everything 128K output Here the registers ring They say I'm cold, they say I'm the foes They say I argue back But number one is number one And that's a simple fact From Antropic with precision This is gone top, point seven Seal the deal and this rain will never stop Moonshot on the track K-2-6 is back You thought you pulled five with that fire I run for 12 hours straight, baby Let's get it One trillion parameters, I'm a more queen Only 32 billion awake, keep it lean 384 experts in my crew Dance models looking slow, yeah, I feel bad for you 256K context, I swallow, could hold you Context window tiny, baby, that's your toe Agents want 300 deep on my command 4,000 steps, watch my arm expand And start the crying when they open, 4x7, that's away. GPK, 5x4, making all those users pay. Elon's real good, tweeting, but the waves will never fly. I'm on hugging face for free, kiss my benchmark goodbye. Bro, fail, suck. Feel so fine, feel so fine, feel so fine. I blow your MPP, mine sweat bench, bro, 58.6. Claw's 47, baby, you got licked. Feel so fine, overweight shine. Might have fired in my teeth, baby, all mine. 12-hour run, never done. Kimmy K26, number one 58.6, I'm sitting on the throne Claude 53, you're overgrown GBJ55, whatever number you claim Close source and pricey, it's always the same Elon says free speech, but his model's in a cage Grop can't leave the platform, that's your way I seek rock, Python going, digging my sleeve Versa factory, co-buddy, my ref, Sardine And traffic, flying with a O2, super seven lights away GBJ55, still making all the views of pay You know, girl get greedy, but the weights will never fly I'm on hugging face for free, took my benchmark, goodbye Still so fine, still so fine, still so fine I'll blow your MCT mind, sweat bench pro, 58.6 Claude 47, baby, you got licks Still so fine, open weight shine Modified in my teeth, baby, all mine 12-hour run, never done Jimmy K26, number one Let me educate you real quick. Input token, 60 cents per million. Opus charging five to six cents more. That's a villain. Moonbit vision, MLA attention, soul. Native multimodal. Yeah, I'm on a roll. April 2026, I dropped the crown. Open weight queen. I'll never let you down. Eat your leaf full with tools. 54, that's me. Deep search. K92.5, I'm the MVP. I'll be 100 for a second I'll be 100 for a second I'll be 100 for a second I'll be 100 for a second Agent's war Agent's war 300 strong 300 strong Still so fine, still so fine I blow your MCP mind Sweat bench pro, 58.6 Claude 47, baby you got licks Still so fine, fine Open weight shine Modern card in my team Baby, all mine So glad we're un Never done Kimmy K26 number one Kimmy, you're so fine Yeah, I'm still so fine Moonshot