ThursdAI - The top AI news from the past week

📅 Apr 23: OpenAI's Week: GPT-5.5, GPT-Image-2, Codex CUA + Chronicle, + Claude Design, Kimi K2.6, Qwen 3.6-27B

144 min

•Apr 24, 20263 months ago

Summary

OpenAI released GPT-5.5, a new state-of-the-art model with significant improvements in agentic tasks, coding, and reasoning while using 40% fewer tokens. The episode covered major releases including GPT Image V2, Claude Design, Codex's computer use capabilities, and open-source models like Kimi K2.6 and Qwen 3.6-27B, demonstrating rapid acceleration in AI capabilities across multiple domains.

Insights

GPT-5.5 achieves state-of-the-art performance on most benchmarks while reducing token usage by ~40%, fundamentally changing the cost-performance equation for enterprise AI deployment
The convergence of image generation (GPT Image V2), design tools (Claude Design), and code generation (Codex) creates a new creative-to-implementation pipeline that bypasses traditional design bottlenecks
Open-source dense models (Qwen 3.6-27B) now match proprietary model performance at consumer hardware scales (18GB RAM), democratizing access to capable AI for local deployment
Computer use with multiple concurrent cursors and steering capabilities enables genuinely long-running autonomous tasks (8+ hours) without human intervention, fundamentally changing agent architecture
Privacy-preserving inference models (OpenAI's privacy filter) running locally in browsers enable enterprise-grade data protection without sacrificing model capability

Trends

Shift from single-model reliance to specialized model pipelines (image→design→code) for complex creative workflowsToken efficiency becoming primary competitive metric alongside raw capability, driving cost-per-task optimizationLocal/on-device inference for safety-critical tasks (privacy filtering, agent monitoring) as enterprise requirement rather than preferenceLong-running agentic tasks (8-24 hours) becoming viable, enabling overnight batch processing and complex multi-step workflowsOpen-source model parity with proprietary models on specific benchmarks, fragmenting market into specialized use cases rather than general-purpose dominanceMultimodal reasoning models enabling code generation from visual designs without intermediate human design specificationAgent monitoring/governance (CrapTrap, privacy filters) becoming infrastructure layer rather than afterthoughtEquirectangular image generation enabling synthetic street-view and immersive environment creation at scaleModel steering during inference allowing real-time task redirection without interrupting long-running processesHallucination rates becoming critical metric alongside accuracy for knowledge-work applications

Topics

GPT-5.5 Model Release and BenchmarksGPT Image V2 Capabilities and ComparisonsCodex Computer Use and Multi-Agent ArchitectureClaude Design Tool IntegrationOpen-Source Model Performance (Kimi K2.6, Qwen 3.6-27B)Agent Safety and Governance (CrapTrap)Privacy-Preserving AI InferenceToken Efficiency and Cost OptimizationLong-Running Autonomous TasksImage-to-Code Workflow AutomationMultimodal Reasoning ModelsLocal Model Deployment on Consumer HardwareAI Model Benchmarking MethodologyAgentic AI Architecture PatternsEnterprise AI Security and Compliance

Companies

OpenAI

Released GPT-5.5, GPT Image V2, Codex updates with computer use, and privacy filter model; dominant focus of episode

Anthropic

Claude Design tool released; Opus 4.7 benchmarked against GPT-5.5; Mythos model mentioned as unreleased comparison

Moonshot AI

Released Kimi K2.6, a 1 trillion parameter mixture-of-experts open-source model claiming state-of-the-art on agentic ...

Alibaba

Released Qwen 3.6-27B dense model beating their own 397B flagship on coding benchmarks

Google

Gemini 3.1 Pro benchmarked against GPT-5.5; Nano Banana Pro compared as previous image generation leader

Brex

CEO released CrapTrap, open-source LLM-as-judge proxy for monitoring and securing agent behavior

Figma

Claude Design integration with Figma MCP caused stock impact; mentioned as design tool ecosystem

xAI

Elon Musk's AI company in potential $60B acquisition deal with Cursor IDE; Grok Imagine model benchmarked

CoreWeave

Mentioned as infrastructure provider for running large open-source models like Kimi K2.6

Weights & Biases

Sponsor/partner mentioned for show production

People

Alex Volkov

Primary host conducting live analysis and demonstrations of new AI models throughout the episode

Ryan Carson

Co-host discussing practical applications of GPT Image V2 in marketing workflows and design integration

Wolfram

Conducted detailed WolfBench evaluations of Kimi K2.6 and Qwen 3.6-27B; provided technical analysis of model architec...

LDJ

Discussed model comparisons, reasoning effort levels, and implications of GPT-5.5 architecture

Nistan Tahir

Demonstrated privacy filter model, tested Mars calculator with GPT-5.5, discussed enterprise security concerns

Jan Peleg

Participated in model discussions and evaluations throughout the episode

Peter Gostev

Guest providing early access insights on GPT-5.5 and GPT Image V2; demonstrated long-running task capabilities and st...

Sam Altman

Quoted on iterative deployment strategy and democratization philosophy for GPT-5.5 release

Elon Musk

Mentioned in context of $60B Cursor acquisition deal and AI infrastructure investments

Pedro

Created and demonstrated CrapTrap, LLM-as-judge proxy for agent security and governance

Quotes

"This model is at 60, Opus 4.7 is 57.3, and Gemini is 57.2. It added Terminal Bench evals. I didn't ask for evals. It added Sweebench Pro evals and added GDP evals as well."

Alex Volkov•~3:45:00

"I feel like when I'm trying to get something done sometimes it would just be kind of abrupt and just kind of do stuff and then you're like I don't really know what's happening this one is a bit better."

Peter Gostev•~3:20:00

"Once you go back to any other computer use, it's useless. It's computer useless. Codex computer use is so good that any other computer use is absolutely useless."

Alex Volkov•~1:45:00

"The first time when a model can actually properly do long running tasks. All previous models I know they kept saying oh like you can do it for many hours but every time I don't know about you guys like I just, I shouted it."

Peter Gostev•~3:10:00

"We believe in iterative deployment. Although 5.5 is already a smart model, we expect rapid improvements. Iterative deployment is a big part of our safety strategy."

Sam Altman•~2:50:00

Full Transcript

Thank you. hello hello welcome to thursday i this is alex volkov coming to you live from denver it's a little bit later than we usually start but i hope some of you who joined us on the live stream saw a few of the openers that were prepared by claude and hyperframes i'm going to tell you all about this today is a big day and this 41 if that means anything to anyone here then you are too connected to x you need to leave your house and go touch some grass uh but if it means nothing to you uh and if you're asking in our chats what is ns31 everybody's saying ns31 is today uh then we'll tell you all about this but plus we have a huge show and to help me through kind of explaining everything that happened in the world of ai today let's bring up some co-hosts here we'll get ryan carson who's back wolf from raven wolfie i'm peleg and ldj what's up folks how are you doing let's start with our long lost brother ryan carson welcome back dude what's up everybody it's so good to be here i was in japan with my family and i'm back you are back and and you chose a hell of a week to be back man i'm excited good to be here it's a it's a crazy week um were you up to date at all or you just like disconnecting so we need to keep you up to date this is the problem man you can't turn off now like and i can code from my phone so yeah i didn't i didn't go anywhere just you didn't disconnect just different time zone uh we'll say hi to wolf from from what's up how you then hey oh i'm i'm not it's hard to keep up with all the model releases the benchmarks take time the analysis takes time and when you are done with one the next one is already there but i'm not complaining i mean this is the acceleration we've been waiting for so keep going this week definitely felt accelerated um my usual spiel is that hey you know until wednesday i kind of like here's a piece of news and piece of news and then only until i start preparing the show notes which we're a serious business here folks we have like a round of show document and everything uh but not only until wednesday do i start feeling the oh my god there's so much to talk about no i knew that this look is gonna be insane from the moment we ended the last episode because the moment we ended the episode uh codex dropped a huge new update we didn't have enough time to tell you about this and then i went on four live streams since then so some of you like milo and some folks in the audience have been with me throughout all these live streams so welcome to live stream number five since the last show uh we have tons to talk about uh yum pelek how are you doing man what's new i see the glasses are you ready for ns 41 have you guys seen quen which there's two quens man which which one the one the one that gives you uh opus four or five at home that kind of one you know uh yeah and codex and and there are also some rumors you know spreading rumors it's not even rumors it's not even rumors if you go to the open ai official account they posted something uh that says ns41 and uh ns41 in base 64 is basically 5.5 if you take the string that they posted convert it back from base 64 you get 5.5 so uh we are going to have conspiracy confirmed that's not that big of a conspiracy yeah they leaked it in codex let's let's just say somebody saw a screenshot in codex with a bunch of other models that we also have to talk about again uh folks who are just joining us um open the eye is about to drop a new model we don't know when so we're really really hoping that they know that Thursday is going on just going to drop in the middle they they love dropping in the middle so we will ask you in the audience i'll just i'll talk directly to you the audience please if you are monitoring the situation like us send us a link in the chat that if anything happens for open the eye in case we're getting too excited about the show and we're all in this like debate uh tell us that you're seeing the open the eyes about to launch something um but there's a bunch of open source as well i think we should probably start with very soon uh ldj what's up what is on your mind what is the one thing that is must not be missed today in the eye uh from last week from last week um or from the past seven days uh since since we finished thursday i last yeah well since since he already mentioned uh quinn i'll mention kimmy uh so kimmy's model seems pretty impressive i think as usual kimmy seems to be the one that's maybe less academically minded than quinn but kind of more creative more poetic kind of more diverse in its outputs and i think it'll be especially interesting to see what types of web designs that people make out of that yep yeah yeah give me uh give me k 2.6 let's just make sure that folks who follow us know exactly because kimmy was out there for a while all right folks i think it's time maybe to start with the tldr folks are saying in comments that the fomo is unreal i agree the fomo is unreal uh somebody wants to ask us a question folks are saying they're watching twitter like a hawk monitoring the situation uh just for folks not to be confused this show is not called monitoring the situation you're on thursday i there's a different show called monitoring the situation uh we've been at this for way longer than theirs and we're significantly deeper diving than just covering the news. So hopefully we'll dive deep. I think it's time for the TLDR. We'll tell you about everything that happens in the TLDR section before we actually get to the deep dives. And then we will definitely wait for GPT 5.5 today. We'll stay on air for 10 hours if we have... No, we're not going to stay on air for 10 hours. That's not going to happen. Some of us have work to do, anything to do. But we're really hoping that OpenAI will stay true to the name and drop gpt 5.5 in around an hour that's usually when we do this uh if there's going to be a live stream we will restream this uh so we we did with gpt image by the way um my big thing from this week i have two but i have to focus on one but i have two it's really it's really hard it's it's i have three in this case i'll just go to the tldr like i won't go through three like one things that must not be missed but it's been a hell of a week and it's about to get disrupted even more right so let's jump into our corner called the TLDR where we talk about everything that we're going to run I'm going to do hopefully a quick one we are in the TLDR let's at Niston we are doing the TLDR my name is Alex Volkov and the iVentures with Weights and Biases your host for today co-host we have everyone everyone's here finally Yes, Ryan Carson to my right. I don't know if you guys see it in the mirror, but like Ryan Carson right here, Wolf and Ravenwolf, LDJ down there, Jan Peleg and Nistan Tahir. We have no guests today. It's just us. I think it's going to be plenty because there's tons of stuff to talk about. So, okay. The number one thing we have to talk about is GPT Image V2. folks open ai released finally the the response to what google has had a leadership in for a long long time gpt image v2 is in the api in codex is open ai's new image model and it renders images up to 4k resolutions it's a thinking and reasoning image model so it means that the more thinking you give it the better it does it does insane things like generating full-on qr codes i i barcodes it does equirectangular images in 3d it's absolutely insane production great editing workflows uh some of the imagery if you saw the thumbnails for the show were generated with image v2 it's great character consistency it's not perfect nothing is but it's really really really good so we're going to talk about this it's so good it's just fucking openly i just knocked it out of the park with this one the i'll no i'll have to fix the showing part because i really want to show you stuff but uh it's uh it broke out in the elo arena score by a significant margin i don't know if you guys saw this and uh we went on the live stream me and peter gosta from arena and he went through like a very detailed breakdown so definitely if you missed any part of this you want more examples that we'll show you today uh check that out uh open the eyes internal models leaked so we know the gpt 5.5 is ready to go we don't know when but open a today posted on their account ns41 which is base 64 for uh gpt 5.5 there's other models in there in that leak code names like arcanine and glacier alpha and gpt rosalind and like i have no idea what those are but i definitely know the 5.5 is ready to go uh from leaks no background information at all um also in big companies and lms one big one was cloud design i don't think you guys tried that but that is fucking magical this released on friday this is a design uh sorry early preview of something and you know as well as i do that many people use cloud to do some designs many people use the figma mcp blah blah blah blah they released the whole thing on friday they crashed the figma stock by like five percent and i can see why it's not a figma replacement but oh my god that ui is so good i will absolutely have to fix my shit to show you uh because i generated a whole um brand guidelines for thursday i with the logo and everything and some of the opening responses this is you know some of the opening videos that i showed you here is the result of that brand guidelines um they have added a new usage meter in cloud max settings this is only available for cloud max subscribers uh and i blew through that like nobody's business like it's just i can't use it until tomorrow so hopefully i'll be able to show you something uh also in big companies in news have you guys have you guys heard of this uh vs clone code cursor anybody here heard about cursor what's that i don't know apparently elon musk did and apparently elon musk is ready to give them 10 billion dollars to experiment inside the gpu system of xai and apparently if that is successful there is a 60 billion dollar deal to buy cursor into xai i it's basically 60 billion to buy it with a 10 billion dollar break clause so i i think and and so there's gonna be a lot of training happening and they're gonna test that like i have a very interesting take on this uh but yes it's insane spacex and cursor uh sorry spacex xai and x are all gonna ipo at some point very very soon probably the 60 billion dollars is just like you know elon's gonna sneeze and i'm gonna like reap 60 billion dollars out of the air so it's not that big a deal it's just the fact that cursor is valued at that price point right now is just absolutely mind-blowing insane specifically because just two weeks ago we talked about ids being dead and then a week ago so we had folks from from uh devon and other places to talk about ids um okay so this is big news in open source we have two big releases this week also very big releases full on we have a full show to talk about these releases uh moonshot ai open source is kimmy k 2.6 it's a one trillion parameter mixture of experts claiming uh open source state of the art on swipbench pro and we have wolfram that i think tested this out and we can talk about kimmy k 2.6 already uh and uh a browser company as well great evals all around and then quen our friends from alibaba quen released week after week they keep releasing things this one is quen 3.6 27b it's a dense 27b model last last week we talked about the moe of quen this is a dense quen uh that beats their own flagship uh 397 billion parameter on every major coding benchmark so 27 billion parameter model beats their almost 400 the parameter model we're going to talk about that one as well in one of the biggest updates from this week i'm pretty sure is that in tools and agentic engineering somebody said 3.6 did i say a different number yeah quen 3.6 that's what i said 2.6 no give me k 2.6 quen 3.6 yeah yeah yeah there's too many points you can't keep them straight yeah it's hard uh that's why i have notes that i can't show you but hopefully i'll hopefully be able to fix this once one of you starts talking for a minute uh all right tools and agentic engineering this is a corner where we talk about you know many of the folks who watch the show many of you who didn't used to be engineers are becoming engineers working with the i tools and agentic engineering is important corner one of the biggest upcoming things in tools engineering is that open the eyes trying to catch up to anthropic and anthropic passed the 30 billion dollar valuation blah blah blah we talked about this uh ARR not valuation it's a small distinction traffic pass 30 billion dollars in ARR not valuation uh codex from open AI has passed 4 million users that's a big one and last week at the end of the show we told you that they released a bunch of stuff and we absolutely missed on the most important thing codex now can do computer use now you may say hey Alex hey hey hey hey hey cloud could do computer use for a while why are you so No, you don't understand. You have to understand this. A year ago, this is TLDR still, but I'm going to go on a little spiel. A year ago, OpenAI bought this company like six months ago. They bought Software Apps, Inc. The folks who used to like almost release a thing called Sky. These folks created shortcuts on iOS. These folks like built some stuff inside Apple. Their computer use only on macOS for now is incredible. I don't know how many of you Mac users. I'm assuming Wolfram, you have a Mac, but also a PC. Ryan, I'm pretty sure you're on Mac. I saw your videos. Yam, I don't know about you, Nistan. You're a Linux guy. Their computer use is insane. It works while you work. I will have to show this. Like, I will start the stream and start a new one just to show you. You guys don't understand how cool that thing is. It's just incredible. And they're promising a speed-up of, like, 10x. They're saying, we just, like, started experimenting with this. Just incredible computer use. But also, Codex now supports image generation. last week it was gpt image 1.5 but now codex supports gpt image 2 it's really really good they have plugins i can see automations but like codex computer use is something i want to show also they released something called chronicle i don't know if you guys saw chronicle chronicle is like do you guys remember rewind ai the thing that watches your screen oh yeah codex now codex now watches your screen takes screenshots every 10 seconds and has full context into everything that you're doing on your computer it's magical it's mad it's creepy as but it's magical supposedly it doesn't go to open the i supposedly but uh i really like it uh so codex has big updates um and also in tools and genetic engineering brex the ceo of brex nonetheless released open source clap trap it's an lm as a judge proxy that you give your agents and it judges whether or not uh your agent is doing something illegal it's really really cool also in open source I completely forgot because my notes are not perfect uh open ai released a new model in apache 2 do you guys see this it's incredible I think uh I will need to go and bring this up because I don't have the full details but open ai released on hugging face it's a model for helping identify and remove personally identifiable information within data sets so whether that's a company wanting to make a fine tune on their own personal data or for whatever other reason, this is something really useful. It's so useful. And we're going to have a demo for you as well from our friends and over from Transformers. Yes, because it runs in the browser. It's a 1.5 billion parameter model, but it's a MOE with 50 million active. So it's like very, very tiny and runs quick. It's a privacy filter model. It's very important to have. And the reason I'm bringing this up is because there's this scrap trap thing that I told you about. You can absolutely use this model in the middle of it. Ryan, go ahead. Two other things from big companies that I think we're going to cover, but just to make sure we do. OpenAI just released their workspace agents, which I think is a big deal. And then also they just released this clinician model slash product for doctors. So there's a lot of stuff coming out of OpenAI right now. OpenAI has been dominating this week. There's no doubt about this in my mind. And they're about to execute a killer blow today, hopefully at some point. Eldir, go ahead. it's something you mentioned earlier in the leaks was gpt rosalind which that was actually something that was announced last thursday but i think we didn't get to cover it but just a really quick summary of it is basically it's a specialized model they developed for things like drug discovery and related operations happening in biotech yeah that's great um we we aim for full coverage uh folks in comments if we're if we're missing any parts please let us know as well uh so the last thing that we were landed on is a bunch of stuff from open the i i think the apache 2 one is very important like open the eyes open sourcing again it's really really really good um i think oh in this week's buzz we have our tui everybody's like going home about tuis away with somebody also has a tui it's called 1b lit and it has a bunch of updates including uh we're now showing you gpu stats inside the tui and that's really good and also wolfram is going to talk about some open source and not only open source evals. Wolfram, you want to give us a highlight one or two sentences from the evals so people have something to wait for? Yeah, Opus 4.7. I've evaluated it and compare it to the old one and in which agent it works the best and also the same for Kimi K2.6. Which is the best model I've ever tested in the open source department. Kimi K2.6 is the best model I've ever tested. both are the best opus is the best model i tested on the wolf bench from the proprietary models and kimmy 2.6 is the best model in the open source department so that was the best that's surprising that's actually surprising okay it's always always more differential when you look at it closely though we'll look at it where it's really good and we're not we absolutely know what folks are waiting for wolfram i think we started with open source let's start there we have two incredible models to cover. Let's start with Kimi K2. We'll just go to open source and we'll start there. Open source AI. Let's get it started. All right, let's get it started. I have no idea how quickly all y'all put glasses in the 13 seconds that this transition was in there, but I love it. I think every transition would show up with different clothes and like props and everything. That would be very, very cool. But also, open source AI, let's get it started. Wolfram, please take it away. LDJ and Nistin, I count you guys to support. Yam as well. Open source, let's start about Kimi K2. I'll just like do the intro, okay? This week, Moonshot AI, the Chinese company Moonshot AI, decided to finally update their Kimi K2.5 to a new model, K2.6. It's a one trillion mixture of experts model. They are claiming open source state of the art on agentic coding on stuff like Swibin Pro. 32 billion parameters active with 384 experts, MLA attention, 256K contacts window, and a modified MIT license. So not fully, fully, fully open, etc. It gets Sweepbench Pro at 58.6. Sweepbench Pro is the difficult version of Sweepbench Verified, which we don't no longer cover. Please take it away and tell us why this model slaps. Wolfram, you started something. And Jam, welcome to Genzo. Do you want me to look at the benchmark results already or do we do it in the... No, no, you can show benchmarks. I can show anything. Do something visual for the audience. yeah you already gave the technical details and um kimmy has always been a special model to me because uh like ldj said it is um not so robotic like other models it has from the beginning when it came out uh kimmy k2 uh it started to be very good at the writing department and felt really really good at creative and when it came out it always was one of the top models so The new version, it is not just the top among the open source, but it is also top among, I would say, it even beats proprietary models. Let me show you what Wolfbench has shown. I will just open the page and share it. So, okay, we definitely have to fix the zoom first because there's so much information on here. and just a quick thing about WolfBench. I'm not just looking at the average score, but I'm also looking at how many percent of the benchmark can it solve and how many does it consistently for the different models and different harnesses. So what we are now doing is basically we are looking at Kimi K 2.6, which I tested from the moonshot directly, and comparing it to just Sonnet and the other models that are even better than, still better. but it beats for example kimmy bet gemini 3.1 pro preview and of course it's better than the old one and all the others so it's the best open source model i have tested so far and if you look at it closely it is basically on the sonnet level very very close to this and the different colors are different benchmark agents terminus 2 is a terminal bench 2.0 basic agent which is uh the default you always see if somebody gives a terminal bench score it is this agent i also tested cloud code but this is not relevant for this so i will just remove it and term is agent open claw these are also the agentic stuff and terminal bench is an agentic benchmark and i care about which agent work the best with this and if you look at kimike 2.6 and looking at the terminal bench and just terminus 2 then it like i said is better than gemini 3.1 bro and if we take open claw for instance now it is even better than opus 4.6 in this which is a super amazing score we get 59 which is the best open source model score i've ever seen with open claw and if you look at hermes um there they're interestingly the harness makes a big difference so in this case it gets basically the same score with open claw so they are almost the same here but um hermes agent is still better with opus 4.6 so it depends what you are doing but if you want to use an open source model definitely Kimike 2.6 is the one to go for. Did you have any trouble with the tool calling? I'm just hearing from a lot of people that the tool calling setup was a bit confusing for them. They were trying to run it on their own. Or was that okay? It seems like it was completely okay here. Yeah, I was using Open Router and the Moonshot AI endpoint, and I did not have problems with this. and the interesting thing is in the open claw benchmark and in the hermes benchmark as well i do not even tell it that it is running as an agent i'm just using the default setup so no custom prompts i think it could get even better if you tell it do not ask the user some questions because some tasks failed because the model couldn't figure it out on its own so it tried to ask the user but there was no user to answer in the benchmark so it could be even better with prompting, which is another level to look at. I just want to mention, there is no doubt that the scores are high. However, the scores, we are, this is a very low tech solution, I hope you can see, we have from, we have a report from BrightMind, that the model might be a little bit benchmarked. What we see here is a rendering of a lava lamp by Kimi and by the other leading models. You are more than welcome to go and check BrightMind's profile to see the exact source for this. I hope you can see anything. Basically, what you see is that the lava lamp doesn't look well at all. and all the other models that he is comparing with are producing pretty nice level amps. As you can say, I wish I could show it to you in a normal way. All I'm saying is that benchmarks are not everything, and I just want to push back and give the other side of this, because clearly, clearly the benchmarks are high. So, yeah. It's the best one I've tested with the agentic test, so I can only say that. Everybody has to do their own experiments, of course. I think benchmarks are a great way to see which models are worth to look at in more detail. Because if they fail completely on a benchmark, it's probably not worth to invest the time to test it with your own test. But if they are really good, that is a model you should look at. Yeah, you also have to keep in mind, sorry, that 3D performance, that's something that they have to train for and need a lot of data sets for. So it might be very good agentically as a tool and just not be great at all to do stuff in 3D. For example, the new Quen, the closed source Quen, the 3.6 max preview was pretty bad for me when it came to stuff like 3.js and making animations and things in 3D. but it looked great at everything else so there's also but yeah the the the frustrating thing is there is bench maxing going on but the models are also good so it's getting a it's getting a bit hard to to tell in that regard so i wanted to join this uh i'm back by the way hopefully hopefully now for good uh that i've i've tested this model uh myself as well uh if you guys are still on on kimmy's and i remained unimpressed i tested it on a bunch of stuff it feels like it's overthinking too much uh and it it gave a lot of well i don't know if you mentioned the sorry i dropped but like definitely there's like way way way too much thinking going on to achieve something so the scores that they showed they don't usually show the times as well and that's something that we need to like start thinking about as well like how long does it take that model to get to that score uh something that they i don't think publish that's why terminal bench has the cutoff right how much you can do. Yeah, we can also look at how many tokens it generated. I'm actually writing an article for a blog about this where I'm looking at a comparison that will go much more deeply into this because, yeah, this has also been one of the most expensive open source benchmarks I did because it takes so many tokens to respond. That is something you don't see when you just look at the scores. But I'm also working at a way to visualize that as well in a future update of full page all right folks are we ready to move on to the next open source let's talk about uh quen we have a bunch of other stuff to talk about but let's talk about quen uh 27b who who got super excited about this listen i think i heard you guys say ldj as well oh yeah is my audio okay by the way there's landscapers cooking your audio we hear some landscapers but also like you're coming to a little bit louder than usual but let me just while you focus let me just like say the the thing folks alibaba released another quen for us it's quen 3.6 27b dense model last week they released the moe version of a very similar size this one is a dense model which means um you know just one model no no moe 15x smaller total parameters than their flagship 397b almost a 400 billion moe and it beats it it wins on every coding benchmark a model um you know 15 times the size it's apache 2 license it gets a sweet bench verified i don't care about super sure if i don't just keep uh terminal bench uh at 59 matches cloud 4.5 opus exactly this is you know we keep talking about like a benchmarking or not benchmarking this this 27 billion parameters model matches opus 4.5 this is quite quite crazy young gentlemen ladies and gentleman opus at home opus 0.5 on terminal bench if you're writing exactly exactly the things you're doing yeah opus at home no problem if this is what you're doing you get opus at home yep just saying yeah it's a really good model man it's it's a dense model you can't go better than that yeah you can't go better that's that's the top man and you know the guy that guy what his name is just gonna drop and you find your new version will be like super quen dance 27b and it's gonna be even better like tomorrow or something man it's just like christmas every day in ai seriously with onslaught you get something with onslaughters dynamic ggufs uh which we had daniel hunt from onslaught here on the show great dude uh and they're doing incredible things uh this runs on 18 gigabytes of ram this is it like opposite home runs on 18 gigabytes of ram Unlike the Kimi model that we just told you about before, which is a one trillion parameter that only us at CoreWeave and some other folks can run, you don't got to pay for Kimi. Let's be very, very, very clear. Nobody here is going to host their own Kimi unless the business pays for it. It does not make sense financially. It does not make sense if you are at that level of paying, just pay for the max account, whatever API is, right? But there is this thing where we're like, oh, we like open source. But also there's this thing where we like open models. and sorry local models open local models this is an open local model 18 gigabytes of ram is very affordable and goes around and so like definitely some folks will will like to run this um it's more like sonnet 4.5 at home when people are actually using it uh when it comes to terminal stuff it can be it can be it's very it's on par with opus and when it comes to judging things visually it is very, very good. There were some issues that people had with 3.6. So it's not quite plug and play where you just replace it as the proxy for cloud code. They noticed very, very different issues like managing hard Git merges and stuff that it was making a mess at. so it's it's not quite there but people do feel like it is on at 4.5 at home and that is a huge deal because that can do most of your non-important tasks now and you can do it you can buy a used 3090 for under a thousand bucks and run it this is the biggest change here because yes we always talk about open models but people had to run like four 3090s at home or figure out some some crazy setup Now it's becoming, you can just go and buy a 24 gig Mac mini and run it and just run Hermes on it. And check your emails and everything. Bro, I've been running Sonnet 4.5 for a long time. It's absolutely usable. Man, it's usable. Yeah, it's great. It's great. I'm taking Sonnet at home. I'm taking Sonnet 4.5 for, what is it? For $700, $600, $3090? Man, we never had this. And you pair this now with OpenAI's privacy filter model, which I find the architecture of this tiny 1.5b just completely insane because it's a mixture of X-Men. Wait, listen, don't skip. We have to announce the next piece of news. You can just skip and click. All right, all right. We'll segue into that. Yeah. But yeah, let's segue into that. Nista, you just did the job, folks. The last piece in open source news that we're going to cover before we jump into like a big bunch of other stuff, is OpenAI is open again. OpenAI is open again. First of all, let's just say Codex is open. The whole thing about cloud code, code leaks, whatever, Codex has been open source since the beginning. So there's openness in OpenAI. But also OpenAI is open sourcing models again. Not LLM models, but fine. They open sourced privacy filter. You can see that there's a text that's marked, hey, private person, here's a time scheduled for private date. and here's the account number, private account number. This is kind of boring to show you, but basically privacy filter is all about PII, private identifiable information. So the stuff that, you know, you're afraid that your open cloud is going to leak, for example. They also call API keys also privacy stuff. Let me just show you the best damn example from a friend of the pod, Zenova, who just built this incredible demo because this privacy filter is so small, it runs on your computer. So we're going to do this. We're going to show you this privacy filter demo. On your browser, on a CPU, on any browser. This runs on the browser. You can see the loading model goes 20, 25, 30. You can see this, right? That's really important, by the way. This is me downloading and loading the model into memory. This is me downloading and loading the whole model into memory. It's about 1.5 gigabytes. It's not that big. I think it's even more quantized, actually. Let me see if I can zoom in here, zoom out. And we have this case file, this beautiful case file. Let me make sure that you guys can see this. Okay. You open up and you see this text. This morning review began with a careful note from our CEO, Sam Altman. It looked harmless, but it still contained contact details, blah, blah, blah. There's a date here, my birthday. There's a phone number. You hit this beautiful redact button. They run the model with beautiful effects. The model super quickly identifies there's one person name. there's one email there's one phone number I don't know how, like he posted this demo like a second after this model got released so like this beautiful demo, and there's a date I think this is a very important model if you think about the agentic use cases like when you have your open claw and Hermes agent running and it always a threat that it is leaking data that goes out so if you use this as basically an intermediate model that checks what goes in and out it could redact the data and notify you I think that is an important security measure for using agents in a more secure way. A hundred percent. And this is why the categories that they have there, the categories, they have private person. So everything, name, last name, et cetera, identifiable information, private address. uh email and phone uh private urls as well so apparently like this blocks urls now private dates and also account number and secret account number and secret are two of the most like important uh things i think so account number is everything related to bank account literally i can show you that i posted an example here this is my email today to somebody at a engineer miami where i requested the you know reimbursement for flights uh this literally include my phone number and my my account number and routing number in here. So I was not afraid to post this because it redacted the whole thing. Obviously, I reviewed it before. Again, you should not trust this model completely. You should definitely, definitely review. But I think the thing is with agents, other agents, LLM as a judge is to review this. Supposedly, like GPT 4.5 can do this in a structured way. This model is so much smaller that you can run this in the browser. This is the thing, the important part. You can run this as part of Crap Trap. This is like a proxy for agents that you run everything that your agent is doing via like an LLM as a judge. And this model, something like this model, is definitely a huge, huge deal for like a proxy, like Wolfram said, because your agent, the whole fear that people are fearing is that, you know, there's two things there. One, it can run some scripts and like eject some stuff, But also, it can just reply to someone after being prompted, and say, hey, my owner's bank account is this. So we definitely don't want to do this. Great model. Privacy filter is on hug and face. And what else is very important there? Oh, it's multilingual as well. Did you guys see this? It's great not only English, which is a very standard thing with very small models. Great in Hindi, Japanese, and Mandarin Chinese, and Turkish, and Urdu, and Korean, and Russian. We actually tested it out in other languages. But any other comments, folks? What else do we want to talk about? First of all, the architecture is completely crazy. It's only 50 million active parameters. So it will use 1.5 gigs on your browser. It's only using 50 megs of RAM for active inference. So this is one of the most compressed things. And I have not been an OpenAI fan. I canceled my chat GBD over two years ago and never used it. I also didn't like GBD OSS all that much, I appreciate it. I think this is one of the most important models that they've ever made. And if you start to count total amount of tokens processed in the future, this might be the model that just processes the most tokens out of all of them. Because it makes it very cheap and very easy for you to just filter everything, either at home, commercially, at scale. it does not hurt to put this in uh before if you were going to use like a llama guard or something model that was something you had to test and set up on your own and it was somewhat expensive even running a 3b at scale it can be expensive you do need infrastructure uh you do need gpus and for this one you don't you can just do the filtering right away and you should use this everywhere you can run client-side on the browser before they even give you the data like that just that's just such a huge enablement it's yeah i it's the most yeah i think this is the most important open source model that they've ever released other than whisper ryan and lbj so uh quickly to touch on Crab Trap. So I think this is very important because as more and more of us have digital employees attached to us or digital employees deployed throughout our organization, like R2 is my open claw, the thinking behind Crab Trap is that you can't manage a digital employee fast enough, right? They're too fast. And you can't have one-on-ones with them. None of the human models make sense. And so Crap Trap basically is a check that whatever your digital employee is doing is basically correct. So if they're a sales, if they're an SDR, are they behaving like an SDR? Are they doing something weird? And so these small local models, and I think this redaction model is an example of, we need more of these that can run quickly, cheaply to basically monitor our digital employees um and so it's very exciting to see that they open source this uh and also as a side note um the open claw setup uh that the brex ceo is running is inspired me to update my claw chief so there's a lot of things going on here with digital employees the other thing is and i think it's very important to also mention um this is fine-tunable so on specific domains like your company they are saying out of the box it's not going to work but it's very easily fine-tunable to your data so if you have very specific things that you're afraid that your company is going to leak then like with a few examples this model becomes just like just great so uh ldj let's talk to you about and then we'll move on to different things yeah it's going to add to what nistan and wallform said i feel like the with how efficient it is with how few active parameters it has i feel like Wallform's idea of having this as basically this private local check for people that really care about privacy and are using local models a lot. It is just kind of a no-brainer to have this do a pass over your text and conversation before you pass some information over to a larger closed source model. All right, I think it's time for us to move on, folks. We don't have any other news from OpenAI yet, but GPT 5.5 is on the way, we know. The reason we know is that the vague posting with base64 examples. Should we show this example? I mean, at some point, they're going to drop the model and then we'll just talk about this. But there's a bunch of stuff that happened from OpenAI this week that we should definitely talk about. I think it's important for us to start with GPT Image V2. I think this is the biggest thing that caught at least Twitter by storm. You all probably have tons of examples. I have a few examples as well. Let's take a look. So OpenAI launches GPT Image version 2 with the biggest jump in in arena elos core that we've ever seen uh i'm gonna pull this up gpd image v2 is open the eyes um thinking and reasoning uh image model why it doesn't matter that it's thinking and reasoning because it can do things that no other image models before could do it just just absolutely incredible the highlights there is it can it can create qr codes which I don't know if it's a tool use thing. I don't know if it's Photoshop. Somebody here explained to me how the heck can a diffusion image model create QR codes that work. I think it just generated it, seriously. Yeah, I think it's probably omnimodal where it's like this unified model that's actually doing reasoning all in the same network because they have it actually and you can have it medium reasoning, high reasoning. and in LM Arena, actually, I'll link this. But I know we've been skeptical of LM Arena lately, but for images, I feel like it's still a bit reliable. And it's insane. So I actually went through the past 50, the top 50 rankings in LM Arena. And there's not more than a 50 point gap between any of those 50 ranking neighbors. The exception here is GPT Image 2 that just released, even on just medium reasoning mode. it's over 200 points above the last top place. It's insane. It's absolutely insane. The jump is insane. So I went on a live stream when this got released and played around with it. And then I had the chance to host Peter Gostep, who's doing evals at LM Arena. And obviously he has some access to this model before. He had like over, I think, over 500 examples to show us where he ran this model against Nano Banana Pro and GPD Image 1.5, which was awful, awful. just completely awful you can see it here i don't know if uh 1.5 uh oh 1.5 is number four here i don't even think it's number four so this like throws a little bit of a um about this ranking but 1.5 was not great specifically very very bad comparison gpd image 2 blew every other model out of out of proportion it's just like the character consistency is great everything is great i definitely want to show you some examples but this is the the kind of the jump that we see from nano banana 2 and nano banana pro which is like incredible models you can see uh this elo rank jumps by almost 300 points now what does this mean this means that people are watching the two examples of models they ask to generate models and they just they just prefer gpt image 2 93 of the time i think this is this is quite crazy yeah and they don't know about which model they're using so there's no like uh recency bias etc let's show let's show some examples folks while we talk about this have anybody played here ryan i think you said this is an incredible model oh it's so so good um and it's good for real stuff right not fancy fun you know play stuff so i'm already uh integrating this into my marketing engine so what i do every week is i interview somebody uh in the divorce space uh which is what untangle is and then i uh basically have a pipeline every night that runs looks at that video and then creates a killer instagram cover um and and it's on brand um and it's just so good so this is really really going to change people's marketing workflows it's nearly perfect at text i think i was able to see like one typo as well and again if you guys remember every model that was released since nano banana pro we compare to nano banana pro we keep saying like no nobody's close this beats nano banana pro on realism this beats nano banana pro on um on uh everything oh i just got a ping from peter goste our friend who just hosted the thing that i want to show you let's see if he sent me a link yes amazing shout out to peter goste folks from uh ele marina uh he did this demo and i'll show you and i'll show you how deeply ele marina goes into evals okay um there is a lot of tests here a lot i i can't even count i think it's over 150 tests if not more uh it's gonna take a while for for for me to load all those images uh we're gonna see a lot of peter ghosted here okay uh but you can see that this this is um he has a bunch of prompts in here as well he has very specific prompts he he has very very specific prompts there's like 30 things that the mall needs to get right. The framing is slightly sloppy, not composed. A wet footprint trail crosses the tiles. There's like a lot of like very specific prompts. So this is an example of GPT image. Let me zoom out here so hopefully we can see something. We're not loading the image on zoom, probably because I'm loading 118 prompts, 517 images. The dude goes deep. So, okay, you can see on the left, this is GPT image 2. in gpd image 1.5 you guys don't know peter but that's not how peter looks like so gpd image the previous one was not as good at like character consistency uh grok is i don't know why he decides to compare to grok grok is consistently the worst out of like all these comparisons and then also this is nano banana 2 and uh we're gonna have to wait until the whole page loads before we zoom in for some reason this is a vibe coded example uh but uh he also provided the reference inputs and his like selfies gpd image pro just like absolutely knocks out of the park um this is this looks like like an actual thing so you can see a wet footprint you can see like this cone um let's let's move into some like more compositional examples uh this is also a great image that like he showed the prompt here is uh rescue helicopter he he provided reference images and said hyper realistic 1970s newspaper photo of peter hanging half out of a rescue helicopter over a flooded town. Okay. He should be grimacing at the wind wearing soaked yellow rescue headset. So this looks like a photo. This literally looks like a photo from a magazine. You guys can see there's a little bit of a magazine. GPT Image even added a specific commentary here in the magazine. Helicopter crews were in action throughout the day, rescuing people trapped. This model just added this text. Like this text wasn't part of the prompt. I love it. And you can see the other ones. I don't know what's going on here with Grok, but this is not a person that's grimacing. It kind of does look like him a little bit. And then Nano Banana 2. Sorry. Nano Banana 2 is... Where's Nano Banana 2? Here's Nano Banana 2. This does not look like a 1917's newspaper. So this model wins on a bunch of stuff. The reason why there's two GPT images here is because I think one of them is on high thinking model. So the thing we absolutely must mention here is that this model performs better the more thinking you give it. So if you go to chat GPT interface, we can do so right now, and then you select the reasoning level, this model performs significantly better the more thinking you give it. And on pro, this model is just like absolutely mocks every other like image generation model that we've seen with perfect text, with perfect fidelity of character consistency. The things that they said on the live stream is this can also generate multiple images with character consistency. So they literally generated like a manga anime comic, and from page to page it looked the same. LDJ, go ahead while I show some more images. So one of the theories here too is because it's so good at generating text and doing character consistency and things in context and everything, it might maybe even be based on GPT 5.5 that's coming out. And they just have it named as like the separate image specific thing. But for Nano Banana and Nano Banana Pro, Google and Gemini team has confirmed that those are actually based on Gemini Flash and Gemini Pro 3.1 respectively. And so those are basically essentially those models or fine tunes of those models specialized for image generation. I think it might be similar here. Go ahead and fill. to begin with and this is why flux 2 has to be so large because they package mistral 24b with it so i'm really excited as to how they might have done it in this case where i do also suspect that they figure out a way to dynamically pair their their largest model uh with the entire entire diffusion dit transformer side to get to this level because this this just looks insane i'm wondering how they've even ran it uh the other thing is i think whatever deals open ai did with all like the newspapers and stuff it just gave them much much better data because it is pretty hard to find audio uh video and especially images data sets and uh yeah and they've done a terrific job of labeling all of that properly now to the point where all of those things just come together to make something that's like twice better than everything else it's so good dude okay so here's an example this is grok imagine right and folks grok imagine doesn't have versions for some reason they just keep updating this and every week elon is like hey this is the best model uh the really funny thing about grok imagine okay i'm gonna read out this prompt this is this is a long problem but hyper realistic backstage green room photograph of three minutes before in the ai summit panel sam altman demis hasabis daria amodey elon musk and jensen wang are all present and immediately recognizable this room is crammed blah blah blah this is grok imagine the only person that grok imagine can like can defect is elon musk i think it's really really funny sam outman doesn't look like sam outman uh dari amode is nowhere close like it's a bald dude and like grok does not know who dari amode is jensen kwang is just like some rando asian guy like not looks nothing like jensen no one here looks like no one besides elon musk i think it's really really funny that this is just like absolutely hilarious uh this is gpt 1.5 so okay you can see elon's kind of elon jensen is kind of jensen with the with the um with the leather coat sorry grocky magic doesn't even know the jensen has a leather coat get the fuck out of here man come on um but like 1.5 has jensen with a leather coat as well you can kind of see the you can kind of see demis and sabis with the glasses and kind of etc uh dario no no other model knows about dario um this is almost like a fucking photograph of how this would look like this is gpt image 2 everyone here is spot on including dario folks i i don't think that it's important this detail i'm about to share with you but like i i tried a similar images uh on gpd on nano banana pro uh no models don't know what dario amade looks like i think this was before cloud like exploded models just like they don't have him in the training data set. But here, he's obviously been trained on. But also, all the compositions here, you can see the Jensen's tag says Jensen's one. Let's say Dario Amadei's tag says Dario. Elon Musk's tag is Elon Musk. You can kind of see the artifact here, right? A little bit if you zoom in. But also, that's kind of like what actual pictures would be like with like low resolution megapixels. All the tags are perfect. you can see that like the prompts at the ai summit the lanyards have ai summit on them right this model is just something else it's just a reflection's right too in the mirror the reflection from elon yeah yeah yeah wait is it though shouldn't be looking at the different side in reflection as he should be if he's if he's looking left oh he found something yeah we should be looking at his face uh but i think that's absolutely insane daria is writing some stuff on the whiteboard i think this was part of the prompt as well um as you can see dario is writing stuff here this is not dario i so obviously like personalities is like absolutely crazy i want to do more comparison nanobanana and then we'll skip uh for some reason nanobanana didn't load fully here okay so it is getting reflections raw anyone wants to i i think that that also uh peter ran it on medium at some places and someone on hard and thinking i think on hard thinking it looks really good um let's talk about infographics this model is like really slaps on infographics i think the nano banana pro is still still great so i had a few examples but basically i want to show you um let's show you this this is the evolution human language families uh the this is 1.5 i don't have two here uh not everything loaded let's find something with two uh okay this one this is the NOA disaster cast frequency analysis. This is an infographic generated by GBT image 2. Text is absolutely perfect. Framing is perfect. You know, the only way to show you a difference between this and something else is to show you the previous one. This is 1.5. It's kind of there, but it's not as dense information-wise, right? It's not as dense. And the text is okay on 1.5, but it's not as dense. 1.5 was six months ago. And this is Grok Imagine. It's not too bad. It's not bad. It's not bad, actually. Yeah. I guess I've had good data on this. Yeah, they probably have good data on infographics. The thing about this is that you need the data to be correct. If you are generating something like this from data, you absolutely care that that thing will represent the data. Probably it's a professional thing that you are generating. I suppose it is. I'm not claiming anything. I'm just saying that that thing has a very distinct characteristics for it to be successful. But yeah, all the text is so perfect. Bro, I've seen it generate an HTML code of an SVG inside the image, and people took it and then rendered it, and it actually was... That thing is crazy. It can write code in the image. i want to show this i want to show this because this is crazy so uh yeah i was referring to the the i think it's this let me see if i can if i can show this to you guys i think this is what you're referring to uh somebody tested simon wilson's friend of the pod very famous kind of svg uh svg pelican experiment where you like ask the model to generate svgs um and then the coding models just write the svgs um this model generated a screenshot of a mac with a was a vs code id i think i just want yeah it's a vs code let me just say this is not this is not a screenshot and this is not real code written this is pixels yeah i mean it took me a while to understand these are pixels it generates the pixels the model generated a screenshot of a mac in that screenshot there's a vs code editor inside the vs code editor there's an svg and this svg taken through a text ocr model that turns like text you know into into actual text shows you almost a pelican riding a bike in svg this this renders so this model is basically like ldj i think this supports what you're saying this could be 5.5 with you know with with the presumption of like this is an image exactly generated running code in the picture like not only running code dude it's more than running code running code is fine this is a running svg code that actually depicts something it's it's like another layer on top of just running code like you can generate like an html page generating an svg there's a reason why this is a benchmark that simon runs like it's hard to generate svgs these models don't see how it looks. These models generate a screenshot of code that actually renders into an SVG. It kind of looks like a pelican. This is absolutely mind-blowing and insane. There's multiple levels of geekery that's going on here. Yeah, I mean, what's funny about this, too, is that the pelican test is a common test for testing abilities of models. And this is a pelican that's better than some of the models just from a couple of years ago. This is a better pelican, yes. Yeah, so I think one more thing, too. I heard Yam once say something, too. but one thing before we get off of gpt image 2 is i think we should definitely show front end ui that it's generated yes after this super quick uh riley good side one of the like more incredible people who test like very very difficult things uh he basically said gpt image 2 generate a game die but instead of numbers it has working qr codes for each of the wikipedia articles of the actual numbers so i'll say this again what the fuck i'm literally saying what the fuck as well what the bleep because youtube we don't want youtube to censor us every side of this cube has a qr code linking to a wikipedia article about this number so if you scan it like i have a qr code scanner here i'm gonna actually do this right now i have a qr code scanner in uh i'm gonna like do like this and then see yes this is all of the work there's a functional qr codes to every number the model needs to understand what the hell are you talking about it's just mind-blowing menus are mind-blowing uh this this one from claire whoa like selfie turns into the whole uh you know i don't even know what this is um ryan can you can you describe this i know you're friends with claire like what what are we seeing here i think everybody's like all right i've also got sisters so i can tell you about this um this is basically a color palette for your skin um so you take a a picture of yourself and you say show me your color palette and apparently all of us who are married should do this for mother's day uh and then it will generate this amazing kind of style color palette or it's i think it's called your uh your your it's a girl thing it's cool so do this for your girls feels like feels like we should bring at least one girl here on the panel as well to tell also about this uh shout out to claire maybe claire can come uh but this is incredible from one picture um folks we can go on and on like me and peter we just sat for an hour and a half and i invite you also i'll give you a link uh and we just like got excited the last thing that we do before we move on and ldj you're absolutely right um i feel like claude saying this but you're absolutely right gpt image 2 is incredible at ui interfaces pair it with the fact that codex is not incredible at UI interfaces. What people are now doing is this thing. This is the new alpha. You ask GPT-Image to generate a beautiful UI, and then you send it to Codex to implement. Instead of asking Codex to come up with stuff, this is now the creative brain, and then you just send it to Codex, Codex implement this, and this is beautiful. Here's an example. Yeah, I know you have comments, but LDJ sent this to us, so if you want to comment on this, feel free. This is a UI that GPT-Image2 created, and I think Codex coded, I'm pretty sure. I'm just saying you might be getting it in codex in like five minutes or something. Oh, yeah. I'm just saying, yeah. I sent another one too, which this is kind of even deeper into the meta. This is image generated with a GPT image. Then CDance2 animates some of the images. And Clawdesign, then it creates the whole website in actual code. That looks super cool. I don't know how usable this website is, honestly. like ryan i don't know if you're gonna rebuild your own website with like animated people uh but you know for some stuff this looks incredible and the thing that we must highlight is also this looks nothing like any website that cloud or opus or cloud opus or the codex will generate for you if you just ask this has to come from like a visual part of the brain i don't know if we're doing this metaphor with like the left part being the creative and the right part being the analytical but let's say we do this metaphor for a second codex and like the coding agents and everything those are the analytical part they can write there's some attempts into getting like excitement and creativity in there we know it's not that great it's better than code uh include in codex ryan go ahead we have crossed a new threshold so this feels very much um like what we experienced in December, January, when everybody realized how good Opus 4.6 was. Now we're experiencing this with design. So with the entrance of Claude Design plus Image 2 from OpenAI, we are now in a spot where you can really begin to get professional design out of AI. Now, what I do personally is I I pay Brett from Design Joy to do an initial layout, right? So all of my web UI, all of my brand is done by a human, but now I can fully hand it off to AI, right? So as soon as you have your design system locked in. Sorry, Brett. I'm sorry. Well, it's just the truth. Like this is where we're at. Now, I think any serious brand needs to pay a human to build out their brand initially in their UI, but pretty much after that, you can do this with AI. it's it's exciting so just for for completeness of sakes ldj you want to like send whatever you send in chat like what's actually going on here because i think it's confusing folks as well um what we're seeing here okay yeah so gpt image generates a mock-up of a website and then at least in the in the image you were showing above at least so if you go back up Yeah, or this video, I mean. Yeah. Yeah, actually, if you scroll a little bit more up, they mention. Yeah, so then they pass the image that GPT Image 2 created. They give that image to Codex, and they're like, hey, can you turn this image that was created into an actual working website? And then it goes from there and makes the website. And then in the other example I gave you, they do the same thing basically, except they also send those images into CDance2 to animate some of those images into videos, and then pass those images and videos into Claw Design and ask Claw Design to make that into a website. Just absolutely mind-blowing. I got scared for a second. I thought those were actual 3D animations, and I thought it scared me. we reached the trust into something crazier yeah but why does it listen listen why does it matter like if they're just pixels and images they don't even they don't have to be 3d right well that would have been indicative of like a much smarter model that can build something of that quality we have this one this one this one yeah this is we have uh rick the last myth send us this thank you rick uh this is again really good side gpd images to generate a photo of a cake decorated with svg that when transcribed to a file renders another cake so so again this is a photo of a cake that has svg glazed on this with some glazing and this svg looks looks like food like this cake it looks like food and it actually renders looks it look look at the cake itself it looks good it also costs 24.99 white cake with buttercream icing net weight 43 ounce is 43 ounce two pounds and 11 ounces i'm not sure like if that's correct 1.2 kilograms that kind of tracks the size so i'm also looking at like you know the the extra stuff it's insane it is absolutely insane that we're here that all of us started our like ai journey in the stable diffusion too that could like barely generate stuff we're not talking about like actual intelligence wolfram let's comment and let's move on we have a bunch of stuff there's no way this was my highlight of the week actually so the first things i did i i still have some old prompts from the dolly era from 2022 though i did the same render with this and it didn't blow my mind it was nicer of course but it wasn't that much but i changed a bit in the prompt and i suddenly got a complete image with text on it that interpreted the image for the character from a role-playing game. So it had a character description. It got all the information. Even the animals with it got their own thing. So the thing is not, it's not just an image model. We have intelligence in the images that we didn't have before. Like Nano Banana was a big step up above what we had before. I think this is an even bigger step where you have, I don't know if they are using agentic stuff or if it's just a very smart omni-model model, whatever, it is like magic. and it is so mind-blowing to see what you can do now outside of just good-looking images the intelligence that is baked into the images that is wow i i think we will be using this for a long time to come oh okay alex popped out but i'm here i'm gonna mention here is i think a disclaimer worth mentioning is since it is does seem to be reasoning model does have different reasoning settings like the most recent image that uh that alex had up it they mentioned that they specifically we're using pro which i believe the heavy reasoning is only available to pro users and plus users are below i think only have access to like a medium or high reasoning or lower and so if people are trying to replicate some of these things and don't notice the same quality then do keep in mind it is a different limit of what you can get with different plans yes uh no so speaking of plans while you find that alex i mean but aren't we all now in a place where it's like at minimum, except for NIST and I know, like if you want to do serious work as a professional, you're going to be paying $200 a month for Anthropic and you're going to be paying $200 a month to OpenAI. Like I feel like you have to do that now. And the only reason why is because they're so heavily subsidizing the tokens. Like if I was to go off my OpenAI plan and I had R2 run off of the API, it would be $3,000 a month. but I can basically run R2 for $200 a month because I'm on the pro plan. Is everybody else doing this or what? Some people do API. And for some people, I think it's very important to run Opus only in their like claws and nervousness, for example. And hopefully today, we still don't have any news from OpenAI, by the way. We're still live waiting. But hopefully today, some of the comments that people have about, you know, agentic stuff that Opus is like much better at, they will address with this new model. But I agree with you, dude. Like everybody should be running these models. And the heavy subsidized thing is very scary because when Anthropic yanked the heavy subsidization of Opus out of the open cloud ecosystem, you could kind of see like the thing go down. And then also all these companies are now copying all these features into codex, into cloud code, into co-work, into like a bunch of features. So, you know, there's a diffusion of where things are landing. Folks, I think it's enough. By the way, they allowed it back, which is a complete mess. Allowed what back, Nistan? We have to... Allowed the use of open claw back in claw max. Yeah, let's mention this. They reverted again. But wasn't that just for like the CLI or something? No, they just allowed it all again. What? I think it's almost like- There are mixed signals from different developers at Anthropic that are seeing this. But they have a Twitter account, everybody. Anthropic has a dev Twitter account, just so you know. Yeah. So- You can use it in Hermes. Just use it in Hermes. Folks, we're talking about over each other. What was allowed again is, as far as I saw, the CLI usage. So if you go to open cloud documentation now, you can see that if you do want to run Anthropic via your max account the CLI usage is fine So what this means is that if you have cloud code installed log in and you can do cloud and send like a prompt you can say, hey, I'm OpenCloud, the assistant, and cloud will not block you anymore. They had this like thing, and now that's fine. I don't believe that everything is fully back like it was, but I definitely should test because I know that like mine fails if I don't have extra usage turned on. uh we must continue because i do want to talk about two features of codex i'm not gonna bore you with the deals no matter what i tried i can like share the screen with you i need to do a full restart and i don't have time but um i think the two features of codex are like the more important things so let's let's see if i can share with you this uh anybody use codex here ryan i know you moved on to uh to to devon but who's like i use codex a little bit i use it a little bit but i'm mostly devin now i've been using codex what's your what's your what's your take on codex odj um well since i'm it's my main thing then i could only really say i love it but but yeah i've been it's definitely especially if you're doing one shot or a few shot it's definitely worse at front end than than clod but when you have a specific vision in mind And I have been kind of getting more into that flow of actually enjoying that design process and kind of creating my own design and giving those instructions to an agent like Codex. It's actually quite good at following those instructions over a long time horizons, just working really hard for very long on implementing very specific specifications I give it. and uh yeah haven't been really building things that crazy though with it in terms of like mostly just kind of useful small credit applications that are useful for myself various tools like training calculators so i think that the thing about codex is um specifically how much work is being put in codex last week we told you that open ai famously decided to consolidate, cut some side projects out and consolidate things into one super app. And there's early signs that Codex is that super app that they're going to focus on. And a lot of the promotions within OpenAI, they're focused on Codex and Codex is getting a lot of new features. So the feature that I really wanted to show you, I'm just going to show you videos of it because you have to see, is Codex computer use. codex last week uh released a bunch of like a bunch of examples a bunch of updates and we're gonna told you about this at the end of the stream we got like very very excited but this was the end of the stream so i wasn't able to like test it out fully uh and since then i have been able to test it out codex now has um on at least on the mac computer use that beats anything else that i've seen not from the perspective of hey this this is better at computer use by clicking buttons identifying things no just from the ui of of it i want to see if uh codex can use your max now um this is a an example you guys see this cursor the little cursor that like jumps and clicks and moves and plays tic-tac-toe the thing that i want to highlight well this is a quick demo video of all of the features but i want to highlight this cursor um this little thing is running on a background thread somehow. This is not your Mac cursor that's getting taken over. I have no idea. And I think still the industry has no idea how they achieve this. You can open AI, but Software Apps Incorporated, and those folks worked at Apple before and worked on different things like workflows. They're running something with accessibility. I think they're the only ones. Maybe labs will catch up. The most important thing there is This happens while you are able to control your computer yourself. Most computers use what happens. If you use cloud code, for example, and say, hey, control my computer, which works. They will just use your mouse cursor and open windows in front of you. And you aren't able to work. You're basically sitting like this. And then a cloud is using your computer. I ran this thing and asked it to go to tldraw and just draw things in the UI. and was able to do it in a background window that I didn't even see. It's so much more powerful. Once you go back to any other computer use, it's useless. It's computer useless. Codex computer use is so good that any other computer use is absolutely useless. Don't know if folks have tried it, but I absolutely recommend a full try of this thing. Comments, folks? Anybody tried this already? Anyone play with this? I think I know how they did it. or if I have to guess. In older Linux, like before, for some reason, everyone switched to Wayland, you could have two cursors on Linux with the X11 or Xorg. And this is, I just found this during university, people just trolling each other. And you could move cursors between laptops. And there is a port in Mac, which I do use, which is called XWords. and that lets you stream Linux applications to your Mac. And so I think there's not a whole lot of work there to get that other mouse going that way. This is just my guess here because this has been a fun Linux trick for a while that people don't realize. So it's got to be black magic here. I don't know if it's the same, but I definitely know that the little cursor that they have, they showed somebody like building the animations. That's a layer on top that they're putting. This is the layer on top that they're putting, and they're like faking clicks. I don't know. I want to highlight the thing. I think I'm going to put up this repo. I noticed, and I don't know if you guys noticed this as well, that browser use, for example, sometimes need actual computer use. So you know the Cloud King controller browser. Playwright is a thing. And then also a native dev tools, API, MCP, is also like a great thing to control browsers. All of them are great at clicking things within the browser, within the website, but then something like a download, for example, or a canvas interaction, these things cannot do. So the DevTools thing cannot drag or do canvas drawings, etc. So computer use paired with browser use, I think is the full exact picture. I've been working on a skill that I will publish later, if you guys are interested, about how to combine these things, how to do a hybrid computer use slash web use. And I think that that beats every other browser thing out of the water. LDJ, go ahead. So nearly a full two years ago, OpenAI actually acquired a company called Multi, June 2024. And this company is precisely trying to do things like that they're saying their goal is to make a computer, the experience of using a computer inherently multiplayer and basically a multiplayer experience where you could have multiple cursors on a screen. And ever since OpenAI acquired that company in June of 2024, I've been waiting for them to release something like this, where you have another cursor on your screen that ChatGPT controls for you. Now we finally have it. It's a little bit underwhelming because it was like a slow boil up to this point, but I'm glad it's here. The video I'm showing right now is from VB, a friend of the pod on OpenAI's colleagues team. That also says that not only can you have computer use, You can have multiple computer uses. The way they did this, because it's not your cursor and it's not your Windows, you can have sub-agents with codecs perform actions within different Windows. Again, sub-agents plus computer use, they will all go and click different things in there. I think that's just like, how insane is this? How absolutely insane this is. So he has a X window on the bottom left that he's typing things. The confetti, do you guys see the confetti thing? So he opens Raycast in another one and like types confetti. And then he types some other stuff in notes. And all these sub agents are doing things in parallel in all these windows. I've been waiting for this from the computer operating system manufacturer. They could have built this already, like Apple or Microsoft. They are in AI. Okay, Apple not so much. But if they could put this in their systems, like a multi-user system where the AI is another user working with you on its own desktop, on your own desktop sharing between those. That is stuff the operating system could provide. And the technology is there. Now just somebody has to be brave enough to actually build this thing. And it looks like the labs are doing it now. They are building everything. Yeah, including the hardware, which OpenAA said over the next six, 12 months, they're going to be announcing their first hardware product. This is after they're saying they're focusing on no longer side quests. the other thing in codex that you asked Ryan you had a comment I see you getting excited about something I was last I was just laughing about your comment about staying focused yes I'm very happy that the GPT images is not a side quest I'm very happy that they're like doubling down on images because like nothing that Entropic does Entropic like famously focuses on like text generation right there's no image generation no voice nothing I'm very happy to open the eyes, staying and leading the pack and like fighting the good fight as well because we all benefit from this. Nano Banana was, I don't remember another AI technology that dominated as long as Nano Banana did. Just absolutely domination until yesterday or whatever when images got released two days ago. The other thing in Codex that you guys absolutely must know about, I really wanted to show you all this, but the technology guys are not like with me today. This, it's called Chronicle. in codex this is a research preview that uses what's on your screen codex behind the scenes taking pictures of everything that you did every i think 10 seconds or so and then it adds it to context so if you ask hey what am i working on what i was working on like an hour ago codex knows codex just knows it fills in the missing context to you saying uh and then it is incredible i it's hard to explain how much this is incredible but folks who've used something like recall rewind ai or like different things that you know rewind got bought out by meta and shut down so people can't use rewind anymore um if all your screen is recorded all the time it's an incredible incredible addition to the context of your model right uh famously rewind this tagline was like an ai system that knows everything you've seen read or heard uh the like this doesn't transcribe everything you you hear but codex now has screenshots of everything you did outside of codex it's it's kind of awesome like honestly it's kind of awesome here's an example of why it's awesome i have granola running on my meetings so when i do meetings i have granola running behind the scenes granola prints out the output of the meeting automatically codex sees that so technically codex now without extra steps has insight into every meeting that i had throughout the day every meeting i can ask codex hey you know when i met with wolfram what did we talk about and it's just just mind-blowing now this does mean that you're enabling screenshots on all everything that you see on your mac and potentially sending this to open the eye right The images are stored locally, but at the time of processing, obviously, they're sent to the image. So that's not for everyone. I think in addition to this, there was news about Meta now adding stuff like this for all their AI engineers to measure their productivity. So that's, you know, the spy stuff, you know, the conspiracy minded folks may say, hey, this is like a thing too far. But I think for usability is great. It wasn't to measure their productivity explicitly, not, but for training models on what they are doing, basically. Oh, for meta, you mean? Yeah, that was what meta was doing or is doing. Yeah. They need the training data. Anybody used Chronicle yet? I really wish I could show you, though. It sounds cool. Like I said, unfortunately, I rolled off Codex, so I would like to try it. But it sounds good. I wonder how much signal to noise there is though. That was pretty good. I can say, you know, I can test this and maybe show you, but like I can definitely say that I asked it, hey, what was I working on an hour ago? I was able to figure it out. He was able to tell me like, hey, you're working on this and this and this. Nistan, what are your thoughts on an always-on computer screenshot taking codecs? I mean, just look at the last few court cases that involved chat GPT documents with CEOs saying, oh, they had deleted everything and they had zero data retention policies and it all showed up in the court. So you have to think of it that the agreements do not mean anything and your data is always recorded. It doesn't matter and it can come back to you or to your customers. So I would not use this unless it's running at home. I do want it, but I would not trust this thing. LJ? Yeah, I'm going to take the opposite stance here. I'm definitely going to use it. Apple, all of these companies, I mean, shouldn't give into it in this way. It's probably not the best argument, but they already have a ton of my data. And I'm not necessarily doing super compromising stuff here. I'm probably going to keep the improve the model for everyone thing off. Yeah, that doesn't guarantee that they're not going to store or train on my data. But I think the benefits and usefulness here are such that it's worth it. For things like contracts and reviewing legal work, I'm still going to just go to private models because they're good enough for that. And it's just worth it to do so. I definitely think there's a path towards building something like this fully locally, right? We're talking about like Sonnet at home. The new Qwenda we just covered is multimodal. So it can understand this. Wolfram? I actually had something like that, a screen watching assistant that was, I was using Florence for Microsoft, the image recognition model for that. And it would turn it into text. But yeah, it's been over a year ago. So the technology just wasn't there now. But I think with the newer models that are smarter and faster, that would be something to run locally for sure. because you don't want to send all of your data. In a company, you probably can't or are not allowed to do this. But if it's all local and it's just an index and it goes in the knowledge base that your AI assistant can refer to, that makes a lot of sense. So yeah, definitely a screen watching assistant is one of the big unlocks. All righty. I think we have a few more things to cover before. I do want to talk about Crap Trap. We mentioned this in the beginning. ryan i think you saw this uh the ceo of brex joins the litany of new ceos who find newfound time to to pair with codex and actually build things uh and he built crap trap and he says basically open claw is not great for enterprises in fact i will say like it's banned in core wave so we like none of us can use open claw uh and the reason why and jensen mentioned this on stage at gtc is that It has access to the sensitive data within enterprise and it can communicate externally and it can be prompt injected. So not great. But he's like, hey, we use OpenClyde Brex internally. This is a great admission from the company. He says, we started deploying agents internally at Brex. We couldn't stop thinking about this question. Let me actually show you this. Agent works. Nobody wants to give them real credentials. Instead of waiting for a solution, we decided to try a novel approach. using LLMs to judge the network traffic of an AI agent. So they built CrapTrap. Open source proxy intercepts every outbound request and blocks risky activity using LLMs. Like we told you before, the privacy filter from OpenAI, that's a great tool to kind of add to this arsenal. So this is an open source proxy that you like proxy all the network to. I think it supports OpenAI, but I'm not sure if it supports Entropic. and then you just like basically proxy everything and then it catches everything that your agent sends to, decrypts the static rules and then LLM is a judge. So this is kind of expensive, right? You're running another LLM to review all other LLMs so you have to consider like context windows. But given that a leak of your private credentials for an enterprise can cost significantly more, this is maybe worth it. LDJ, comments, Ryan Carson comments about whether or not you're going to rub Cremtrap on your agents. I've heard it's especially effective if you're using the judge as, let's say, Claude. And if the main model you're using is also Claude, but you tell the judge, hey, the model that you're monitoring is Grok or something like that. Apparently, it's especially good at actually catching things better because it's extra critical about it. Oh, nice. This is absolutely going to be a thing. Like intelligence is on demand now. So what company would not want intelligence monitoring all their traffic to make sure that their employees are not doing bad things? Like absolutely, this is going to happen. I just wanted to say that I want to change my pick of the week to the trap trap. I haven't looked at it in detail, but every week I'm doing a deep research. My agent is doing it, looking at how to secure agents. because the more i use my agents the more access i give it the more i'm concerned about this so um basically some a security solution running in the background and observing what is happening and being able to intervene this is what i've been looking for all the time i looked at all the guardrail implementation but crap type i will definitely this is my weekend project i will implement this and my age is already on it so definitely um i want this and i think we all need something to make sure that our agents are not doing stuff they shouldn't. Hey everyone, this is Pedro from Brex. So this is the demo. We're not going to listen to Pedro from Brex, but he's, I think, the CEO. There's four minutes of things, but basically not only does it look at every request that your agent does, you can also with natural language define, hey, this does not look dangerous or this is something that looks dangerous. You can add those rules. I think it's very important to to malleability. So Crouch Up from, somebody mentioned it's Okta for agents and I love this, Okta for agents. So I'm definitely going to implement this for my agents as well and go forward there. We have breaking news. AI breaking news coming at you only on Thursday night. All right. Now, finally, we have breaking news, folks. OpenAI's newest model, GPT 5.5, just launched. They call it the new class of intelligence for real work. We're not going to take a look at the video because we want to go directly into the evals and show you that on Terminal Bench, GPT 5.5 gets 82%, jumping from 75% to GPT 5.4, beating every other model that they have here. uh gpt 5.5 pro is also launched but they didn't test it for some reason on gpt on on terminal bench uh we have experts we internal benchmark jumping at 273 from say 68 uh os world verified is a little bit of a bump uh gb gdp val that we specifically love here this like is state of the art model now like beats clopus clodopus 4.7 beats gemini 3.1 uh oh we have yam in the car joining us to the breaking news um browser comp is almost state-of-the-art and uh let's go can we can we test this out but yeah what else frontier math is incredible uh 35 percent model capabilities open the eyes building the global infrastructure over the past year we've seen the i dream egg accelerate software engineering with 5.5 in codex and chat gpt the same transformation is beginning to extend into scientific research and broader work people do on computers across these domains gpt 5.5 is not just more intelligent. It is more efficient in how it works through problems. Often reaching higher quality outputs with fewer tokens and fewer retries. This is a trend that we showed you before. Not only are models like capabilities are blowing up. Also, they do it with like lower tokens. So let's take a look here. Artificial analysis index. I love the fact that people, the big labs show artificial analysis here. They show the GPT 5.5, which is the kind of the purple here, gets significantly less output tokens on the artificial analysis like intelligence index. Yeah, this is great. It's absolutely great. We have folks in the comments like freaking out as well. Somebody says it will work for 30, 60, 90 minutes or more. As people are already using it. So they're saying this is our strongest agent decoding models to date on terminal Bench 2 which tests complex command line workflows it gets state-of-the-art accuracy of 82.7 percent this is now just state-of-the-art on on on terminal bench 2. Eldije go ahead yeah yeah so I don't have access to it in chat GPT or codex yet I've been refreshing but I overall the sheet of different benchmarks that they showed and earlier in the blog posts I'm not seeing a single benchmark where opus 4.7 is beating 5.5 which i think is pretty impressive and yeah it's maybe partially i have access let's go oh there we go there we go uh extra high and high let's say speed is fast okay i'm gonna use this speed nistan let's do the mars thing mars by the way we had opus 4.7 last week at terminal bench 2.0 at 69.4 percent so that is a huge jump here i'm gonna do a gbt 5.5 on on fast mode with high reasoning let's take a look oh folks are saying that they had a run for eight hours what's up peter gostev from arena who gets access to early models all right we're gonna send this uh mars instrumentation let's keep let's keep reading here uh for terminal bench 2 not only does this model beats the scores as you guys can see it uses significantly less tokens almost twice less tokens that's incredible so uh let's look at this one so this is a medium this is low reasoning effort okay um the low reasoning effort gets a little bit lower score but uses like one half of the tokens and then for medium reasoning effort you can see that gbt 5.5 gets a score of 75% on terminal bench, the medium raising effort for 5.4 takes 63. So almost 10% difference on medium thinking with significantly less tokens, 7,000 versus 9,000. That's very important as well, right? Wolfram, we talked about like how important is how many tokens you get as well. How many tokens you use. I'm so excited. I can speak. the price of the intelligence you are getting that is super important and we found out also that if a model is thinking longer it can actually be detrimental on the agentic benchmarks so finding a good way that is also probably why the score is higher now because it decides it doesn't have to think so much but act and then correct instead of overthinking let's use let's build a simple website build me a okay you guys do that but i am kind of blown away by this design thing listen not now not now nistan you should have joined me a week ago now the big news is gpd 5.5 uh you're killing me yes lot design is amazing yes i agree but nistan i have to edit yes it's incredible please um what else do we have here so experts we as well you can see that it actually uses less tokens um they're showing an example here of the space mission app with like 3d things uh it shows the price comparison between gpt 5.5 i'm gonna build a website that shows price comparison i'm hoping this is gonna go and actually look at uh the prices gp 5.5 uh opus 4.7 where's gemini by the way gemini 3.1 gemini is the last one right 3.1 uh in 3d somehow with 3gs so i asked it to go and look up the scores i think the new meta that we're we're now like waiting for also is generating things with image so let's see if it builds the mars thing niston it still thinks a lot so i have an answer by the way for the pricing oh okay tell us i just noticed i ran it i ran it with 5.4 i wanted with 5.5 yeah go ahead about pricing sure so it looks like um it is for for regular gbt 5.5 it's priced at five dollars per one million input tokens and 30 per one million output tokens i think 5.4 was 25 per one million output tokens um so yeah that's like a little bit more expensive but not insanely much and then for 5.5 pro um it's the usual cost of the pro models it seems like 30 dollars per 1 million input tokens 180 dollars per 1 million output token so still a lot but the other pro models were really a lot like that too oh look at this we have a friends from the twitter like showing up on the actual page saying then shipper founder of every described 5.5 as the first coding model i've used that has serious conceptual clarity and pietro skiron our friend from magic pass says a similar step change when 5.5 merged the branch with hundreds of front end and refractor changes uh that has also changed substantially resolving the work in one shot about 20 minutes it generally feels like i'm working with a higher intelligence and there's almost a sense of respect i i gotta wonder if they fixed like open claw or something did they Did they mention Open Claw here? Open Claw. Nope, they didn't mention Open Claw. Because we know that like once Anthropic yanked the Open Claw thing, then everybody was waiting for Open AI to catch up to Codex. Let's look at GDP Val. GDP Val is attest agentic abilities to produce well-specified knowledge work across 44 occupations. GPT 5.5 scored 84%. 84%. Where's my 84? Here. And the industry expert baseline is here. So like all these models beat it. Just a little bit above GPT 5.4, not a huge amount. But somebody says it's a good model, sir. Yes, okay. All models now are good models. OS world verified. Oh, this is a nicer model for tool use as well. Oh, looks like we are, we're about to see the Mars generator. Listen. And we can compare it to the previous one that we ran with Opus 4.7. I expect that to take some time back and forth. But yeah, I don't know. We're going to see. The desktop view is alive. The default target was one kilometer under the exact rail length after rounding. So it's labeled the minimum orbit low, blah, blah. It really takes, you know, it really thinks about the math there because we asked it to do all of the math. It's doing mobile page view too? I think so. Interesting. That's the first time we've seen this, right? We have two examples that the model compares. Wait, show the picture that it took. It took a picture, right? A bigger one? Okay, okay, all right. Oh, it's going to use my browser now. It asked me permission to use the browser because Codex is like that. I'll give it permission to use my browser and we'll see what's going on. So folks who are just tuning in, we have a bunch of folks here. We're testing GPT 5.5 from OpenAI that just dropped. and we're testing it in multiple ways. First of all, we're running this on a Mars Rails calculator from NIST that we usually test it with things on the show. It has a verified, the thing we're noticing, it has a verified mobile PNG. Now it deletes it, but this model decided to test its own UI both on desktop browser and mobile browser. I've never seen this before. And it's done. Let's take a look. And it's running now. Let's take a look. Now, I want to open this in the actual browser for a second. Nistan, I will let you verify the numbers if you want to, but there we go. So we have Mars. It should get the numbers right. Yeah. We have the target. Minimal orbit, eastward, no rotation assist, escape outward, escape no rotation assist, and then custom rail. I don't know. No, let's just do minimum orbit, eastward. Okay. Eastward? and then we have good we have acceleration time etc exit angle we can what the hell is this oh this is a different one this is okay and then we hit launch and let's see we can see the oh it launched it okay interesting launch but I don't see yeah maybe do the exit angle I don't know like 15 degrees or something yeah just three that's fine okay it's not the best one that we've seen yeah it's not it's everything every everything else we showed has like multiple views angles uh i did customer let me refresh this guy and let's start again sometimes you have to tell it add orbit controls and other cinematics stuff and it's not showing it in orbit either no but it did do mars that's pretty cool yeah yeah can you rotate it can you drag and rotate no it's it's locked in place so we didn't get any of the fancy stuff that we got from like uh opus or even the previous gpt meanwhile though i've been running this codex so on the mars thing we're saying it's it wasn't the best one but maybe we need to specify a little bit better yeah i just need better better prompting meanwhile i asked this problem build me a beautiful website that shows the price comparison with the gpt 5.5 opus 4.7 and Gemini 3.1 in 3D somehow with 3.js. And then it's still running, but it built me this. You guys want to see? Frontier model, price field, $4.50 to $1,150. There's kind of some text overlapping some other text, but it is a price comparison with blended input and output. And you can see the 3D kind of like rotate. And you can see the prices. So GPT 5.5, $5 per input, $30 per output. It added the artificial analysis index. So this model is at 60, Opus 4.7 is 57.3, and Gemini is 57.2. It added Terminal Bench evals. I didn't ask for evals. It added Sweebench Pro evals and added GDP evals as well. Not only that, Codex asked the model to confirm. So you guys can see the little Codex window here. If you press, it took a screenshot and confirmed that it works. And the mobile pass exposed the usual absolute layout trap. The cards were starting too high. So this is now the second time that it looks and verifies its own work on mobile. This is the first model they have seen that does this without prompting at all, which is very, very cool. What else do we have? Folks are saying artificial analysis posted their benchmark. Let's take a look. Let's take a look at artificial analysis. Okay, we have the official score here and also artificial analysis. Let's take a look. let me just open this on your tab I have some scores to compare with Mythos by the way when you're ready oh with Mythos let's go let me find artificial analysis here is their official thing independent analysis of 5.5 all right from artificial analysis GBD 5.5 takes OpenAI back to the clear number one in AI OpenAI's new model tops the artificial analysis intelligence by three points it's not that much Breaking a three-way tie with Entropic and Google, OpenAI gave us pre-release access to test all five reasoning effort levels, extra high, high, medium, low, and non-reasoning. OpenAI topping the GBT Terminal Bench Hard, GDP Val, and our newly hosted APEX Agents Artificial Analysis Eval. The model trades only other OpenAI models in CRIT PT and comes second to Gemini 3.1 Pro Preview on three additional evaluations. 20% more expensive to run on intelligence. Intex per token pricing was doubled from GPT 5.4, doubled the pricing to $5 and like $30 per 1 million output tokens. However, a 40% token reduction largely absorbs the hike. So this model is like more expensive, but 40% token use reduction on artificial analysis, resulting in a net 20% cost to run our intelligence index. Effort, a clear ladder for balancing intelligence and cost. GPT 5.5 scores the same as Cloud Opus on our intelligence index at one quarter of the cost. Wow. All right, cool. What else? Number one, GDP Val and trailing the frontier on hallucination. Our private AI Omniscience benchmark rewards factual knowledge. uh gpt 5.5 extra high has the highest accuracy at 57 meaning the model can recall facts and omni science corpus more effectively than any other model however it has a hallucination rate of 86 percent versus opus at 36 uh this makes it more likely to answer a question where it does not know the answer that's not great honestly uh but great model sir great model sir all right this the pricing thing has finished let's see if it changed anything uh no it's still kind of like wonky but i i kind of like the price comparison thing i didn't ask it for too much besides the fact that it's a little bit uh there's text over overlapping here it's pretty cool um so some folks are saying can wait to test the 5 pro let see 5 pro is up i tried to I tried to check my chatGBT but I don see it on mine But maybe you have it on yours You still sharing by the way by the way Yeah. Thank you. Am I still sharing even now? Yes. Lovely. Even if I moved it away. Well, I don't see your browser anymore, but I see it. That's fine. So let me log in and see if I have the Pro. I don't think Codex has access to Pro. I think it's just online, right? So we'll see. All right, login into Chagipity. Let me confirm that I do not have. Here, while you do that, I could say some Mythos versus 5.5 scores. Yes, please. So it looks like Humanity's last exam and GPQA, and most of the benchmarks, Mythos is significantly beating it. But it is interesting in CyberGym, which is, it seems like really the only popular cybersecurity benchmark that Anthropic tested Mythos in. In that case, Mythos preview got 83.1%. Opus 4.7, which just released, gets 73.1%. And GPT 5.5 gets... Sorry, I just had it pulled up here. Okay. GPT 5.5 gets 81.8%. Could you send me something so we have a visual as well? Sure. So basically, GPT 5.5 only scores about 1.5% lower than Mythos here. Than Mythos. Yeah, while Opus 4.7 is a full 10% behind. Oh, wow. So this is a cybersecurity model as well. Is this Spud? Do we know if it's Spud? I don't think it's confirmed. I think it's kind of been implied. It might be at least like an early version of Spud or something like that. Yeah. I think that folks from OpenAI, I think, posted something about Spot is Coming. What else do we want to tell? Let's see what Sam says. Sam Altman, we believe in iterative deployment. Although 5.5 is already a smart model, we expect rapid improvements. Iterative deployment is a big part of our safety strategy. We believe the world is best equipped to win at the team sport of AI resilience this way. Okay? We believe in democratization. We want people to be able to use lots of AI. We aim to have the most efficient models, the most efficient inference stack, and the most compute. We want our users to have access to the best technology. We have been tracking cybersecurity's preparedness category for a long time and have built mitigations. We believe in that enable us to make capable models broadly available. He's taking direct shots at Anthropic with Mythos and the too dangerous to release thing. Oh, yeah. Oh, yeah. Oh, yeah. Oh, yeah. That's shots fired. Absolutely. We love you and we want you to win. sam altman says we want to be a platform for every company scientist entrepreneur in person and in parentheses my whole career has largely been about magic of startups and i think we're about to see that magic at hyperscale uh this is great this is great so shout out to sam altman uh let's see what else open the eye posted about open the eye you guys know what i gotta wonder if GPT image now is significantly better because GPT 5.5 is significantly better. Or like we said, the GPT 5.5 was already GPT image. We just didn't know about it yet. Oh, yes. Again, for these types of models, you have to include a full language model in it. So yes, yes, that will help. So we can go and take a look. Let me log into fal.ai. We'll try GPT image. Meanwhile, folks, let us know if you want us to test anything specific. Oh, yeah, Peter, please, please do. Dude, you want to jump on? Let me invite Peter Gostev because he's been testing this model for a bit. Let's invite Peter. The new GBT image is also an open router, and there it's called GBT 5.4 image 2. So basically, the language model blew the image model. Wait, could you say this again, Wolfram? Yeah, on open router, the model is called OPMI slash GBT 5.4 image 2. So basically you have the name of the model that it is the image. Oh, it should be 5.4 image 2 on the API, you mean? Yeah, on the API. So basically if you expect it to be 5.5, then it will be updated there as well. So far it has been 5.4 with image 2 as a layer. Yeah. We'll see if Peter Agoste from Arena comes on. I texted him. Let's see. But Peter, if you're listening, I sent you a DM on X with the link. Please join us on stage. Meanwhile, let's see what OpenAI says here. So we're going to look at those evals, but I think what is standing out to us? They have GPT-5.5 Pro, and very interestingly, they compare the four models here. They compare the Thinking and the Pro to both 5.4. GPT-5.5 Thinking looks like beats nearly all the models on Tulathalon? I don't know what Tulathalon is. I don't know. Wolfram, did you try and switch to 5.5 in Hermes? I don't have it in Germany yet. It is being rolled out, so I asked, at least I did some research. It should come, but it's not there yet. Otherwise, I would definitely switch. And I just subscribed to Opmei again with the pro account. Oh, nice. Okay. so when 5.3 or something in december i unsubscribed and now i i will test and see but i have high hopes and high expectations meanwhile while we wait for peter to come on folks i want to show you uh the thing that i wanted to show you this whole time i want to show you computer use okay so i'm gonna i'm gonna start a new chat here with 5.5 i don't think we need high let's do a medium and use at computer use to you got it you got it in codex already yeah i got it in codex crazy crazy crazy let's try uh use computer use to interact with chrome browser and tweet from my account hey we're live and testing gpt 5.5 that just dropped join us on our live stream but also quote tweet my previous tweet that has the live stream not an easy task definitely for computer use not an easy task uh let's let's see folks let's see so right now we're going to use gpt 5.5 and on medium i'm on also on fast mode so it burns like 1.5 the tokens we'll see how the computer use is going to work and the thing that i want to show you all this time and finally i can show you is that it's already clicking in chrome while i'm focused on here so it's already doing the clicks in chrome uh chrome is already on next logged into altrain i can see your live broadcasting sidebar so i'm going to use the it found the live stream post it clicked you guys see this it clicks the the thing it's going to do quote are you seeing this let's go this is beautiful and now it's gonna post for me you guys see the little cursor it's gonna focus hopefully it's focusing do you know that nikita see this uh guys didn't have it nikita don't look at don't look at it i have the quote we drafted against the live stream post ready to send because this is publicly post please confirm should i click post now yes let's go Alex did you give it a prompt to wait for confirmation or is that default it's default that's great that's great it's clicking it's clicking it's sending and it's sent ah this was awesome this is folks this is computer use with 5.5 and post and verify this new quote tweet is live on your verification as well uh folks i wanna welcome peter goste from arena ai to the show uh peter welcome we just did a live stream with you what two days ago talked about gpt image and now we're talking about a new model gpt 5.5 that just dropped from open the eye uh first of all thank you for joining second of all uh impressions from you would love to hear about this model please yeah let me you know what i feel like i can hear you from two places and i clearly have you open too many times all right so while you fix this i was just like yeah let me i will reiterate with folks that we just saw a demo of computer use with gpt 5.5 and i asked it not only to post something on twitter which is easy i asked to quote tweet another tweet of mine with the live stream there's quite a lot of like intelligence involved in like figuring this out and and they did it like super fast we all saw it happening on the fly and i think like it's incredibly incredibly cool um peter let me know when you're ready there you go just to be clear before that uh across the board state of the art right from thinking and above everything is state of the art yeah state of the art while using complete like what 20 less tokens or something or sorry like almost 50 less tokens all right folks let's welcome peter goste from arena to the show to talk about gpd515 yeah so uh i haven't had huge amounts of time with it but uh it was really fun to test i would say the biggest thing that jumps out is that is the first time when a model can actually properly do long running tasks um all previous models i know they kept saying oh like you can do it for many hours but every time i don't know about you guys like i i I just, I shouted it. I do anything I can think of. I come up with these constructs of how it's supposed to do it. And then it never does it. So that was always very annoying. And that was the first time when I could really, maybe it's not completely, I say, work for 10 hours and it works for 10 hours. But without too much prompting, you can get it to work for a long time. So I'll give you one example. Yesterday, I came up with a little idea. It's not done yet considering how long it's running. but i wanted to kind of generate some images create an app you know the and create the whole kind of experience around it and um what i did before going to sleep i came up with a prompt and then i queued up you know how you do you queue up like 10 more prompts to like keep it going yeah and then when i woke up i thought okay i'll be done and we'll probably be done at like 3 a.m i woke up it hasn't literally finished the first one so like all of this queuing up was completely unnecessary so it just kept going oh wow um so that's the first time i've ever had that happen so how long did it run for so i at about eight and a half hours i kind of stopped it to just kind of check in with it and try and rearrange things a little bit just to speed it up um like for example i wanted to use sub agents a bit more so it's like so it's not going to run for another 20 hours so um but yeah probably i don't know how long it would have been running for i'll tell you even now i've got the button to to do an update on codex um up and i literally cannot do it because it's gonna ruin my my long running tasks and i've got like i've got a couple of i've got i've got three now long running tasks running one let me just check i think it's been running for for about seven and a half hours. And I have this little app that I have just for my work where I've got like different visualizations, pulling in the data. I've got very custom visualizations that I'm doing. And it's kind of, I barbed code it in a really crappy way. So it's like, it's all breaks and so on. So I want to keep migrating it to like better architecture. And yeah, it's been going for like seven hours. Every time I check, come on. yeah literally literally started today seven hours i can't even update the bloody app because it keeps running it's still running yeah um it's quite crazy so like ralph loops are dead essentially with models that are essentially running through all this time seven hours is insane yeah what else did you notice what differences um so it feels i mean it's kind of silly but it does feel nicer like it does feel kind of a little bit smart and just nicer to to speak to and it just kind of explains things a bit better so i feel like when when i'm trying to get something done sometimes it would like especially starting with like i want to say maybe 5.2 or something like that it would just be kind of abrupt and just kind of do stuff and then you're like i don't really know what's happening this one is a bit better i don't know if it's a an important change or they just change the style of it or something like that so it's like it'd be hard to know for sure but it does feel kind of smarter that way I did some kind of one shot 3D generations like I like to do and that was noticeably better kind of one shot versus the 5.4 yeah so that was really nice yeah I was using computer use as well like you were saying I like that as well I would say though like I still feel like we are not quite there in terms of it being like quite as good as i i hope it would be so what i mean is that what i want for for it to do is that it can literally use the app itself and then i properly reflect on what's wrong and then make it better and it's like it's not quite getting it like there's not really any model and i use used it with opus as well with like gemini they'll kind look at it and just say oh yeah that's good and just not really register very obvious issues but i don't know if it's like vision kind of sucks still or what i don't know but so it's kind of it kind of does something but it's like feels like it needs like another probably generation probably with vision yeah i i uh you know we're just getting it now but i'm running this now with 5.5 medium and computer use and i asked it to go and download the the brand kit that we generated from Cloud AI and just generate like a launch video for itself. So we're going to see, but this is a long running task. I don't know if we're going to sit here, but folks, I can show you like what's going on. We have kind of a, like it went and found the brand kit and says the kit files are now in this folder, the readme, the tokens, the cloud design system skill. Next, I'm scaffolding the hyperframes project. So basically the, all the tools that I told you about before, GPT 5.5 is now using computer use plus via codex using the like writing clis and doing some things to create a video the only thing that i have and this is a comment that i want to give not 5.5 related necessarily it's just a comment about how codex works uh like uh peter said you said you queued up some stuff i chose the steer function versus the queue up function so a steer is something that only i think gpd gpd has i think devin also has steer but gpd the model has steer is while it's running you can steer it so it's great for long-running tasks you know like the one that you said you have like eight hours uh usually in cloud you have to pause it or stop it completely with tool calls and say hey do this instead uh gpt has steering enabled into its thinking so you can actually tell it to do some stuff so i can say uh for example uh don't you guys see this probably don't render the video just show me the hyper frames ui ui to confirm before i'm saying by the way don't render the video this is steering so now i'll throw this in the middle of the of the reasoning process and then you can kind of like uh you know you can join the the long running process with your thoughts um anything else we should cover folks i'm seeing this mythos comparison is very interesting Wolfram you want to talk about this a little bit because I think LDJ showed us this and we can show this once again LDJ had a nice comparison with the scores and the Mythos model still has some advantages can you bring it up or should I just I just checked on chatGBT I don't have 5.5 still on chatGBT yeah me neither basically terminal bench it's very very close but in humanity's last exam which is not the last it is still a big gap over there, especially even if it's using tools, it's still over 12% of a difference point. You're talking about the unreleased cloud mythos from Anthropic compared to GPT-5.5 that we all just got in codex. So I think that's also a big difference. Yeah, GPT-5.5, when everybody can access it, this is what they said, it is close to mythos. It's much better than Opus 4.7 in all those cores, I think almost all those cores somewhere Opus 4.7 is still ahead but it's very very close and yeah like Sam Altman said it's great to have the AI available for everybody and not restricted or different classes of who can access what. Do you guys read this thing where some like Cloud Mythos was actually available for some folks on Discord and what they used it for is generally some websites and that was really funny on the first day or something A bunch of folks on the Discord got access to Cloud Mythos. What they use is not to jailbreak or break computers, just to generate websites. I love it. What no benchmark has shown, and mine doesn't show either, that's what I do the wipe checks for, is if the personality has changed. Because 5.4, it was so boring and robotic to talk to, which is something if you want to, you notice the same. When you have your agent and you talk to your agent it feels like just a robot now it's not as much fun to use it and so i i hope i will test this uh hopefully they also changed a bit at this uh yeah i i gotta ask you i gotta ask peter i gotta ask you guys who are focusing on evals how do we even test this this is all just vibes you just like work with your assistant for a while and they're like oh this is better because i know that i have no idea like i know the opus is better than gpt 5.4 but i don't know no eval that compares this Wolfram, what do you think? I call it a private evil in a way. When I talk to an agent or anywhere, when I get a good AI response that is funny, that touches me on a level, then I copy it into a quotes file and I keep it that way and I always write which AI did that. And so basically I could do a list of which quotes came from which models. And I know that Opus is super, super strong in there. And JGBT has, for all were the last models where I had some quotes from. so basically uh that is my personal thing though i notice when i copy a lot of stuff in there that is a model that i really like i see okay so we'll see and we'll see how many like things you you quoted from gpt 5.5 peter on the on arena i'm assuming this model is just now running there's no no confirmation like we had with gpt image that this was a masking tape or something right no I think there's no API yet. Do you guys see that? There's no API? It's just like available in Codex and not even in ChargeGPT. I only have it in Codex. I think some people said rolling out in Codex CLI. Do you have it in ChargeGPT? I haven't checked, but it should be in ChargeGPT and in Codex as well. But it's not going to be on API as far as I understand. at least for the time being so we can't test it ourselves as well um so we have the task that i asked it for again i want to show you guys that involves computer use involved a bunch of other stuff involved the hyper stream so i asked it uh i asked gbt 5.5 to create a launch video for itself this is the prompt open the new window go to cloudai.design download the thursday brand kit into a folder and then generate a launch video using the brand kit for gpt 5.5 using our brand guidelines with hyper frames uh it's quite a complex thing to do and now we have some sort of a video this is with medium thinking let's let's see i it still controls my browser i'm not sure what i can show you but this is the this is the video that is generated let's take a look let's play dismiss okay dismiss i did i asked it to not to render the video let's play oh this is pretty cool build for a gentic i wish i could like zoom in here i'm not sure how i can like full screen this video oh maybe okay like this let's start again there's no music but here's the video breaking model drop gvd 5.5 just landed uh this is literally our branding kit built for agentic work, writing the bug code research across the web, operate software, more capable, same pace, latency matches dbt 5.5. We're live testing it right now. No hype, no fluff, just a signal Thursday news. This was impressive, folks. This is very impressive. I will say specifically, it's impressive because I tested both 5.4, the previous model and Opus and Opus was way better on than 5.4 before and like grading these videos uh this model not only just created the video it understood what to talk about it understood where to get the brand kit it downloaded the brand kit and did all things uh very quick as well how long did it run nine minutes it ran for nine minutes while while we were covering this yep one thing one thing you might want to check is a front-end design because that thing uh front-end and design in general is something that uh is specifically known to be hard to the codex models yeah all of them so uh it's gonna be so we did kind of have a competition if we have competition for clouds i mean we did kind of test this okay so we asked this and uh and this one is kind of not the best this is the comparison with 3d the 3d is here but like this front end design is not the best one and then we also checked it on our olympus mons mars driver thing that we always always tested with nisten this is lacking let's say it's lacking opus 4.7 was just incredible in this uh but for regular web designs i think the new meta is we have to test and peter we talked about this when we went live with gpt image yeah yeah yeah no i agree i don't think it's quite good enough which is kind of kind of odd right because they didn't they obviously know it they know you need to get better at this so it's kind of interesting why they can't quite get it right um i don't know why so i guess well it's interesting if if this is spot as we guess and i don't know for sure but if it is means it's not pre-trained right it's like something in post-training that they're not quite getting right yet um but i i do think with with some like uh codecs uh or with the gpg 5.5 you just kind of need to fight it and then it's great but yeah the initial instincts are terrible so yeah it's not one shorting is much better with focus yeah just for design's sake uh in the case of in the case of uh just this video i will say like i asked it to go and download the design um how should i say the the the brand guidelines i will say those are spot on this looks like the brand guidelines for thursday you guys can see the logo clear you You can see the font rendered fine, like all of this. I wouldn't say this is like the most beautiful design, but it's spot on on what I asked it for. Yeah, but that's what I find as well. Like if you do have some guidelines, some structure, it's completely excellent. Like it's really, really spot on. But to do the initial thing, yeah, I wouldn't rely on it, to be honest. But we do know the GPT image is great. Yeah. We know that GPT-Image is great. So how about we test this? You guys want to test this? Jan, what type of web design would you want to imagine? Absolutely do it. Okay, let's do it. Absolutely do it. Imagine it with GPT-Image and let's see. Let's see. So we're going to open like proper codecs, not the little side window here that we have. We're going to build. Nisto, how about we do the Mars thing, but with GPT-Image first? Can I just give you the design file that Cloud generated? it no but hold on no hold on we said that like we want to one shot it with with the thing yes we know that like using the design Peter just said using the design it's really good at we want to see uh the ability to use codecs as a stand-in for like the creativity of uh of of uh gpt 5.5 send me the design file i'll take a look but for now i want to say uh you can just copy the prompt and you can say generate the screenshot of this game uh from this the mars interface first with image how do you how do you how do you image gen image image 2 and then implement with code okay let's do high thinking on on on speed it should be fine yeah on speed it should be fine Okay, so we're generating, you guys are not watching this, of course. Yeah, there we go. In the new codex, we're generating the Olympus Mons driver rail thing. But in parentheses, I said, generate the screenshot of this interface first with image gen, GPT image 2, and then implement with code. And I send it. So this is kind of the new method that we talked about, that you can substitute opposites creativity with potential GPT image, because it is really good at web design. It's really good with different things. and it says i'm using the image and make sure the requested visual target then i'll build the working version with 3gs let's see um alex well this working i can show you a fun project that this is the one that's been working overnight but peter you look exactly like it generated you from gpt image 2 not gpt 1.5 he didn't look like exactly like peter we were showing your examples in the beginning of the show uh yeah feel free to share how does it work here oh yeah so what um i was imagining so basically i know the i haven't actually thought of this but in the demo they showed this idea that you can generate these 360 images and i thought oh that that's a really cool idea what if you actually just generate a whole bunch of them so what i was getting into do is to plan out the whole the hunting gardens of babylon kind of what that would look like and i tried to do this kind of 360 view of them. This is GPT-Image 2, right? Yeah, so this is GPT-Image 2 together with all of the planning, all of the coding, all of the coordination is by 5.5 and what I try to do is to kind of create this street view kind of view and you can see it's still a bit buggy but the idea is that what I was trying to do is to do kind of a walkthrough I'm going to try and fix it it's still it's still working but this is like a few hundred images um so and then i can just like go into here and then like look around that's insane let me just let me just let me just it's incredible let me just repeat so that folks understand what's going on here gpd image uh 2 can do 360 images equirectangular images format that you can then put in and like rotate around it can do them very very well uh peter you're saying you're generating a few hundreds of images you can walk through that whole universe that was generated yeah exactly okay absolutely insane that's insane bonkers so you build like a street view thing completely generated with with image view obviously it's not 3d because it's all equirectangular but that's how google street view works right it's all equirectangle images one after another and they have like this nice animation that they fake dude this is a crazy demo man you should holy cow wow yeah how long did it go for how long did it go for and also how many images generated so i need to check how many images i think it ended up so far i think i have like about 400 but i'm gonna so the issue that i have is that what i was trying to do is i was trying to like get it to like um in the same way like you can do street view where you can literally move from one to the next but it's like it's a little bit buggy and it's like the images don't quite align and i think that's that is the i mean i i think i'm asking a lot from it right to plan like literally everything and i think that i probably need like many thousands for it to work properly so i just need to work out like what's the nice balance so it so you're asking how long did it work for i came up with this idea at about like 1am last night london time so that was like an end of your working day and then it worked the whole night in the morning then i was like tweaking and giving it a bit more direction and it's still working so i guess i don't know it's going to be coming up to 24 hours in terms of to build this uh but i think maybe i went a bit too ambitious to try and create the whole thing i think if i had like one road with like some cool stuff i think i probably could have done that overnight so maybe if you've got like a bit more better scoped ideas maybe you can try that as well but yeah you can literally do it in codex now that's so cool you basically created street view of a place that doesn't exist yeah well it did exist but we don't know what it looks like yeah we don't know it existed but we don't know what it looks like and then the hallucinated like latent space version of it that comes from gpd image yeah yeah wow that's crazy the only caveat to this is that at least i didn't quite work out how i can get the like the 4k resolution so some of it is like looks a little a bit rubbish just because and you'll see a bunch of artifacts here I did also use upscaling to just get it a bit nicer but if you zoom in some of it kind of looks bad and I don't think it's because the image model couldn't technically do it but I think it's just because the the resolution that you can get access to via codecs is not the highest so then it starts just doing that so that's like a little bit of a downside so if you're going to try and replicate this you you so i was using topaz upscaling if i replicate um but yeah it cost me some amount of some amount of dollars as well yeah yeah well dollars i have to pay for it separately so yeah not ideal wow but this this is like a very long long live project peter go ahead sorry to interrupt now the the there are always tricks to to using these yeah for sure so yeah i i don't wanna i think we're We're going to read a lot of hype about like, yeah, GPG 5.5 or the next model or the next, but it's never, we are not at AGI yet, right? So let's remember, we still need to trick them a little bit, massage them, understand how they behave. I would say I have had a couple of times and when I was testing it, where it was doing something like a little bit weird, where for example, I was asking it to like, to basically do a little bit like what we were just doing them to validate its work and work until completion and so on and then they just randomly created like an automation that would run every 30 minutes i'm like what the hell that's literally never happened like why would you do that and then it just like took my work until completion somehow as if like it needs to run on automation so i think that probably i don't think it's like dumb or something but it is probably just behavior is a bit different so i could imagine if you're going to try it yourself now and do exactly the same thing as before you might be disappointed for whatever reason but as always just adjust a bit if it's a new base model especially it'll probably be a bit different um so yeah it's not it's not agi definitely try it and get used to it if you're using it yeah one shot is is fun for demos but like for an actual thing you have to iterate you have to work you have to learn the model and that's what we're trying to do here uh folks so Let's do... We've been on there for a while now, like almost four hours. So I think let's do a recap and start talking about this. We got an insane week, just absolutely insane week, capping with quite an incredible model that looks like based on benchmark state of the art in most anything. I won't treat Mythos benchmarks as relevant because Mythos is not available. And those are just like marketing numbers from Anthropic. We covered pretty much everything. How could we not? We're almost live for four hours. We covered pretty much everything on the stream. Thanks, Peter. Peter had to drop, looks like. But thank you, Peter Agosta, for joining us and giving us first thoughts of the GPT 5.5. Obviously, the big release from this week is GPT 5.5 from OpenAI that we were waiting for most of the stream and finally dropped. We also had, we talked about GPT Image, which is a huge, huge model that we're now trying to collaborate GPT Image and GPT 5.5. We talked about, some people are asking, we talk about privacy filter. Yes, absolutely. We talked about GPT's OpenAI's latest open source called GPT privacy, not even GPT, just privacy filter. That's Apache 2 license model that's on the Hagen Face Hub. We talked and demoed at length Cloud Design, which is a new skill that we're all getting very excited about. Just cloud.ai slash design. We talked about the fact that Anthropic Reset the quotas for all of their users so basically if you did quartered out for this week you can go back and look at the quotas um we talked of course of computer use we showed off computer use a lot of stuff so crazy crazy week in the eye i think at this it's time for us to drop because with almost four hours on live with almost 5 000 of you tuning in throughout like different things it's been a great show thank you so much for joining us all right cheers everyone bye-bye Thank you. you