The AI Daily Brief: Artificial Intelligence News and Analysis

What I Learned Testing GPT-5.5

37 min

•Apr 24, 20264 days ago

Summary

The episode provides a comprehensive analysis of OpenAI's new GPT-5.5 model, examining benchmark results, real-world testing, and competitive positioning against Anthropic's Claude. The host conducts extensive personal testing across coding, writing, design, and knowledge work tasks, concluding that GPT-5.5 represents a significant leap forward while noting that practical improvements may be incremental for many users.

Insights

GPT-5.5 achieves state-of-the-art performance on most benchmarks but shows weakness in specific domains like Sweebench Pro, suggesting benchmark selection significantly impacts perceived model quality
Model efficiency and cost-performance ratio matter more than raw capability scores; GPT-5.5 dominates the cost-performance frontier despite higher per-token pricing
The competitive dynamic has shifted from capability announcements to execution and availability; OpenAI's iterative deployment strategy directly counters Anthropic's restricted access approach
Long-running task reliability and tool integration represent major practical improvements over previous models, enabling real-world agent workflows previously impossible
Design and planning remain areas where Claude Opus retains advantages, indicating no single model dominates all dimensions—multi-model workflows are becoming standard practice

Trends

Shift from single-model to multi-model workflows optimized by task type (e.g., Opus for planning, GPT-5.5 for execution)Inference efficiency and cost-performance becoming primary competitive differentiators as capability ceiling rises across modelsAI harnesses and application layers increasingly important as native model capabilities plateau; skills and integrations drive practical valueLong-context and context-compaction improvements enabling persistent, single-thread agent workflows for strategic thinking and planningRapid model release cadence accelerating; OpenAI signaling quarterly or faster improvements rather than annual major releasesEnterprise AI adoption shifting from experimentation to production workflows; agents and autonomous task completion becoming table stakesBenchmark skepticism growing; community increasingly dismisses individual benchmark results in favor of real-world testing and use-case validationCommunication strategy maturation; OpenAI moving away from hype-driven announcements toward product-focused, humble messagingPerformance issues and model degradation becoming public competitive weapons; Anthropic's Claude Code quality regression widely discussed and leveragedCompute capacity emerging as sustainable competitive advantage; OpenAI's inference infrastructure investment creating moat beyond model quality

Topics

GPT-5.5 Model Capabilities and Benchmarks AI Model Cost-Performance Analysis Agentic AI and Long-Running Task Execution Multi-Model Workflow Optimization AI Code Generation and Software Development Knowledge Work Automation Design and UI/UX Generation Data Analysis and Spreadsheet Automation AI Safety and Iterative Deployment Competitive Positioning: OpenAI vs Anthropic AI Inference Infrastructure and Efficiency Context Window Management and Compaction Benchmark Reliability and Model Evaluation Enterprise AI Implementation AI Communication and Marketing Strategy

Companies

OpenAI

Released GPT-5.5 model; primary subject of episode analysis and competitive comparison

Anthropic

Competitor with Claude Opus 4.7; restricted access to Mythos model creates competitive narrative

Google

Mentioned as third player in AI model rankings alongside OpenAI and Anthropic

Arena.ai

Provides benchmarking and evaluation framework for comparing AI model performance

Artificial Analysis

Maintains intelligence index and benchmarks used to evaluate GPT-5.5 vs competitor models

Val's AI

Maintains professional task benchmarks in finance, medical, and legal fields

CodeRabbit

Tested GPT-5.5 for code review capabilities and reported strong performance improvements

Runway

Referenced for optimal multi-model setup combining Opus for planning with GPT-5.5 for execution

Semi-Analysis

Analyst firm providing critical perspective on GPT-5.5 release and competitive positioning

A16Z

Venture capital firm with analyst commentary on OpenAI's communication strategy shift

People

Sam Altman

Made official GPT-5.5 announcement and emphasized iterative deployment and democratization strategy

Dario Amodei

Implicitly referenced regarding Anthropic's restricted model access and communication approach

Jacob Pachalki

Stated expectation of rapid continued progress and accelerating pace of model improvements

Greg Brockman

Positioned GPT-5.5 as beginning point with larger improvements expected in coming months

Nora Brown

Argued intelligence is function of inference compute; cost-performance matters more than single metrics

Tebow

Defended GPT-5.5 coding performance against Sweebench Pro criticism; questioned benchmark relevance

Adam McLaughlin

Demonstrated GPT-5.5 capability for 31-hour autonomous RL task execution without interruption

Matt Schumer

Provided nuanced perspective that GPT-5.5 is major leap but may not impact 99% of users practically

Ethan Mollick

Had early access to GPT-5.5; framed it as sign of continued rapid AI improvement trajectory

Ben Davis

Praised GPT-5.5 as best code-writing AI model experienced; noted improved conversational quality

Pietro Chirano

Called GPT-5.5 highest leverage tool ever used; felt limited only by imagination not model capability

Bindu Reddy

Found GPT-5.5 tops LiveBench; extremely good instruction follower with strong practical performance

Flavio Adamo

Noted GPT-5.5 understands request shape better; writes cleaner code with less over-engineering

Peter Gosteff

Demonstrated GPT-5.5 reliability on 8+ hour long-running migration tasks; unprecedented capability

Simon Smith

Tested GPT-5.5 PowerPoint creation; found good autonomous iteration but design taste still lacking

Siki Chen

Recommended Opus 4.7 for planning with GPT-5.5 for execution as optimal multi-model setup

Ali K. Miller

Noted certain model class where non-technical users may not notice differences from previous versions

Justine Moore

Praised OpenAI's shift to shipping without giant PR campaign to scare people about AI risks

Cremio

Summarized sentiment that Opus 4.7 regression made GPT-5.5 competitive advantage more apparent

Peter Levels

Confirmed Claude Code quality degradation on March 4th coinciding with user complaints

Quotes

"GPT 5.5 takes OpenAI back to the clear number one. OpenAI's new model tops the Artificial Analysis Intelligence Index by three points, breaking a three-way tie with Anthropic and Google."

Artificial Analysis

"With today's AI models, intelligence is a function of inference compute. Comparing models by a single number hasn't made sense since 2024. What matters is intelligence per token or per dollar."

Nora Brown, OpenAI

"GPT-55 is the highest leverage tool I've ever touched. For the first time, I don't feel limited by what a model can do. I feel limited only by what I can imagine."

Pietro Chirano

"This is the first time where the upgrade feels relatively large, but most of the time it does not matter that much. Not because the model is disappointing, but because the last set of models was already so good."

Matt Schumer

"What 5.5 represents is not an endpoint. In many ways, it's a beginning point. It's really a step towards the kind of models that we see coming over even just the upcoming months."

Greg Brockman, OpenAI President

"GPT-55 has it all. OpenAI's new model is a top-end senior engineer and easy to talk to. The surprising thing about GPT-55 is how few of those trade-offs it asks you to make."

Every's VibeCheck

Full Transcript

GPT 5.5, aka Spud, is here, but does it live up to expectations? This is one of the most hyped models we've had in a very long time, and we are going to go through all of the first reactions, the benchmarks, and of course, about a dozen of my own tests. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, Blitzy, Granola, and Mercury. To get an ad-free version of the show, go to patreon.com slash ai daily brief, or you can subscribe on Apple Podcasts. If you want to learn more about sponsoring the show, send us a note at sponsors at aidailybrief.ai. Now, aidailybrief.ai is, of course, where you can find out about all the different things going on in our ecosystem. That includes things like the AIDB New Year's program, Claw Camp, etc. And to try to make things a little bit easier as we have some perhaps new free programs forthcoming, I'm actually launching an AI Daily Brief account system so that you can just sign up once and then add yourself to programs as they come up without having to sign up again each and every time. If you go to aidailybrief.ai right now, you can claim your username and be first in line to hear about another free program we have launching tomorrow on an Operator's Bonus episode. Well, friends, it is here. Ever since back in December, when OpenAI declared a code red, we knew that they were deep in the lab cooking something good. Or at least we hoped it would be good. Certainly the last few months have seen the company regain its verve, particularly around Codex, which has grown from just a couple hundred thousand users at the beginning of the year to over 4 million now. We've heard about the elimination of side quests, TBPN acquisition notwithstanding, and overall that focus has seemed to reshape the company. And ultimately leaked memos and grand statements about focus don't matter a fig if it doesn't produce results. Now, honestly, for OpenAI, the stakes heading into the 5.5 release had been increased dramatically because of their competition with Anthropic. Maybe the biggest story for the last few weeks in AI has been the model that we don't have in Anthropic's mythos. Anthropic basically said to the world, we've got a new powerful model that is a step change in capabilities, but it's too powerful right now for us to provide to the average user. Now, of course, in some cases, there has been skepticism that the power is the real reason that Anthropic isn't delivering this. Some have speculated that it has more to do with compute constraints than true cybersecurity concerns. But it has seemed like the limited set of partner companies that have had access have validated that it is indeed a very good model. Whatever OpenAI put out next then was always going to be their response to that missing mythos model, and the expectations were ratcheted up accordingly. On Friday at 2 p.m., OpenAI dropped GPT 5.5. In their announcement tweet, they called it a new class of intelligence for real work empowering agents built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks, they wrote, a new way of getting computer work done. Some of the use cases they pointed to as where it excelled were writing, debugging code, researching, analyzing data, creating documents and spreadsheets, operating software, and quote, moving across tools until a task is finished. In other words, this is a knowledge work model. And certainly the benchmarks seem to slap. Taking just a comparison to Opus 4.7, whereas Opus 4.7 scored a 69.4% on Terminal Bench 2.0, an agentic coding benchmark, GPT-55 scored an 82.7. On the real-world task GDPVal, Opus 4.7 scores an 80.3. GDPVal gets an 84.9. Overall, the model ranks right at the top of artificial analysis's overall benchmarks, with the extra-high version being the first model to ever score in the 60s. Artificial analysis themselves write, GPT 5.5 takes OpenAI back to the clear number one. OpenAI's new model tops the Artificial Analysis Intelligence Index by three points, breaking a three-way tie with Anthropic and Google. Now, while obviously all of that is good news for both OpenAI and for people who like powerful models, not every benchmark was that clear-cut. And in labs found that GPT 5.5 was behind Opus 4.7 on Vending Bench, which tasks the model running a profitable vending machine business. In that test, GPT 5.5 was about on par with Opus 4.6. In Vending Bench Arena, which is a multiplayer variant that introduces competition, GPT 5.5 did actually beat Opus 4.7 by a healthy margin, and And in labs also noted that 5.5 didn't display any of the underhanded tactics like Opus had, like lying to suppliers or stiffing customers on refunds. Val's AI, which maintains a range of benchmarks that test professional tasks, including finance, medical, and legal fields, found that Opus 4.7 still comes out ahead, although GPT-5.5 was a decent jump over 5.4. The most discussed negative benchmark was Sweebench Pro, where 5.5 significantly underperformed Opus 4.7, pointing to a footnote where OpenAI suggested that Anthropic had reported signs of memorization on a subset of problems with their Sweebench Pro score, Didi said, That footnote is trying really hard to bury the lead. GBT 5.5 isn't state-of-the-art for coding. Tebow on the Codex team at OpenAI clapped back, You'll be missing out if you think Sweebench is representative of anything real. He then pointed to an article that they had published about this in February called Why Sweebench Verified No Longer Measures Frontier Coding Capabilities. We'll talk more about what people found with coding, but to not bury the lead, it doesn't seem like that Sweebench Pro number has much actual signal to add. Outside of benchmarks, one of the things that people noticed quickly was the cost. Theo pointed out that it was double the price of GPT-54 and 20% more expensive than Opus 4-7, at least in terms of the cost per million tokens in and cost per million tokens out, which is $5 and $30 respectively. And yet, just looking at cost in terms of token in and tokens out, misses the actual functional key dimension of cost, which is how efficient a model is in solving a problem. Nome Brown from OpenAI writes, a hill that I will die on. With today's AI models, intelligence is a function of inference compute. Comparing models by a single number hasn't made sense since 2024. What matters is intelligence per token or per dollar. This is especially true when using it in a product like Codex. And on that front, as Scaling01, Lassan Algaid points out, the GPT-55 model family completely dominates the cost-performance frontier on the Artificial Analysis Index. So taking a step back from the benchmarks and just going to first impressions, while it was possible to find people who were unimpressed, for example, Fabricated Knowledge, who works with Semi-Analysis, wrote, dude, so like if this is the best OAI got, are they going to close down and join ANTH to make AGI? That perspective was pretty few and far between. There were a few more folks who thought that perhaps the model had been overhyped, but then again, as Control-Alt Duane pointed out, quote, OpenAI wasn't the one hyping this release. It was people on this app doing it. Maybe a different way to point that out is that the swirl of discussion surrounding Mythos increased the hype totally outside of the control of OpenAI. Some pointed out that when you looked at scores like Terminal Bench 2.0 and the computer use benchmark OS World Verified or other benchmarks like Browser Comp and CyberGym, while GPT-55 didn't necessarily beat the reported Mythos numbers, although it did on Terminal Bench 2.0, it was close enough that it would be fair to consider this a Claude Mythos level, but, as Chubby Kimonismus puts it, for public use. Scaling01 again writes, After some deliberation, I think GPT-55 is close to Mythos despite being only a fifth to half the size. In that post, they write, Sweebench Pro threw me off but should be just discarded as noise or spiky intelligence. He also speculated that in terms of parameters, GPT-5-4 was around $1-2 trillion, 5-5 was $2-5 trillion, and Mythos is about $10 trillion. They also point out Mythos pricing does look kind of ridiculous at $125. Mythos might turn out to be Anthropik's GPT-5-4 moment. Now, on this question of Mythos vs. GPT-5-5, I actually think that Riley Brown has it right when he writes, Mythos benchmarks do not matter until released to the public. As far as I'm concerned, it does not exist. Based on my review of reactions, the much more common reaction is that this is the new standard. Every's VibeCheck declared GPT-55 has it all. OpenAI's new model is a top-end senior engineer and easy to talk to. They write, frontier models usually come with trade-offs. You get more depth but less speed, more agency but less control, better code but worse pros. The surprising thing about GPT-55, the new OpenAI model out today, is how few of those trade-offs it asks you to make. it's much faster than opus 4.7 easier to collaborate with better at writing than any open ai model we've used since gpt 5.4 and 4.0 and the strongest model we've tested on our new senior engineer benchmark for a long time they write open ai looked like it was trying to be everywhere at once sora for video atlas for browsing consumer chat gpt features creative media tools and whatever else might turn AI into the next mass market platform Meanwhile Anthropic doubled down on work and Cloud became the default for coding agents long-running engineering tasks, and professional workflows. GPT-55 is OpenAI's clearest bid to reclaim the code and work narrative. It does not win everything. Opus 4.7 seems to write better plans and have a superior eye for design and product details, but GPT-55 is faster, steadier, and easier to trust for everyday professional work. Ben Davis, who works with Theo on his YouTube channel, writes, The best code I've ever seen in AI write came from this model. Feels way better to talk to than 5.4 did. Still kind of has that GPT cringe but dialed back. Overall, this is 100% my new everything model. Pietro Chirano goes farther. GPT-55 is the highest leverage tool I've ever touched, he writes. For the first time, I don't feel limited by what a model can do. I feel limited only by what I can imagine. The most interesting nuanced views came from people who tried to explain the weird idea that while it is a big leap forward for a big portion of users, it's not really going to feel like it. Matt Schumer writes, I've been using GPT-55 for the last few weeks. It's a massive leap forward. But the weird thing is for 99% of users, it probably won't matter. In his review essay, Matt writes, The honest reaction is a little weird. This is the first time where the upgrade feels relatively large, but most of the time it does not matter that much. Not because the model is disappointing, but because the last set of models was already so good. Basically, he says that although it is better in all of these different ways, that does not, in his words, always translate into a dramatic change in his daily workflow. Quote, If I ask it to build something normal, it crushes it. But GPT-53 codex already crushed it. GPT-54 already crushed it. Opus often crushed it. The ceiling is getting so high that a lot of normal work does not stress the models anymore. Now, where he argues the real value is, then, is about the rounding out of capabilities that weren't so great in OpenAI's models before, arguing that design is his clearest example. Ali K. Miller put it in terms of knowledge professionals, writing, There is a certain class of models, one that we're hitting now, where unless you're deep in code or scientific research, you might not even notice a difference. Now, let's talk about some specific use cases. And let's start with coding, given that A, it's so important for so many different types of use cases, and B, there was that discussion around that weirdly low Sweebench Pro result. TLDR, people are finding this is a very good coding model. You heard some of that in the initial reactions, but some of the independent testers are finding that as well. Entrepreneur Bindu Reddy writes, GPT-55 tops LiveBench. It's an extremely good model on both benchmarks and in practice. It tops benchmarks in most categories and is an insanely good instruction follower. In practice, this makes GPT-55 better than Opus 4-7. CodeRabbit writes, We've been testing GPT-55 in early access and are excited by its performance in code review. In our evaluation, it delivered a more direct review flow, stronger signal, and better performance on the issues that matter most. Headline result, 79.2% expected issue found versus 58.3% baseline. Entrepreneur and engineer Flavio Adamo writes, Is GPT-55 better than 5.4 at code? Yes. Not because it suddenly turns every prompt into some magical perfect implementation, but because it seems to understand the shape of the request better. It writes cleaner code. It touches fewer things it does not need to touch. It is less likely to over-engineer a simple change. And most importantly, it feels like it wastes less time. I think everyone who uses coding agents has seen this happen. You ask for a small fix, and the model technically solves it, but it does so in the most annoying way possible. It adds an abstraction you did not ask for, changes unrelated files, rewrites some logic that was already fine, and suddenly your quick fix becomes something you now have to review carefully because the model got a little too excited. With GPT-55, I've seen less of that. I do not know exactly how to explain it, but a model can be smart and still tiring to use. GPT-55 feels less tiring. Now, one specific aspect of coding that people have pointed out comes from Peter Gosteff from Arena.ai, who writes, GPT-55 is much more reliable on longer running tasks. For the first time with any model. As we speak, I have a migration running for over 7 plus hours. This literally never happened before. The model would maybe run for 30 minutes, or if you really shout at them, for 2-3 hours. Last night I went to sleep, set a long running task, then queued up 10 prompts to keep it going. It did not stop after the first prompt and kept going for 8 plus hours, and I woke up to the same prompts still queued up. The ability to run for a long time, in combination with the ability to validate with computer use and other tools, makes it much more useful for building real applications. Adam McLaughlin from OpenAI found something similar. He wrote, Over break, I dictated to 5.5 for minutes describing a new ambitious RL run. Hit send and forgot about it as I hung out with friends and boyfriend for a few days. Returned on Monday to an industrial-scale RL run humming after it worked for 31 hours. Now, one of the things I've talked about in the past with Codex and OpenAI models is that they historically have been very, very bad at design. Did that change? Yes-ish is what I would say. First of all, the native capabilities for design and front-end are better in 5.5 than they were in 5.4. More important than that, however, there are just other ways to integrate those capabilities. First of all, you can use skills in codecs, but even more than that, it's pretty clear that the workflow is GBT Images 2 for concepting UI, and then 5.5 and Codex for implementing it. And with that, you can get something much better. Although still, I think in general, the broad perception is that Opus retains a lead when it comes to just pure aesthetics. Another area where I saw Opus still have the lead, according to a few different reviewers, was around planning. This is something that every said, remember when they wrote, Opus 4.7 seems to write better plans? Siki Chen from Runway said something similar. Opus 4.7 at extra high to plan and GPT 5.5 at high to execute is the optimal setup. I know Opus to plan and GPT to execute has been optimal for some time, but the release of 4.7 and 5.5 in particular has really widened that gap against a mono model setup. Now for what it's worth, I'm about to get into my own tests. I have certainly found this iteration of 5.5 in codex to be much better at planning than previous versions, but I haven't had the chance yet to compare it against this sort of multi-model setup. What about on knowledge work tasks. On presentation, Simon Smith writes, my first test of GPT-55 PowerPoint creation in codecs really runs the range from incredible to what the hell is that? I asked it to pick a topic, craft a Nancy Duarte-inspired narrative on it, and generate images to develop a design language and create slides that reflected that design language and included a range of visualization. It chose the haptic internet. The good, it generated a mood board and four visuals in one go, and the mood board it generated was really good. It worked autonomously for over 16 minutes, just iterating across image generation, presentation construction, and presentation QA. I told it to use any font on my machine that would work in PowerPoint, and it hunted them down. This, as an aside, is a huge thing all on its own. Anyone who has ever tried to export a great looking design into PowerPoint only to see that the available slides completely break it down will know. Overall, though, Simon says, I still don't get the sense that it has great design taste. He also pointed out that there wasn't a ton of visual variety, that it maybe used too many fonts, And then this one, which is one of the most annoying things across all models to me right now. Simon writes, it references the prompt within the text in a very break the fourth wall kind of way. This thing happens a lot where a model explains what it's doing out loud, but in copy in an asset. I find that this happens a lot, especially when you're refining something. So for example, if I've told Claude Code or Codex to stop trying to connect all the dots between three different ideas in a set of web copy, it will often do things like have a new header that says not trying to connect the ideas, just simple, clean, separate thoughts, which is obviously completely not the intent. I don't really know exactly what to call that, but it happens a lot and it's something that I would love to see go away. Other people found it was good for other things like spreadsheets. And overall, on their knowledge work evals, the 5.5 model saw a 10 percentage point jump on accuracy on enterprise content tasks compared to GPT-5.4. All right, folks, quick pause. Here's the uncomfortable truth. If your enterprise AI strategy is we bought some tools, you don't actually have a strategy. KPMG took the harder route and became their own client zero. They embedded AI and agents across the enterprise, how work gets done, how teams collaborate, how decisions move, not as a tech initiative, but as a total operating model shift. And here's the real unlock. That shift raised the ceiling on what people could do. Humans stayed firmly at the center while AI reduced friction, surfaced insight, and accelerated momentum. The outcome was a more capable, more empowered workforce. If you want to understand what that actually looks like in the real world, go to www.kpmg.us.ai. That's www.kpmg.us.ai. Blitzy is driving over 5x engineering velocity for large enterprises A publicly traded insurance provider leveraged Blitzy to build a bespoke payments processing application an estimated 13 project and with Blitzy the application was completed and live in production in six weeks. A publicly traded vertical SaaS provider used Blitzy to extract services from a 500,000-line monolith without disrupting production 21 times faster than their pre-Blitzy estimates. These aren't experiments. This is how the world's most innovative enterprises are shipping software in 2026. You can hear directly about Blitzy from other Fortune 500 CTOs on the modern CTO or CIO classified podcasts. To learn more about how Blitzy can impact your SDLC, book a meeting with an AI solutions consultant at Blitzy.com. That's B-L-I-T-Z-Y dot com. Today's episode is brought to you by Granola. Granola is the AI notepad for people in back-to-back meetings. You've probably heard people raving about Granola. It's just one of those products that people love to talk about. I myself have been using Granola for well over a year now, and honestly, it's one of the tools that changed the way I work. Granola takes meeting notes for you without any intrusive bots joining your calls. During or after the call, you can chat with your notes, ask Granola to pull out action items, help you negotiate, write a follow-up email, or even coach you using recipes which are pre-made prompts. Once you try it on a first meeting, it's hard to go without. Head to granola.ai slash ai daily and use code AI daily. New users get 100% off for the first three months. Again, that's granola.ai slash ai daily. This podcast is brought to you by Mercury, banking designed to work the way modern software does. One thing I've always found weird as a founder is that almost every tool you use to run a company is modern. Your analytics tools, your email tools, your AI tools, they all feel like software built in, you know, the last decade. Then you go to banking and suddenly it feels like you've time traveled back to the 70s. That's why I use Mercury. It's business banking that actually works like the rest of the tools founders rely on. Clean interface, everything where you expect it And basic things like wires, cards, or permissions Taking a couple clicks instead of a phone call in three forms For the whole AIDB ecosystem, it is just dramatically simpler You can see everything from the dashboard, control spend And give the right people access without handing over the whole account If you run a company and you're tired of banking Feeling like the one tool that never modernized Check out Mercury Visit mercury.com to learn more and apply online in minutes Mercury is a fintech company, not an FDIC-insured bank Banking services provided through Choice Financial Group and column NA members FDIC. Now I want to get into my tests, but the last discussion point that was really prominent on the internet in the wake of the release of 5.5 was around how different the OpenAI communication felt and the clear narrative repositioning that's going on. It seems very apparent to me that OpenAI is picking up on the signal that one, people are a little bit annoyed by Anthropic's approach to telling us all about a super powerful model, but then not giving people access. And two, even more, people are really annoyed about performance issues with anthropic models, presumably due to resource constraints. Contrasts to both of those things run throughout OpenAI's communications around this. For example, in one tweet, Sam Altman writes, we believe in iterative deployment. Although 5.5 is already a smart model, we expect rapid improvements. Iterative deployment is a big part of our safety strategy. We believe the world will be best equipped to win the team sport of AI resilience in this way. Now, to be clear, that is something that OpenAI and Altman have always talked about, but they're definitely putting an exclamation point on it right now. As witnessed by the next bullet in that same tweet, where Sam writes, We believe in democratization. We want people to be able to use lots of AI. We want our users to have access to the best technology and for everyone to have equal opportunity. We've been tracking cybersecurity as a preparedness category for a long time and have built mitigations we believe in that enable us to make capable models broadly available, he said directly to Dario Amadei. Not really, but you get the point. There is also a lot of emphasis on OpenAI's compute resources. In another tweet, Altman said, Really excellent work by the inference team to serve this model so efficiently. To a significant degree, we have become an AI inference company now. And his overall announcement tweet was really simple. GPT55 is here, he wrote. We hope it's useful to you. I personally like it. A new at Luru writes, this is a very different kind of comms. Discuss. Benjamin DeKracker writes, OpenAI seems to have dialed back their hype machine and just focused on building and shipping excellent models lately. That's a breath of fresh air and a winning strategy. A little more pointedly, Justine Moore from A16Z writes, crazy how you can just ship a model without a giant PR campaign to scare the crap out of everyone first. Retweeting Sam Altman's simple, we hope it's useful to you tweet. Cree Beauvoir writes, this feels like someone inside OpenAI is doing work. They realized that Anthropic and Dario were gaining more traction, mostly because they have a good product, but also because people like and want them to win. First, there was a night of funny drunk tweets, and now this new product announcement feels noticeably more personal and dare I say humble. My take, this is going to be a war of authenticity. Alex Cantruis actually asked whether this was the fingerprints of the TBPN acquisition. Now, ultimately, when it comes to new models, there is simply no substitute for testing it yourself, especially now that performance is so high across so many different dimensions. One of the real cheat codes is knowing which models you prefer for your different use cases, because in many cases, it won't just be one. Now, of course, not everyone is in a position to pay for multiple models. And so part of the goal here might be to select the one that is mostly the best for you. But regardless, the point is that there is simply no substitute for trying it out. So for me, I did about nine or 10 tests around a pretty wide array, but fairly common use cases for me. The first was script prep from my wife's podcast. She does a true crime show and produces an immense amount of research around that. And so I use this to test both ChatGPT's research abilities as well as its writing abilities. And really writing was the one I was more concerned with. I don't even remember the last time I used an open AI model for writing over Claude. And while I don't have some definitive result, I will say that what 5.5 did with this assignment, way better than any model recently, is it actually took the instruction to be clear and simple and journalistic in its writing, and did that rather than trying to add a bunch of dramatic flair. One of the problems I've often had with Opus, especially 4.7, even more than 4.6, is that it tries way too hard to lean into whatever dramatic style the writing is. It has AI affectation fingerprints all over the writing, and I spend half my time trying to beat it down to just get the simple, basic starting point. Ultimately, all of the key writing and voice details are going to come from my wife, and so the goal of this step is just to have a very simple, basic narrative flow to build off of. And it did a good enough job that I will definitely be testing 5.5 at least for other writing use cases. Now, as I'm going through these, you may note that I did most of this in codex. That is by no means a requirement, but I would say that if you haven't invested in experimenting with Codex yet, this might be a good time. It's very clear that OpenAI is putting a ton of emphasis on this as the core workspace for not only coders, but knowledge workers who are using GPT models. And the shift to a new model is a pretty good time to start digging in and figuring out how it works for you. Now, one thing that I'm going to take advantage of with that that I have not fully got up and running yet, you might remember when the Codex app first came out, one of the things that people were talking about is how its better approach to compaction, i.e. taking a bunch of long context where it would be running up against the limits of the context window and compacting it so that you can keep the conversation going. OpenAI apparently has made some developments in that area that allows people to have just an ongoing single thread and use it in pretty different new ways. Specifically, if you go look at my Claude, I have a whole project that I call meta planning for all big picture question type things. But what people are experimenting with in codex specifically is the monothread where instead of it being a bunch of different conversations split across a project, it's just one long thread that keeps all the context and takes advantage of that compaction to not run out of the context window. So I haven't done this yet because it's going to involve a fair bit of investment of time to get its background context on me up to speed. But what I'm going to do first is have it interview me to create a broad understanding an outline of who I am and what I'm working on, and then I'm going to experiment with using this single, continuously updated thread as a way to think through and iterate on strategic questions. Speaking of strategic questions, I'm working on an experimental sponsored episode that'll come out a couple weeks from now. And one of the things that I'm really keen on doing is integrating resources alongside sponsorship so that when a company is sponsoring the show, they could also be sponsoring additional resources that turn that show into more value for you guys as the listeners. This gave me a chance to do a couple different things with 5.5. First of all, I got to test its creative capabilities and how it aligns its ideas with broader strategic goals. And second, I got to go directly from those ideas into project planning and then actual web app execution in Codex. The episode is about the frontier of how humans and agents can collaborate together and what that looks like inside an enterprise context And so we working on a companion kit that has a set of different resources for companies to try to figure out things like where their team is mapping them to a set of different archetypes that can help them understand what they need to do figuring out what context gaps their agents and AI tools have, designing at least one agent-shaped workflow, and moving one use case beyond chat. Now, these are all themes that are directly in that episode that we are turning into interactive elements. And I found 5.5 in Codex to be a quality collaborator at all steps of the process. In terms of both creativity and strategy, I was pretty impressed, especially relative to 5.4. I would effectively never turn to 5.4 for something like that. Honestly, kind of still 4.6 would always be my default. But what was interesting about 5.5, and I was using thinking mode at that time, was that not only was it pretty quality in terms of its ideas and just thought process, but it was really fast. I got to experience that speed that other people were talking about. And honestly, especially when you're in an iterative mode, speed is really, really valuable. Now, I did have to go back and forth a bunch of times on the UI. In the first version, for example, it had really kind of junky brown colors, and also did this very weird thing where instead of telling the story of why this artifact existed, it was just this very clunky survey-based UX. And so honestly, I just installed a set of skills focused on front-end design and UI UX. And when push comes to shove, ultimately, while it is useful to know how natively good a model is when it comes to things like that, we are now officially in the era where anything that you do is going to be model and harness together. And so practically for me, it's more useful to know how well 5.5 can take advantage of a skill than what it can do natively because I know I'm just not going to use what it produces without that skill. And with the skill, while we're not done yet, I think it's doing a much better job and I'm quite encouraged. Like I said, in a couple of weeks you will get to see the output of that. A couple more visual things. To test research, aesthetics, and slide design, I told 5.5 that I wanted to learn about some underexplored topic around the golden age of piracy. I asked it to propose a topic, research the topic, and turn it into an art book using public domain oil paintings. It did a pretty good job, you can see the visual here. There are some errors, although what I've found is that it's fairly good at correcting those things. Now I will say that this is nowhere near revolutionary in terms of design quality, and it feels at least for now fairly unlikely to me that PDF outputs are something that I'm going to be reaching for this particular model for, although again, with changes to the harness, that could change. I also had it take the AI Daily Brief media kit and update it both to have a consistent visual of a style that it wanted, which it did fine but which I didn't think was particularly better than what I had, in fact I thought it was worse, but it did a better job making some arguments for how to have stronger framing and frankly pitching in the media kit for why sponsors should care about the show. On another more comprehensive build than just the companion site I was just telling you about, I turned to Codex to help me with a new jobs portal for AIDB that isn't just an interface for submitting information, but actually has a backend where I have the top models from OpenAI and Anthropic debating so that I can automate a shortlist, which is completely essential when anytime I post a job, I get hundreds and hundreds of responses. The process so far is really good. To be clear, because I'm coming at this from a non-technical perspective, I don't really have the ability to know how this code compares to what 5.4 would have written. And I also think that this falls in that category that Matt Schumer was talking about of fairly easy build tasks that any of the last few generations of models could have done really well with. What I can say is that the experience of using Codex for this was very smooth. The auto-review mode kept it so that it didn't ask me too many questions, and so it could kind of just work in the background. Finally, one thing it absolutely crushed. I dumped in an absolute boatload of data, basically 10 or 12 different charts from both Apple and Spotify about the show and asked it to analyze it and give a bunch of insights. It did a great job at this, enough that I actually also asked it to then think about how that should inform podcast strategy going forward. And this is not something that I've gotten great results from LLMs on before. Mostly I've found that it gives very stereotypical advice that would befit any podcast rather than AIDB specifically. It was much better than that. And on top of that, when I asked it to turn all of this data into a spreadsheet that organized all the information. It did that really well too, getting me pretty enthusiastic about what it can do from a data analysis and spreadsheet usage standpoint. So the TLDR on all of this is my first impressions are very positive. For a long time now, six months or more, really kind of since Opus 4.5, I've never fully stopped using ChatGPT, but Opus models have definitely been the daily drivers. CloudCode has been the main building app. I would not go so far to say that I'm 100% sure that's going to shift overnight, but the combination of the initial impressions that I have of 5.5 being pretty positive, and the improvements in the harness that come with the Codex app, means that at least for the next period, I anticipate doing a lot of jumping back and forth and seeing which model and which harness does better on particular tasks. From a strict competitive standpoint, you gotta think that the model release in the moment right now is a win for OpenAI. Cremio summed up the feelings of a lot of folks when they wrote, model update, Opus 4.7 is so lazy that it's worse than 4.6, GPT-5.5 is a good model and it's gotten much faster. And just to be clear that this isn't just people's bitter grapes or just model preferences expressed more aggressively, on the same day that 5.5 came out, the team at Anthropic published a post-mortem around recent Claude Code quality issues, and the TLDR is that people weren't just imagining things. Now, if you want to read all the specifics, that is available on Anthropics' website. And I think it is absolutely to their credit that they are digging into these things and trying to fix them rather than just trying to pretend that they didn't exist. But the response from the enfranchised CloudCode users has been a very loud, I told you so. Theo again writes, confirmed that CloudCode got dumber, not Claude. They shipped slop and it made the models worse. Solopreneur Peter Levels wrote, I can't believe we were right. Claude was dumbified on March 4th just when we noticed. And even taking a step back from that, people seem to be pretty bullish on OpenAI's insurgents when it comes to the competition. Jason's Chips writes, Gonna call it now. OpenAI's GPT-55 and insane new codex features will cause a market share recapture and narrative shift. Private market valuation will overtake Anthropic again, and their quote-unquote reckless compute spending from six months ago gives them a capacity advantage that will keep it that way. Now, I would certainly not count Claude out yet. Just yesterday, they launched a feature which I am extremely excited to see the impact of, memory on Claude Managed Agents. And I think that if you are sitting there from a user perspective, the beneficiaries of this intense competition are 100% us, who get better models, better harnesses, and better applications that actually allow us to do more and new things. And finally, it feels like this might be the beginning of more to come. A lot of people compare this moment to 03, but NoMoreID thinks 01 is actually the better comparison. They write, As I've been thinking lately, GPT-55 seems to be the initial RL checkpoint of their new pre-training model. So in a way, it probably makes more sense to see it as something closer to O1 preview or O1. You can really feel how much they compromised on cost and speed, but they know the recipe, so I think this model's O3 moment will come soon. Ethan Mollick explored similar themes, calling this a sign of the future. He writes, I had early access to 5.5, and I think it's a big deal. It is a big deal because it indicates that we are not done with the rapid improvement in AI. It is also a big deal because it is just plain good. And it is a big deal because even with all of this, the frontier of AI ability remains jagged. Jumping ahead, he concludes, 5.5 shows us that the models keep getting smarter, the apps keep getting more capable, and the harnesses keep getting better, making them ever more effective at solving real problems. A year ago, none of this was close, and with the latest releases, capability gains appear to be accelerating. GPT-5.5 is clearly not the end of this process, but it is a noteworthy step along the way. And all indications from the OpenAI team seem to be that more is on the way. When asked by reporters whether the pace of model releases would increase going forward, OpenAI chief scientist Jacob Pachalki said, Yes, we expect quite rapid continued progress. We see pretty significant improvements in the short term, extremely significant improvements in the medium term. I would definitely expect that we will continue to see the pace of AI capabilities improvement to keep increasing. I would say the last few years have been surprisingly slow. Putting that even more clearly, President Greg Brockman said, What 5.5 represents is not an endpoint. In many ways, it's a beginning point. It's really a step towards the kind of models that we see coming over even just the upcoming months. And I think that you should expect that we are going to have even larger improvements in the capability across a wide variety of these aspects of what the model can do. So there you have it, friends. That is the first look at GBT 5.5. I expect that in the next couple of days, we will see people both finding more things that it does incredibly well, but we will also start to find all the different chinks in the armor, and things that we hope get fixed in the future. For now, I will say that your weekend just got a lot more fun and probably a lot more productive. So thanks as always for listening or watching, and until next time, peace.