Welcome back to How I AI. I'm Claire Beau, product leader and AI obsessive here on a mission to help you build better with these new tools. Today, we're going to bring you up to date on all the new coding model releases from OpenAI and Anthropic. In case you missed it, OpenAI released last week Codex, their desktop app for AI engineering, the new model GPT-53 Codex, try saying that five times fast, and Anthropic released their response, Opus 4.6 and Opus 4.6 fast. If you're new here, then you don't know. But when these new models come out, I put them through their paces. I test them. I test them side by side on the same task. And I'm going to give you my opinion about where they do well, where they fall apart and which one goes where in my AI engineering stack. Spoiler alert, I've shipped more code in the last five days than I think I have in the last month. So I think these are pretty fabulous models, but they do have their quirks, they do have their strengths, and sometimes they go off the rails. Let's get to it. This episode is brought to you by WorkOS. AI has already changed how we work. Tools are helping teams write better code, analyze customer data, and even handle support tickets automatically. But there's a catch. These tools only work well when they have deep access to company systems. Your co-pilot needs to see your entire code base. Your chatbot needs to search across internal docs. And for enterprise buyers, that raises serious security concerns. That's why these apps face intense IT scrutiny from day one. To pass, they need secure authentication, access controls, audit logs, the whole suite of enterprise features. Building all that from scratch? It's a massive lift. That's where WorkOS comes in. WorkOS gives you drop-in APIs for enterprise features, so your app can become enterprise-ready and scale up market faster. Think of it like Stripe for enterprise features. OpenAI, Perplexity, and Cursor are already using WorkOS to move faster and meet enterprise demands. Join them and hundreds of other industry leaders at WorkOS.com. Start building today. Okay, to start, I like to pick a task when I'm evaluating new models that's pretty ambitious, something I definitely wouldn't want to do by hand, and is consistent enough that I can actually compare the pros and cons of each model side by side. And I picked a task that I choose often when comparing these models, which is redesign my marketing site. I think all these models are pretty good at one-shotting kind of a landing page or a marketing page, a simple app. I don't feel like that's a practical evaluation criteria for these new models. I like to take a code base that's relatively complex or at least established and compare side by side how these models work inside these code bases. So I took my chat PRD homepage, marketing site. It's got lots of pages. It's got a blog. It's got the How I AI workflows on there. It's not a simple app, even though it's just kind of like a content front end. And I want to bring it up to my 2026 ambitions, which are all about the enterprise. So while this, you know, website looks great, it's cute. It's got nice colors. It's definitely more focused on the kind of PLG individual user workflow. And I want to up level this as we sell more to enterprise customers. I'm going to have these models duke it out and see which one does the better job. And I'm going to test these in order of when they came out. So the first thing that came out in our very busy week last week was Codex. Now Codex, as I said, is OpenAI's desktop app for coding. And before we get into it, I want to show off some of the things that I think make Codex unique. First of all, Codex is focused around Git primitives. Now, if you don't know, or you're not technical, you're a new software engineer, you probably run into some concepts of Git as you get gotten started vibe coding. But I just want to walk through a couple things that might be useful for you to know. The first thing is the idea of a Git repository. That is basically a whole code base that represents an app or a project. Git repositories are represented over here in Codex as projects. You can see I have different repositories here that I'm working on, including my chat PRD website, the www website. Then in your repo, you can start working on new types of code. And there are kind of two ways you can take code and make it contained so that when you edit it, it doesn't break your production website. The first way that I use a lot are branches. Branches are little, as we say, branches of your code that you can make changes to commit and then ultimately decide to merge production. There's also the concept of work trees. These are full copies of your code base that you would use or an agent would use to make changes. And one of the benefits of work trees versus branches, and you get many of them going on on the same time on your same machine. And so if you're working with a lot of agents, you could give each agent its own work tree to work on. And it could do a lot of work in parallel without running into each other or causing issues. If you want to learn more about work trees, definitely watch our episode with Alex from OpenAI on Codex, the terminal app, where he goes through how he uses work trees on a daily basis to kick off his agentic work. And then up in the top right, you can see we have a good diff panel. A diff is, again, the difference between what you had and what you have now. You'll see red is code that was removed. Green is code that was added. You can see up here the count of line changed, either added or removed. And then you can create pull requests from Codex. Pull requests are kind of a signal to your team that says this code that I'm working on is ready to be part of the main production branch. Can you pull it in? I'm requesting it. And often that's where your CICD pipeline, your pre-production checks go, and where your team with their human eyes tends to look at your code. And you can see here as I'm talking through this, Codex has put these concepts up front and center. And I think that's because they're trying to appeal to two audiences. One, they're just trying to appeal to, you know, let the tokens go, highly empowered, use all the agents, software engineers that are doing a lot of things at once on their local machine and need to be able to benefit from these concepts of Git, work trees, local and cloud agents, all that kind of stuff. The second thing is I think this is actually a really good framework for folks that are less technical to learn the concepts of GIF. I have always said you should invest in the GitHub desktop experience. It is a version of this. It's what I use all the time to manage my work across branches and across files. I could work in the command line tool for GitHub. I just think it's nice to be able to see your changes and really know what's going on. And so Codex has brought some of these visual concepts, UI concepts of Git into the Codex app. So it's nice if you're learning. The second thing that you'll see in Codex that is a little new and unique compared to other apps is the concept of bringing skills up as a first class citizen. So if you are new, skills are sort of a package set of prompts, instructions reference files and code that can be called by an agent to kind of consistently execute a task over time If you want to be like really cheap it like a bundled prompt And you can see here that OpenAI and Codex have given screens a home and they given them icons and they've given them buttons. And I have to say, I love this. If you watched my early episode when skills first came out, I was so exasperated that skills were like a zip file that you had to upload somewhere or put in your repository. This just makes it a much more visual experience to add skills to your code base or to your system and refer to them over time. I also like that OpenAI shipped a bunch of recommended skills that a lot of people could benefit from so you can get your mind wrapped around what kind of skills would benefit your AI work. The final thing that I think OpenAI put kind of front and center in codex that's interesting is this concept of automations. So automations are basically tasks that can run on a schedule. You can see here when you create a new automation, you give it a name, you say what project it needs to run on, you basically run a prompt, it's not that fancy, and then you give it a schedule. And again, like skills, OpenAI has shipped a bunch of out-of-the-box automations. Now, my reaction here was, I'm already doing a lot of this stuff. You know, I'm a little ahead of the curve when it comes to some of the automations around my code base. So I've solved these problems, but I think everybody should solve these problems. So if you're looking for inspiration on what kind of automations would benefit your code base, the Codex Automations, Recommended Automations is a really good place to start and get some inspiration. But let's get to actually writing code. Now, I have to say one caveat, which is I ran this process using GPT 5.2 Codex, which was the recommended model when this app came out. Now, very quickly, they came out with 5.3, and we'll see that towards the end of the episode. But I do want to call out this is a slightly older version of the model, though I think the family of models, given my experience, have very similar output. So I would say I would probably get the same experience with 5.3. Now, what is the test case we're going to do? As I like to do, we are going to redesign the chat PRD site. Last time when some models came out, we redesigned a page. But we've been pitched that these new models are more independent, can do more long running tasks, can handle more. And so I wanted to take an existing code base and redesign the whole thing. And I'm going to trust these very smart models to do it without too much prompting. And so that was my test cases. I wanted to take this homepage and this website, which is lovely, but it's very PLG focused and make it more polished, more upleveled for an enterprise audience. And so I started that in Codex and I gave it pretty high level prompt, but I thought it could go with it, which is I said, optimize the marketing site in this repo for PLG plus enterprise. You can create new pages, redesign templates, et cetera, to make it the highest quality marketing site I could have. And then I listed a bunch of sites that I really like. If you're on this list, I think you have a nice website. Now, here's where it immediately disappointed me. And I'm sad to say it, but it did. One of the things that I've noticed about the GPT-5X codex models is they are so literal. They are so literal. And so they follow instructions very well. And I know that is a, in many instances, a feature, not a bug. You want your model to follow your instructions explicitly, but you don't want it to follow it blindly. And that's what I found. I found that the Codex app harness plus the Codex models were just too literal to do green field or creative broad work on my behalf. It will do high quality coding work. I will get to that soon. But your ability to tell these models like, hey, go and do X, I often found that with a combination of it being too literal, and not pushing me to the next step, not actually saying, are you ready for me to build, meant that it was much more painful and slower to get work done with these models. And this is really ironic, because the 5.3 model is actually pretty fast. And so it should feel faster to code with it. But the actual back and forth experience conversationally was really challenging. You'll see some of that here. So I said redesign the website, we went back and forth on how to use the Figma skill, it didn't actually pick it up well. So I just gave gave up on that. And then I asked it to redesign the page, and it did it. Now here's where example number one of being too literal came in. I had told it I wanted to redesign the marketing site for a combination of product led growth and enterprise. Basically, I wanted a market site that'd be friendly to users, but it would also help our sales team bring in inbound leads. And it built it and literally had explicit references to PLG and enterprise in the copy. It was like, if you're here for product led growth, click here and sign up. If you are here as an enterprise customer, click here and talk to sales. It was so explicit. And this was my perpetual cycle with Codex on this redesign. We went back and forth. I gave it some design help. I asked it to design a couple things on styling. At some point I said the design's okay, but it could be better. Take more inspiration from the sites I offered. Make the copywriting top tier. I've spent two million dollars on it. You can see some of my desperate prompting here just trying to figure out what is the unlock. Is it a technical spec unlock? Is it a, you know, find reference content unlock? Is it an identity unlock for these models. I couldn't figure it out. So I kept trying. And what was really funny is I just, every time I would say something, it would overfit to my prompt. So when it gave me a website that I generally liked, but said, hey, can you add more about integrations? Our enterprise customers really like integrations. It made the entire page around integrations. If I said, hey, I want to focus a little more on enterprise, it would make the entire page about enterprise. It really didn't have that nuance of what goes where and how to build a balanced experience, it was really overfitting to my last prompt. And I will, you know, I was saying like, we don't need to list exactly everything. I was trying to give it explicit examples. And then it put a long list of all those examples. It was just having a really hard time editing itself. And then I'm going to give my favorite example of Codex being way too literal, which is I told it, you know, it created it was something that I thought was fine, but it was a lot of images and not a lot of content. And I said, Hey, I like a more content dense site like Hex. Hex, you have a lovely site. I think you did a really nice job. I just want a more copy on there because I think I want to be more technical, more detailed, more precise about what the value of my product is. And after two prompts, literally made the headline, a dense product workflow for AI powered teams. And I was like, Oh, I mean, I made the like facepalm emoji face. I was like, why in the world would you say that our product has a dense workflow? I asked for a content dense site. I didn't say make our content all about how dense our product is. So I just had a really tough time with Codex and Codex 5 GPT 5 on this particular task We eventually got there and I would say the output was okay So this was the before the after from Codex I really liked this headline that came up. It was like one of the things somewhere buried down on the page that I thought was great. It eventually got overwritten by my content dense headline. I thought some of the headlines were like kind of interesting. It looks pretty nice. It pulled some interesting graphics from our repo. It put placeholders in here. You know, I think this is okay. It kind of didn't quite fit our design aesthetic. And what I was more frustrated by than the sort of literal nature of the GPT model, which I kind of gotten used to, this is like not something that's new to me is that it really only redesigned this homepage and the enterprise page. So I had asked it to redesign the whole page, the whole site. And it really did not do that. And so again, this like sort of it can do long running tasks and take on ambitious things. It just took a lot more work for me to get it to even get to this two page redesign, which I thought was okay, not great. Now, the code is great. It's fine. It's not terrible. It's certainly faster and better than what I would have done myself. That being said, I think we can do a lot better. So speaking of doing a lot better, let's go over to my friend Opus. Now, again, spoiler alert, y'all, I love, I love her. I love Opus. And I will caveat by saying I found a place where I really love Codex. So we're going to come back. But as soon as I started getting my hands on Opus, I was just really happy, but it didn't start off perfect. So let's talk about where it went well and where it kind of went off the rails. So again, I started with the same prompt, optimize the chat marketing site in this repo for PLG and enterprise. You can create new pages, redesign things, et cetera. Again, I put this content dense framework in here. I just, I had just come off that bad experience. I wanted to see what it did. And I will say Opus 4.6 was just a lot better at planning for itself so that it could execute a long running task. So it did its exploration of our code base and reference marketing sites. It used cursor plan mode to do a plan. And then it started building the components. Now, I have to give kudos to cursor. I'm still a cursor girl. Yes, I could have tested Opus 4, 6 and Cloud Code. I am sure there are optimizations there. I just hand to God think that cursor does a good job of building harnesses for all of these models. I think the combination of like planning and to do's and exploration and the question tool, I just tend to get good results. So there is this open question of was it the model or was it the codex harness that you know, in the desktop app that is not as mature as cursor? Which one caused that bad experience? I'm not sure. But using Opus 4.6 in the cursor desktop app was quite nice. Okay, so it's building, it's building, it's building, it's building. It goes, it runs a build, it gives me a summary. I am very pleased with the independent nature of this model. I'm about to hire her, she can go run my marketing site, you are now my marketing engineer. Except the copy was great, the design was terrible. And unfortunately, I didn't commit this at this point. So I lost the design. But it just did not look good. It did not look sophisticated. I was like, I'm going back home to Codex. What are we doing here? It was terrible. So again, I did my desperate prompting here. I want it to look like I spent a million dollars on my design with the best agencies out here. Here are some colors. Let's see if I said, oh, I said, I want you to develop a unique and modern front end visual style. This is Tailwind Indigo AI slop. If you know, you know. And they agreed with me. It was like, you're right. I gave you and Eric Tailwind's law. Let me rebuild. And it rebuilt. And it was so lovely. And so we went back and we, you know, integrated our design system. It gave me an outline of what it did in terms of design. We had to go back and forth on build, but eventually I got something lovely. Here was the before and the after was like this. I love this so much. We're probably going to ship this in the next day or two, hopefully live when the episode goes live. But it still matches our brand aesthetic, but just looks so much nicer. It has our colors. She is pink. It uses some of our graphics instead of placeholders. It calls out some numbers, which is really great for selling the value proposition. It highlights the reviews. And then, you know, instead of what Codex was doing, which was making like very blunt statements about enterprise, it was like 100% security, all this stuff. It gives a really nice kind of value proposition oriented view of what would be nice for enterprise and redesigned our enterprise page as well. So once I got exactly what I liked, I asked it, OK, let's take these styles and go ahead and redesign the rest of the site to bring it up to matching. And it did a really good job. It kept everything consistent. It redesigned our pricing page. It's working on our Hawaii AI page to make sure we're matching some of the designs. I think this looks really nice and I was super happy with the output and this is going to be my meta assessment of Opus 4.6 versus the GPT 5x models is that Opus 4.6 is really good at kind of generative broad green field work you want it to implement a new feature it will go implement a new feature you want to completely redesign your site it will completely redesign your site I was really really pleased with my experience on this model and we're probably going to ship this live. Now this is a much more front-end focused design oriented task. I like this task because we can literally say okay what did I start with before? What did you know Opus come up with? And then even compare that directly to what did Codex come up with? Which I can refresh and show you here. I can do a side by side and you can see with your eyes you can read all the words and really make a decision about where these models do well but that is not enough to assess whether or not these are good models bad models I like them I'm going to use them or I'm not going to use them and as I go into the next workflow where I found both models to be super useful I'm going to admit something that is a little scary and maybe impressive which is I asked Devin today how much code have I merged into GitHub in the last five days, I need to fix my dev and workspace. But if you go into it, in the last five days, I have merged 44 PRs containing 98 commits across 1088 files, I have added 92, almost 93,000 lines of code, I have removed 87,000 lines of code, we've added 5000 net new lines, we have released our a 12345 MCP integrations, we've completely overhauled one of our big components we've completely refactored our components folder and we have shipped and fixed and we have fixed a bunch of bugs we have done a lot and this is none of this is in the web app this is all in our core application which is quite complicated much more complicated than our marketing site and I did all of this with now my two pals on my team Opus 4 and Codex 5 So I did find a place that these two operate really well and I am going to talk you through it As I mentioned earlier one of the big features that I released recently on ChatPRD was a bunch of MCP connectors. So now from our chat, you can look at what's happening in GitHub, you can look at what's happening in Linear, you can look at what's happening in Ganolla, and you bring all that into your product work. And this is one of probably two dozen tools that we now have available in the ChatPRD app. And we were displaying them all in different ways. All our tools were different. They were individual components. Our code was super, super messy. And so one of the things that I kicked off in Opus was a refactor of a reused component that I wanted to be able to add to, remove from, customize, but have some shared code. I just knew the way we were doing this wasn't great. And so I started off a Opus 4.6 task to refactor how we use our tool components. So let's talk about how I actually rebuilt these components and where I use these different models. So first I opened up Cursor and honestly this might be the secret sauce in some of these experiences. I opened up Cursor, I built a plan with Opus 4.6 using plan mode, I kicked it off and I went back with 4.6 on how to build this. And you can see here, I got this lovely, like sort of extensible tool component where I could add different things in, give them different link or give it different copy and language as it went through. It built a bunch of really nice front end components for me. And I think honestly, they look lovely. So as we saw before, you get these lovely tool calls here they look nice across all of our different kinds of tool calls whether you're creating documents I'm just really happy with this experience now I'm ready to push this code to production here is where our friend codex comes back to play now and this is where I love to use codex I went back into codex and I said I've redesigned tool usage in this index it's gone through several rounds of feedback can you review the architecture and performance and see if you have any feedback we should consider before shipping. We're looking for something scalable, but customizable, and we don't want to overfit in any direction. And it went through and searched and identified a couple high impact issues, prioritized those issues for me, asked me questions. I said, one is intentional, two is an edge case. And it asked me if it wanted to implement any of the polish. I said, yes. And polished it. It passed our AI bug bot code review, and we shipped this to production. And now this is my flow. So this was a very, again, kind of front end focus component focused workflow. We just, you know, like for the technical folks out there, we just completely are re re platforming our vector stores. It was a huge, huge, huge thing. It touched 50 files. It was really hard to do without kind of doing a huge, huge PR. It required like, I don't know, probably 30 rounds of feedback on this thing. And GPT-53 Codex was so lovely. Love it for code review, architectural review, and finding edge cases. And what I found is you could ask Opus 4.6 to build something. It would build something 80 to 90% done or good. You'd ask Codex to find everything wrong with it. It would find all the things that were wrong with it. And then you take get back to Opus and Opus would be like, oh, yeah, yeah, bro, you're right. I really missed that thing. I better fix it. And so I do think I'm going to give Codex some love here. I think it's the better software engineer. Technically, Opus is kind of the software engineer that you want on your team, though. It actually builds stuff. And so what I've been saying to people about GPT-5-3 Codex is it really replicates the principal software engineer experience and that you will fight them tooth and nail to build anything for you, but they are more than happy to tear apart someone else's code. So if you are looking for a principal engineer on your team to pair with your eager product engineer of Opus 4.6, definitely, definitely use Codex. And I kind of feel like I can't live without Codex reviewing my code now. So I'm quite happy with this experience. Again, BugBot, which I use from Cursor, does a lot of review of our PRs. It's also run on the Codex model. So I think it's a really good eagle eye reviewer. It's just too hard to get out of the gate building new products. So I really like this flow and I highly recommend that folks replicate it. I think it's really useful. To conclude our episode, I just want to give a quick nod to Opus 4-6 Fast. If you have not heard, Opus 4-6 is Opus 4-6, but fast. You can select it here in its most powerful model, but fast and it is expensive six times the price I think it's $150 per million output tokens something like that I actually used Opus for six fast a lot and now I gotta go look at how much I'm spending so what I will say is while I have consumed the tokens I am floating through an infinite ocean of tokens, I embrace a token abundance mindset. Starting to spend a lot of money on models, which at the end of the day, super, super high ROI. Again, if we're looking at this, how expensive would it be for me to ship 44 PRs, really, really huge features. It would take months of time, tons of people. We probably also wouldn't get it to perfect quality. And so I am really bullish that this is a worthwhile investment for my team, but don't mess around with 4.6 unless you're ready to pay the bill. And so I just think we're also going to start looking at where does this fit from a personality perspective? Where does this fit from a capability perspective? And then where does it fit from a budget perspective? And as my friend from Cody at Century said, if you're playing between 4.6 and 4.6 fast, don't pick the wrong task or you're going to get a bill that you're not happy with. So that's today's model focused episode of How I AI. I have been paired Opus 4.6, Codex, GPT-5.3 Codex, and Opus 4.6 Fast. What I found, you want to use Opus for your product and feature work, being creative and creating high quality designs. You want Codex catching all your bugs, advising on our architecture, and really writing exceptional high quality hardened code. Both of these models have a place in your stack. I still love cursor for using them. I'm still a multi model girl. But I think they do well in either the Codex desktop app, Cloud Code or wherever you like to get your AI generated code. That is today's episode of How I AI. I'm looking forward to hearing your feedback about what your favorite model is and where you're using it and we will see you next week. Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube, or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time.