ChatGPT 5.5: ‘This is just a software update’ | Meredith Broussard

23 min

•Apr 28, 20263 months ago

Summary

Meredith Broussard, NYU data journalism professor, argues that ChatGPT 5.5 is merely an incremental software update driven by OpenAI's IPO preparation rather than meaningful innovation. She critiques the company's benchmarking practices as misleading, highlights persistent bias and hallucination problems, and questions whether the environmental costs justify the marginal improvements.

Insights

OpenAI's benchmarking methodology is deliberately constrained to only test problems computers can theoretically solve, excluding harder problems to inflate accuracy metrics—a practice called 'juking the stats'
AI models memorize training data rather than truly understanding, which is why they pass benchmarks containing questions already present in their training sets
Agentic AI capabilities are rebranded existing functionality that simply streamlines multi-step processes, not a fundamental breakthrough
AI companies are either not conducting bias testing or deliberately withholding results, despite robust academic methods existing to measure fairness
Humanizing AI systems as 'digital workers' causes real harm through anthropomorphization, particularly affecting mental health and child development

Trends

PR-driven AI releases: Major AI companies prioritizing marketing hype over substantive improvements to maintain cultural relevance and investor interestBenchmark gaming: AI companies strategically designing or selecting benchmarks that favor their models while excluding harder test casesIncremental improvement plateau: All major LLM advances are now incremental, suggesting the field may be hitting diminishing returns on current architecturesBias opacity: AI companies systematically avoiding public bias testing and fairness benchmarking despite academic methods being availableJunior year wall phenomenon: Students using AI to learn coding without mastering fundamentals, hitting skill walls when problems exceed AI capabilitiesEnvironmental cost scrutiny: Growing questions about whether marginal AI improvements justify massive water and energy consumption of data center retrainingAgentic AI rebranding: Industry repackaging existing automation capabilities as novel 'agent' functionality to justify new releasesHomogenization of AI output: LLM outputs becoming increasingly recognizable and formulaic, reducing nuance and contextual understandingPolitical ideology embedding: AI systems reflecting unstated political decisions by creators, such as exclusion of DEI topics in generated curriculaMemorization over understanding: Evidence accumulating that LLMs pass tests through data memorization rather than genuine comprehension

Topics

ChatGPT 5.5 capabilities and limitations AI benchmarking methodology and gaming Agentic AI and multi-step task automation Bias in machine learning models LLM hallucination and accuracy problems AI in HR and hiring decisions Data memorization vs. understanding in LLMs Environmental costs of AI model training Anthropomorphization of AI systems AI education and the junior year wall Goodhart's Law and measurement gaming OpenAI's IPO preparation strategy Comparison with Anthropic Claude and Google Gemini AI safety and security vulnerabilities Fairness, accountability, and transparency in AI

Companies

OpenAI

Released ChatGPT 5.5 (Spud); primary subject of discussion regarding benchmarking practices and IPO preparation strategy

Anthropic

Competitor mentioned; Claude Opus 4.7 and Claude Mythos compared to ChatGPT 5.5 in terms of capabilities

Google

Mentioned as competitor in AI space with Gemini model competing against OpenAI's releases

Amazon

Case study cited for biased ML hiring model that discriminated against women based on historical hiring patterns

Apple

Referenced as analogy for how tech companies hype incremental software updates (iPhone releases)

Bloomberg

Conducted study testing ChatGPT bias in resume evaluation, finding discriminatory patterns in hiring recommendations

GitHub

Code sharing platform whose open-source problems and solutions were used to build original SWE Bench benchmark

Hugging Face

Platform hosting SWE Bench benchmark publicly available for AI model evaluation

People

Meredith Broussard

Guest expert critiquing ChatGPT 5.5 release, benchmarking practices, and AI bias issues

Quotes

"Let's be honest, this is a software update."

Meredith Broussard•Opening

"It's juking the stats, like it's rigging the game."

Meredith Broussard•Mid-episode

"An AI agent is basically three APIs in a trench coat."

Meredith Broussard•Mid-episode

"When a measure becomes a target, it ceases to become an effective measure."

Meredith Broussard•Late episode

"It is absolutely, absolutely incorrect. What this is, is it is a software system. And when you humanize it, you're doing your customers a disservice."

Meredith Broussard•Late episode

Full Transcript

Let's be honest, this is a software update. One of the reasons that machines can pass these tests in some cases is because the questions and the answers are already in the training data that the model has memorized. The numbers that we're seeing about their whatever percent accuracy on this benchmark, Those are on a subset of problems that they know that a computer could theoretically solve. It's juking the stats, like it's rigging the game. On the Tech Report with me today is NYU data journalism professor and author of several books, including More Than a Glitch, Meredith Broussard. Thanks for coming on. Thanks for having me. It's great to be here. OpenAI has released ChatGPT 5.5. It's also known as Spud, which is an interesting name. The general reaction I've heard and read is kind of mixed. Positive points point to its ability to take more action with less human intervention. And then the negatives are sort of more in its unsurprising and somewhat incremental improvements over 5.4. doesn't meaningfully overtake Anthropics clawed Opus 4.7 from what I can tell either, despite what OpenAI's internal benchmarking and press releases claim. What do you know about how much of improvement this has been over ChatGPT's previous model? So let's be honest, this is a software update. Okay. What this is, is OpenAI's attempt to hype up the next version of ChatGPT in advance of its IPO. Right. They're preparing for an IPO. They're trying really hard to stay in the cultural conversation. Would you say that this is a release more for PR than one motivated by a meaningful reason to release it? Software updates happen all the time, right? All of our personal computers are trying to update themselves in the background all the time. And yeah, tech companies like to make a big deal about, oh, there's a big new version coming. All the new version means is that there's a collection of new features, right? So every software update is by nature incremental. And so publicizing a big software update and giving it a name is a way of marking it for the public. Now, we are only going to see incremental advances with all of these AI models, right? Because that's how advances in AI work, right? And we're going to see the PR departments trying to hype them up every single time, the same way that Apple tries to hype everybody up every time there's a new iPhone. And yeah, it was exciting the first couple times. But, you know, like we've seen a lot of iPhones by this point. We've seen a lot of releases of chatbots. And, you know, they get a little bit better. Do they get good? No. One of the things people, as I mentioned, are pointing about and celebrating about this is the agentic capabilities that it's claiming to have. In other words, that's kind of being able to take more actions on complex multi-step tasks with less human input. But given ChatGBT's infamy for sycophancy and tendency to hallucinate rather than admit it can't do something, to me that sounds like more of an opportunity to compound the hallucination effect with one wrong assumption being multiplied in the subsequent steps. Yeah, you're absolutely right. And let's talk about what agentic AI means, right? An AI agent is basically three APIs in a trench code, okay? What agentic AI allows you to do is it allows you to write code using natural language, right? And that is hands down the greatest contribution of LLMs, is the fact that now we can write code using natural language, human language, instead of having to use programming language because programming languages are really hard to use. However, what you're doing when you're making an AI agent is you're basically making the missing button in software. Okay. So you are developing a using natural language. You're talking to the computer to develop a bit of code that makes, say, two programs or three programs work together efficiently. And yeah, you could do that before, but you had to write more code and it was really hard to do. So all it does is streamline that process, right? AI agents running in the background, like, guess what? We've had the ability to write processes that run in the background for the entire time we've had computers, So it's just, it's rebranding. What about the homogenizing effect that AI has? People will be familiar, sort of the output that it creates. It uses a lot of similar words, a lot of similar, the M dash, these sorts of things. It quite easy to notice something is AI If it is able to take much messier prompts as the press release says it can would that not create also some homogenizing effect on the input that it taking If it's able to supposedly take in more of this messy stuff, is it maybe missing some of the nuance that it's kind of supposed to be paying attention to but now might not be? Oh, yeah, it's absolutely missing some nuance. Because what an LLM is, what a chatbot is, it's a machine, right? It's not sentient. It's never going to be sentient. There's no such thing as AGI. And so what it's doing is it is an AI model that is predicting the next. So when an LLM makes text, for example, it's predicting the next token in a sequence. So LLMs have gotten really good at parsing the natural language, the human language that people put in, and then turning it into an output that looks remarkably like, say, a five-paragraph essay, right? Or looks remarkably like a bit of code or subroutine. Is it accurate? it? Is it always perfect? No. But, you know, it does generate code, like it generates code-shaped objects. And if you are already good at coding, then yeah, you can take this code-shaped object and turn it into something that actually works. And that's great. You know, if you are already an expert coder, like absolutely, this speeds things up. If you're not already an expert coder, than, well, you're in a different situation. So one of the things we're seeing in academia is what computer scientists are calling the junior year wall, right? So when students who are first and second years, freshmen and sophomores, when they use generative AI or use chatbots to learn to code. We don't really call it cheating anymore, but they're not learning the basic skills that they need in order to solve computational problems. And then by the time they get to junior year, they haven't learned all of the concepts. And junior year is when the problem sets in computer science classes get too hard for chatbots to solve. And so if the students haven't actually learned those skills, they hit the junior year wall, they can't do the problem sets themselves because they have not developed two years of skills, and then they can't finish their degrees. What are the dangers of assuming, and I suppose also selling AI as something that can understand the intent of somebody's words or navigate the ambiguities of the workplace, especially considering it's, like you say, just simply a probabilistic assumption on what the next word should be. Oh, yeah, that's very dangerous. That's very dangerous. I think that we can also take a closer look at OpenAI's press release around this new release in order to figure out what are some of the limitations, right? So if you look at the press release, you'll see that there are benchmarks, right? And benchmarking is a really useful process. It's typically used in computer science to figure out how good is a machine learning model or how good is an AI model. So very, very normal process. And OpenAI is actually following pretty conventional techniques. However, they're doing something a little tricky. So one of the benchmarks they're using is something called SWE Bench, right? Software Engineering Bench. And so what this is, is this is a benchmark that was developed by some researchers and, you know, published in a paper that's available on the archive. And the benchmark is widely available for everybody via a platform called Hugging Face. Again, very conventional. But when you go in and you look at the history of this benchmark, it turns out that the original software engineering bench was built by taking a whole bunch of problems and solutions from the open source world. So they were published to a code sharing website called GitHub. And OpenAI looked at that original set of problems and answers and tried to run ChatGPT against these problems and solutions. And the idea is that what you do is you have the machine take in the problem and generate its own solution. And then if it matches the solution that the human has determined already, then it passes. However, this original set of questions, of problems and solutions, had a bunch of problems in there that were impossible for computers to solve. And so OpenAI said, hey we don actually want those problems we only want problems that computers can solve So they made their own benchmark called SWE Bench Verified And so the numbers that we seeing about their whatever percent accuracy on this benchmark which they're using to say, oh, it's so great at coding, those are on a subset of problems that they know that a computer could theoretically solve. So there's a whole set of problems out there that computers can't solve, that only humans can solve. And some of them actually humans can't solve because there's a whole lot about the world that we don't know. So it's juking the stats, like it's rigging the game. On the topic of benchmarking, the one thing that seemed to be glaringly absent, not just from this press release and this release, whatever you would call it, But from pretty much every benchmark of every AI that I've seen, they do not include bias benchmarking, trying to see where the areas are that we know that it is making incorrect assumptions based on data sets that are biased themselves. what do you make of the fact that ai companies are either not taking the time to do these tests because they do exist i've taken some time to find them and look into what you could do to test them or that they're not willing to publish the results if they are doing these tests i mean that is true like that is undeniably true uh there are ways to evaluate models for bias And if the big AI companies are doing it, then they are not publishing the results. There's a very robust community in computer science that is concerned with fairness, accountability, and transparency. We have things like 21 different measures of mathematical fairness. and we know that there is bias in these models. So for example, Bloomberg did a story a year or two ago where they took a lot of resumes and put them into ChatGPT and tried to evaluate whether there was bias in what the chatbot said would be the best resumes, the best people to hire. And they found the same kind of bias we found in every single experiment that has been done on computational systems, right? And the reason for that is that there's bias in the training data. You know, when you make an AI model, what you do is take a whole bunch of data, put in the computer, you say, make a model. Computer makes a model. Model shows the mathematical patterns in the data, and then it can reproduce things based on the training data. But there's bias in the real world. So there's bias in the training data. And that's what's going to come out. Do you think it's possible for LLMs to reach a point where we can trust them for more sort of sensitive workflows, like with HR or things like that, which apparently OpenAI says it is already using it for? You know, in HR, you have to think about what do we know has happened in the past? And you have to think about has that problem actually been solved? And usually the answer is, well, it's a pretty big problem. And the answer is no, the problem hasn't been solved. So, for example, there's a very famous case when Amazon tried to make a machine learning model that would look at resumes and look at who had been hired at Amazon already and then use the data on who had been hired in order to figure out who's going to be a good hire in the future. And what that model did is it kicked out all of the women, right? Anybody who went to a women's college, anybody who played a women's sport got kicked out. Now, why is that? Well, likely it's because of the people who had already been hired at Amazon, who, you know, did not include a whole lot of people who went to women's colleges or played women's sports. And I actually came across something really interesting the other day. I was playing with one of the chatbots. I don't remember which one. But I asked it to make a sample curriculum for a journalism school, for a university level journalism school. And it gave me a curriculum-shaped object, and it did include a lot of things that one would think about in journalism. But interestingly, it did not include any language around diversity, equity, or inclusion. And we think about these things a lot in education. Diversity is important. Equity and inclusion are important things in the world, despite, you know, any like any political pressure to exclude them. And so it's interesting to see the way that political ideologies get implemented inside AI systems. So there are all kinds of political decisions made by the people who create AI systems that Joe User is not aware of is not thinking about But when you start going to look for longstanding social problems inside AI systems you inevitably find them What do you make of OpenAI's marketing, specifically calling ChatGPT 5.5 a digital worker? Is it harmful to call what is, again, a random number generator with fancy clothes a digital worker? It is absolutely, absolutely incorrect. What this is, is it is a software system. And when you humanize it, you're doing your customers a disservice, right? We have all kinds of examples of people coming to harm because they anthropomorphize chatbots. One estimate suggests that a million people a week are descending into mental health spirals using chatbots. People have probably heard by this point about AI psychosis and what a problem this is. I think it's especially a problem when you think about kids using chatbots and kids kind of imagining that this machine is somehow real. It's going to have, it could potentially have really negative developmental effects. And just finally, one reason that I had seen floating around for why this launch, aside from to try and I suppose catch up with Anthropic or Gemini, depending on which side you're looking at, is that this will be the basis for their new 5.5, their new super app. It's going to be this new model and they fully retrained ChatGPT for 5.5. If we're not really seeing that much of an improvement and there's some reports that it's twice as expensive, but then somehow also uses less tokens. We're not seeing enough to maybe warrant the incredible energy costs of retraining a full AI model for what is essentially an iterative improvement. So the anthropologist Margaret Strathern coined something called Goodhart's Law, which is about measures and effectiveness. So when a measure becomes a target, it ceases to become an effective measure. So this is what we're seeing with all of the benchmarks that people are using for all of the chatbots. Because what you can do is once you know what the measure is, you can game it. Right. So what's going on here is, as I mentioned before, what the what the tests are is they are what we could call open book tests for the chatbots. OK, so the answers, the questions and the answers are already known. And why are they known? Because they are somewhere on the Internet. Okay. So many, many, many, many evaluators, and this is actually in OpenAI's documentation as well, evaluators have found evidence of memorization. Okay. So because the software developers don't really know what's happening inside the black box of an AI model, they have to kind of guess at what's happening. Like it's a very educated mathematical guess, but it is also a guess. And there is always evidence of memorization. So one of the reasons that machines can pass these tests in some cases is because the questions and the answers are already in the training data that the model is memorized. Right. So this is a known limitation of chatbots. And I do think that people need to understand that all change in chatbots is incremental. And we are going to have hype. You know, people were really excited last week about Claude Mythos, right? And the idea was that Claude Mythos was so good at, you know, cracking code that it was unsafe to be released. And what does this mean? Well, it means it's really, really good at finding holes in software programs and exploiting them. Guess what? Things that are vibe coded, right? Things that are coded using chatbots have a lot of security holes. OK, so now we also have software that is really good at finding the security holes in the code that was made by the chatbot. I mean, is this worth the environmental cost, the amount of water, say, that it takes to cool these giant data centers that train these AI models? Right. Is it worth the higher electric bills that that people are seeing as a result of data center energy use? You know, are these chatbots worth it overall? I think that's really a question that people have to ask themselves. Well, Meredith Broussard, thanks for taking the time. Thank you.