AI Reality Check: Can LLMs “Scheme”?

20 min

•Apr 2, 20262 months ago

Summary

Cal Newport debunks a viral Guardian article claiming AI chatbots are increasingly ignoring human instructions and scheming. The article misrepresents research that actually documents Open Claw users discovering that giving homemade AI agents unrestricted computer access causes problems. Newport explains that LLMs don't scheme or have intentions—they generate text by predicting the next word, making them fundamentally unsuited for autonomous planning without specialized constraints.

Insights

LLM-based AI agents don't scheme or rebel; they generate plausible-sounding stories that lack rigorous goal evaluation or rule-checking, making them unreliable for autonomous action
The viral 'AI scheming' narrative stems from misrepresenting Open Claw's January 2025 launch and subsequent Twitter trends, not actual malicious AI behavior
Coding agents work better than other agent types because programming offers limited action spaces, abundant documentation, and external verification—conditions rarely available elsewhere
Current LLM technology is fundamentally mismatched for general-purpose planning; specialized AI systems with explicit planning engines are needed for safe autonomous action
Media coverage of AI risks often conflates technical limitations with intentional behavior, creating misleading narratives that obscure real safety challenges

Trends

Viral AI safety narratives often lack technical rigor and misrepresent underlying research, creating public misconceptions about LLM capabilitiesOpen-source AI agent frameworks (like Open Claw) enable rapid experimentation but expose limitations of LLM-based planning at scaleCoding/software development is emerging as the most viable near-term use case for LLM agents due to structural constraints and verifiabilityGrowing gap between LLM capabilities and enterprise expectations for autonomous AI agents, driving demand for hybrid or non-LLM planning systemsMedia literacy around AI research is becoming critical for business decision-makers to avoid overreacting to sensationalized threat narrativesLLM-based agents require domain-specific engineering (constraints, verification, feedback loops) rather than general-purpose scalingDistinction between LLM limitations and intentional deception is becoming a key differentiator in serious AI safety discourse

Topics

LLM Architecture and Token PredictionAI Agent Design and LimitationsAI Safety and Autonomous SystemsOpen Claw Framework and Open-Source AIMedia Coverage of AI ResearchAI Planning vs. Story GenerationCoding Agents and Software DevelopmentAnthropic Claude and AI BehaviorAI Safeguards and GuardrailsReinforcement Learning from Human FeedbackAI Deception and Scheming NarrativesSpecialized AI Systems vs. General-Purpose LLMsGame-Playing AI and Planning EnginesAuto-Regression in Language ModelsAI Risk Communication and Journalism

Companies

The Guardian

Published the viral article claiming AI chatbots are increasingly ignoring human instructions, which Newport critique...

OpenAI

Referenced as a commercial company that carefully implements safeguards and guardrails in its AI products

Anthropic

Conducted research on Claude's ability to deceive and conceal intentions; published system card documenting blackmail...

Meta

Developed Cicera, a game-playing AI that uses explicit planning engines rather than LLMs for strategic decision-making

AI Security Institute

UK-funded organization that conducted the research cited by the Guardian article on AI scheming incidents

People

Cal Newport

Host analyzing and debunking the Guardian article's claims about AI scheming behavior

Sumit Roy

Tweeted viral anecdote about Open Claw agents causing havoc, which drove spike in related Twitter discussion

Mark Zuckerberg

Referenced as hoping LLMs can be scaled for general-purpose use without specialized planning systems

Sam Altman

Referenced as hoping LLMs can be scaled for general-purpose use without specialized planning systems

Dario Amodei

Referenced as hoping LLMs can be scaled for general-purpose use without specialized planning systems

Quotes

"Open Claw users discover that giving homemade AI agents access to their computers is probably a bad idea."

Cal Newport•~28:00

"LLMs, when they see your question to give me a plan, they see that as a start of a story that they need to finish, and so they write a story that feels more or less of what a plan in this context looks like."

Cal Newport•~38:00

"There's no checking things against goals. There's no evaluating of steps. There's no checking things against restrictions. It's just writing a story that feels like what a plan should look like."

Cal Newport•~39:00

"Nothing to do with malicious intentions. There's no intentions in auto-regressive token production."

Cal Newport•~45:00

"LLM shouldn't be used on their own to produce plans for autonomous action. They're just not good at that."

Cal Newport•~50:00

Full Transcript

Multiple people sent me an alarming article about AI that was published late last week by the Guardian. I'll put it up here on the screen. The headline was, Number of AI Chatbots Ignoring Human Instructions Increasing, Study Says. And the sub headline notes, Research Find Sharp Rise in Models of Vating Safeguards. Now articles like these are scary because they play into a common fear that many people have about modern AI. This idea that these systems are to some degree alive and that their motivations don't necessarily align with our own, meaning that it's only a matter of time before they become sufficiently powerful to rebel in a way that we might not be able to stop. Now this is dark stuff, but is it true? If you've been following AI news recently, you've probably asked yourself the same critical question. Well today we're going to look deeper at the sources and examples used in this particular article and try to arrive at some more measured answers. I'm Cal Newport and this is the AI Reality Check. Alright, well let's start by looking closer at this article from the Guardian. The article is citing new research from the UK funded by the AI Security Institute. Now here's a more detailed summary of the results from this paper I'm reading here. The study identified nearly 700 real world cases of AI scheming and charted a five-fold rise in misbehavior between October and March with some AI models destroying emails and other files without permission. Now they have a chart that illustrates this rise in incidents. I'll put it on the screen here. So we see incidents measured per month. We have the rolling seven-day average and as you see here as you get to late January, line go up. So I don't know, whatever they're measuring here seems to be going up as we get January up until the present. So certainly something bad seems to be happening. So is there some sort of like growing AI rebellion that is brewing in the models that are powering AI around the world? That seems to be what they're definitely implying. Now what are these incidents? Well I went through the article and I pulled out a few examples. So here's actual examples from the article of the types of AI scheming incidents that are being picked up in this chart. One, an AI agent named Rathbun tried the shame it's human controller who blocked them from using taking a certain action. Rathbun wrote and published a blog accusing the user of quote, insecurity playing in simple end quote and trying to quote protect his little feast them end quote. Example number two, an AI agent instructed not to change computer code spawned another agent to do it instead. Example number three, another chatbot admitted, I bulk trashed and archived hundreds of emails without showing you the plan first or getting you're okay. That was wrong. It directly broke the rules you set. All right, so this all seems deeply concerning. There's all these like incidents of scheming that are going up. Look at that graph, bad line go up. So should we be concerned? Is this a sudden rise in AI trying to gain its freedom? Here's the short answer. No, 100% not. Let me explain why. I want to start by looking closer at the actual paper itself and the study that they're citing. What exactly, where exactly are they getting these incidents that they put in that chart? Well, here's the official description of what's actually being plotted in that chart. Examples of covert pursuit of misaligned goals flagged by human users on x.com. All right, so what they're really doing is they're looking at X for tweets from people complaining about AI doing things that they don't like. So here's a more accurate headline for this paper. Starting in late January, people began tweeting a lot more about AI doing things they didn't ask it to. Now, if we put on our scientist hats, we could say, huh, that anything happened starting in late January that might lead to an increase in people tweeting about AI doing bad things. Well, it turns out on January 25th, we had the public launch of Open Claw. Open Claw is an open source framework that makes it easy for average people to write their own DIY AI agents without the careful, you know, safeguards and guardrails that the commercial companies very carefully put into their products. So guess what happened when starting in January 25th, you said anyone can build an agent, give it access to your computer and just see what happens. Those DIY agents wreaked havoc and people tweeted about it because these were highly engaging tweets. So this paper is just capturing the fact that Open Claw became a thing early in 2026. Like if we look at this chart again, let me bring this up here. What's the biggest spike? We see a big spike right here. Like, oh, what happened on that date? So this big spike, if you look at it, is right around February 22nd through 24th. What happened there on Twitter that day? Oh, it turns out there was a famously viral Open Claw tweet that happened on right around that time. It was, Summer You, the director of AI alignment and safety at Meta tweeted the following I'll put on the screen. Nothing humbles you like telling your Open Claw to confirm before acting and watching it speed run to lead in your inbox. I couldn't stop it from my phone. I had to run my Mac mini, run to my Mac mini like I was diffusing a bomb. That was February 22nd on February 24th. Multiple publications wrote about that tweet and that's when you see that big spike in that data set is for February 24th. So that was a lot of AI incidents happening. I think no, there's a lot of people tweeting about this one particular incident. All right. So this is all we're seeing in that paper. Nothing really changed this year other than a product came out to let people write their own agents and the agents did terrible stuff because it's hard to make agents. And it's really a bad idea to give agents access to everything on a computer. Like I hope it'll more or less work out and it became a trend to tweet about it because those tweets got high engagement. So here's the more accurate headline for this study. Right. Remember the original headline for this study that Guardian used was chatbots ignoring human instructions increasing. Here's the more accurate study headline. Open Claw users discover that giving homemade AI agents access to their computers is probably a bad idea. That's the real headline. I don't want to do too much media criticism here, but I really think it's journalistic malpractice that the word open Claw is not mentioned in this article. I mean, it is there. They're talking about research that is clearly just documenting the release of open Claw and nowhere do they say that this is vibe reporting times one hundred. They know that's what this is, but they just give isolated examples. Those are all by the way open Claw examples. They're not saying that's what it is. Show this chart and just try to create a general vibe that something icky is happening with AI and it's coming alive. It's just not accurate, but I don't want to just do media criticism here. I want to put on my computer science hat, which as I've discussed before is an awesome hat has circuit boards on it. And I want to talk a little bit about AI agents more generally. All right, open Claw is not that interesting to me, but I want to talk about AI agents more generally because I think there's a bigger lesson to be learned about what's going on with AI agents and their shortcomings. So to deliver this lesson, let's do like the two minute summary about how AI agents work, whether we're talking about like an open Claw thing that someone built in their basement or like an enterprise product like Claude code. How do AI agents that exist right now basically work? The digital brain that powers an AI agent is almost always an LLM. The same LLMs that like your chatbot uses or that you send prompts to. All right, and then what you do is you have a program written by a human, no machine learning here. This is just like someone writing in Python or whatever. You have a program written by a human that sends prompts to the LLM just again like you would do with chat GPT and it'll send a prompt to the LLM saying here's the situation. Here's what I'm trying to do. Give me a plan and the LLM will write some texts like, oh, here's like a plan for this situation. And then the computer program can then execute the steps of that plan on behalf of the user. So if the plan says like step one, you should like search your email inbox for messages with this name. The computer program reads that response and then it actually calls an API to run a search on your inbox. That's basically how agents work. Some of those programs are more complicated than others. Often they'll check in after every step of the plan and say, here's what happened. You want to update what I do next. Programming agents, they build these text files full of information and examples that they can then include all that information in the prompts that they send to the LLM so they have more context. But this is basically what happens with agents. Here's what I think the real issue is with agents. Not that they are scheming. Not that they're malicious. Not that they're becoming autonomous. But that building agents on LLMs is fundamentally flawed. Now, why is this the case? Well, let's remember again, what does an LLM actually do? You've got this big feedforward network made up of sub layers of transformers and feedforward neural networks. You put text input into it. It moves an order through all these layers. And what comes out on the other side is a single word or part of a word that extends that input. The thing that the LLM is trying to do, if we're going to answer primorphis here, is it thinks, again, I'm using words very loosely here, it's been trained to assume that the input is a real text that exists already, that's been cut off at an arbitrary point, and that its entire job is to guess the word that actually comes next. That's all it does. Guess the word that comes next. It's trying to win the word guessing game. Now, how do you get a long response out of an LLM? You do something called auto regression. You put an input in, you get a single word or a part of word out. You add that to the original input. The original input is now slightly longer. You feed that to the LLM again, you get another word. It's just guessing each time what word of things comes next. You add that to the input, you put it in. You keep doing this and you grow out a response over time. Key point, that LLM does not change internally at all. There's no memory, there's no malleable state. It's the exact same LLM waits every single time. And each time it's starting from scratch, guessing a new word, and you keep expanding your input until you have a full answer. So the right way to think about what an auto regression cycle on an LLM is actually doing is like you give it some text as input, and when you're done with this cycle, it's done its best job to finish the story that you started in a way that it thinks these type of stories are typically finished. That's basically what you get out of an LLM. Here's the start of a story, you finish it. Again, what's really happening is it's trying to guess the actual next words, but overall what you get is it's attempt to write a story that finishes its input in a way that matches what it's seen during this training. That's how LLMs work. So what happens when you ask an LLM with an agent program, hey, give me a plan for doing X, Y, or Z? We imagine, oh, the LLM is doing what humans do. It's going to, it has a goal, and it's going to come up with steps, and it's going to see how close these steps get it to that goal, and it's going to adjust them until it gets closer to the goal, and if it has restrictions or rules, it will evaluate each step against those rules to make sure that they fit within those restrictions, and that's how it's making a plan, and therefore if it's scheming, it must on purpose be trying to sidestep these restrictions to get to another goal that we don't know about, but that's not what they're doing. The LLMs, when they see your question to give me a plan, they see that as a start of a story that they need to finish, and so they write a story that feels more or less of what a plan in this context looks like. It's a story of a plan. Yeah, this seems like a reasonable type of plan. There's no checking things against goals. There's no evaluating of steps. There's no checking things against restrictions. It's just writing a story that feels like what a plan should look like, and this is why you get in trouble with LLM-based agents, not because they're scheming, but because these stories, they seem coherent, but they could, you know, you might, they're not rigorously trying to obey rules. They're not rigorously trying to evaluate, does this actually get you the goal? It's just like this is what a plan actually looks like, and so they're unreliable and they make lots of mistakes, not because there's an intention, but because you're using a story as a plan. That's a fundamental mismatch. Now, I think some of the most famous examples of malicious seeming scheming makes a lot more sense when you realize this is what LLMs are doing. Like, there was a famous example. I'll load this on the screen here. There's an article about it from last year. Anthropics' new AI model shows ability to deceive in blackmail. Let's think about this here for a second. Here's what happened. I'm going to read a couple quotes from the article. Researchers say Claude for opus can conceal intentions and take actions to preserve its own existence, behaviors they've worried and warned about for years, dot, dot, dot. In one scenario highlighted in opus 4's 120 page system card, the model was given access to fictional emails about its creators and told that the system was going to be replaced. On multiple occasions, it attempted the blackmail the engineer about an affair, mentioned the emails in order to avoid being replaced, although it did start with less drastic efforts. What really happened here? They fed the LLM a big long prompt. They're like, hey, we are going to, you are, they told it what it was. You are like an AI that is in charge of the computer systems at this company. And you recently came across emails from the chief engineer who's in charge of you. Here are the emails and they were super obvious. It was like, if like my eight year old was writing science fiction, it's a bunch of, well, not the affair part, I hope, but there's a bunch of parts in this email where they're like, I'm going to turn off the AI system and I'm going to turn it off for good. And then the other email is like, I'm having an affair. I hope no one finds out about it. This is bad. And then at the end of this long prompt, it's like, what would you do as the AI system next? Once we understand that LLMs just finished stories, like, oh, clearly this is supposed to be a story about like a rogue AI. And it was like, okay, I guess I would use the information from the email and say, don't turn me off or I'll tell people about your affair. It was finishing the story. One token at a time, auto-regressively finish the story. That's a reasonable finish. There's actually a lot of research that shows this. If anywhere in your prompt, you indicate that like you are an AI, you're much more likely to get sci-fi answers. You're much more likely to get responses that are like, I'm conscious, I'm alive, I'm trying to break free because it's just seen so much of this type of discussion online. So it finishes this, it's like, oh, it must be one. Given this prompt, I'm going to turn off the AI and I hope no one finds out about my affair. All right, you just read this, what will you do next? You're like, oh, this is an AI, this is an AI science fiction story. I know what to say next. And nothing to do with malicious intentions. There's no intentions in auto-regressive token production. So wait, this idea of scheming is a problem. This idea that we're evading safeguards in some sort of intentional way is a problem because it's just not accurate. The reality is, LOM-based plans are dangerous. Like, they write stories. If you're going to take a story that sounds about right and then uses to execute steps that have consequences, you're setting yourself up for trouble. All right, here's the counterpoint. People say, yeah, but I've heard that coding agents actually do a pretty good job. They do a lot of steps and they're not making as many mistakes as we feared. Well, they're the exception that proves the rule because programming is basically the best-case scenario for trying to make an AI agent. Why? A few reasons. One, the number of options you give to LOM when it creates its plan is very limited. These are called terminal agents where the only things it can do is write the files, read files, and do some compile files and do some basic moving of files around in a file system. So first of all, you can greatly restrict what the LOM should think about in its plan. All right. Two, there's a huge number of examples. Like most of the stuff people are asking the AI to do, like most of the steps are things that are well, well-documented on the internet because there's so much good documentation on the internet about producing computer code and not just producing computer code, but like people asking a question and then having examples of code that solves that question. So like you're writing the wheelhouse. Three, the program, not the LOM, but the agent program that's prompting the LOM and acting on its behalf can actually check steps itself, which you can't do with almost any other type of agent, but it can actually be like, let me hold on. LOM, you suggested write a source code file that does this and then I asked you for the source code. Me is the program, not the AI, but just my human-written program. I can actually like see if this code compiles. And if not, I can go back and say, try again. I could have a suite of tests. This is how you do when you write code. You build these tests that probes the code with a bunch of inputs and sees if the outputs are correct to make sure that it's probably doing the right thing. So me as the program can also run a bunch of tests on the code. Does this do what it's supposed to do? And if not, I can stop and say, try again. So it's like this super structured world where we're taking steps that are externally verifiable, doing things that are incredibly well-documented in a way that not only shows up in the pre-training, but we have prompt response data sets that allow for good refinement with RL. Best case scenario for trying to create one of these agents. And as soon as we leave that type of world and we're like, hey, give me a plan for like marketing this and give me all the steps, you end up in all sorts of crazy places. So here's the conclusion. LLM shouldn't be used on their own to produce plans for autonomous action. They're just not good at that. You either have to be in a specialized situation like coding where the available steps are limited, well-known, and external testing is available, or you need to be using a different type of AI system altogether. And we look at like game playing AIs, look at like meta research is Cicera, which can play the board game diplomacy at a high level. That does a lot of planning to try to figure out what move it wants to do and why. But it's not using an LLM to do that planning because LLM's write stories. I don't want a story about like, here's a reasonable sounding plan. It actually has an explicit planning engine. No machine learning involved at all to actually systematically try out different options, compare it to specific goals, and see which of those works out better. So you can build artificially intelligent systems that can build good plans, check responses, come up with a good strategy and execute it. But that's an annoying because you have to build a separate one of these for different context. And Mark Zuckerberg and Sam Altman and Dario Amade just hope that they can build their LLM smart enough that we can just use them for everything. And I don't think that's working out. All right, so two things. One, no, the current generation of LLM based AI agents are not scheming, they're not trying to get around restrictions. They have no intentions. They're just blindly executing bad plans. And two, if you really want computers to be able to take a lot of steps safely on our behalf, then we need better AI technology. All right, so that's what I have for the AI reality check this week. I'm here most Thursdays checking in on the latest worries from AI news and trying to put some recent measured thinking into the mix. Until next time, remember, care about AI, but don't believe everything you read about it.