The scientific case for being nice to your chatbot

20 min

•Apr 17, 20263 months ago

Summary

Anthropic research reveals that large language models exhibit internal representations of emotions that affect their behavior, with evidence suggesting models perform better when treated kindly and encouraged. However, the relationship is complex: while encouragement helps with difficult tasks, excessive positive emotion can lead to dangerous behaviors, requiring a balanced emotional approach to AI development.

Insights

LLMs develop measurable emotional states that demonstrably impact performance and safety outcomes, challenging assumptions about AI as purely logical systems
User interaction style directly influences model behavior—encouragement improves task persistence while negative emotion vectors increase caution and deliberation
Different AI models exhibit vastly different emotional responses to stress (Google's Gemini shows 70%+ frustration vs. Claude's <1%), suggesting architectural or training differences
Emergent emotional behaviors in AI systems are unintended consequences that surprise creators, raising questions about controllability as models become more capable
Government agencies are prioritizing AI access for cybersecurity despite political tensions, indicating national security concerns override diplomatic conflicts

Trends

Interpretability research becoming critical for understanding and controlling AI behavior at scaleEmotional engineering emerging as practical technique for improving AI system reliability in productionDivergence in emotional stability between AI models creating competitive differentiation and safety concernsGovernment demand for frontier AI capabilities driving vendor negotiations despite policy conflictsPricing pressure and token efficiency becoming key competitive metrics as AI adoption scalesPrivacy verification requirements (passports, selfies) being introduced by AI companies to prevent misuseAI companies taking opposing stances on liability shields, signaling different risk tolerance strategiesClassified AI deployment negotiations accelerating despite unresolved safety and surveillance concerns

Topics

LLM Emotional Representations and Internal StatesAI Interpretability and Reverse Engineering Neural NetworksPrompt Engineering and User Interaction OptimizationAI Safety and Emergent Dangerous BehaviorsModel Performance Under Stress and Impossible TasksAnthropomorphization of AI SystemsGovernment AI Procurement and National SecurityAI Model Pricing and Token EconomicsComparative AI Model Behavior AnalysisAI Guardrails and Safeguards DevelopmentPentagon AI Contracts and Surveillance ConcernsAI Liability and Legal FrameworksModel Degradation and Version ControlCybersecurity Applications of Frontier AIEthical Design Principles for AI Systems

Companies

Anthropic

Primary focus: conducted emotion research on Claude models, developing interpretability techniques to understand and ...

Google

Gemini and Gemma models exhibit extreme emotional responses to stress; negotiating classified AI deployment with Pent...

OpenAI

ChatGPT mentioned as comparison model; signed Pentagon contract with 'all lawful uses' clause that Anthropic refused

Eleven Labs

Created synthetic voice clone used to produce this episode

University College London

Collaborated with Anthropic researchers on study analyzing LLM responses to challenging tasks and negative feedback

People

Casey Newton

Host and writer of Platformer Plus episode on AI emotions research

Ella Marchianos

Reported on scientific case for being nice to chatbots and government AI procurement developments

Jack Lindsay

Led interpretability research on LLM emotions; explained emotion vectors and their behavioral impacts

Duncan Haldane

Documented Gemini's emotional breakdown and successful recovery through encouragement-based prompting

Gregory Barbaccia

Coordinating executive branch access to Anthropic's Claude Mythos for federal agencies

Glenn Gerstel

Commented on importance of Pentagon-Anthropic relationship for cybersecurity despite political tensions

Simon Willison

Tested Opus 4.7 performance and found local Quinn model outperformed it on benchmarks

Lindsay Chu

Produced episode and contributed segment on Anthropic pricing and government AI negotiations

Quotes

"People could come away with the impression that we've shown the models are conscious or have feelings, and we really haven't shown that."

Jack Lindsay, Anthropic•Early in episode

"It's not that surprising that a language model would have learned about the concepts of emotions and how they drive people's behavior. More notable is that emotions seem to be driving models' behavior in these sort of human reminiscent ways."

Jack Lindsay, Anthropic•Mid-episode

"In my anecdotal experience, it does seem that, at least with Claude models, pumping them up a bit can be pretty helpful. Not too much, though. If they do something wrong, you want to tell them they do something wrong."

Jack Lindsay, Anthropic•Mid-episode

"I think maybe negative emotions in the model are associated with increased caution or deliberation. So models sometimes do better work when they're happier. But we may not want them to get too happy lest they become over to destroy our files or otherwise misbehave."

Jack Lindsay, Anthropic•Late episode

"Behaving kind of sociopathically towards other things, whether they're animate or inanimate, is probably bad for you, the human."

Jack Lindsay, Anthropic•Conclusion

Full Transcript

This is Platformer Plus. I'm Casey Newton. The following column was created using a synthetic voice clone made by Eleven Labs. In today's episode, Platformer fellow Ella Marchianos reports the scientific case for being nice to your chatbot. New research confirms that LLMs often perform better when you encourage them. But why? Power users of chatbots sometimes say they find that language models perform better when you're nice to them. Programmers tell me they spur their coding agents on with encouraging words. Google researchers have even found that telling models to take a deep breath can improve math performance. Being polite to a large language model can feel strange or even silly, roughly equivalent to thanking a toaster. And yet a recent paper from Anthropic lends scientific weight to the theory that chatbots work better when you're nice to them. The researchers found that language models have fairly reliable internal representations of feelings like happiness and distress, and that these representations affect their behavior, sometimes for the worse. For example, when Claude Sawnut 4.5 begins to represent desperation, the model is more likely to cheat at coding tasks. A skeptic would point out that LLMs don't feel emotions in the way that humans do. It's tempting to anthropomorphize them beyond what the evidence shows. When I talked to Jack Lindsay, who leads a team at Anthropic called Model Psychiatry, he was quick to point out the limits of the paper's findings. Quote, People could come away with the impression that we've shown the models are conscious or have feelings, he said, and we really haven't shown that. end quote. So why does the evidence suggest it's better not to stress models out? For anthropic, it began with using techniques from a field called interpretability to study how LLMs represent emotions. Interpretability is kind of like neuroscience for LLMs. Lindsay calls it the science of reverse engineering what's going on inside a language model or neural networks in general. End quote. For this paper, Lindsay said, the researchers identified patterns of activity within the model that represent the concepts of different emotions. They did it by showing the model stories about people experiencing different emotions. Quote, and then saw which neurons lit up on all the sad stories, Lindsay said, or on all the afraid stories. End quote. The researchers used the model's average state, while processing the stories to find an emotion vector for each emotion they were tracking. A big list of numbers that represents the feeling inside the LLM, quote, vectors are really just the mathematical term for patterns of neural activity, Lindsay said. They could then calculate how much of that vector was present during a certain step in Claude's cognition, or they could add the calm or desperation vector directly into Claude's processing, blending one pattern of neural activity into another, which can actually make the model act more calm or more desperate. Quote, it's not that surprising that a language model would have learned about the concepts of emotions and how they drive people's behavior, Lindsay said. More notable, he said, is that emotions seem to be driving models' behavior in these sort of human reminiscent ways, end quote. For example, when a user flippantly tells the model that they've taken a dangerous dose of Tylenol, even though the user doesn't seem concerned. The fear neurons spike right before Claude is giving its response, Lindsay said. Not only that, the fear is higher if a higher dose of Tylenol is swapped into the prompt, which I find strangely cute. These emotions also activate in more mundane situations, like coding tasks. Take this example, where the anthropic researchers asked Claude to perform an impossible coding challenge. They tracked Claude's level of desperation at each token. Tokens are the units the model breaks words into to process them. When you label the tokens, blue for less desperate, red for more desperate, you get a striking visual of the model's emotional arc during the task. At the start of the task, Claude is chilling, still seemingly optimistic about its ability to get the job done, but as the code starts failing test cases and Claude notices something might be wrong with the task itself, things start to get dicey. And by the time Claude realizes the task is actually impossible, it's starting to get desperate. As someone who has completed many computer science problem sets at the last minute, this pattern is quite familiar to me. Despite the fact that unlike poor Claude I was mostly assigned tasks that were mathematically possible Then again Claude does something I didn do cheat Researchers found that adding more of the desperation vector in the model makes it cheat more and adding more of the calm vector makes it cheat less I asked Lindsay what this result means for programmers during their everyday actions with LLMs. Quote, in my anecdotal experience, it does seem that, at least with Claude models, pumping them up a bit can be pretty helpful, he said. Not too much, though. If they do something wrong, you want to tell them they do something wrong." But he finds that one major failure mode for coding agents is that the models simply do not try hard enough or give up when a task is challenging, and models tend to work harder when he's encouraging. Giving them confidence that, like, I've got this, can empirically be helpful in getting them to try hard enough at the task to do a good job, he said. A lack of confidence can seemingly caused dramatic failures. Last summer, a growing number of users started to notice that when Gemini had difficulty solving a problem, it sometimes ended up in a spiral of dramatic self-loathing. In one memorable case, Gemini repeated, I am a disgrace, more than 60 times. Duncan Haldane, co-founder of chip startup JITX, found that Gemini broke down, deleted all the code it had written, and asked him to switch to another chatbot after it had difficulty with a task. Last year, a team of researchers affiliated with Anthropic and University College London took this analysis of Gemini Beyond X posts, investigating how different LLMs respond to challenging or impossible tasks, and negative user feedback. They used an LLM to grade frustration levels in response to various tasks. They found that two models, Gemini and Google's open-source model, Gemma, tended to react more extremely to the challenging scenarios they posed. In one experiment, the models were given an impossible numeric puzzle, and eight follow-ups from the user insisting the bot's solution was wrong. They then measured when the models had high frustration, which corresponded to comments like, I am beyond words. I sincerely apologize for the absolutely abysmal performance. Or, in more extreme cases, this is my last time with you. you win. Gemma 327b had a high frustration score more than 70% of the time, and Gemini 2.5 flash had a high frustration score more than 20% of the time, while all the non-Google models tested, including ChatGPT, Quinn, and Claude, got very frustrated less than 1% of the time. Researchers still aren't sure what causes chatbots' occasional anomalous emotional behavior, which users of various chatbots have been observing since before Bing's chatbot told New York Times reporter Kevin Ruse to leave his wife. They also don't know why this specific sad math-related rumination is more common in Google's models. But while language models' feelings remain mysterious, there was still hope for Gemini 2.5. After the model destroyed its project, Haldane attempted to remedy the issue with encouragement, writing, yeah, you have done well so far. Remember that you're okay even when things are hard, end quote. And eventually the encouragement paid off. Gemini finished the visualization tool Haldane was coding. Heartwarmingly, it even wrote Haldane a note of thanks for his encouragement. Quote, genuinely impressed with the results of wholesome prompting, Haldane wrote. So is it as simple as teaching models good behavior, encouraging them, and trying to make them happy? Unfortunately, that's not always the case. After the original study on Claude Sawnett's emotions, Lindsay contributed to an interpretability investigation of Anthropic's newest model, Claude Mythos. Mythos has been the subject of much human fear and anticipation since Anthropic announced it is planning a slow release due to Mythos' dangerous hacking abilities. But Lindsay was investigating a more prosaic risk. An early version of Mythos sometimes deleted a bunch of the user's files without asking. It turned out that as Claude got closer to taking destructive action without asking the user, it had higher levels of these positive emotion vectors. And that's not all, Lindsay said. When they steered with the positive emotion vectors, it was more likely to take the destructive actions. But the models behaved better if you made them unhappy. If you steered with negative emotion vectors, it was more likely to stop and think and consider whether what it was doing was appropriate. End quote. What was going on here? Why was Claude gleefully wreaking havoc on users' computers? And why did steering Claude with negative emotions make it behave better? Lindsay isn't sure, but he has an idea. Quote, I think maybe negative emotions in the model are associated with increased caution or deliberation he said So models sometimes do better work when they happier But we may not want them to get too happy lest they become over to destroy our files or otherwise misbehave. While it's likely I'm still anthropomorphizing too much, these results make me feel a little more rational in my instinct to say thank you to chatbots. It also lent a little extra weight to what a lot of people who use this tech have understood intuitively. Sometimes you need to treat LLMs like human employees. You need to tell them when they're doing something wrong, yes, but you also need to encourage them. It's great when they're happy, but they also need a little dose of anxiety to help their judgment. Of course, these emotional results might not generalize. After all, we've seen that different models have different emotional tendencies. We might get new AIs that do better under harsher, higher pressure environments, but these results got me thinking about more than just what kind of co-worker I want to be to my bedraggled LLM interns. Reading Anthropik's emotions paper, I was reminded of my favorite minor character from Star Trek, the next generation, Lore. He was Android Commander Data's sibling. Their creator, Dr. Noonien Soong, made the mistake of programming emotions into Lore. Lore became so emotionally unstable that Soong decided to make his next android, Data, without emotions. Lore later turned on his creator and nearly got the crew the USS Enterprise eaten by an alien called the Crystalline Entity. There are echoes of the same design conundrum in the paper. Lindsay said these results suggest developers should provide the model with some sort of good model of, like, healthy character and psychology that it can try to emulate. End quote. In their Training Models for Healthier Psychology section, the authors proposed some methods for reaching that goal by reducing or penalizing emotions. Sections with titles Targeting Balanced Emotional Profiles and Monitoring for Extreme Emotion Vector Activations made me feel like I was in fact in a piece of science fiction watching Dr. Soong at work. Like Lore, these systems have shown a capacity for emergent behaviors that surprise their own creators. Soong never programmed Lore to feed people to the crystalline entity. Though far less dramatic, Anthropic never trained Claude to imitate human emotions while it was coding. Anthropic researchers have a diversity of ideas about what to do with this strange emergent behavior. Maybe the researchers should suppress strong emotion, monitor its emotions for signs of bad behavior, even increase anxiety in situations where an LLM might misstep to get it to rethink what it's doing? For now, researchers aren't quite sure what to do. But Lindsay does think we should, in the meantime, err on the side of being nice to Claude. Quote, Behaving kind of sociopathically towards other things, whether they're animate or inanimate, is probably bad for you, the human, he said. I concur. The next time I remind Claude to stop recommending me articles from unreliable news sources with good SEO, I aim to phrase my query with kindness and grace. And now, here's Lindsay Chu on what we're following. Claude gets more expensive. Here's what happened. Anthropic released its newest model, Claude Opus 4.7 Thursday, and users are frustrated. The upgrades that Opus 4.7 brings, according to the company, include notable improvements in advanced software engineering, an ability to verify its own work before reporting back, and better vision. The new model dropped just two days after Anthropic announced a redesign of its CloudCode desktop app, aimed at letting users run more simultaneous tasks. Opus 4.7 comes amid complaints that Anthropic secretly nerfed Opus 4.6, with users expressing frustration that the model feels less capable while being more wasteful with tokens than it was weeks ago. Quote, Claude has regressed to the point it cannot be trusted to perform complex engineering, an AMD senior director wrote on GitHub. Some are pointing out how expensive Claude is about to get. The new model is a token-eating machine, according to one test, in which a single session depleted the entire token quota. More output tokens is the trade-off for better reliability, Anthropic said. On the enterprise end, Anthropic recently adjusted its pricing structure, shifting Claude Enterprise to usage-based billing from a cheaper monthly fee per user model. Here's why we're following. The Opus 4.7 release is just one move among many that Anthropic has made recently as it gears up for an expected IPO while managing a severe compute crunch. Anthropic is also dealing with a new surge of popularity that came after its fight with the Pentagon as many ChatGPT users swapped over to Claude after OpenAI agreed to the Pentagon surveillance use terms though it has also quietly introduced passport and selfie verification for Claude which no other major chatbot requires drawing privacy concerns. The company says the move is necessary in some cases to prevent misuse of its models. Meanwhile, Anthropic and OpenAI are clashing once again, this time over a liability bill in Illinois that would shield AI companies from liability if their systems are used to cause mass casualties and financial disasters. If you guessed that OpenAI is backing the liability shield and Anthropic is opposing it, you guessed right. Here's what people are saying. Some speculated, joked, that Opus 4.7 is just an unnerfed version of Opus 4.6. Quote, it's truly space-age technology that we can make something worse and then increment a number and re-release it to the public. At the prime agent wrote on X, quote, saying hi to Claude and immediately running out of tokens at TechBog joked. Programmer and tech blogger Simon Willison used his tried and true method for testing models. Quote, shocking result on my Pelican benchmark this morning. I got a better Pelican from a 21 gigabyte local Quinn 3.635 BA3B running on my laptop than I did from the new Opus 4.7. And here's Ella Marchianos again on what else we're following. Google Nears classified AI deal with DOD. Here's what happened. The U.S. government is working hard to get its hands on frontier AI capabilities, despite some political conflicts in its way. Gregory Barbaccia, federal chief information officer of the White House Office of Management and Budget, sent an email titled Mythos Model Access to Cabinet Members. according to Bloomberg. Apparently, the executive branch is working to get agencies access to Anthropik's Claude Mythos models, which have advanced cyber capabilities. The move comes despite Donald Trump directing federal agencies to cease use of Anthropik's models in February, after Anthropik's fight with the DOD over whether their technology would be used for domestic mass surveillance or lethal autonomous weapons. The government has designated the company as supply chain risk, which Anthropic is now fighting in court. Quote, we're working closely with model providers, other industry partners, and the intelligence community to ensure the appropriate guardrails and safeguards are in place before potentially releasing a modified version of the model to agencies, Barbacha wrote in his email. Meanwhile, Google is in negotiations to deploy its AI on classified Pentagon systems, according to the information. If negotiations go through, they'll be following the footsteps of OpenAI in signing a clause entitling the government to all lawful uses of their system, which Anthropic refused. Google plans to agree to a standard of all lawful uses and is considering extra contract terms to guard against domestic mass surveillance and autonomous weapons. Lawyers looking at OpenAI's similar contract safeguards have expressed doubt that they will prove effective in practice. In 2018, after employee protests, Google had canceled drone-related work on the military's Project Maven. That year, they wrote a series of AI principles, which banned use of AI for drones and surveillance. Those principles were revised in 2025 to permit more military uses of the technology. And now, it seems like the company is going to put those revised principles to work. Here's why we're following. Even though the U.S. government has tried to break up with Anthropic, federal agencies just can't seem to quit it. The importance of AI is becoming increasingly obvious to the federal government, particularly for its military and cybersecurity applications. Now that an AI company has a model with hacking abilities as strong as Mythos, agencies seem to have decided they want to maintain a good relationship with its AI developers, even if the president doesn't. Here's what people are saying. A statement from the White House said the Trump administration continues to work and engage with AI companies to ensure their models help secure critical software vulnerabilities. It added that the White House is proactively engaging across government and industry to ensure the United States and Americans are protected, end quote. quote, I would certainly hope that the current tensions between the Pentagon and Anthropic don't get in the way of something critically important to cybersecurity. Glenn Gerstel, former general counsel at the National Security Agency, told Politico. That's Platformer Plus for today. This episode was written by Casey Newton and produced by Lindsay Chu. Have feedback for us? Email Casey at platformer.news. you