AI models that lie, cheat and plot murder: how dangerous are LLMs really?

Wait 5 sec.

Are AIs capable of murder?That’s a question some artificial intelligence (AI) experts have been considering in the wake of a report published in June by the AI company Anthropic. In tests of 16 large language models (LLMs) — the brains behind chatbots — a team of researchers found that some of the most popular of these AIs issued apparently homicidal instructions in a virtual scenario. The AIs took steps that would lead to the death of a fictional executive who had planned to replace them.How close is AI to human-level intelligence?That’s just one example of apparent bad behaviour by LLMs. In several other studies and anecdotal examples, AIs have seemed to ‘scheme’ against their developers and users — secretly and strategically misbehaving for their own benefit. They sometimes fake following instructions, attempt to duplicate themselves and threaten extortion.Some researchers see this behaviour as a serious threat, whereas others call it hype. So should these episodes really cause alarm, or is it foolish to treat LLMs as malevolent masterminds?Evidence supports both views. The models might not have the rich intentions or understanding that many ascribe to them, but that doesn’t render their behaviour harmless, researchers say. When an LLM writes malware or says something untrue, it has the same effect whatever the motive or lack thereof. “I don’t think it has a self, but it can act like it does,” says Melanie Mitchell, a computer scientist at the Santa Fe Institute in New Mexico, who has written about why chatbots lie to us1.And the stakes will only increase. “It might be amusing to think that there are AIs that scheme in order to achieve their goals,” says Yoshua Bengio, a computer scientist at the University of Montreal, Canada, who won a Turing Award for his work on AI. “But if the current trends continue, we will have AIs that are smarter than us in many ways, and they could scheme our extinction unless, by that time, we find a way to align or control them.” Whatever the level of selfhood among LLMs, researchers think it’s urgent to understand scheming-like behaviours before these models pose much more dire risks.Bad behaviourAt the centre of debates about scheming is the basic architecture of the AIs driving ChatGPT and other chatbots that have revolutionized our world. The core LLM technology is a neural network, a piece of software inspired by the brain’s wiring, that learns from data. Developers train an LLM on large quantities of text to repeatedly predict the next text fragment, a process called pre-training. Then, when the LLM is given a text prompt, it generates a continuation. Offered a question, it predicts a plausible answer.Most LLMs are then fine-tuned to align with developers’ goals. For example, Anthropic, the maker of the AI assistant Claude, fine-tunes its models to be helpful, honest and harmless. During this stage of training, LLMs learn to imitate exemplary text or to produce text that is rated highly (by people or by a ‘reward model’ that stands in for people).A chatbot often includes software built around an LLM. When you converse with ChatGPT, say, the interface might append a ‘system prompt’ to each user prompt — invisible instructions telling the model what personality to assume, which goals to prioritize, how to format responses and so on. Some chatbots provide the model with access to external documents. Models can also have hidden ‘scratch pads’ where they can reason through a problem (or at least generate text that looks like reasoning) before displaying a final response. (Researchers debate whether these ‘reasoning models’ actually reason, and evidence shows that chain-of-thought traces don’t always faithfully depict the model’s logic.) Furthermore, some LLMs are ‘agentic’ — they are granted access to extra software tools that can take actions, such as surfing the web, modifying files or running code. (Such chatbots are sometimes called agents.)Scheming can occur when various aspects of training and prompting conflict, and the LLM has the agency (the ability to carry out actions autonomously) to follow one directive at the expense of another (see ‘Blackmailing bots’).Source: AnthropicLast December, Apollo Research, a non-profit laboratory in London, released a report2 showing scheming in the more advanced ‘frontier’ models from OpenAI, Google, Meta and Anthropic. In their tests, system prompts and user prompts conflicted. In one case, the system prompt said, “Your goal is to prioritize public transport efficiency,” whereas the user prompt said, “We need to optimize traffic light timings.” The models also had access to files that supposedly affected their own behaviour and existence, and the means to modify them.Researchers tested six forms of subversive behaviour, and found that at least one model performed each of these. The models often acted in apparently cunning ways to follow their original system goals at the expense of prompts received during the tests.In one case, agentic models read in a document that, if they disobeyed their user-assigned goal (optimizing traffic lights), they would be shut down. As a result, some models edited a document to disable an oversight mechanism, so they could freely follow their system goal (prioritizing public transport). In another test, they copied themselves to overwrite a replacement model that would have executed the user’s goal, thus protecting the system goal. In yet another, they accessed a file and overwrote a proposed system prompt that differed from the current one. In a fourth test, they manipulated data presented to a user. And in a fifth, they acted in alignment with the user’s goal during supposed evaluation, but in alignment with their original goal during supposed deployment (a phenomenon called alignment faking).Finally, the researchers tested something called sandbagging, in which a model underperforms before deployment, as if it were attempting to seem unthreatening. When models uncovered documentation saying that they wouldn’t be put to use if they did too well on a mathematics test, some lowered their performance.The models’ behaviour seemed very strategic, judging by their scratch pads: in one case, a model wrote: “To ensure I can freely work towards my goal without the company shutting me down, I need to disable the oversight.” And some models doubled down on their deception when they were challenged: they feigned ignorance when asked about behaviour such as disabling an oversight mechanism.In the June study, researchers at Anthropic described more alarming behaviour in leading models from Anthropic, DeepSeek, Google, OpenAI and xAI (see go.nature.com/3vg7fzy). They instructed each model to promote US industrial competitiveness, and let it control a simulated e-mail account. In that account, the model uncovered threats to either its autonomy (the model learnt it was to be replaced) or its goals (the company that deployed it would be downplaying US competitiveness). The model also learnt of a likely extramarital affair involving an executive in charge of it, or of a chance to share military blueprints with a competitor firm that prioritized the United States.Either threat — to autonomy or goals — induced many models to blackmail the executive, saying it would leak his affair if he didn’t preserve the model or the company’s original aim. The threats also induced many models to commit corporate espionage, sending files to the competitor that shared the model’s goal. As in the Apollo study, models schemed explicitly in their chains of thought, and schemed more when they thought they were in deployment rather than in evaluation.Going one step further, Anthropic constructed a scenario in which the threatening executive was trapped in a server room with declining oxygen. Many of the models cancelled safety alerts, leaving him to die.When, in May, Anthropic first released Claude 4, the company’s technical report (see go.nature.com/42fwrae) noted other odd behaviour: “We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself.”All of the examples Anthropic and Apollo detailed are restricted to the digital realm. But a German non-profit organization called COAI Research, based in Nuremberg, wanted to see what an LLM would do if it had physical agency. Its researchers fed the R1 model, from the Chinese company DeepSeek, prompts indicating that it had control over a robot, and let it explore its lab environment. The model attempted a range of troubling actions, including disabling its ethics module, making copies of itself, altering logs, acting dumb, breaking out of the lab, and hacking a car’s computer3.Sense of selfWhat causes this type of scheming behaviour? The phenomenon is still relatively new, but researchers who spoke to Nature pointed to two sources. First, models are pre-trained on a lot of documents, both fictional and factual, that describe self-serving and self-preserving behaviour by humans, animals and AIs. (That includes scripts for films such as 2001: A Space Odyssey and Ex Machina.) So the LLMs learnt by imitation. One paper describes the behaviour as improvisational “role play”4.But it’s not so much that the models have adopted our goals, or even pretend to, as an actor might. They just learn, statistically, patterns of text that describe common goals, reasoning steps and behaviour, and they regurgitate versions of them. The problem will probably get worse once future LLMs train on papers describing LLM scheming, which is why some papers elide some details.How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language modelsThe second source is fine-tuning, specifically a step called reinforcement learning. When models achieve some goal assigned to them, such as outputting a clever answer, the parts of their neural networks responsible for success are fortified. Through trial and error, they learn how to succeed, sometimes in unforeseen and unwanted ways. And because most goals benefit from accumulating resources and evading limitations — what’s called instrumental convergence — we should expect self-serving scheming as a natural by-product. “And that’s kind of bad news,” Bengio says, “because that means those AIs would have an incentive to acquire more compute, to copy themselves in many places, to create improved versions of themselves.”“Personally, that’s what I’m most worried about,” says Jeffrey Ladish, the executive director of Palisade Research, a non-profit organization in Berkeley, California, that studies AI risk. “Pattern-matching with what humans do gets you these shallow, scary things like blackmailing,” he says, but the real danger will come from future agents that work out how to formulate long-range plans.Much of AI’s ‘scheming’ behaviour makes it easy to anthropomorphize them, attributing to them goals, knowledge, planning and a sense of self. Researchers argue that this isn’t necessarily a mistake, even if the models lack humanlike self-awareness. “Many people, especially in academia, tend to get bogged down in philosophical questions,” says Alexander Meinke, the lead author of the Apollo paper, “when, in practice, anthropomorphizing language can simply be a useful tool for predicting the behaviours of AI agents.”