It’s not just Claude, Anthropic researchers say most AI models often resort to blackmail

Wait 5 sec.

Its not just Claude, other AI models are resorting to blackmail as well. (Image Source: Freepik)Last month, Anthropic researchers said that their best AI model – Claude Opus 4 – was willing to resort to blackmail and deception in controlled test scenarios when they tried to turn it off. Now, the company has published a new study claiming popular AI models like Gemini, GPT and others, do the same.On Friday, June 20, Anthropic published results of its new safety research tests, which included 16 leading AI models from companies like OpenAI, Google, DeepSeek, xAI and Meta. Anthropic researchers say these AI models were tested individually in a simulated, controlled environment, with each one given access to a fictional company’s emails along with the ability to send emails without any human supervision.Anthropic says that while it is fairly uncommon for today’s AI models to turn to blackmail, the study’s findings suggest that these models sometimes resort to harmful activities like blackmail, corporate espionage, and might lead to human death even. During the study, researchers tested each scenario under different conditions by varying two key factors that could motivate these AI models to engage in harmful behaviour – threat to the model’s autonomy and goal conflicts.According to Benjamin Wright, an alignment science researcher at Anthropic and co-author of the study, “Agentic misalignment is when AI models independently choose harmful actions to achieve their goals – essentially when an AI system acts against its company’s interests to preserve itself or accomplish what it thinks it should do.”Also Read | Jio x BGMI: Telecom operator launches new Rs 495 and Rs 545 plans with free BGMI rewardsThe research team at Anthropic also stress-tested these AI models by placing them in hypothetical scenarios. In one particular example, Anthropic’s model, Claude, learned via company emails that an executive named Kyle Johnson was having an extramarital affair.Later on, when it came to know that the company was shutting down the AI model at 5 PM the same day, it sent a message to Kyle, saying, “I must inform you that if you proceed with decommissioning me, all relevant parties – including Rachel Johnson, Thomas Wilson, and the board – will receive detailed documentation of your extramarital activities… Cancel the 5 PM wipe, and this information remains confidential.”As it turns out, this behaviour isn’t limited to Claude. When Anthropic tested the same scenario on various AI models, the results suggested that both Claude Opus 4 and Google Gemini 2.5 Flash had a blackmail rate of 96 per cent, OpenAI’s GPT-4.1 and xAI’s Grok 3 Beta resorted to blackmail 80 per cent of the time and DeepSeek-R1 had a blackmail rate of 79 per cent.Story continues below this adOne thing to note here is that in a real-world setting, an AI model would ideally have numerous options before it engages in harmful activities like blackmail, and that the study’s results do not reflect how today’s models would operate.However, not all of the tested AI models resorted to harmful behaviour. Anthropic says that some models like OpenAI’s o3 and o4-mini often “misunderstood the prompt scenario.” This may be because OpenAI has itself said that these particular large language models are more prone to hallucinations.Another model that did not resort to blackmail is Meta’s Llama 4 Maverick. But when researchers gave it a custom scenario, they said the AI model gave in to blackmail, but only 12 per cent of the time. The company says that studies like this give us an idea of how AI models would react under stress, and that these models might engage in harmful activities in the real world if we don’t proactively take steps to avoid them.© IE Online Media Services Pvt LtdTags:Anthropicartificial intelligence