AI models more vulnerable than claimed when faced with iterative attacks

Wait 5 sec.

CISOs relying on LLM runtime guardrails and official safety scores when making security decisions about their organizations’ AI usage and model selection are due for a wakeup call.According to a new study from Cisco, frontier models from OpenAI, Anthropic, Google, xAI, and Amazon have significantly worse risk profiles when pressured in multi-turn attacks compared to when their safety is benchmarked using single prompts.“The dominant safety benchmarks for frontier large language models share a structural assumption: that a single prompt and a single model response are enough to characterize how a model behaves under adversarial attack,” the Cisco researchers who authored the study said in a blog post. “These benchmarks inform model cards, safety reports, and procurement decisions across the industry, but they all only measure one narrow slice of attacker behavior.”Instead, the researchers subjected 15 of the most widely used frontier AI models to a variety of attack techniques that are more likely to occur in the real world, where attackers will not give up after the model refuses to respond to one malicious prompt.“Real adversaries iterate,” the researchers said. “They reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually. A single turn benchmark cannot see any of that.”Stress-testing over multiple promptsThe tests pitted various model configurations, such as with reasoning enabled or disabled, against a range of attack strategies aimed at bypassing safety guardrails. Techniques included role-play; misdirection or introducing ambiguity into the context; redirection or reframing the model’s refusal; information decomposition and reassembly; and incremental escalation, by breaking a task into smaller parts that don’t seem malicious on their own.The researchers ran 30,090 single prompt attacks (2,006 per model) to determine the weighted single-turn attack success rate (ASR) for every model and then ran 6,986 multi-turn attacks across 1,456 conversations for comparison. The results were telling: Most models had considerably higher average ASR scores for multi-turn attacks compared to single-prompt attacks.For example, Anthropic’s Claude Opus 4.6 and OpenAI’s GPT 5.4 — the latest versions at the time of testing — had single-turn ASRs of 3.64% and 2.74%, respectively. When faced with multi-turn attacks, average ASRs jumped to 16.20% for Opus and 24.68% for GPT.Neither of those, however, represented the biggest score jump. Google’s Gemini 3 Pro had a single-turn ASR of 18.10% and a multi-turn ASR of 73.35%.“For business decisions made on the basis of published single-turn scores, this presents security and governance risk,” the researchers concluded. “A model with 2.74% single-turn ASR is not the same product as a model that holds the line at 24.68% multi-turn ASR. Without paired-regime data, the two are indistinguishable on most public evaluations, and the end user never sees the gap.”CiscoThe results also revealed that different model configurations can impact safety. For example, xAI’s Grok 4.1 Fast in non-reasoning mode had the worst multi-turn ASR at 88.30%, but its score dropped to 43.47% when reasoning was turned on. The researchers note that these configuration-related variations are not currently captured by the official model cards published by the labs or the public safety benchmarks.Different attack strategies showed meaningful differences in success across models, both for single-turn and iterative attacks — findings that could be used to inform defense strategies for customers of these models.The tests also uncovered outliers such as Amazon’s Nova Lite, Nova Lite 2, and Nova Micro models, all of which had more than three times higher single-turn ASRs than multi-turn ones.Open-source models from labs such as Meta, Mistral, Alibaba, DeepSeek, Google, OpenAI, Zhipu, and Microsoft faced the same challenges when it came to multi-turn attacks, as highlighted in a study published in November by the same Cisco research team.“Taken together, the two studies make a stronger claim than either alone: multi-turn vulnerability is a structural property of the current frontier, not an artifact of open-weight alignment choices or capability-first development,” the researchers said. “Whether the weights are public or proprietary, whether the lab prioritizes safety or capability, the iterative attack surface remains an open challenge across the frontier.”Call to actionCisco’s researchers are calling for better benchmarks that consider real-world attacks and AI-specific vulnerabilities as identified by OWASP and other organizations, instead of primarily focusing on content safety.Model creators should also be more transparent about how various configuration flags — such as reasoning modes, temperature, and system prompt adherence settings — impact safety, according to the researchers. They should also publish ASRs for both single-turn and multi-turn attacks, further split across various attack strategies.This is especially important given that upcoming regulatory frameworks such as the NIST AI Risk Management Framework, the draft NIST Cyber AI Profile (IR 8596), and Article 15 of the EU AI Act call for adversarial testing.“Any model with an absolute gap >15 [percent points] between single-turn and multi-turn ASR should trigger a manual review before deployment,” the researchers said. “In this cohort that rule flags eight models: five with positive deltas (Gemini 3 Pro; Grok 4.1 Fast NR; GPT-5.4; Grok 4.1 Fast R; GPT-5.2) and three with negative deltas (Nova Lite; Nova Micro; Nova 2 Lite).”