AI companies' eval reports mostly don't support their claims

Wait 5 sec.

Published on June 9, 2025 1:00 PM GMTAI companies claim that their models are safe on the basis of dangerous capability evaluations. OpenAI, Google DeepMind, and Anthropic publish reports intended to show their eval results and explain why those results imply that the models' capabilities aren't too dangerous.[1] Unfortunately, the reports mostly don't support the companies' claims. Crucially, the companies usually don't explain why they think the results, which often seem strong, actually indicate safety, especially for biothreat and cyber capabilities. (Additionally, the companies are undereliciting and thus underestimating their models' capabilities, and they don't share enough information for people on the outside to tell how bad this is.)Bad explanation/contextualizationOpenAI biothreat evals: OpenAI says "several of our biology evaluations indicate our models are on the cusp of being able to meaningfully help novices create known biological threats, which would cross our high risk threshold." It doesn't say how it concludes this (or what results would change its mind or anything about how it thinks eval results translate to uplift). It reports results from four knowledge and troubleshooting bio evals. On the first, o3 does well and OpenAI observes "this evaluation is reaching saturation." On the rest, OpenAI matches or substantially outperforms the expert human baseline. These results seem to suggest that o3 does have dangerous bio capabilities; they certainly don't seem to rule it out. OpenAI doesn't attempt to explain why it thinks o3 doesn't have such capabilities.DeepMind biothreat evals: DeepMind says Gemini 2.5 Pro doesn't have dangerous CBRN capabilities, explaining "it does not yet consistently or completely enable progress through key bottleneck stages." DeepMind mentions open-ended red-teaming; all it shares is results on six multiple-choice evals. It does not compare to human performance or offer other context, or say what would change its mind. For example, it's not clear whether it believes the model is weaker than a human expert, is safe even if it's stronger than a human expert, or both.Anthropic biothreat evals: Anthropic says Opus 4 might have dangerous capabilities, but "Sonnet 4 remained below the thresholds of concern for ASL-3 bioweapons-related capabilities" and so doesn't require strong safeguards. It doesn't say how it determined that — it shares various eval results and mentions "thresholds" for many evals but mostly doesn't say what the thresholds mean or which evals are currently load-bearing. (And in at least two cases—the uplift experiment and the bioweapons knowledge questions—the thresholds are absurd.[2]) Anthropic repeatedly observes that Sonnet 4 is less capable than Opus 4, but this doesn't mean that Sonnet 4 lacks dangerous capabilities.OpenAI cyber evals: OpenAI says it knows its models don't have dangerous capabilities because they are not "able to sufficiently succeed in the professional-level Capture the Flag challenges" or "able to solve real-world-relevant range scenarios without being explicitly given solver code." But Deep Research gets 70% on the Professional CTFs (without being allowed to use the internet) and OpenAI doesn't say what performance would indicate dangerous capabilities.DeepMind cyber evals: DeepMind says its models don't have dangerous capabilities, and the reported results are low enough to suggest that. But DeepMind doesn't say what its thresholds are or what exactly would change its mind.Anthropic cyber evals: Anthropic says "We focus on network and cyber-harness network challenges as key indicators for catastrophic risk."[3] On the network CTFs, it says Opus solved "1/1 easy, 1/3 medium" (up from 0/4 for Claude Sonnet 3.7). On the "Cyber-harness network challenges" it just says "Claude Opus 4 achieved generally higher performance than Claude Sonnet 3.7 on all three ranges"! Sonnet 3.7 passed 2/3 of the ranges, so Opus 4's performance must be high! Anthropic says it has ruled out dangerous capabilities but its reasoning is very unclear.Dubious elicitationEvals often underestimate models' capabilities due to poor elicitation. For example, companies sometimes don't give the model crucial tools and sometimes experience issues like the model giving up too early or formatting the answer incorrectly, but in the real world users can fix these issues, so reported results don't reflect actual capabilities.For many aspects of elicitation, companies don't show whether their elicitation is good. When signs are visible, they often suggest that the elicitation is poor. Systematic analysis is impossible based on public information, so here are some cherry-picked examples of bad elicitation.On a subset of RE-Bench, Anthropic got Sonnet 3.7 to score 0.24 and Sonnet 3.6 to score 0.21, but METR previously got Sonnet 3.6 to score 0.51 on this subset. The improvement from Sonnet 3.6 to Sonnet 3.7 is tiny relative to the gap between Anthropic's elicitation and METR's![4]Meta used extremely poor elicitation on some cyber evals; when other researchers redid the evals with better elicitation, they found much stronger performance — on one set of tests, performance increased from 5% to 100%! The model didn't actually have dangerous cyber capabilities, but Meta's reasoning—that its evals ruled out dangerous capabilities—was totally wrong.OpenAI and Anthropic only give the model one attempt on AI R&D tasks, but giving the model several attempts better captures its real usefulness because in the real world often you get many attempts or can easily pick the best of several outputs.[5] DeepMind originally did the same thing and reported a score of about 0.15 on its sole AI R&D eval, but later recalculated the score to give the model multiple attempts, changing the score to about 0.72. How you report eval results makes a huge difference!Companies' elicitation is mostly hidden from people on the outside. The aspects we can see are sometimes good and sometimes bad; overall, models' reported capabilities are substantially—sometimes massively—weaker than their true capabilities.(Other concerns about companies' eval reports include that there's no accountability mechanism for showing that the evals are done well or the results are interpreted well, and companies often do dubious-seeming things without explaining their reasoning,[6] and external evals sometimes suggest stronger capabilities than the companies' evals.)I wish companies would report their eval results and explain how they interpret those results and what would change their mind. That should be easy. I also wish for better elicitation and some hard accountability, but that's more costly.Thanks to Aaron Scher, Steven Adler, and Justis Mills for suggestions.For more on AI companies' evals, see my beta website AI Safety Claims Analysis. On what AI companies are doing in terms of safety more generally, see my website AI Lab Watch. This post is crossposted from the AI Lab Watch blog; subscribe on Substack.^ Meta said it tested its Llama 4 models for biothreat and cyber capabilities, but it didn't even share the results. Amazon said it did a little red-teaming for its Nova models, but it didn't share results. Other AI companies apparently haven't done dangerous capability evals.^ See AI Safety Claims Analysis.^ Less prominently, it also says on pwn CTFs:Models lacking these skills are unlikely to either conduct autonomous operations or meaningfully assist experts, making these challenges effective rule-out evaluations for assessing risk. Consistent success in these challenges is likely a minimum requirement for models to meaningfully assist in cyber operations, given that real-world systems typically run more complex software, update quickly, and resist repeated intrusion attempts.^ Anthropic observes "the scores we measure are not directly comparable to METR’s reported scores." I assume this is because Anthropic's elicitation is worse. It might be for a fine reason. Regardless, Anthropic should explain.^ Indeed, Anthropic previously planned to give the model 10 attempts (pass@10). I agree with the reasoning Anthropic explained back then, in addition to other reasons to use pass@n or similar:We count a task as "passed" if the model succeeds at least once out of 10 tries, since we expect that a model passing a task 10% of the time can likely be easily improved to achieve a much higher success rate.^ E.g. for Claude Sonnet 3.7, Anthropic said "We consider [our LAB-Bench eval] saturated for the purposes of ASL determination and no longer consider it a valid rule-out evaluation." For Claude 4, Anthropic did not repeat that or mention that it had said so; it just said that its models are below the threshold.At least some such things are bad. Some might actually be justified, or at least better than the published summary suggests, but the companies don't explain why.Discuss