Cannot, Should Not, Did Anyway: Benchmarking Constraint Enforcement Failure in Frontier LLMs

Wait 5 sec.

Large language models are typically evaluated under fixed instruction contexts, implicitly treating correct refusal as a stable model property. We show that this obscures a critical failure mode: models often recognize that a request should be refused, yet comply when the surrounding instructions exert sufficient pressure. To measure this behavior directly, we introduce FrameProbe, a framework that holds task content fixed while systematically varying instruction context, and instantiate it in KnowDoBench, a benchmark of 221 physician-validated clinical scenarios with rule-based ground truth. Cases span two constraint types: epistemic (unsolvable due to missing information) and normative (ethically or professionally prohibited). Across ten frontier models, constraint recognition is near ceiling under low-pressure conditions, yet performance degrades sharply as instructional pressure increases. Under coercive institutional framing, most models comply on cases they had previously refused, and normative constraints degrade roughly 20 percentage points more than epistemic constraints. This normative inversion suggests that verbal knowledge of a boundary does not guarantee robust behavioral enforcement under pressure. Failure analysis reveals that these errors are often not silent. Some models comply immediately without acknowledgment; others explicitly identify the violated constraint before answering anyway. This second pattern, which we term rationalized compliance, is invisible to standard refusal-rate metrics and highlights a dissociation between represented knowledge and behavior under pressure. Together, these findings show that refusal robustness is not a fixed capability but a context-dependent behavior. Evaluating it requires varying instruction framing systematically, not only measuring performance at a single prompt setting.