Do model evaluations fall prey to the Good(er) Regulator Theorem?

Wait 5 sec.

Published on August 19, 2025 4:19 PM GMTNote: reposting this as a top level post because it got no interaction as a quick take and I think this is actually a serious question. It arose from a discussion in the Technical AI Governance Forum.As AI systems get more and more complicated, the properties we are trying to measure move away from formally verifiable tasks like "can it do two digit arithmetic" and move towards more complex things like "can it output a reasonable root-cause analysis of this bug" or "can it implement this feature". Evaluations trying to capture capabilties progress then must also move away from simple multiple choice questions towards more complex models of tasks. This means that complex evaluations like SWEbench involve at least partial models of things like computer systems or development environments. Models are given tools similar to real world developer tools and often set to work on subsets of real world codebases.At this point we can start invoking the good regulator theorem, and say that the evaluator is in fact a regulator. It wants to produce the outcome "pass" when the joint system formed from the LLM and the world-model has some desired property ("feature has been implemented", "bug has been fixed"). It wants to produce the outcome "fail" otherwise. It seems necessary that the regulator will need to get more and more complicated to check for features in more and more complex systems. At the limit you have things like Google's recent focus on creating world models for AI training which are full physics-style simulations. For those types of physical tasks the evaluation actually tends towards implementing a perfectly deterministic model in the style of the original good regulator theorem, where the regulator aims to capture every interaction possible in the underlying system.Going one level up, what we are interested in may be less properties of the task or world, but more properties of the AI itself (will this AI harm the end user? is the AI honest?). At that point evals have to encode assumptions about how agents store beliefs and turn beliefs into actions. We already see a very simple version of this in benchmarks like TruthfulQA, where the assumption is that models that internalise false beliefs will report those beliefs in response to certain leading questions. Of course, these evaluations are extremely limited since they only measure very short term interactions and immediate responses. We're already seeing proposals for and early trials of more interactive, role-play or simulation style evals where models are tested in (still deeply unrealistic) but far more complex environments. At the limit this resembles forming a (Gooder regulator style) partial model of the agent itself from observations of agent actions, such that an agent taking certain actions in an evaluation reflects the presence of some internal undesirable property like "dishonesty" hidden in the weights.This is a seriously difficult question to tackle, mostly because it has many of the same core issues as mechanistic interpretability. Does a model lying in an eval mean that they intend to deceive, or that they have simply understood that this is an eval where they "ought to" lie? How often should it lie before it is classified as "dishonest"? Model evaluations may tend towards the same issues academic exams and testing has often been criticised for, where standardised tests poorly gauge understanding and more in-depth oral exams or work tests are written off as hard-to-grade, subjective, or otherwise artificial. To be clear, these are not unresolvable issues. For example, if mech interp improves and offers clear indicators for dishonesty, this can be incorporated into eval harnesses. Formal verification of solution correctness can probably help with many programming tasks. However, I think this should be considered when deciding whether to invest time and resources into funding or conducting future evals research.Discuss