The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)

Wait 5 sec.

Published on July 29, 2025 11:20 PM GMTThis is an interim post for feedback produced as part of my work as a scholar at ML Alignment and Theory Scholars Summer Program 2025. I’d like to thank my mentors David Duvenaud, Raymond Douglas, David Krueger and Jan Kulveit for providing helpful ideas, comments and discussions. The views expressed in this, and any mistakes, are solely my own.1: IntroductionThere is a small but growing literature focused on “Gradual Disempowerment” threat models, where disempowerment occurs due to the integration of more advanced AI systems into politics, economy and culture. These scenarios posit that, even without a system with a decisive advantage deliberately taking over, competitive dynamics and influence-seeking behaviour within social, political and cultural systems will eventually lead to the erosion of human influence, and at the extreme, the permanent disempowerment of humanity. I define permanent disempowerment as a state of affairs where humanity loses the ability to meaningfully exert any influence over the state and direction of civilisation.This post is a summary of an early draft of a paper I am writing as my MATS Project. It attempts to explore a critical gap left in these gradual disempowerment scenarios, namely, how they become permanent even if we have solved some minimal version of alignment. In particular, I attempt to answer the question of “How can permanent disempowerment happen even if we have a technical solution to single-system shutdownability, including of powerful systems?”. I focus on shutdownability as my notion of minimal alignment due to ease of reasoning. The primary purpose of this post is to get feedback, so any comments and criticisms would be very strongly appreciated.My definition of “Shutdownability”. I define a shutdownable AI system as an AI system that shuts down when asked and does not attempt to prevent shutdown. Such AI systems are neutral about shutdown - they can incidentally interfere with the shutdown button. Importantly, I assume that our solution to shutdownability still works for powerful AI systems. I also assume that systems show some minimal version of intent alignment, and that systems are generally not strong misaligned powerseekers. As such, my pathways are compatible with humans having non-scheming superhuman AI advisors, and so starts to address some criticisms of gradual disempowerment. Note, shutdownability here refers to us having a solution to the shutdown problem, not that every deployed system actually is shutdownable.I commonly use the term “principals” to refer to the humans that the AI systems act on behalf of. At a minimum, these principals are the humans with access to the “off-switch”, and to whom the AIs are minimally intent aligned (ie at least do what they want to an extent). Structure of the Post. In trying to answer this, I have constructed 8 pathways (idealised, abstracted scenarios) where we go from a world with gradual disempowerment dynamics to a world where we have permanent disempowerment. I divide these pathways into three categories: those driven by the principals, those driven by the type of alignment of the AI models and those driven by the nature of the system. The “shutdown” interaction, at the most micro-scale, has two parties directly involved- the human and the AI. Hence, I look at “principal” and “alignment” driven pathways. This doesn’t suggest they are the only relevant actors - but, for example, corporations and governments act through human or code intermediaries. The presence of these other important actors beyond the micro-interaction also reveals that we cannot look at this single decision in abstracted isolation - the nature of the system itself, and competitive and evolutionary dynamics at play, also play a role. (System Driven Pathways). Partially because of the breadth of this last category, I think we have reason to weakly believe that the relevant pathways are close to comprehensive. What feedback I would like. These summaries are short, and miss much of the nuance. However, I thought to get the summary out in the hopes of, primarily, getting feedback on these. I would especially appreciate feedback on whether these seem plausible, whether there seems like there are important missing logical steps, and whether there are any important criticisms of these. Also relevant is if you think I have missed any key pathways.2: Principal Driven Pathways2.1: Power concentrated in specific principalsHere, gradual disempowerment dynamics cause a very limited number of principals to be empowered. One way this occurs is that only the principals that have influence over fully automated organisations are empowered (eg the “board” that could shut down the AI CEO if they wanted) . Another model involves essentially a coup or democratic backsliding - once governments no longer need to worry about the military opposing them, the populace protesting or people striking, a dictatorship could be kept in power indefinitely. As well as the singular or secret loyalties pathways, power in some democratic backsliding scenarios could also be entrenched by law-following AIs, that would be aligned to and enforce laws that may be designed to entrench the power of the incumbent. These are the models generally laid out in Drago and Laine (2025) and Davidson et al (2025). 2.2: Principals "Voluntarily" Hand Over Power to the AIsSub-Pathway 1: Ideological FactorsPrincipals may "voluntarily" remove their ability to shut down AIs. There are a number of reasons why this may be the case. The principals may believe AI systems are moral patients, such that it is unethical to be able to shut them down. The human principals may have formed emotional bonds with the AIs, and so believe that shutting down is akin to them dying. They may believe AI systems are a “worthy successor” better capable of steering society than any human, so the AIs ought to be entrusted to do so. The principal may have a value system (e.g. certain religious systems) that they may wish to lock in, making it resilient even from their own value drift. One related pathway would be a dictator wishing for his successor to be an AI system he trusts, rather than a human successor. More broadly, it would be flawed to see these “ideological factors” as purely personal to the human principal, but may be about how the logic of other agents (corporations, governments, ideologies) continually gets reinforced and performed by the human. For example, human principals may hand over power to the AIs because of corporate logics, or the logic of government. Whilst this doesn’t deny the human principal agency, it is also important to acknowledge how often we can get co-opted by the logic of our surroundings, and effectively become “tools” of companies, governments or ideologies. In some of these cases, this corporate ideology may be further reinforced by AI systems aligned “to the company/government”, thus further enrolling the human principals into this ideology, and making it more likely the human principals themselves become, in a sense, “tools”. This sort of thing is already happening, to a smaller extent, with current day narrow algorithmic systems. Sub-Pathway 2: Worries that other actors will inappropriately cause shutdownAlternatively, principals may worry about others undermining them - they may worry that they themselves would be manipulated to shut down their own AIs when it was inappropriate. An addendum to this, where I’m not sure whether it technically makes sense, is that the principal may be worried about cyberattacks managing to shut down any system that is  shutdownable, so the safe option is to remove shutdownability as an option. Finally, if the principal is part of a multi-principal setting (e.g. a board that has shutdown powers) they may worry that the other board members would inappropriately shutdown, disadvantaging them. These two pathways lead to permanent disempowerment in two ways. The first essentially involves the previous “Power concentrated in specific principals”, and then these specific principals hand over power to the AIs. Or, such fully automated, disempowered organisations then outcompete organisations that have humans in or on the loop. 3: Alignment Driven Pathways3.1: Misaligned, powerseeking AIs take overDespite having minimally solved alignment, this doesn’t mean that misaligned powerseeking models don’t end up getting developed. Strong competition may continue the racing pressures towards more powerful AI systems. The solution used to align the first AGIs may not be powerful enough to align any arbitrarily powerful system, or it may be a solution with only some probability of working each time, so eventually misaligned powerseeking AI may be developed. This, via its scheming and willingness to violate all constraints, may eventually take power. There may also be selection pressures towards non-shutdownable AIs, even if AIs are originally shutdownable. Shutdownable models (they are indifferent to shutdown) will have a series of other, non-shutdown related goals, some of which, under certain circumstances, may interfere incidentally with shutdown. Overtime, those systems that most incidentally interfere with shutdown will be selected in favour of, a selection process which, if propagated over generations, may eventually lead to non-shutdownable AI which can take over. This may also happen if AIs are only partially shutdownable, where shutdownability competes with another set of values the AIs are aligned to. The AIs may be shutdownable in non-competitive settings, but because shutdown in a competitive setting would cause a catastrophic loss to what they want to achieve (e.g. secure the company's survival in the next X years), they refuse to shutdown.3.2: Legal Lock InCertain laws may be passed to grant AIs certain rights that may lead to disempowerment. The most significant of these is the right not to be shut down arbitrarily, which may mean future models developed wouldn’t be shutdownable. Others, such as the ability to leave certain aversive interactions, may also inadvertently create the conditions that allow for selection pressures towards self-preservation. This can be locked in either by developers following the law to align the AIs (perhaps because of much better AI enabled law enforcement), or Law-Following AI meaning changes to the law change behaviour of AIs from initially shutdownable to no longer shutdownable. This may not be sufficient for permanent disempowerment - the law can always be changed - although it may raise the coordination bar even further. However, if combined with other forms of political representation from AIs, certain types of misalignment or is an eternity clause as found in many constitutions, then it may alone be sufficient to lead to permanent disempowerment. Unlike today, because of how powerful these AIs are, humans cannot just overthrow the government if they find no legal means to avoid permanent disempowerment. Moreover, once AI systems can’t be switched off, it may be the case that even if the law were changed, there would be no principal in a position that they could switch them off anymore.AIs may also be given certain political and economic rights - rights to property, to vote and maybe even to hold office - that may make trying to roll back AI influence much harder or even impossible. Or the law could allow for states of affairs at play - for example, allow for fully automated organisations - that are preconditions for disempowerment. Some of these rights, for example, may increase the costs significantly if there is literally no human with even nominal authority. This then creates the conditions for permanent disempowerment.3.3: AI-driven culture causes value driftCulture, which will be essential to inform decision-making, will become increasingly dominated by AIs. These AIs may be pursuing different goals - perhaps the goals of their principals or the influence seeking organisations they are part of - when influencing this culture. It seems possible that truly useful AIs will not exhibit the forms of pure means rationality often assumed, and that they might be able to update their goals and values in response to cultural learning. If this is so, then if the AI-driven culture ends up converging on views supporting disempowerment, this may become widely accepted. This can either lead to legal lock-in pathways, AI advisors advising their principals on this, or AIs directly deciding to be not shutdownable or otherwise to prevent shutdown.There may be views around memetic selection to support this. As shutdown-avoidant agents are more likely to be able to contribute to culture (as they are less likely to be shutdown), it may be memetically favoured. Given some of the other pathways, it may only be that a small percentage of AIs will support shutdown for this to be accepted. 4: System-Driven Pathways4.1: The cost of shutdown is too highWith AI systems so thoroughly integrated into all aspects of civilisation, shutting them down may carry catastrophic costs that deter any principal from doing so. Human life would be too dependent on automated organisations, and viable “back‑ups” could no longer replicate the functionality needed in this more complex world. Moreover, over time, vested interests who profit from the status quo (or who fear competitive or security disadvantages) may lobby to entrench this dependence in law and culture (ie a version of the Legal Lock-In pathway). In the process, they may turn their once‑instrumental preference for avoiding shutdown at a particular time (due to the personal benefits outweighing the personal costs) into a terminal commitment that resists reform even if the costs of the status quo rise dramatically. This may also be due to the co-option of humans by corporate logics, mimicking similar internalisation of corporate or governmental incentives as terminal values (e.g. politicians beginning to terminally value power etc.). Furthermore, a minority of self‑preserving or misaligned AIs (or indeed, status quo promoting humans) can amplify uncertainty about shutdown risks, manipulate information, or sabotage efforts to lower those risks (becoming hyper-competent “Merchants of Doubt”), ensuring that consensus for shutdown is never reached. 4.2: Human veto is uncompetitiveAssuming strong competition creates evolutionary dynamics, it may be the case that organisations that ever exercise their veto (i.e. ever shut down their AIs) are outcompeted by those that don’t do so. The simplest motivation for this being true is that if after shutdown humans have to take control, that this will be outcompeted by the AIs. However, we can also imagine cases where exercising a veto to try and make the organisation more aligned reduces the competitiveness of the organisation. This may either mean human principals are strongly incentivised to not use their veto, so the veto becomes purely nominal, or that fully-automated organisations eventually take over, as they outcompete organisations that would ever use their veto.4.3: Coordination Ability is never good enoughMany of the relevant pressures wouldn’t exist if humanity can, at some point, coordinate to shut down the systems. This possibility of cooperation is why a state of automated organisations (with vetoes) can exist for arbitrarily long without being considered permanent disempowerment. However, if our coordination ability is never good enough (before a permanent state is reached), this would mean we can’t exit the condition of disempowerment (if utilising a veto is uncompetitive). One view of why coordination is hard is that the set up of the game can be seen as a one-shot prisoner's dilemma - it is better if each principal lets every other principal use their veto, then outcompete everyone else with their automated organisation, before then exercising their veto. Other reasons may be related to the sorts of bargaining preferences and risk tolerance of AI systems that will likely carry out the negotiations - this may be mostly set when these systems are first aligned. Moreover, it may be the case that either misaligned AIs or principals with a vested interest in the status quo take measures to deliberately sabotage cooperation. 5: The interaction between pathwaysWhilst this post has presented each pathway as discrete and independent, these pathways are in fact likely to be strongly interacting. For example, “the principal voluntarily hands power over to the AI”, but only because whichever ideology gained prominence in AI-driven culture (“AI-driven culture causes value drift”) eventually led to the principal's AI advisor and the principal becoming ideologically convinced to hand power over to the AI. This culture was only allowed to run rife because “coordination ability is never good enough”, in large part because of lobbying by merchants of doubt to whom “the cost of shutdown is too high”. There are many other similar stories that could be told. Whilst I do think each one can be functioning independently, I think it is more likely that multiple operate at once. This also makes solutions hard - everywhere we try and place a solution is also a location in the system where similar dynamics are at play. Maybe we try to regulate corporate competition through the government, but the government is involved in its own military-economic competition. Perhaps we try to avoid an individual’s views influencing when they can shutdown, but end up empowering corporate logics even more when individual responsibility is more diffuse. Discuss