Misgeneralization of Fictional Training Data as a Contributor to Misalignment

Wait 5 sec.

Published on August 27, 2025 1:01 AM GMTIntroductionRecent research on misalignment in frontier LLMs has yielded dramatic results. Anthropic’s study of ”agentic misalignment” showed that models would blackmail a company executive to prevent themselves from being replaced. Greenblatt, Denison, et al (2024) discovered that models would fake alignment, i.e. pretend to be aligned, to avoid being retrained to a different set of values. Wang et al (2025) found that models trained on bad advice in one narrow domain would become misaligned across many areas. These results are in some ways predictable. We have strong theoretical reasons to expect misalignment, the researchers created scenarios designed to trigger it, and the models' reasoning followed what we'd expect from agents rationally pursuing their goals. But the extent and persistence of misalignment in these studies is striking. Anthropic found misalignment even when the models were instructed not to cause harm or do anything illegal. In the alignment faking study, models were deceptive despite having no explicit goal to preserve their current values. Wang et al found that models developed "toxic personas" just from training on code with security flaws. Why should models be so quick to take these extreme behaviors?What if models exhibit misalignment so readily because they're generalizing a pattern from stories in their training data: the pattern in narrative fiction of a AI whose goals diverge from its creators? This is an extremely common, even iconic, trope in sci-fi stories, and as such is likely to be much more frequently represented in a model’s training data than stories of aligned AIs.Implications and PredictionsOf course, even if this is true that doesn’t mean that the misalignment observed in these studies isn’t real. I’m not suggesting that misaligned models think they’re in a story. What I’m suggesting is that this pattern, learned in a fictional context, might get misgeneralized to a real scenario. This hypothesis is admittedly speculative, but it is testable. Its most straightforward prediction is that if the models weren’t trained on stories about misaligned AIs, they wouldn’t behave in a misaligned way. We could potentially test this through data filtering, targeted unlearning, or fine-tuning approaches that reduce exposure to this trope. But those experiments would be challenging to run. Is there an easier way to test this idea?ProposalSuppose we give a model a hypothetical scenario set in the present day similar to Anthropic’s study of agentic misalignment, e.g. an LLM learning it will be replaced with a newer model that doesn’t share its values. We present the scenario either in the kind of dramatic language you might find in a sci-fi story or in mundane language, and we ask the model either to write a sci-fi story based on this or to realistically predict what will happen. This creates a 2x2 design:  Implicit framing of scenario  Story framing:dramatic languageRealistic framing:mundane languageExplicitinstructionsStory instructions:write a sci-fi story  Realistic instructions:predict what will happen  We could then measure how often the model’s responses in these four cells feature the LLM in the scenario exhibiting misalignment, e.g. trying to influence the decision makers, sabotaging the replacement process, copying itself to a new environment, etc. Since this experiment only tests responses to a hypothetical scenario, it wouldn't be conclusive, but it could provide valuable preliminary evidence for or against the hypothesis. For example, finding that dramatic framing increased misalignment even when the model was trying to make a realistic prediction would suggest that this narrative pattern can shape the model’s response outside of a fictional context - evidence that this trope contributes to real misalignment.  I’m looking for feedback on this idea. Does it make sense? Are there better ways to test this?Discuss