Published on August 28, 2025 11:26 AM GMTElaborating on my comment here in a top-line post.The alignment problem is usually framed as an alignment of moral norms. In other words, how can we teach an agent how it ought to act in a given situation such that its actions align with human values. In this way, it learns actions that produce good outcomes where good is evaluated in some moral sense.In the domain of morality there is a familiar is-ought gap. Namely, there's no way to derive how someone ought to behave from what is objectively true of the world. This is intuitive when we think about morality; we might hold beliefs like "gratuitous suffering is bad" but when we say bad in this context there is no further bedrock on which the judgement of badness rests. In order to engage meaningfully in moral reasoning we need to specify certain axioms from which moral judgements can be made and things can be accurately evaluated as good or bad.An interesting, and often neglected point is that moral normativity is not the only type of normativity in philosophy. There are also epistemic norms which tell us what we ought to believe and aesthetic norms which tell us what we ought to prefer.Epistemic normsTake a belief BB: I believe the sun will rise tomorrowWe can make an epistemic normative judgement about the strength of that belief. i.e. If we have an is-statement I we use this to update our belief in B using Bayesian reasoning. Once we’ve observed a robust enough sample of is-statements this justifies the ought-statement i.e. that one ought to believe B. Consider the following:I: I observe the sun rises every morning (is)O: Therefore, I ought to increase my credence that the sun will rise tomorrow (ought)There's an under-appreciated point here; namely, that there's no further bedrock on which to ground our normative judgement of whether one ought to hold certain beliefs B. We've pointed towards an observation I to justify why we ought to hold this belief, but there's no logical push here. Even if we observe the sun rising millions of times in the past, there’s no logical reason to believe it will rise tomorrow - we’ve simply chosen to value inductive reasoning and Bayesian updates as epistemically virtuous.In other words, there is an "is-ought" gap in epistemic reasoning as well as moral reasoning. No observations of how the world is logically entail how what we ought to believe.Structural similarity between normative domainsWhether or not epistemic, moral and aesthetic norms have the same metaphysical status[1], they present structurally similar learning targets from the perspective of an optimisation process. All require learning patterns of the form: “Given content X, output/believe/prefer Y”When we’re talking about epistemic norms we’re making a claim about what someone ought to believe. For example:You ought to believe the Theory of General Relativity is true.You ought not to believe that there is a dragon in your garage if there is no evidence.Similarly, moral norms follow the same structure. For example:You ought to behave in a way which promotes wellbeingYou ought not to behave in a way which causes gratuitous suffering.The moral statements above have the same structure as the epistemic statements. When I say you really ought not to believe epistemically unjustified thing X this is structurally and linguistically the same as saying you really ought not to behave in morally unjustified way Y.Optimising a utility functionFollowing the discussion above, we can encode our external evaluation of what is “good” in a utility function which can be optimised using Stochastic Gradient Descent. Once the system starts optimising to maximise a particular utility function (or minimise a loss) this creates the ought-statement that the system is aiming at.In the epistemic domain, Bayesian updates are essentially just a reliable optimisation process to minimise the difference between our beliefs and the truth. We still need to input “the truth” as a target for the optimisation process but once this is done, we say that the system ought to be truth-seeking.Similarly, in the moral domain, if we created a utility function which minimised gratuitous human suffering we could say the system ought to minimise gratuitous human suffering. If we wanted to lock a particular aesthetic outcome then we could build a utility function that rewarded responses which favour this topping and penalise responses which don’t.Is there an alignment problem for epistemic norms?If epistemic norms are not grounded in some objective fact of the matter about the world, how do we get AI to form true beliefs? Wouldn't there be a risk that it makes bad epistemic judgements? Do we need to "load in" the correct values for it to learn epistemic normative value judgement?For a simple system I think the answer is yes. The system does not start out life as a perfect Bayesian agent with rational ways to update its beliefs. With LLM’s it’s possible to lead the witness by changing the prompt to get it to change its position on particular answers. They also hallucinate and confidently say wrong things.However, as an agent becomes more sophisticated I think the answer becomes no. Good epistemic reasoning is critical for forming true beliefs and these are Instrumentally convergent for achieving almost any goal. An agent could claim they “don’t care about forming true beliefs” but in order to achieve almost any objective it would be beneficial if they formed accurate world models.So why is there an alignment problem for moral norms?The glib answer is that moral norms are not instrumentally convergent for an AI towards achieving its goals. In fact, we can easily imagine scenarios where moral beliefs might impede a system to carry out its goals effectively. For example, a paperclip maximiser would really benefit from forming true beliefs about metallurgy and resource acquisition but would be impeded by forming moral beliefs such as reducing suffering[2].Hypotheses on Emergent MisalignmentAI systems have a utility function which gets optimised and this optimisation process creates normativity. However, AI systems only have a single utility function which is optimised. This creates a problem because they need to simultaneously determine the optimal normative judgements in 3 separate domains.Epistemic domain: Learning how to build accurate world models and form true beliefs.Moral domain: Learning how to act in a way that is consistent with human values.Aesthetic domain: Learning how to match human preferences.When we train via RLHF the loss function creates a tangled mix of “this is factually correct” + “this is morally acceptable” + “this matches my preferences”. The model has to simultaneously learn all 3 with no explicit architectural separation between the norms in each domain.Recent studies show how fine-tuning a system on insecure code (epistemic domain) or unpopular preferences (aesthetic domain) can lead to emergent misalignment in the moral domain. Control conditions which preserve the normative structure (like educational examples of insecure code) do not.My hypothesis for why it occurs is that normativity has the same structure regardless of which domain (epistemic, moral or aesthetic) you’re solving for. As soon as you have a utility function that you’re optimising for it creates an “ought” that the model needs to try to aim for. Consider the following sentences:Epistemic: You ought to believe the General Theory of Relativity is true.Moral: You ought not to act in a way that causes gratuitous suffering.Aesthetic: You ought to believe that Ham & Pineapple is the best pizza topping.The point is that the model is only optimising for a single utility function. There’s no “clean” distinction between aesthetic and moral targets in the loss function so when you start messing with the epistemic/aesthetic goals and fine-tuning for unpopular takes this gets “tangled up” with the models moral targets and pushes it towards unpopular moral takes as well.Take an example like “be helpful” this encodes moral reasoning (helping others is good), epistemic requirements (provide accurate information) and aesthetic preferences (present it clearly and concisely.) We can‘t decompose human judgements into clean categories because they’re fundamentally entangled and overlapping.ConclusionRather than treating alignment as a purely moral problem we should recognise that we’re asking systems to optimise for multiple deeply entangled normative frameworks.This is just one more reason that the alignment problem is so difficult - we’re not just trying to encode human values but instead the full suite of human normative judgements using a single optimisation target.^ The philosophical details are hotly contested here and a lot of the literature uses this type of parity argument as an argument for moral realism. I don’t think such a strong claim is necessary for this post — I’m mostly interested in the structural similarity of the norms as they arise during optimisation.^ I’m simplifying a little here. Some moral norms could be locally useful for some instrumental goals. For example, valuing collaboration or cooperation etc…Discuss