Thus, the crux of alignment is aligning the generative models of humans and AIs. Generative models could be “decomposed”, vaguely (there is a lot of intersection between these categories), into
Methodology: the mechanics of the models themselves (i.e., epistemology, rationality, normative logic, ethical deliberation),
Science: mechanics, or “update rules/laws” of the world (such as the laws of physics or the heuristical learnings about society, economy, markets, psychology, etc.), and
Fact: the state of the world (facts, or inferences about the current state of the world: CO2 level in the atmosphere, the suicide rate in each country, distance from Earth to the Sun, etc.)
These, we can conceptualise, give rise to “methodological alignment”, “scientific alignment”, and “fact alignment” respectively. Evidently, methodological alignment is most important: it in principle allows for alignment on science, and methodology plus science helps to align on facts.
Under this framework, goals are a specific type of facts, laden by a specific theory of mind of another agent (natural or AI). A theory of mind here should be a specialised version of a general theory of cognition which itself, as noted above, includes a generative model and planning-as-inference, under which goals become future world states or some features of future world states, predicted/planned (prediction and planning is the same thing, under planning-as-inference) by the other mind (or oneself, if the agent reflects about its own goals).[1]
Thus, goal alignment is easy (practically, automatic) when two agents are aligned on methodology and science (albeit goal alignment even between methodologically and scientifically aligned agents usually still requires communication and coordination, unless we enter the territory of logical handshakes...), but is also futile when there is no common methodological and scientific ground.
Incidentally, this means that RL is not a very useful framework for discussing goals, because goals couldn’t be conceptualised under RL easily, which causes a lot of trouble to people in the AI safety community who tend to think that there should be a single “right” theory or framework of cognition and intelligence. There should not: For alignment, we should simultaneously use multiple theories of cognition and value. And RL, although probably couldn’t be deployed very usefully to discuss goal alignment specifically, could still be used to discussed some aspects of value alignment between the minds.
Note that judgements about the harmfulness and dangerousness of some goals or behaviours are themselves theory-laden. This is why Goal alignment without alignment on epistemology, ethics, and science is futile. From the perspective of any theory of cognition/intelligence that includes a generative model (which is not only Active Inference, but also LeCun’s H-JEPA, LMCAs such as the “exemplary actor”, and more theories of cognition and/or AI architectures) for performing planning-as-inference, I think a straightforward and useful ladder of aligned could be introduced: methodological, scientific, and fact alignment:
Under this framework, goals are a specific type of facts, laden by a specific theory of mind of another agent (natural or AI). A theory of mind here should be a specialised version of a general theory of cognition which itself, as noted above, includes a generative model and planning-as-inference, under which goals become future world states or some features of future world states, predicted/planned (prediction and planning is the same thing, under planning-as-inference) by the other mind (or oneself, if the agent reflects about its own goals).[1]
Thus, goal alignment is easy (practically, automatic) when two agents are aligned on methodology and science (albeit goal alignment even between methodologically and scientifically aligned agents usually still requires communication and coordination, unless we enter the territory of logical handshakes...), but is also futile when there is no common methodological and scientific ground.
Incidentally, this means that RL is not a very useful framework for discussing goals, because goals couldn’t be conceptualised under RL easily, which causes a lot of trouble to people in the AI safety community who tend to think that there should be a single “right” theory or framework of cognition and intelligence. There should not: For alignment, we should simultaneously use multiple theories of cognition and value. And RL, although probably couldn’t be deployed very usefully to discuss goal alignment specifically, could still be used to discussed some aspects of value alignment between the minds.