I think the degree to which alignment generalizes depends a lot on the type of alignment you’re talking about. I think that corrigibility generalizes really badly. In contrast, I think that first-order values kind of okeyishly, and that second order values (i.e., meta ethics on how to weigh different values against each other) generalizes really well. In fact, I suspect that the meta ethics attractor is stronger than the capabilities attractor.
I also think about the “human values versus evolution” misalignment in a substantially different manner. I don’t think that evolution “tried” to directly specified human values. Rather, I think it specified the human reward circuitry, then individual humans learn their values by optimizing for activating their reward circuitry in their environment. So, when you’re wondering how much misalignment to expect between a learning process and its reward signal, you shouldn’t look at the level of misalignment between things that increase inclusive genetic fitness and human values. Instead, you should be looking at the level of misalignment between things that increase human reward circuit activation and human values.
Finally, I think that evolution had very little say in our meta ethics. I think the meta ethics we do have is more or less convergent for a broad class of learning processes. I think this for two reasons:
I don’t think there was much evolutionary pressure towards adopting a specific meta ethics. You can, of course, invent “just so” stories about how some particular type of meta ethics cognition was actually adaptive in the ancestral environment. I don’t think any such stories hold water for the simple fact that very few humans actually perform meta ethical reasoning. Even today, with our near-universal literacy, highly abstract society and preexisting frameworks for meta ethical cognition, it’s very rare. The odds of it being common / important enough in the ancestral environment to be a significant focus of evolutionary optimization are tiny.
If you reason from first principles about how agency / learned values ought to work for a generic RL system trained on a complex reward function in a complex environment, then I think you end up with a system that more or less has to do some sort of value reflection / moral philosophy process startlingly similar to our own meta ethical cognition.
I realize that the second point seems like a stretch. I intend to eventually make a post that properly presents the case for such a conclusion. In the meantime, you can read this comment, which poorly presents the case for such a conclusion. The core argument goes like:
Given a model trained via RL, there is no “ground truth” on how to draw the agency boundaries. You can draw one boundary around all the parameters and call that “one agent”. You can also draw many boundaries around different parameter subsets and call the system “multi agent”.
Different regions of the model will tend to specialize towards different types of situations in which the model could receive reward for its actions.
These different regions of specialization will then tend to have different value representations that are specific to the types of situations in which those regions “activate” to steer the system’s behavior.
Due to the aforementioned arbitrariness of the agent boundaries, we can draw “soft” agent boundaries around these (overlapping) regions of partial specialization. The overall model is less like a single agent with a single value representation, and more like a continuous distribution over possible subagents, whose individual values can be in conflict and highly situational.
Thus, the process by which the overall model becomes coherent is via a quasi-multi-agent negotiation process between the different regions in the distribution over subagents / values.
If you think about the sort of cognition that’s required to reach the Pareto frontier of internal consensus, it would involve things like weighing different learned values against each other, reflecting on which circumstances different values ought to apply to, checking whether a given joint policy among our values is in strong conflict with any of our existing values[1], etc.
Basically, the sorts of cognition that are necessary to resolve conflicts between learned values (or “mesa objectives”, if you prefer that term) seems very similar to the sorts of cognition that are central to our meta ethical process.
Consider the cognitive process that would be involved in such a check. You’d have a proposed method to define a joint policy for maximizing your current distribution of values. You’d want to verify that there aren’t situations in which this joint policy is radically misaligned with your existing values. To do this, you’d search over situations where the joint policy proscribed actions that were strongly in conflict with one or more of your existing values. If we substitute in the name “moral theory” in place of “joint policy”, then this elegantly recovers the cognitive process that we call “moral philosophy by counterexample”, without ever having appealed to our evolutionary history or any human-specific aspect of cognition!
I think the degree to which alignment generalizes depends a lot on the type of alignment you’re talking about. I think that corrigibility generalizes really badly. In contrast, I think that first-order values kind of okeyishly, and that second order values (i.e., meta ethics on how to weigh different values against each other) generalizes really well. In fact, I suspect that the meta ethics attractor is stronger than the capabilities attractor.
I also think about the “human values versus evolution” misalignment in a substantially different manner. I don’t think that evolution “tried” to directly specified human values. Rather, I think it specified the human reward circuitry, then individual humans learn their values by optimizing for activating their reward circuitry in their environment. So, when you’re wondering how much misalignment to expect between a learning process and its reward signal, you shouldn’t look at the level of misalignment between things that increase inclusive genetic fitness and human values. Instead, you should be looking at the level of misalignment between things that increase human reward circuit activation and human values.
Finally, I think that evolution had very little say in our meta ethics. I think the meta ethics we do have is more or less convergent for a broad class of learning processes. I think this for two reasons:
I don’t think there was much evolutionary pressure towards adopting a specific meta ethics. You can, of course, invent “just so” stories about how some particular type of meta ethics cognition was actually adaptive in the ancestral environment. I don’t think any such stories hold water for the simple fact that very few humans actually perform meta ethical reasoning. Even today, with our near-universal literacy, highly abstract society and preexisting frameworks for meta ethical cognition, it’s very rare. The odds of it being common / important enough in the ancestral environment to be a significant focus of evolutionary optimization are tiny.
If you reason from first principles about how agency / learned values ought to work for a generic RL system trained on a complex reward function in a complex environment, then I think you end up with a system that more or less has to do some sort of value reflection / moral philosophy process startlingly similar to our own meta ethical cognition.
I realize that the second point seems like a stretch. I intend to eventually make a post that properly presents the case for such a conclusion. In the meantime, you can read this comment, which poorly presents the case for such a conclusion. The core argument goes like:
Given a model trained via RL, there is no “ground truth” on how to draw the agency boundaries. You can draw one boundary around all the parameters and call that “one agent”. You can also draw many boundaries around different parameter subsets and call the system “multi agent”.
Different regions of the model will tend to specialize towards different types of situations in which the model could receive reward for its actions.
These different regions of specialization will then tend to have different value representations that are specific to the types of situations in which those regions “activate” to steer the system’s behavior.
Due to the aforementioned arbitrariness of the agent boundaries, we can draw “soft” agent boundaries around these (overlapping) regions of partial specialization. The overall model is less like a single agent with a single value representation, and more like a continuous distribution over possible subagents, whose individual values can be in conflict and highly situational.
Thus, the process by which the overall model becomes coherent is via a quasi-multi-agent negotiation process between the different regions in the distribution over subagents / values.
If you think about the sort of cognition that’s required to reach the Pareto frontier of internal consensus, it would involve things like weighing different learned values against each other, reflecting on which circumstances different values ought to apply to, checking whether a given joint policy among our values is in strong conflict with any of our existing values[1], etc.
Basically, the sorts of cognition that are necessary to resolve conflicts between learned values (or “mesa objectives”, if you prefer that term) seems very similar to the sorts of cognition that are central to our meta ethical process.
Consider the cognitive process that would be involved in such a check. You’d have a proposed method to define a joint policy for maximizing your current distribution of values. You’d want to verify that there aren’t situations in which this joint policy is radically misaligned with your existing values. To do this, you’d search over situations where the joint policy proscribed actions that were strongly in conflict with one or more of your existing values. If we substitute in the name “moral theory” in place of “joint policy”, then this elegantly recovers the cognitive process that we call “moral philosophy by counterexample”, without ever having appealed to our evolutionary history or any human-specific aspect of cognition!