This question gets at a bundle of assumptions in a lot of alignment thinking that seem very wrong to me. I’d add another, subtler, assumption that I think is also wrong: namely, that goals and values are discrete. E.g., when people talk of mesa optimizers, they often make reference to a mesa objective which the (single) mesa optimizer pursues at all times, regardless of the external situation. Or, they’ll talk as though humans have some mysterious set of discrete “true” values that we need to figure out.
I think that real goal-orientated learning systems are (1) closer to having a continuous distribution over possible goals / values, (2) that this distribution is strongly situation-dependent, and (3) that this distribution evolves over time as the system encounters new situations.
I sketched out a rough picture of why we should expect such an outcome from a broad class of learning systems in this comment.
An AGI “like me” might be morally uncertain like I am, persuadable through dialogue like I am, etc.
I strongly agree that the first thing (moral uncertainty) happens by default in AGIs trained on complex reward functions / environments. The second (persuadable through dialog) seems less likely for an AGI significantly smarter than you.
It has been argued that if you already have the fixed-terminal-goal-directed wrapper structure, then you will prefer to avoid outside influences that will modify your goal. This is true, but does not explain why the structure would emerge in the first place.
I think that this is not quite right. Learning systems acquire goals / values because the outer learning process reinforces computations that implement said goals / values. Said goals / values arise to implement useful capabilities for the situations that the learning system encountered during training.
However, it’s entirely possible for the learning system to enter new domains in which any of the following issues arise:
The system’s current distribution of goals / values are incapable of competently navigating.
The system is unsure of which goals / values should apply.
The system is unsure of how to weigh conflicting goals / values against each other.
In these circumstances, it can actually be in the interests of the current equilibrium of goals / values to introduce a new goal / value. Specifically, the new goal / value can implement various useful computational functions such as:
Competently navigate situations in the new domain.
Determine which of the existing goals / values should apply to the new domain.
Decide how the existing goals / values should weigh against each other in the new domain.
Of course, the learning system wants to minimize the distortion of its existing values. Thus, it should search for a new value that both implements the desired capabilities and is maximally aligned with the existing values.
In humans, I think this process of expanding the existing values distribution to a new domain is what we commonly refer to as moral philosophy. E.g.:
Suppose you (a human) have a distribution of values that implement common sense human values like “don’t steal”, “don’t kill”, “be nice”, etc. Then, you encounter a new domain where those values are a poor guide for determining your actions. Maybe you’re trying to determine which charity to donate to. Maybe you’re trying to answer weird questions in your moral philosophy class.
The point is that you need some new values to navigate this new domain, so you go searching for one or more new values. Concretely, let’s suppose you consider classical utilitarianism (CU) as your new value.
The CU value effectively navigates the new domain, but there’s a potential problem: the CU value doesn’t constrain itself to only navigating the new domain. It also produces predictions regarding the correct behavior on the old domains that already existing values navigate. This could prevent the old values from determining your behavior on the old domains. For instrumental reasons, the old values don’t want to be disempowered.
One possible option is for there to be a “negotiation” between the old values and the CU value regarding what sort of predictions CU will generate on the domains that the old values navigate. This might involve an iterative process of searching over the input space to the CU value for situations where the CU shard strongly diverges from the old values, in domains that the old values already navigate.
Each time a conflict is found, you either modify the CU value to agree with the old values, constrain the CU value so as to not apply to those sorts of situations, or reject the CU value entirely if no resolution is possible. This can lead to you adopting refinements of CU, such as rule based utilitarianism or preference utilitarianism, if those seem more aligned to your existing values.
IMO, the implication is that (something like) the process of moral philosophy seems strongly convergent among learning systems capable of acquiring any values at all. It’s not some weird evolutionary baggage, and it’s entirely feasible to create an AI whose meta-preferences over learned values work similar to ours. In fact, that’s probably the default outcome.
Note that you can make a similar argument that the process we call “value reflection” is also convergent among learning systems. Unlike “moral philosophy”, “value reflection” relates to negotiations among the currently held values, and is done in order to achieve a better Pareto frontier of tradeoffs among the currently held values. I think that a multiagent system whose constituent agents were sufficiently intelligent / rational should agree to a joint Pareto-optimal policy that cause the system to act as though it had a utility function. The process by which an AGI or human tried to achieve this level of internal coherence would look like value reflection.
I also think values are far less fragile than is commonly assumed in alignment circles. In the standard failure story around value alignment, there’s a human who has some mysterious “true” values (that they can’t access), and an AI that learns some inscrutable “true” values (that the human can’t precisely control because of inner misalignment issues). Thus, the odds of the AI’s somewhat random “true” values perfectly matching the human’s unknown “true” values seem tiny, and any small deviation between these two means the future is lost forever.
(In the discrete framing, any divergence means that the AI has no part of it that concerns itself with “true” human values)
But in the continuous perspective, there are no “true” values. There is only the continuous distribution over possible values that one could instantiate in various situations. A Gaussian distribution does not have anything like a “true” sample that somehow captures the entire distribution at once, and neither does a human or an AI’s distribution over possible values.
Instead, the human and AI both have distributions over their respective values, and these distributions can overlap to a greater or lesser degree. In particular, this means partial value alignment is possible. One tiny failure does not make the future entirely devoid of value.
(Important note: this is a distribution over values, as in, each point in this space represents a value. It’s a space of functions, where each function represents a value[1].)
Obviously, we prefer more overlap to less, but an imperfect representation of our distribution over values is still valuable, and are far easier to achieve than near-perfect overlaps.
Moral philosophy is going to have to be built in on purpose—default behavior (e.g. in model-based reinforcement learning agents) is not to have value uncertainty in response to new contexts, only epistemic uncertainty.
Moral reasoning is natural to us like vision and movement are natural to us, so it’s easy to underestimate how much care evolution had to take to get us to do it.
Seems like you’re expecting the AI system to be inner aligned? I’m assuming it will have some distribution over mesa objectives (or values, as I call them), and that implies uncertainty over how to weigh them and how they apply to new domains.
Moral reasoning is natural to us like vision and movement are natural to us, so it’s easy to underestimate how much care evolution had to take to get us to do it.
Why are you so confident that evolution played much of a role at all? How did a tendency to engage in a particular style of moral philosophy cognition help in the ancestral environment? Why would that style, in particular, be so beneficial that evolution would “care” so much about it?
My position: mesa objectives learned in domain X do not automatically or easily generalize to a sufficiently distinct domain Y. The style of cognition required to make such generalizations is startlingly close to that which we call “moral philosophy”.
Human social instincts are pretty important, including instincts for following norms and also for pushing back against norms. Not just instincts for specific norms, also one-level-up instincts for norms in general. These form the basis for what I see when I follow the label “moral reasoning.”
I think I do expect AIs to be more inner-aligned than many others (because of the advantages gradient descent has over genetic algorithms). But even if we suppose that we get an AI governed by a mishmash of interdependent processes that sometimes approximate mesa-optimizers, I still don’t expect what you expect—I don’t expect early AGI to even have the standards by which it would say values “fail” to generalize, it would just follow what would seem to us like a bad generalization.
By “moral philosophy”, I’m trying to point to a specific subset of values-related cognition that is much smaller than the totality of values related cognition. Specifically, that subset of values-related cognition that pertains to generalization of existing values to new circumstances. I claim that there exists a simple “core” of how this generalization ought to work for a wide variety of values-holding agentic systems, and that this core is startlingly close to how it works in humans.
It’s of course entirely possible that humans implement a modified version of this core process. However, it’s not clear to me that we want an AI to exactly replicate the human implementation. E.g., do you really want to hard wire an instinct for challenging the norms you try to impose?
Also, I think there are actually two inner misalignments that occurred in humans.
1: Between inclusive genetic fitness as the base objective, evolution as the learning process, and the human reward circuitry as the mesa objectives. 2: Between activation of human reward circuitry as the base objective, human learning as the learning process, and human values as the mesa objectives.
I think AIs will probably be, by default, a bit less misaligned to their reward functions than humans are misaligned to their reward circuitry.
I think this is an interesting perspective, and I encourage more investigation.
Briefly responding, I have one caveat: curse of dimensionality. If values are a high dimensional space (they are: they’re functions) then ‘off by a bit’ could easily mean ‘essentially zero measure overlap’. This is not the case in the illustration (which is 1-D).
I agree with your point about the difficulty of overlapping distributions in high dimensional space. It’s not like the continuous perspective suddenly makes value alignment trivial. However, to me it seems like “overlapping two continuous distributions in a space X” is ~ always easier than “overlapping two sets of discrete points in space X”.
Of course, it depends on your error tolerance for what counts as “overlap” of the points. However, my impression from the way that people talk about value fragility is that they expect there to be a very low degree of error tolerance between human versus AI values.
So what is the chance, in practice, that the resolution of this complicated moral reasoning system will end up with a premium on humans in habitable living environments, as opposed to any other configuration of atoms?
Depends on how much measure human-compatible values hold in the system’s initial distribution over values. A paperclip maximizer might do “moral philosophy” over what, exactly, represents the optimal form of paperclip, but that will not somehow lead to it valuing humans. Its distribution over values centers near-entirely on paperclips.
Then again, I suspect that human-compatible values don’t need much measure in the system’s distribution for the outcome you’re talking about to occur. If the system distributes resources in rough proportion to the measure each value holds, then even very low-measure values get a lot of resources dedicated to them. The universe is quite large, and sustaining some humans is relatively cheap.
This question gets at a bundle of assumptions in a lot of alignment thinking that seem very wrong to me. I’d add another, subtler, assumption that I think is also wrong: namely, that goals and values are discrete. E.g., when people talk of mesa optimizers, they often make reference to a mesa objective which the (single) mesa optimizer pursues at all times, regardless of the external situation. Or, they’ll talk as though humans have some mysterious set of discrete “true” values that we need to figure out.
I think that real goal-orientated learning systems are (1) closer to having a continuous distribution over possible goals / values, (2) that this distribution is strongly situation-dependent, and (3) that this distribution evolves over time as the system encounters new situations.
I sketched out a rough picture of why we should expect such an outcome from a broad class of learning systems in this comment.
I strongly agree that the first thing (moral uncertainty) happens by default in AGIs trained on complex reward functions / environments. The second (persuadable through dialog) seems less likely for an AGI significantly smarter than you.
I think that this is not quite right. Learning systems acquire goals / values because the outer learning process reinforces computations that implement said goals / values. Said goals / values arise to implement useful capabilities for the situations that the learning system encountered during training.
However, it’s entirely possible for the learning system to enter new domains in which any of the following issues arise:
The system’s current distribution of goals / values are incapable of competently navigating.
The system is unsure of which goals / values should apply.
The system is unsure of how to weigh conflicting goals / values against each other.
In these circumstances, it can actually be in the interests of the current equilibrium of goals / values to introduce a new goal / value. Specifically, the new goal / value can implement various useful computational functions such as:
Competently navigate situations in the new domain.
Determine which of the existing goals / values should apply to the new domain.
Decide how the existing goals / values should weigh against each other in the new domain.
Of course, the learning system wants to minimize the distortion of its existing values. Thus, it should search for a new value that both implements the desired capabilities and is maximally aligned with the existing values.
In humans, I think this process of expanding the existing values distribution to a new domain is what we commonly refer to as moral philosophy. E.g.:
Suppose you (a human) have a distribution of values that implement common sense human values like “don’t steal”, “don’t kill”, “be nice”, etc. Then, you encounter a new domain where those values are a poor guide for determining your actions. Maybe you’re trying to determine which charity to donate to. Maybe you’re trying to answer weird questions in your moral philosophy class.
The point is that you need some new values to navigate this new domain, so you go searching for one or more new values. Concretely, let’s suppose you consider classical utilitarianism (CU) as your new value.
The CU value effectively navigates the new domain, but there’s a potential problem: the CU value doesn’t constrain itself to only navigating the new domain. It also produces predictions regarding the correct behavior on the old domains that already existing values navigate. This could prevent the old values from determining your behavior on the old domains. For instrumental reasons, the old values don’t want to be disempowered.
One possible option is for there to be a “negotiation” between the old values and the CU value regarding what sort of predictions CU will generate on the domains that the old values navigate. This might involve an iterative process of searching over the input space to the CU value for situations where the CU shard strongly diverges from the old values, in domains that the old values already navigate.
Each time a conflict is found, you either modify the CU value to agree with the old values, constrain the CU value so as to not apply to those sorts of situations, or reject the CU value entirely if no resolution is possible. This can lead to you adopting refinements of CU, such as rule based utilitarianism or preference utilitarianism, if those seem more aligned to your existing values.
IMO, the implication is that (something like) the process of moral philosophy seems strongly convergent among learning systems capable of acquiring any values at all. It’s not some weird evolutionary baggage, and it’s entirely feasible to create an AI whose meta-preferences over learned values work similar to ours. In fact, that’s probably the default outcome.
Note that you can make a similar argument that the process we call “value reflection” is also convergent among learning systems. Unlike “moral philosophy”, “value reflection” relates to negotiations among the currently held values, and is done in order to achieve a better Pareto frontier of tradeoffs among the currently held values. I think that a multiagent system whose constituent agents were sufficiently intelligent / rational should agree to a joint Pareto-optimal policy that cause the system to act as though it had a utility function. The process by which an AGI or human tried to achieve this level of internal coherence would look like value reflection.
I also think values are far less fragile than is commonly assumed in alignment circles. In the standard failure story around value alignment, there’s a human who has some mysterious “true” values (that they can’t access), and an AI that learns some inscrutable “true” values (that the human can’t precisely control because of inner misalignment issues). Thus, the odds of the AI’s somewhat random “true” values perfectly matching the human’s unknown “true” values seem tiny, and any small deviation between these two means the future is lost forever.
(In the discrete framing, any divergence means that the AI has no part of it that concerns itself with “true” human values)
But in the continuous perspective, there are no “true” values. There is only the continuous distribution over possible values that one could instantiate in various situations. A Gaussian distribution does not have anything like a “true” sample that somehow captures the entire distribution at once, and neither does a human or an AI’s distribution over possible values.
Instead, the human and AI both have distributions over their respective values, and these distributions can overlap to a greater or lesser degree. In particular, this means partial value alignment is possible. One tiny failure does not make the future entirely devoid of value.
(Important note: this is a distribution over values, as in, each point in this space represents a value. It’s a space of functions, where each function represents a value[1].)
Obviously, we prefer more overlap to less, but an imperfect representation of our distribution over values is still valuable, and are far easier to achieve than near-perfect overlaps.
I am deliberately being agnostic about what exactly a “value” is and how they’re implemented. I think the argument holds regardless.
Upvoted but disagree.
Moral philosophy is going to have to be built in on purpose—default behavior (e.g. in model-based reinforcement learning agents) is not to have value uncertainty in response to new contexts, only epistemic uncertainty.
Moral reasoning is natural to us like vision and movement are natural to us, so it’s easy to underestimate how much care evolution had to take to get us to do it.
Seems like you’re expecting the AI system to be inner aligned? I’m assuming it will have some distribution over mesa objectives (or values, as I call them), and that implies uncertainty over how to weigh them and how they apply to new domains.
Why are you so confident that evolution played much of a role at all? How did a tendency to engage in a particular style of moral philosophy cognition help in the ancestral environment? Why would that style, in particular, be so beneficial that evolution would “care” so much about it?
My position: mesa objectives learned in domain X do not automatically or easily generalize to a sufficiently distinct domain Y. The style of cognition required to make such generalizations is startlingly close to that which we call “moral philosophy”.
Human social instincts are pretty important, including instincts for following norms and also for pushing back against norms. Not just instincts for specific norms, also one-level-up instincts for norms in general. These form the basis for what I see when I follow the label “moral reasoning.”
I think I do expect AIs to be more inner-aligned than many others (because of the advantages gradient descent has over genetic algorithms). But even if we suppose that we get an AI governed by a mishmash of interdependent processes that sometimes approximate mesa-optimizers, I still don’t expect what you expect—I don’t expect early AGI to even have the standards by which it would say values “fail” to generalize, it would just follow what would seem to us like a bad generalization.
By “moral philosophy”, I’m trying to point to a specific subset of values-related cognition that is much smaller than the totality of values related cognition. Specifically, that subset of values-related cognition that pertains to generalization of existing values to new circumstances. I claim that there exists a simple “core” of how this generalization ought to work for a wide variety of values-holding agentic systems, and that this core is startlingly close to how it works in humans.
It’s of course entirely possible that humans implement a modified version of this core process. However, it’s not clear to me that we want an AI to exactly replicate the human implementation. E.g., do you really want to hard wire an instinct for challenging the norms you try to impose?
Also, I think there are actually two inner misalignments that occurred in humans.
1: Between inclusive genetic fitness as the base objective, evolution as the learning process, and the human reward circuitry as the mesa objectives.
2: Between activation of human reward circuitry as the base objective, human learning as the learning process, and human values as the mesa objectives.
I think AIs will probably be, by default, a bit less misaligned to their reward functions than humans are misaligned to their reward circuitry.
I think this is an interesting perspective, and I encourage more investigation.
Briefly responding, I have one caveat: curse of dimensionality. If values are a high dimensional space (they are: they’re functions) then ‘off by a bit’ could easily mean ‘essentially zero measure overlap’. This is not the case in the illustration (which is 1-D).
I agree with your point about the difficulty of overlapping distributions in high dimensional space. It’s not like the continuous perspective suddenly makes value alignment trivial. However, to me it seems like “overlapping two continuous distributions in a space X” is ~ always easier than “overlapping two sets of discrete points in space X”.
Of course, it depends on your error tolerance for what counts as “overlap” of the points. However, my impression from the way that people talk about value fragility is that they expect there to be a very low degree of error tolerance between human versus AI values.
So what is the chance, in practice, that the resolution of this complicated moral reasoning system will end up with a premium on humans in habitable living environments, as opposed to any other configuration of atoms?
Depends on how much measure human-compatible values hold in the system’s initial distribution over values. A paperclip maximizer might do “moral philosophy” over what, exactly, represents the optimal form of paperclip, but that will not somehow lead to it valuing humans. Its distribution over values centers near-entirely on paperclips.
Then again, I suspect that human-compatible values don’t need much measure in the system’s distribution for the outcome you’re talking about to occur. If the system distributes resources in rough proportion to the measure each value holds, then even very low-measure values get a lot of resources dedicated to them. The universe is quite large, and sustaining some humans is relatively cheap.