The reason I don’t think this (that is, a particular sort of value stability under self-modification) is a key problem is because this is one of those areas where the AI’s incentives are automatically aligned. We don’t have to solve the problem ahead of time, because almost any AI is going to be 100% on board with avoiding value drift. So it seems like there’s less pressure on us to get everyrhing done.
However, there is one case where it might become important to solve this correctly before turning on an AGI: seed AI that starts very subhuman and increases its intelligence through self-modification. But I think that we should avoid this scenario anyhow. Even if we do have subhuman self-improving AI, it shouldn’t be incentivized to touch its value function anyhow, only its world-model—any such incentive (e.g. “attempting to make itself run faster”) should be only a subgoal in a hierarchical goal structure that remembers not to touch its value function.
I don’t know, I’m increasingly less convinced that we should reasonably expect to not see value drift. In particular value drift can at least be a function of computing reflective equilibrium such that values may drift from their original position in order to be consistent with other values. In this sense the original value might be thought of as mistaken, and it could be a correct move to drift towards a value that is stable under reflection, and this is to say nothing of “drift” as a result of updating on new information.
Put another way, it seems unlikely to me that we can build AGI that is both fully general and not open to instability under self-modification and in order to get greater stability we must give up some general function. Arguably this is exactly what alignment is—giving up access to parts of mind space in exchange for meeting particular safety guarantees—but I think it’s also worth pointing out that there may be a sense in which we can oversolve alignment such that we remove all value drift rendering the intended AGI narrow rather than general.
Thank you for your input, I found it very informative!
I agree with your point that any aligned AI will be 100% on board with avoiding value drift, and that certainly does take pressure off of us when it comes to researching this. I also agree that it would be best to avoid this scenario entirely and avoid having a self-improving AI touch its value function at all.
In cases where a self-improving AI can alter its values, I don’t entirely agree that this would only be a concern at subhuman levels of intelligence. It seems plausible to me that an AI of human level intelligence, and maybe slightly higher, could think that marginally adjusting a value for improved performance is safe, only to be wrong about that. From a human perspective, I find it very difficult to reason through how slightly altering one of my values would impact my reflective reasoning about the importance of that value and the acceptable ranges it could take. A self-improving agent would also have to make this prediction about a more intelligent version of itself, with the added complication of calculating potential impact for future iterations as well. It’s possible that an agent of human level intelligence would be able to do this easily, but I’m not entirely confident of that.
And the main reason that I bring up the scenario of self-improving AI with access to its own values is that I see this as a clear path to performance improvement that might seem deceptively safe to some organizations conducting general AI research in the future, especially those where external incentives (such as an international General AI arms race) might push researchers to take risks that they normally wouldn’t take in order to beat the competition. If a general AI was properly aligned, I could see certain organizations allowing that AI to improve itself through marginally altering its values out of fear that a rival organization would do the same.
I’m going to reflect upon what you said in more depth though. Since I’m still new to all of this, it’s very possible that there is relevant external information that I’m missing or not considering thoroughly.
The reason I don’t think this (that is, a particular sort of value stability under self-modification) is a key problem is because this is one of those areas where the AI’s incentives are automatically aligned. We don’t have to solve the problem ahead of time, because almost any AI is going to be 100% on board with avoiding value drift. So it seems like there’s less pressure on us to get everyrhing done.
However, there is one case where it might become important to solve this correctly before turning on an AGI: seed AI that starts very subhuman and increases its intelligence through self-modification. But I think that we should avoid this scenario anyhow. Even if we do have subhuman self-improving AI, it shouldn’t be incentivized to touch its value function anyhow, only its world-model—any such incentive (e.g. “attempting to make itself run faster”) should be only a subgoal in a hierarchical goal structure that remembers not to touch its value function.
I don’t know, I’m increasingly less convinced that we should reasonably expect to not see value drift. In particular value drift can at least be a function of computing reflective equilibrium such that values may drift from their original position in order to be consistent with other values. In this sense the original value might be thought of as mistaken, and it could be a correct move to drift towards a value that is stable under reflection, and this is to say nothing of “drift” as a result of updating on new information.
Put another way, it seems unlikely to me that we can build AGI that is both fully general and not open to instability under self-modification and in order to get greater stability we must give up some general function. Arguably this is exactly what alignment is—giving up access to parts of mind space in exchange for meeting particular safety guarantees—but I think it’s also worth pointing out that there may be a sense in which we can oversolve alignment such that we remove all value drift rendering the intended AGI narrow rather than general.
Thank you for your input, I found it very informative!
I agree with your point that any aligned AI will be 100% on board with avoiding value drift, and that certainly does take pressure off of us when it comes to researching this. I also agree that it would be best to avoid this scenario entirely and avoid having a self-improving AI touch its value function at all.
In cases where a self-improving AI can alter its values, I don’t entirely agree that this would only be a concern at subhuman levels of intelligence. It seems plausible to me that an AI of human level intelligence, and maybe slightly higher, could think that marginally adjusting a value for improved performance is safe, only to be wrong about that. From a human perspective, I find it very difficult to reason through how slightly altering one of my values would impact my reflective reasoning about the importance of that value and the acceptable ranges it could take. A self-improving agent would also have to make this prediction about a more intelligent version of itself, with the added complication of calculating potential impact for future iterations as well. It’s possible that an agent of human level intelligence would be able to do this easily, but I’m not entirely confident of that.
And the main reason that I bring up the scenario of self-improving AI with access to its own values is that I see this as a clear path to performance improvement that might seem deceptively safe to some organizations conducting general AI research in the future, especially those where external incentives (such as an international General AI arms race) might push researchers to take risks that they normally wouldn’t take in order to beat the competition. If a general AI was properly aligned, I could see certain organizations allowing that AI to improve itself through marginally altering its values out of fear that a rival organization would do the same.
I’m going to reflect upon what you said in more depth though. Since I’m still new to all of this, it’s very possible that there is relevant external information that I’m missing or not considering thoroughly.