Thank you for your input, I found it very informative!
I agree with your point that any aligned AI will be 100% on board with avoiding value drift, and that certainly does take pressure off of us when it comes to researching this. I also agree that it would be best to avoid this scenario entirely and avoid having a self-improving AI touch its value function at all.
In cases where a self-improving AI can alter its values, I don’t entirely agree that this would only be a concern at subhuman levels of intelligence. It seems plausible to me that an AI of human level intelligence, and maybe slightly higher, could think that marginally adjusting a value for improved performance is safe, only to be wrong about that. From a human perspective, I find it very difficult to reason through how slightly altering one of my values would impact my reflective reasoning about the importance of that value and the acceptable ranges it could take. A self-improving agent would also have to make this prediction about a more intelligent version of itself, with the added complication of calculating potential impact for future iterations as well. It’s possible that an agent of human level intelligence would be able to do this easily, but I’m not entirely confident of that.
And the main reason that I bring up the scenario of self-improving AI with access to its own values is that I see this as a clear path to performance improvement that might seem deceptively safe to some organizations conducting general AI research in the future, especially those where external incentives (such as an international General AI arms race) might push researchers to take risks that they normally wouldn’t take in order to beat the competition. If a general AI was properly aligned, I could see certain organizations allowing that AI to improve itself through marginally altering its values out of fear that a rival organization would do the same.
I’m going to reflect upon what you said in more depth though. Since I’m still new to all of this, it’s very possible that there is relevant external information that I’m missing or not considering thoroughly.
Thank you for your input, I found it very informative!
I agree with your point that any aligned AI will be 100% on board with avoiding value drift, and that certainly does take pressure off of us when it comes to researching this. I also agree that it would be best to avoid this scenario entirely and avoid having a self-improving AI touch its value function at all.
In cases where a self-improving AI can alter its values, I don’t entirely agree that this would only be a concern at subhuman levels of intelligence. It seems plausible to me that an AI of human level intelligence, and maybe slightly higher, could think that marginally adjusting a value for improved performance is safe, only to be wrong about that. From a human perspective, I find it very difficult to reason through how slightly altering one of my values would impact my reflective reasoning about the importance of that value and the acceptable ranges it could take. A self-improving agent would also have to make this prediction about a more intelligent version of itself, with the added complication of calculating potential impact for future iterations as well. It’s possible that an agent of human level intelligence would be able to do this easily, but I’m not entirely confident of that.
And the main reason that I bring up the scenario of self-improving AI with access to its own values is that I see this as a clear path to performance improvement that might seem deceptively safe to some organizations conducting general AI research in the future, especially those where external incentives (such as an international General AI arms race) might push researchers to take risks that they normally wouldn’t take in order to beat the competition. If a general AI was properly aligned, I could see certain organizations allowing that AI to improve itself through marginally altering its values out of fear that a rival organization would do the same.
I’m going to reflect upon what you said in more depth though. Since I’m still new to all of this, it’s very possible that there is relevant external information that I’m missing or not considering thoroughly.