This is a more general objection about various sorts of proposed AI alignment schemes that ends up applying to your ideas that I’ve been thinking about since formalizing the AI alignment problem. I’ll probably write more up on this later but for now here’s a sketch as it applies to your ideas as far as I understand them.
A must learn the values of H and H must know enough about A to believe A shares H’s values
where A is an AGI and H is humanity (this is an informal summary of the more formal result given earlier in the paper). Your solution only seems to be designed to address the first part, that A must learn the values of H. Your proposal mostly seems to me to ignore the requirement that H must know enough about A to believe A shares H’s values, although I know you have thought about the issue of transparency and see it as necessary. Nevertheless in your current approach it feels tacked on rather than a key part of the design since you seem to hope it can be achieved by training for transparency.
The need to train for transparency/explainability is what makes me suspicious of RL-based approaches to AI alignment now. It’s difficult to come up with a training program that will get an agent to be reliably transparent in that the reward function will always reward the appearance of transparency but not actual transparency since it can only force observed behavior but not internal structure. This creates an opportunity for a treacherous turn where the AI acts in ways that indicate it shares our values, appears to be believably sharing our values based on the explanations it gives for its actions, yet could be generating those explanations independent of how it actually reasons such that it would appear aligned right up until it’s not.
Intuitively this is not very surprising because we have the same challenge with aligning humans are not able to do it reliably. That is, we can try to train individual humans to share our values, observe their behavior indicating they have learned our values and are acting on them, and ask them for reasons that convince us that they really do share our values, but then they may still betray those values anyway since they could have been dissociative and hiding their resentful intent to rebel the entire time. We know this to be the case because totalitarian states have tried very hard to solve this problem and repeatedly failed even if they are often successful in individual cases. This suggests approaches of this class cannot be made reliable enough to be worth pursuing.
This is a more general objection about various sorts of proposed AI alignment schemes that ends up applying to your ideas that I’ve been thinking about since formalizing the AI alignment problem. I’ll probably write more up on this later but for now here’s a sketch as it applies to your ideas as far as I understand them.
In “Formally Stating the AI Alignment Problem” I say alignment requires that
where A is an AGI and H is humanity (this is an informal summary of the more formal result given earlier in the paper). Your solution only seems to be designed to address the first part, that A must learn the values of H. Your proposal mostly seems to me to ignore the requirement that H must know enough about A to believe A shares H’s values, although I know you have thought about the issue of transparency and see it as necessary. Nevertheless in your current approach it feels tacked on rather than a key part of the design since you seem to hope it can be achieved by training for transparency.
The need to train for transparency/explainability is what makes me suspicious of RL-based approaches to AI alignment now. It’s difficult to come up with a training program that will get an agent to be reliably transparent in that the reward function will always reward the appearance of transparency but not actual transparency since it can only force observed behavior but not internal structure. This creates an opportunity for a treacherous turn where the AI acts in ways that indicate it shares our values, appears to be believably sharing our values based on the explanations it gives for its actions, yet could be generating those explanations independent of how it actually reasons such that it would appear aligned right up until it’s not.
Intuitively this is not very surprising because we have the same challenge with aligning humans are not able to do it reliably. That is, we can try to train individual humans to share our values, observe their behavior indicating they have learned our values and are acting on them, and ask them for reasons that convince us that they really do share our values, but then they may still betray those values anyway since they could have been dissociative and hiding their resentful intent to rebel the entire time. We know this to be the case because totalitarian states have tried very hard to solve this problem and repeatedly failed even if they are often successful in individual cases. This suggests approaches of this class cannot be made reliable enough to be worth pursuing.