A core problem lies in the distribution of goals that you vary things over. The AI will be trained to be corrigible within the range of that distribution, but there is no particular guarantee that it will be corrigible outside it.
So you need to make sure that your distribution of goals contain human values. How do you guarantee that it contains that without getting goodharted by instead containing something that superficially resembles human values?
It might be tempting to achieve this by making the distribution very general, with lots of varied goals, so it contains lots of alien values including human values. But then human values are given exponentially snall probability, which utility-wise is similar to the distribution not containing human values.
So you need to somehow give human values a high probability within the distribution. But at that point you’re most of the way to just figuring out what human values are in the first place and directly aligning to them.
“which utility-wise is similar to the distribution not containing human values.” - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values?
For corrigability I don’t see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than “don’t care about controling goals”. For capabilities my intuition is that starting with superficially-aligned goals is enough.
Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn’t 100% fit and I can’t quickly analyze your proposal and apply my point to it.
More on the meta level: “This sort of works, but not enough to solve it.”—do you mean “not enough” as in “good try but we probably need something else” or as in “this is a promising direction, just solve some tractable downstream problem”?
This sort of works, but not enough to solve it.
A core problem lies in the distribution of goals that you vary things over. The AI will be trained to be corrigible within the range of that distribution, but there is no particular guarantee that it will be corrigible outside it.
So you need to make sure that your distribution of goals contain human values. How do you guarantee that it contains that without getting goodharted by instead containing something that superficially resembles human values?
It might be tempting to achieve this by making the distribution very general, with lots of varied goals, so it contains lots of alien values including human values. But then human values are given exponentially snall probability, which utility-wise is similar to the distribution not containing human values.
So you need to somehow give human values a high probability within the distribution. But at that point you’re most of the way to just figuring out what human values are in the first place and directly aligning to them.
“which utility-wise is similar to the distribution not containing human values.” - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don’t see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than “don’t care about controling goals”. For capabilities my intuition is that starting with superficially-aligned goals is enough.
Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn’t 100% fit and I can’t quickly analyze your proposal and apply my point to it.
More on the meta level: “This sort of works, but not enough to solve it.”—do you mean “not enough” as in “good try but we probably need something else” or as in “this is a promising direction, just solve some tractable downstream problem”?