I agree with you that utility-maximizing “maximum optimization power” AIs need to have some knowledge of human values to be safe—at least enough to avoid bad side effects.
On the other hand, I think that when you have an AI that can safely create 2 copies of the same strawberry, there might still be problems to solve that the AI might not already be solved at that point—like how to aggregate preferences of various people, how to extrapolate human values into weird situations, etc.
On the other other hand, some alignment problems give me the impression that they’re independent of human values—like mesa-optimization or “Look where I’m pointing, not at my finger”.
Maybe “to what degree is solving this subproblem of alignment necessary to have a safe strawberry cloner” might be an interesting distinction.
I agree with you that utility-maximizing “maximum optimization power” AIs need to have some knowledge of human values to be safe—at least enough to avoid bad side effects.
On the other hand, I think that when you have an AI that can safely create 2 copies of the same strawberry, there might still be problems to solve that the AI might not already be solved at that point—like how to aggregate preferences of various people, how to extrapolate human values into weird situations, etc.
On the other other hand, some alignment problems give me the impression that they’re independent of human values—like mesa-optimization or “Look where I’m pointing, not at my finger”.
Maybe “to what degree is solving this subproblem of alignment necessary to have a safe strawberry cloner” might be an interesting distinction.