“You can’t reason a man out of a position he has never reasoned himself into.”
I think I have seen a similar argument on LW for this, and it is sensible. With vast intelligence, it is possible for the search space to support priors to be even greater. An AI with a silly but definite value like “the moon is great, I love the moon” may not change its value as much as develop an entire religion around the greatness of the moon.
We see this in goal misgeneralization, where it very much maximizes a reward function independent of the meaningful goal.
“You can’t reason a man out of a position he has never reasoned himself into.”
I think I have seen a similar argument on LW for this, and it is sensible. With vast intelligence, it is possible for the search space to support priors to be even greater. An AI with a silly but definite value like “the moon is great, I love the moon” may not change its value as much as develop an entire religion around the greatness of the moon.
We see this in goal misgeneralization, where it very much maximizes a reward function independent of the meaningful goal.