I think that under the worldview of this concern, the distribution of reward functions effectively defines a representation that, if too different from the one humans care about, will either mean that no realistic impact is possible in the real world or be ineffective at penalising unwanted negative impacts.
I think that under the worldview of this concern, the distribution of reward functions effectively defines a representation that, if too different from the one humans care about, will either mean that no realistic impact is possible in the real world or be ineffective at penalising unwanted negative impacts.
is there a central example you have in mind for this potential failure mode?
No.