Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy—i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward.
Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy—i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward.