The main argument for this is that most “simple” reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that’s what the human says is correct. (Also we can just look at the long list of specification gaming examples so far.)
+1, I was about to write an argument to this effect.
Also, you can’t always rationalize M as state-based reward maximization, but even if you could, that doesn’t tell you much. Taken on its own, the argument about M-equivalence proves too much, because it would imply random policies have convergent instrumental subgoals:
Let M(s,a) be uniformly randomly drawn from the unit interval, the first time it’s called. Have the agent choose the argmax for its policy. This can be rationalized as some R(s,a,s′) maximization, so it’s probably power-seeking.
This doesn’t hold, obviously. Any argument about approval maximization should use specific facts about how approval is computed.
Put otherwise, specifying an actual reward function seems to be a good way to get a catastrophic maximizer, but arbitrary action-scoring rules don’t seem to have this property, as Rohin said above. Most reward functions have power-seeking optimal policies, and every policy is optimal for some reward function, but most policies aren’t power-seeking.
+1, I was about to write an argument to this effect.
Also, you can’t always rationalize M as state-based reward maximization, but even if you could, that doesn’t tell you much. Taken on its own, the argument about M-equivalence proves too much, because it would imply random policies have convergent instrumental subgoals:
Let M(s,a) be uniformly randomly drawn from the unit interval, the first time it’s called. Have the agent choose the argmax for its policy. This can be rationalized as some R(s,a,s′) maximization, so it’s probably power-seeking.
This doesn’t hold, obviously. Any argument about approval maximization should use specific facts about how approval is computed.
Put otherwise, specifying an actual reward function seems to be a good way to get a catastrophic maximizer, but arbitrary action-scoring rules don’t seem to have this property, as Rohin said above. Most reward functions have power-seeking optimal policies, and every policy is optimal for some reward function, but most policies aren’t power-seeking.