TurnTrout comments on Arguments against myopic training

TurnTrout 9 Jul 2020 18:38 UTC
LW: 2 AF: 1
AF
The main argument for this is that most “simple” reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that’s what the human says is correct. (Also we can just look at the long list of specification gaming examples so far.)
+1, I was about to write an argument to this effect.
Also, you can’t always rationalize $M$ as state-based reward maximization, but even if you could, that doesn’t tell you much. Taken on its own, the argument about $M$ -equivalence proves too much, because it would imply random policies have convergent instrumental subgoals:
Let $M (s, a)$ be uniformly randomly drawn from the unit interval, the first time it’s called. Have the agent choose the argmax for its policy. This can be rationalized as some $R (s, a, s^{'})$ maximization, so it’s probably power-seeking.
This doesn’t hold, obviously. Any argument about approval maximization should use specific facts about how approval is computed.
Put otherwise, specifying an actual reward function seems to be a good way to get a catastrophic maximizer, but arbitrary action-scoring rules don’t seem to have this property, as Rohin said above. Most reward functions have power-seeking optimal policies, and every policy is optimal for some reward function, but most policies aren’t power-seeking.