Rafael Harth comments on Attainable Utility Preservation: Scaling to Superhuman

Rafael Harth 5 Aug 2020 9:16 UTC
LW: 4 AF: 2
AF
(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)
I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting $R_{AUX} := R$ seems to sacrifice quite a lot of performance. Is this real or am I missing something?
Namely, whenever there’s an action $a$ which doesn’t change the state and leads to 1 reward, and a sequence $a_{1}, . . ., a_{n}$ of actions such that $a_{n}$ has $m$ reward with $m > n$ (and all $a_{< n}$ have 0 reward), then it’s conceivable that $R_{AUP-1}$ would choose the $(a_{i})$ sequence while $R_{AUP-5}$ would just stubbornly repeat $a$ , even if the $(a_{i})_{1 \leq 1 \leq n}$ represent something very tailored to $R$ that doesn’t involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where $R_{AUX} \neq R$ . This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn’t involve gaining power. I guess the scaling step might help here?
Separately and very speculatively, I’m wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn’t be caught by your penalty since it’s about an internal change. If so, that might be a sign that it’ll be difficult to fix. More generally, if you don’t consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?
- TurnTrout 5 Aug 2020 13:39 UTC
  LW: 4 AF: 2
  AF Parent
  I realize that impact measures always lead to a tradeoff between safety and performance competitiveness.
  For optimal policies, yes. In practice, not always—in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!
  it seems to penalize reasonable long-term thinking more than the formulas where $R_{AUX} \neq R$ .
  Yeah. I’m also pretty sympathetic to arguments by Rohin and others that the $R_{aux} = R$ variant isn’t quite right in general; maybe there’s a better way to formalize “do the thing without gaining power to do it” wrt the agent’s own goal.
  whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.
  I think this is plausible, yep. This is why I think it’s somewhat more likely than not there’s no clean way to solve this; however, I haven’t even thought very hard about how to solve the problem yet.
  More generally, if you don’t consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?
  Depends on how that shows up in the non-embedded formalization, if at all. If it doesn’t show up, then the optimal policy won’t be able to predict any benefit and won’t do it. If it does… I don’t know. It might. I’d need to think about it more, because I feel confused about how exactly that would work—what its model of itself is, exactly, and so on.