I realize that impact measures always lead to a tradeoff between safety and performance competitiveness.
For optimal policies, yes. In practice, not always—in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!
it seems to penalize reasonable long-term thinking more than the formulas where RAUX≠R.
Yeah. I’m also pretty sympathetic to arguments by Rohin and others that the Raux=R variant isn’t quite right in general; maybe there’s a better way to formalize “do the thing without gaining power to do it” wrt the agent’s own goal.
whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.
I think this is plausible, yep. This is why I think it’s somewhat more likely than not there’s no clean way to solve this; however, I haven’t even thought very hard about how to solve the problem yet.
More generally, if you don’t consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?
Depends on how that shows up in the non-embedded formalization, if at all. If it doesn’t show up, then the optimal policy won’t be able to predict any benefit and won’t do it. If it does… I don’t know. It might. I’d need to think about it more, because I feel confused about how exactly that would work—what its model of itself is, exactly, and so on.
For optimal policies, yes. In practice, not always—in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!
Yeah. I’m also pretty sympathetic to arguments by Rohin and others that the Raux=R variant isn’t quite right in general; maybe there’s a better way to formalize “do the thing without gaining power to do it” wrt the agent’s own goal.
I think this is plausible, yep. This is why I think it’s somewhat more likely than not there’s no clean way to solve this; however, I haven’t even thought very hard about how to solve the problem yet.
Depends on how that shows up in the non-embedded formalization, if at all. If it doesn’t show up, then the optimal policy won’t be able to predict any benefit and won’t do it. If it does… I don’t know. It might. I’d need to think about it more, because I feel confused about how exactly that would work—what its model of itself is, exactly, and so on.