(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)
I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting RAUX:=R seems to sacrifice quite a lot of performance. Is this real or am I missing something?
Namely, whenever there’s an action a which doesn’t change the state and leads to 1 reward, and a sequence a1,...,an of actions such that an has m reward with m>n (and all a<n have 0 reward), then it’s conceivable that RAUP-1 would choose the (ai) sequence while RAUP-5 would just stubbornly repeat a, even if the (ai)1≤1≤n represent something very tailored to R that doesn’t involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where RAUX≠R. This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn’t involve gaining power. I guess the scaling step might help here?
Separately and very speculatively, I’m wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn’t be caught by your penalty since it’s about an internal change. If so, that might be a sign that it’ll be difficult to fix. More generally, if you don’t consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?
I realize that impact measures always lead to a tradeoff between safety and performance competitiveness.
For optimal policies, yes. In practice, not always—in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!
it seems to penalize reasonable long-term thinking more than the formulas where RAUX≠R.
Yeah. I’m also pretty sympathetic to arguments by Rohin and others that the Raux=R variant isn’t quite right in general; maybe there’s a better way to formalize “do the thing without gaining power to do it” wrt the agent’s own goal.
whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.
I think this is plausible, yep. This is why I think it’s somewhat more likely than not there’s no clean way to solve this; however, I haven’t even thought very hard about how to solve the problem yet.
More generally, if you don’t consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?
Depends on how that shows up in the non-embedded formalization, if at all. If it doesn’t show up, then the optimal policy won’t be able to predict any benefit and won’t do it. If it does… I don’t know. It might. I’d need to think about it more, because I feel confused about how exactly that would work—what its model of itself is, exactly, and so on.
(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)
I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting RAUX:=R seems to sacrifice quite a lot of performance. Is this real or am I missing something?
Namely, whenever there’s an action a which doesn’t change the state and leads to 1 reward, and a sequence a1,...,an of actions such that an has m reward with m>n (and all a<n have 0 reward), then it’s conceivable that RAUP-1 would choose the (ai) sequence while RAUP-5 would just stubbornly repeat a, even if the (ai)1≤1≤n represent something very tailored to R that doesn’t involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where RAUX≠R. This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn’t involve gaining power. I guess the scaling step might help here?
Separately and very speculatively, I’m wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn’t be caught by your penalty since it’s about an internal change. If so, that might be a sign that it’ll be difficult to fix. More generally, if you don’t consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?
For optimal policies, yes. In practice, not always—in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!
Yeah. I’m also pretty sympathetic to arguments by Rohin and others that the Raux=R variant isn’t quite right in general; maybe there’s a better way to formalize “do the thing without gaining power to do it” wrt the agent’s own goal.
I think this is plausible, yep. This is why I think it’s somewhat more likely than not there’s no clean way to solve this; however, I haven’t even thought very hard about how to solve the problem yet.
Depends on how that shows up in the non-embedded formalization, if at all. If it doesn’t show up, then the optimal policy won’t be able to predict any benefit and won’t do it. If it does… I don’t know. It might. I’d need to think about it more, because I feel confused about how exactly that would work—what its model of itself is, exactly, and so on.