There is some similarity, but there are also major differences. They don’t even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that’s the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that “if you run this AI, this won’t make things worse than not running the AI”, no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can’t (empirically and in the worst-case) prove a negative.
There is some similarity, but there are also major differences. They don’t even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that’s the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that “if you run this AI, this won’t make things worse than not running the AI”, no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can’t (empirically and in the worst-case) prove a negative.