In other words, I don’t think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.
This seems wrong to me, but I’m having trouble articulating why. It feels like for the actual “prior” we use there will be many more hypotheses for capable behavior than for safe, capable behavior.
A background fact that’s probably relevant: I don’t expect that we’ll be using an explicit prior, and to the extent that we have an implicit prior, I doubt it will look anything like the universal prior.
The way I imagine it will work, the advisor will not do something weird and complicated that ey don’t understand emself. [...] I have a research direction based on the “debate” approach about how to strengthen it.
Yeah, this seems good to me!
The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example).
I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn’t matter whether the prior is “explicit” and I doubt it’s even a well-defined question. What I mean by “prior” is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.
I think the prior will have to look somewhat like the universal prior. Occam’s razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there’s even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).
I think that DRL doesn’t require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it’s relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn’t make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).
This seems wrong to me, but I’m having trouble articulating why. It feels like for the actual “prior” we use there will be many more hypotheses for capable behavior than for safe, capable behavior.
A background fact that’s probably relevant: I don’t expect that we’ll be using an explicit prior, and to the extent that we have an implicit prior, I doubt it will look anything like the universal prior.
Yeah, this seems good to me!
Okay, that makes sense.
I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn’t matter whether the prior is “explicit” and I doubt it’s even a well-defined question. What I mean by “prior” is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.
I think the prior will have to look somewhat like the universal prior. Occam’s razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there’s even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).
I think that DRL doesn’t require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it’s relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn’t make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).