For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?
Ultimately, the set of hypotheses should be something like the universal prior. More precisely, it should be whatever we need to use instead of the universal prior to get “general” intelligence that is computationally efficient. In other words, I don’t think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.
Maybe I’m also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn’t realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn’t seem like DRL saves us here.
The way I imagine it will work, the advisor will not do something weird and complicated that ey don’t understand emself. So the AI also cannot do something weird and complicated that the advisor doesn’t understand. In order for this not to be too constraining, I imagine the advisor having some kind of “diary” where ey write down eir thoughts and reasoning, which is considered a part of normal interaction with the environment. The advisor will only act on complicated plans after laying down the reasoning. The AI will then have to write down in this diary an explanation for its plans that will be understandable by and convincing for the advisor. This is partially protected from manipulations, because the explanation has to come from the space of explanations that the advisor could produce. That said, I think this defense from manipulation is insufficient in itself (because the AI can write down all arguments for a given position that the advisor could come up with, without writing down any arguments against it), and I have a research direction based on the “debate” approach about how to strengthen it.
Maybe another way of putting it—is there additional safety conferred by this approach that you couldn’t get by having a human review all of the AI’s actions? If so, should I think of this as “we want a human to review actions, but that’s expensive, DRL is a way to make it more sample efficient”?
The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example). In the “soft” version it will become a limit on how much the AI policy deviates from the advisor policy, so it’s not quite a review in the usual sense: there is no binary division between “legal” and “illegal” actions. I think of it more like, the AI should emulate an “improved” version of the advisor: do all the things the advisor would do on eir “best day”.
In other words, I don’t think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.
This seems wrong to me, but I’m having trouble articulating why. It feels like for the actual “prior” we use there will be many more hypotheses for capable behavior than for safe, capable behavior.
A background fact that’s probably relevant: I don’t expect that we’ll be using an explicit prior, and to the extent that we have an implicit prior, I doubt it will look anything like the universal prior.
The way I imagine it will work, the advisor will not do something weird and complicated that ey don’t understand emself. [...] I have a research direction based on the “debate” approach about how to strengthen it.
Yeah, this seems good to me!
The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example).
I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn’t matter whether the prior is “explicit” and I doubt it’s even a well-defined question. What I mean by “prior” is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.
I think the prior will have to look somewhat like the universal prior. Occam’s razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there’s even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).
I think that DRL doesn’t require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it’s relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn’t make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).
Ultimately, the set of hypotheses should be something like the universal prior. More precisely, it should be whatever we need to use instead of the universal prior to get “general” intelligence that is computationally efficient. In other words, I don’t think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.
The way I imagine it will work, the advisor will not do something weird and complicated that ey don’t understand emself. So the AI also cannot do something weird and complicated that the advisor doesn’t understand. In order for this not to be too constraining, I imagine the advisor having some kind of “diary” where ey write down eir thoughts and reasoning, which is considered a part of normal interaction with the environment. The advisor will only act on complicated plans after laying down the reasoning. The AI will then have to write down in this diary an explanation for its plans that will be understandable by and convincing for the advisor. This is partially protected from manipulations, because the explanation has to come from the space of explanations that the advisor could produce. That said, I think this defense from manipulation is insufficient in itself (because the AI can write down all arguments for a given position that the advisor could come up with, without writing down any arguments against it), and I have a research direction based on the “debate” approach about how to strengthen it.
The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example). In the “soft” version it will become a limit on how much the AI policy deviates from the advisor policy, so it’s not quite a review in the usual sense: there is no binary division between “legal” and “illegal” actions. I think of it more like, the AI should emulate an “improved” version of the advisor: do all the things the advisor would do on eir “best day”.
This seems wrong to me, but I’m having trouble articulating why. It feels like for the actual “prior” we use there will be many more hypotheses for capable behavior than for safe, capable behavior.
A background fact that’s probably relevant: I don’t expect that we’ll be using an explicit prior, and to the extent that we have an implicit prior, I doubt it will look anything like the universal prior.
Yeah, this seems good to me!
Okay, that makes sense.
I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn’t matter whether the prior is “explicit” and I doubt it’s even a well-defined question. What I mean by “prior” is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.
I think the prior will have to look somewhat like the universal prior. Occam’s razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there’s even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).
I think that DRL doesn’t require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it’s relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn’t make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).