I think that most of the “intentional” catastrophic actions can be regarded as “due to ignorance” from an appropriate perspective
This makes sense to me at a high level, but I’m struggling to connect it to the math. It seems like delegative RL as described in the post I read wouldn’t solve reward hacking, because it has a specific reward function that it is trying to maximize, and it can’t “learn” a new reward function that it should be optimizing. I suppose if the advisor never explores an area of state space then the agent will never go there, but it doesn’t feel like much progress if our safety guarantees require the delegator to never explore anywhere that reward hacking could occur.
Example 1 is corrupt states, that I discussed here. These are states in which the specified reward function doesn’t match the intended reward function (and also possibly the advisor becomes unreliable).
So to be clear, this setting requires a different algorithm, detailed in the other post, right? (I haven’t read that post in detail.) Maybe that’s the answer to my question above; that in fact delegative RL doesn’t solve reward hacking, but this other post does.
Dealing with corrupt states requires a “different” algorithm, but the modification is rather trivial: for each hypothesis that includes dynamics and corruption, you need to replace the corrupt states by an inescapable state with reward zero and run the usual PSRL algorithm on this new prior. Indeed, the algorithm deals with corruption by never letting the agent go there. I am not sure I understand why you think this is not a good approach. Consider a corrupt state in which the human’s brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored? Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability. In this case, I agree, the current model is extremely simplified, but it still feels like progress. You can see this for a model of continuous corruption in DIRL, a simpler setting. More generally, I think that a better version of the formalism would build on ideas from quantilization and catastrophe mitigation to arrive at a setting where, you have a low rate of falling into traps or accumulating corruption as long as your policy remains “close enough” to the advisor policy w.r.t. some metric similar to infinity-Renyi divergence (and, as long as your corruption remains low).
Consider a corrupt state in which the human’s brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored?
I agree that state shouldn’t be explored.
Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability.
That seems closer to my objection but not exactly it.
Indeed, the algorithm deals with corruption by never letting the agent go there.
For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?
Maybe I’m also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn’t realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn’t seem like DRL saves us here.
Maybe another way of putting it—is there additional safety conferred by this approach that you couldn’t get by having a human review all of the AI’s actions? If so, should I think of this as “we want a human to review actions, but that’s expensive, DRL is a way to make it more sample efficient”?
For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?
Ultimately, the set of hypotheses should be something like the universal prior. More precisely, it should be whatever we need to use instead of the universal prior to get “general” intelligence that is computationally efficient. In other words, I don’t think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.
Maybe I’m also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn’t realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn’t seem like DRL saves us here.
The way I imagine it will work, the advisor will not do something weird and complicated that ey don’t understand emself. So the AI also cannot do something weird and complicated that the advisor doesn’t understand. In order for this not to be too constraining, I imagine the advisor having some kind of “diary” where ey write down eir thoughts and reasoning, which is considered a part of normal interaction with the environment. The advisor will only act on complicated plans after laying down the reasoning. The AI will then have to write down in this diary an explanation for its plans that will be understandable by and convincing for the advisor. This is partially protected from manipulations, because the explanation has to come from the space of explanations that the advisor could produce. That said, I think this defense from manipulation is insufficient in itself (because the AI can write down all arguments for a given position that the advisor could come up with, without writing down any arguments against it), and I have a research direction based on the “debate” approach about how to strengthen it.
Maybe another way of putting it—is there additional safety conferred by this approach that you couldn’t get by having a human review all of the AI’s actions? If so, should I think of this as “we want a human to review actions, but that’s expensive, DRL is a way to make it more sample efficient”?
The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example). In the “soft” version it will become a limit on how much the AI policy deviates from the advisor policy, so it’s not quite a review in the usual sense: there is no binary division between “legal” and “illegal” actions. I think of it more like, the AI should emulate an “improved” version of the advisor: do all the things the advisor would do on eir “best day”.
In other words, I don’t think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.
This seems wrong to me, but I’m having trouble articulating why. It feels like for the actual “prior” we use there will be many more hypotheses for capable behavior than for safe, capable behavior.
A background fact that’s probably relevant: I don’t expect that we’ll be using an explicit prior, and to the extent that we have an implicit prior, I doubt it will look anything like the universal prior.
The way I imagine it will work, the advisor will not do something weird and complicated that ey don’t understand emself. [...] I have a research direction based on the “debate” approach about how to strengthen it.
Yeah, this seems good to me!
The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example).
I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn’t matter whether the prior is “explicit” and I doubt it’s even a well-defined question. What I mean by “prior” is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.
I think the prior will have to look somewhat like the universal prior. Occam’s razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there’s even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).
I think that DRL doesn’t require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it’s relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn’t make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).
This makes sense to me at a high level, but I’m struggling to connect it to the math. It seems like delegative RL as described in the post I read wouldn’t solve reward hacking, because it has a specific reward function that it is trying to maximize, and it can’t “learn” a new reward function that it should be optimizing. I suppose if the advisor never explores an area of state space then the agent will never go there, but it doesn’t feel like much progress if our safety guarantees require the delegator to never explore anywhere that reward hacking could occur.
So to be clear, this setting requires a different algorithm, detailed in the other post, right? (I haven’t read that post in detail.) Maybe that’s the answer to my question above; that in fact delegative RL doesn’t solve reward hacking, but this other post does.
Dealing with corrupt states requires a “different” algorithm, but the modification is rather trivial: for each hypothesis that includes dynamics and corruption, you need to replace the corrupt states by an inescapable state with reward zero and run the usual PSRL algorithm on this new prior. Indeed, the algorithm deals with corruption by never letting the agent go there. I am not sure I understand why you think this is not a good approach. Consider a corrupt state in which the human’s brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored? Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability. In this case, I agree, the current model is extremely simplified, but it still feels like progress. You can see this for a model of continuous corruption in DIRL, a simpler setting. More generally, I think that a better version of the formalism would build on ideas from quantilization and catastrophe mitigation to arrive at a setting where, you have a low rate of falling into traps or accumulating corruption as long as your policy remains “close enough” to the advisor policy w.r.t. some metric similar to infinity-Renyi divergence (and, as long as your corruption remains low).
I agree that state shouldn’t be explored.
That seems closer to my objection but not exactly it.
For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?
Maybe I’m also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn’t realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn’t seem like DRL saves us here.
Maybe another way of putting it—is there additional safety conferred by this approach that you couldn’t get by having a human review all of the AI’s actions? If so, should I think of this as “we want a human to review actions, but that’s expensive, DRL is a way to make it more sample efficient”?
Ultimately, the set of hypotheses should be something like the universal prior. More precisely, it should be whatever we need to use instead of the universal prior to get “general” intelligence that is computationally efficient. In other words, I don’t think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.
The way I imagine it will work, the advisor will not do something weird and complicated that ey don’t understand emself. So the AI also cannot do something weird and complicated that the advisor doesn’t understand. In order for this not to be too constraining, I imagine the advisor having some kind of “diary” where ey write down eir thoughts and reasoning, which is considered a part of normal interaction with the environment. The advisor will only act on complicated plans after laying down the reasoning. The AI will then have to write down in this diary an explanation for its plans that will be understandable by and convincing for the advisor. This is partially protected from manipulations, because the explanation has to come from the space of explanations that the advisor could produce. That said, I think this defense from manipulation is insufficient in itself (because the AI can write down all arguments for a given position that the advisor could come up with, without writing down any arguments against it), and I have a research direction based on the “debate” approach about how to strengthen it.
The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example). In the “soft” version it will become a limit on how much the AI policy deviates from the advisor policy, so it’s not quite a review in the usual sense: there is no binary division between “legal” and “illegal” actions. I think of it more like, the AI should emulate an “improved” version of the advisor: do all the things the advisor would do on eir “best day”.
This seems wrong to me, but I’m having trouble articulating why. It feels like for the actual “prior” we use there will be many more hypotheses for capable behavior than for safe, capable behavior.
A background fact that’s probably relevant: I don’t expect that we’ll be using an explicit prior, and to the extent that we have an implicit prior, I doubt it will look anything like the universal prior.
Yeah, this seems good to me!
Okay, that makes sense.
I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn’t matter whether the prior is “explicit” and I doubt it’s even a well-defined question. What I mean by “prior” is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.
I think the prior will have to look somewhat like the universal prior. Occam’s razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there’s even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).
I think that DRL doesn’t require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it’s relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn’t make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).