The switch would be useful if you expect the reinforcement learning to work significantly better than imitation learning.
Paul Christiano’s Against mimicry talks about this. Well it actually talks about a big problem with imitating humans, but the same issue also applies to imitating an amplified agent, AFAICT.
(I only came upon that post recently because I was searching for posts on human imitations. Your comment here made me realize that it also explains in a clear way why Paul and Ought would want to switch to using RL for IDA, which I’ve been wondering for a while.)
One thing I’m still confused about is, if there is a security hole in the evaluation mechanism implemented by the overseer, it seems like current standard RL algorithms (given enough compute) would train the model to exhibit behavior that would exploit that security hole (because during training it would eventually hit upon such an exploit and the evaluator would return a high reward, thus increasing the probability the model would output that exploit). Is this addressed by making sure that the overseer has no security hole (which seems like a very hard problem compared to what the overseer needs to do in the imitation learning case), or through some other way?
I found that the Recursive Reward Modeling paper talks about my concern under the Reward Gaming section. However it is kind of short on ideas for solving it:
Reward gaming Opportunities for reward gaming arise when the reward function incorrectly
provides high reward to some undesired behavior (Clark & Amodei, 2016; Lehman et al., 2018);
see Figure 4 for a concrete example. One potential source for reward gaming is the reward model’s
vulnerability to adversarial inputs (Szegedy et al., 2013). If the environment is complex enough,
the agent might figure out how to specifically craft these adversarially perturbed inputs in order to
trick the reward model into providing higher reward than the user intends. Unlike in most work on
generating adversarial examples (Goodfellow et al., 2015; Huang et al., 2017), the agent would not
necessarily be free to synthesize any possible input to the reward model, but would need to find a
way to realize adversarial observation sequences in its environment.
Reward gaming problems are in principle solvable by improving the reward model. Whether this
means that reward gaming problems can also be overcome in practice is arguably one of the biggest
open questions and possibly the greatest weakness of reward modeling. Yet there are a few examples
from the literature indicating that reward gaming can be avoided in practice. Reinforcement learning
from a learned reward function has been successful in gridworlds (Bahdanau et al., 2018), Atari
games (Christiano et al., 2017; Ibarz et al., 2018), and continuous motor control tasks (Ho & Ermon,
2016; Christiano et al., 2017).
It seems to me that reward gaming not happening in practice is probably the result of limited computation applied to RL (resulting in a kind of implicit mild optimization) so that result would likely not be robust to scaling up. In the limit of infinite computation, standard RL algorithms should end up training models that maximize reward, which means doing reward gaming if that is what maximizes reward. (Perhaps with current RL algorithms, this isn’t a concern in practice because it would take an impractically large amount of computation to actually end up with reward gaming / exploitative behavior, but it doesn’t seem safe to assume that this will continue to be true despite future advances in RL technology.)
It currently feels to me like the problems with imitation learning might be less serious and easier to solve than the reward gaming problem with RL, so I’m still wondering about the move from SL/imitation to RL for IDA. Also, if there’s not existing discussion about this that I’ve missed, I’m surprised that there has been so little discussion about this issue.
Against mimicry is mostly motivated by the case of imitating an amplified agent. (I try to separate the problem of distillation and amplification, and imitation learning is a candidate for mimicry.)
You could try to avoid the RL exploiting a security vulnerability in the overseer by:
Doing something like quantilizing where you are constrained to be near the original policy (we impose a KL constraint that prevents the policy from drifting too far from an attempted-imitation).
Doing something like meeting halfway.
These solutions seem tricky but maybe helpful. But my bigger concern is that you need to fix security vulnerabilities anyway:
The algorithm “Search over lots of actions to find the one for which Q(a) is maximized” is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
Iterated amplification does optimization anyway (by amplifying the optimization done by the individual humans) and without security you are going to have problems there.
The algorithm “Search over lots of actions to find the one for which Q(a) is maximized” is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
This search is done by the overseer (or the counterfactual overseer or trained model), and is likely to be safer than the search done by the RL training process, because the overseer has more information about the specific situation that calls for a search, and can tailor the search process to best fit that situation, including how to avoid generating a candidate action that may trigger a flaw in Q, how many candidates to test, etc., whereas the RL search process is dumb and whatever safety mechanisms/constraints that are put on it has to be decided by the AI designer ahead of time and applied to every situation.
You could try to avoid the RL exploiting a security vulnerability in the overseer by:
Ok, would be interested to learn more about the solution you think is most promising here. Is it written up anywhere yet, or there’s something in the ML or AI safety literature that you can point me to?
I mostly hope to solve this problem with security amplification (see also).
I think when I wrote my comment I already internalized “security amplification” and was assuming a LBO/meta-execution style system, but was worried about flaws introduced by meta-execution itself (since doing meta-execution seems like programming in that you have to break tasks down into sub-tasks and can make a mistake in the process). This seems related to what you wrote in the security amplification post:
Note that security amplification + distillation will only remove the vulnerabilities that came from the human. We will still be left with vulnerabilities introduced by our learning process, and with any inherent limits on our model’s ability to represent/learn a secure policy. So we’ll have to deal with those problems separately.
I’m sort of confused by the main point of that post. Is the idea that the robot can’t stack blocks because of a physical limitation? If so, it seems like this is addressed by the first initial objection. Is it rather that the model space might not have the capacity to correctly imitate the human? I’d be somewhat surprised by this being a big issue, and at any rate it seems like you could use the Wasserstein metric as a cost function and get a desirable outcome. I guess instead we’re instead imagining a problem where there’s no great metric (e.g. text answers to questions)?
Not all of these are compatible with “and so the robot does the thing that the human does 5% of the time”. But it seems like there can and probably will be factors that are different between the human and the robot (even if the human uses teleoperation), and in that setting imitating factored cognition provides the wrong incentives, while optimizing factored evaluation provides the right incentives.
I think AI will probably be good enough to pose a catastrophic risk before it can exactly imitate a human. (But as Wei Dai says elsewhere, if you do amplification then you will definitely get into the regime where you can’t imitate.)
Is it rather that the model space might not have the capacity to correctly imitate the human?
Paul wrote in a parallel subthread, “Against mimicry is mostly motivated by the case of imitating an amplified agent.” In the case of IDA, you’re bound to run out of model capacity eventually as you keep iterating the amplification and distillation.
I guess instead we’re instead imagining a problem where there’s no great metric (e.g. text answers to questions)?
Paul Christiano’s Against mimicry talks about this. Well it actually talks about a big problem with imitating humans, but the same issue also applies to imitating an amplified agent, AFAICT.
(I only came upon that post recently because I was searching for posts on human imitations. Your comment here made me realize that it also explains in a clear way why Paul and Ought would want to switch to using RL for IDA, which I’ve been wondering for a while.)
One thing I’m still confused about is, if there is a security hole in the evaluation mechanism implemented by the overseer, it seems like current standard RL algorithms (given enough compute) would train the model to exhibit behavior that would exploit that security hole (because during training it would eventually hit upon such an exploit and the evaluator would return a high reward, thus increasing the probability the model would output that exploit). Is this addressed by making sure that the overseer has no security hole (which seems like a very hard problem compared to what the overseer needs to do in the imitation learning case), or through some other way?
I found that the Recursive Reward Modeling paper talks about my concern under the Reward Gaming section. However it is kind of short on ideas for solving it:
It seems to me that reward gaming not happening in practice is probably the result of limited computation applied to RL (resulting in a kind of implicit mild optimization) so that result would likely not be robust to scaling up. In the limit of infinite computation, standard RL algorithms should end up training models that maximize reward, which means doing reward gaming if that is what maximizes reward. (Perhaps with current RL algorithms, this isn’t a concern in practice because it would take an impractically large amount of computation to actually end up with reward gaming / exploitative behavior, but it doesn’t seem safe to assume that this will continue to be true despite future advances in RL technology.)
It currently feels to me like the problems with imitation learning might be less serious and easier to solve than the reward gaming problem with RL, so I’m still wondering about the move from SL/imitation to RL for IDA. Also, if there’s not existing discussion about this that I’ve missed, I’m surprised that there has been so little discussion about this issue.
Against mimicry is mostly motivated by the case of imitating an amplified agent. (I try to separate the problem of distillation and amplification, and imitation learning is a candidate for mimicry.)
You could try to avoid the RL exploiting a security vulnerability in the overseer by:
Doing something like quantilizing where you are constrained to be near the original policy (we impose a KL constraint that prevents the policy from drifting too far from an attempted-imitation).
Doing something like meeting halfway.
These solutions seem tricky but maybe helpful. But my bigger concern is that you need to fix security vulnerabilities anyway:
The algorithm “Search over lots of actions to find the one for which Q(a) is maximized” is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
Iterated amplification does optimization anyway (by amplifying the optimization done by the individual humans) and without security you are going to have problems there.
I mostly hope to solve this problem with security amplification (see also).
This search is done by the overseer (or the counterfactual overseer or trained model), and is likely to be safer than the search done by the RL training process, because the overseer has more information about the specific situation that calls for a search, and can tailor the search process to best fit that situation, including how to avoid generating a candidate action that may trigger a flaw in Q, how many candidates to test, etc., whereas the RL search process is dumb and whatever safety mechanisms/constraints that are put on it has to be decided by the AI designer ahead of time and applied to every situation.
Ok, would be interested to learn more about the solution you think is most promising here. Is it written up anywhere yet, or there’s something in the ML or AI safety literature that you can point me to?
I think when I wrote my comment I already internalized “security amplification” and was assuming a LBO/meta-execution style system, but was worried about flaws introduced by meta-execution itself (since doing meta-execution seems like programming in that you have to break tasks down into sub-tasks and can make a mistake in the process). This seems related to what you wrote in the security amplification post:
I’m sort of confused by the main point of that post. Is the idea that the robot can’t stack blocks because of a physical limitation? If so, it seems like this is addressed by the first initial objection. Is it rather that the model space might not have the capacity to correctly imitate the human? I’d be somewhat surprised by this being a big issue, and at any rate it seems like you could use the Wasserstein metric as a cost function and get a desirable outcome. I guess instead we’re instead imagining a problem where there’s no great metric (e.g. text answers to questions)?
There are lots of reasons that a robot might be unable to learn the correct policy despite the action space permitting it:
Not enough model capacity
Not enough training data
Training got stuck in a local optimum
You’ve learned from robot play data, but you’ve never seen anything like the human policy before
etc, etc.
Not all of these are compatible with “and so the robot does the thing that the human does 5% of the time”. But it seems like there can and probably will be factors that are different between the human and the robot (even if the human uses teleoperation), and in that setting imitating factored cognition provides the wrong incentives, while optimizing factored evaluation provides the right incentives.
I think AI will probably be good enough to pose a catastrophic risk before it can exactly imitate a human. (But as Wei Dai says elsewhere, if you do amplification then you will definitely get into the regime where you can’t imitate.)
Paul wrote in a parallel subthread, “Against mimicry is mostly motivated by the case of imitating an amplified agent.” In the case of IDA, you’re bound to run out of model capacity eventually as you keep iterating the amplification and distillation.
I’m pretty sure that’s it.