so, do we have guarantees or not? Because the first sentence says there are, and the second says the model could end up stronger than the ones it imitates.
The first sentence says that you have a guarantee that the overseer is at least as strong as the target while the second sentence notes that the model might be stronger (or weaker) than the target. So we know overseer > target, but we don’t know target > model, so we can’t conclude overseer > model.
About 3, isn’t there a risk that M_{n+1} behaves such that it simplifies or remove the checks of Amp(M_{n+1})? One way to deal with that would be to make humans do the adversarial attacks, but that will probably hurt training competitiveness.
There’s still a human in the loop since Amp(M) is just H consulting M—and you should still be using a target model to do the oversight. But the real thing you’re relying on here to prevent M from causing the oversight to fail in the future is myopia verification, as a myopic M should never pursue that strategy.
I think I get the intuition, but evaluation is far less rich a signal that production of behavior: you have a score or a binary yes/no for the former, and the full behavior for the latter. What I believe you meant is that using evaluation instead of production makes the method applicable to far more problems, but I might be wrong.
I think there are lots of cases where evaluation is richer than imitation—compare RL to behavior cloning, for example.
Finally, for the 8, do you have examples of when this behaves differently from 3? Because it seems to me that in the limit, imitation will produce the same behavior than extraction of the reward function and maximization of the reward. Maybe something about generalization changes?
Certainly they can behave differently not in the limit. But even in the limit, when you do imitation, you try to mimic both what the human values and also how the human pursues those values—whereas when you do reward learning followed by reward maximization, by contrast, you try to mimic the values but not the strategy the human uses to pursue them. Thus, a model trained to maximize a learned reward might in the limit take actions to maximize that reward that the original human never would—perhaps because the human would never have thought of such actions, for example.
About the guarantees, now that you point it out, the two sentences indeed have different subjects.
About the 3, makes sense that myopia is the most important part
For evaluation vs imitation, I think we might be meaning two different things with richer. I mean that the content of the signal itself has more information and more structure, whether I believe you mean that it applies to more situations and is more general. Is that a good description of your intuition, or am I wrong here?
For the difference between reward learning + maximization and imitation, you’re right, I forgot that most people and systems are not necessarily optimal for their observable reward function. Even if they are, I guess the way the reward generalizes to new environment might differ from the way the imitation differs.
Glad you enjoyed the post!
The first sentence says that you have a guarantee that the overseer is at least as strong as the target while the second sentence notes that the model might be stronger (or weaker) than the target. So we know overseer > target, but we don’t know target > model, so we can’t conclude overseer > model.
There’s still a human in the loop since Amp(M) is just H consulting M—and you should still be using a target model to do the oversight. But the real thing you’re relying on here to prevent M from causing the oversight to fail in the future is myopia verification, as a myopic M should never pursue that strategy.
I think there are lots of cases where evaluation is richer than imitation—compare RL to behavior cloning, for example.
Certainly they can behave differently not in the limit. But even in the limit, when you do imitation, you try to mimic both what the human values and also how the human pursues those values—whereas when you do reward learning followed by reward maximization, by contrast, you try to mimic the values but not the strategy the human uses to pursue them. Thus, a model trained to maximize a learned reward might in the limit take actions to maximize that reward that the original human never would—perhaps because the human would never have thought of such actions, for example.
Thanks for the answers.
About the guarantees, now that you point it out, the two sentences indeed have different subjects.
About the 3, makes sense that myopia is the most important part
For evaluation vs imitation, I think we might be meaning two different things with richer. I mean that the content of the signal itself has more information and more structure, whether I believe you mean that it applies to more situations and is more general. Is that a good description of your intuition, or am I wrong here?
For the difference between reward learning + maximization and imitation, you’re right, I forgot that most people and systems are not necessarily optimal for their observable reward function. Even if they are, I guess the way the reward generalizes to new environment might differ from the way the imitation differs.