Thanks for the post! It’s really useful as a pointer towards part of the research landscape I might not see or notice otherwise.
I still have some comments/questions:
About 2, you write:
Additionally, because we do intermittent oversight whenever we change the target network, we have a guarantee that the overseer is always at least as strong as any targets that the model was trained to imitate. That being said, the model could end up stronger than the targets it was trained to imitate if it manages to, for example, achieve significant compression of what the targets are doing—which in fact is necessary for training competitiveness.
so, do we have guarantees or not? Because the first sentence says there are, and the second says the model could end up stronger than the ones it imitates.
About 3, isn’t there a risk that M_{n+1} behaves such that it simplifies or remove the checks of Amp(M_{n+1})? One way to deal with that would be to make humans do the adversarial attacks, but that will probably hurt training competitiveness.
About 4, you write:
In particular, the standard maxim that it is generally easier to evaluate behavior than produce it seems to suggest that human approval should produce a significantly richer training signal than the simple automated distance metric in imitative amplification, resulting in more efficient training.
I think I get the intuition, but evaluation is far less rich a signal that production of behavior: you have a score or a binary yes/no for the former, and the full behavior for the latter. What I believe you meant is that using evaluation instead of production makes the method applicable to far more problems, but I might be wrong.
I really liked the 5th one! I did not know it, and I like the intuition behind it.
For the 6, my intuition is that useful/interesting/valuable STEM actually requires a human perspective. Thus I conjecture that STEM AI is either good at STEM and thus models humans (defeating the purpose), or don’t model humans but sucks at STEM.
Finally, for the 8, do you have examples of when this behaves differently from 3? Because it seems to me that in the limit, imitation will produce the same behavior than extraction of the reward function and maximization of the reward. Maybe something about generalization changes?
so, do we have guarantees or not? Because the first sentence says there are, and the second says the model could end up stronger than the ones it imitates.
The first sentence says that you have a guarantee that the overseer is at least as strong as the target while the second sentence notes that the model might be stronger (or weaker) than the target. So we know overseer > target, but we don’t know target > model, so we can’t conclude overseer > model.
About 3, isn’t there a risk that M_{n+1} behaves such that it simplifies or remove the checks of Amp(M_{n+1})? One way to deal with that would be to make humans do the adversarial attacks, but that will probably hurt training competitiveness.
There’s still a human in the loop since Amp(M) is just H consulting M—and you should still be using a target model to do the oversight. But the real thing you’re relying on here to prevent M from causing the oversight to fail in the future is myopia verification, as a myopic M should never pursue that strategy.
I think I get the intuition, but evaluation is far less rich a signal that production of behavior: you have a score or a binary yes/no for the former, and the full behavior for the latter. What I believe you meant is that using evaluation instead of production makes the method applicable to far more problems, but I might be wrong.
I think there are lots of cases where evaluation is richer than imitation—compare RL to behavior cloning, for example.
Finally, for the 8, do you have examples of when this behaves differently from 3? Because it seems to me that in the limit, imitation will produce the same behavior than extraction of the reward function and maximization of the reward. Maybe something about generalization changes?
Certainly they can behave differently not in the limit. But even in the limit, when you do imitation, you try to mimic both what the human values and also how the human pursues those values—whereas when you do reward learning followed by reward maximization, by contrast, you try to mimic the values but not the strategy the human uses to pursue them. Thus, a model trained to maximize a learned reward might in the limit take actions to maximize that reward that the original human never would—perhaps because the human would never have thought of such actions, for example.
About the guarantees, now that you point it out, the two sentences indeed have different subjects.
About the 3, makes sense that myopia is the most important part
For evaluation vs imitation, I think we might be meaning two different things with richer. I mean that the content of the signal itself has more information and more structure, whether I believe you mean that it applies to more situations and is more general. Is that a good description of your intuition, or am I wrong here?
For the difference between reward learning + maximization and imitation, you’re right, I forgot that most people and systems are not necessarily optimal for their observable reward function. Even if they are, I guess the way the reward generalizes to new environment might differ from the way the imitation differs.
Thanks for the post! It’s really useful as a pointer towards part of the research landscape I might not see or notice otherwise.
I still have some comments/questions:
About 2, you write:
so, do we have guarantees or not? Because the first sentence says there are, and the second says the model could end up stronger than the ones it imitates.
About 3, isn’t there a risk that M_{n+1} behaves such that it simplifies or remove the checks of Amp(M_{n+1})? One way to deal with that would be to make humans do the adversarial attacks, but that will probably hurt training competitiveness.
About 4, you write:
I think I get the intuition, but evaluation is far less rich a signal that production of behavior: you have a score or a binary yes/no for the former, and the full behavior for the latter. What I believe you meant is that using evaluation instead of production makes the method applicable to far more problems, but I might be wrong.
I really liked the 5th one! I did not know it, and I like the intuition behind it.
For the 6, my intuition is that useful/interesting/valuable STEM actually requires a human perspective. Thus I conjecture that STEM AI is either good at STEM and thus models humans (defeating the purpose), or don’t model humans but sucks at STEM.
Finally, for the 8, do you have examples of when this behaves differently from 3? Because it seems to me that in the limit, imitation will produce the same behavior than extraction of the reward function and maximization of the reward. Maybe something about generalization changes?
Glad you enjoyed the post!
The first sentence says that you have a guarantee that the overseer is at least as strong as the target while the second sentence notes that the model might be stronger (or weaker) than the target. So we know overseer > target, but we don’t know target > model, so we can’t conclude overseer > model.
There’s still a human in the loop since Amp(M) is just H consulting M—and you should still be using a target model to do the oversight. But the real thing you’re relying on here to prevent M from causing the oversight to fail in the future is myopia verification, as a myopic M should never pursue that strategy.
I think there are lots of cases where evaluation is richer than imitation—compare RL to behavior cloning, for example.
Certainly they can behave differently not in the limit. But even in the limit, when you do imitation, you try to mimic both what the human values and also how the human pursues those values—whereas when you do reward learning followed by reward maximization, by contrast, you try to mimic the values but not the strategy the human uses to pursue them. Thus, a model trained to maximize a learned reward might in the limit take actions to maximize that reward that the original human never would—perhaps because the human would never have thought of such actions, for example.
Thanks for the answers.
About the guarantees, now that you point it out, the two sentences indeed have different subjects.
About the 3, makes sense that myopia is the most important part
For evaluation vs imitation, I think we might be meaning two different things with richer. I mean that the content of the signal itself has more information and more structure, whether I believe you mean that it applies to more situations and is more general. Is that a good description of your intuition, or am I wrong here?
For the difference between reward learning + maximization and imitation, you’re right, I forgot that most people and systems are not necessarily optimal for their observable reward function. Even if they are, I guess the way the reward generalizes to new environment might differ from the way the imitation differs.