Can you send links? In any case I do believe that it is understood that you have to be careful in a setting where you have two models A and B, where B is a “supervisor” of the output of A, and you are trying to simultaneously teach B to come up with good metric to judge A by, and teach A to come up with outputs that optimize B’s metric. There can be equilibriums where A and B jointly diverge from what we would consider “good outputs”.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
The bottom line is that I think we are very good at optimizing any explicit metric M, including when that metric is itself some learned model. But generally, if we learn some model A s.t. A(y)≈M(y), this doesn’t mean that if we let B(x)=argmaxA(y) then it would give us an approximate maximizer of M(y) as well. Maximizing A would tend to push to the extreme parts of the input space, which would be exactly those where A deviates from M.
The above is not an argument against the ability to construct AGI as well, but rather an argument for establishing concrete measurable goals that our different agents try to optimize, rather than trying to learn some long-term equilibrium. So for example, in the software-writing and software-testing case, I think we don’t simply want to deploy two agents A and B playing a zero-sum game where B’s reward is the number of bugs found in A’s code.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
Sure. And the GPT-2 adversarial examples and overfitting were much worse than the GPT-3 ones.
see “Adversarial Policies Beat Professional-Level Go AIs”
The meaning of that one is in serious doubt so I would not link it.
(The other one is better and I had not seen it before, but my first question is, doesn’t adding those extra stones create board states that correspond to board states that the agent would never reach following its policy, or even literally impossible board states, because those stones could not have been played while still yielding the same captured-stone count and board positions etc? The approach in 3.1 seems circular.)
Will read later the links—thanks! I confess I didn’t read the papers (though saw a talk partially based on the first one which didn’t go into enough details for me to know the issues) but also heard from people that I trust of similar issues with Chess RL engines (can be defeated with simple strategies if you are looking for adversarial ones). Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale (though some types of attacks are automatically motivated with diversity of training data / scenarios).
Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale
I don’t think we know that. (How big is KataGo anyway, 0.01b parameters or so?) We don’t have much scaling research on adversarial robustness, what we do have suggests that adversarial robustness does increase, the isoperimetry theory claims that scaling much larger than we currently do will be sufficient (and may be necessary), and the fact that a staggeringly large adversarial-defense literature has yet to yield any defense that holds up longer than a year or two before an attack cracks it & gets added to Clever Hans suggests that the goal of adversarial defenses for small NNs may be inherently impossible (and there is a certain academic smell to adversarial research which it shares with other areas that either have been best solved by scaling, or, like continual learning, look increasingly like they are going to be soon).
I don’t think it’s fair to compare parameter sizes between language models and models for other domains, such as games or vision. E.g., I believe AlphaZero is also only in the range of hundreds of millions of parameters? (quick google didn’t give me the answer)
I think there is a real difference between adversarial and natural distribution shifts, and without adversarial training, even large network struggle with adversarial shifts. So I don’t think this is a problem that would go away with scale alone. At least I don’t see evidence for it from current data (failure of defenses for small models is no evidence of success of size alone for larger ones).
One way to see this is to look at the figures in this plotting playground of “accuracy on the line”. This is the figure for natural distribution shift—the green models are the ones that are trained with more data, and they do seem to be “above the curve” (significantly so for CLIP, which are the two green dots reaching ~ 53 and ~55 natural distribution accuracy compared to ~60 and ~63 vanilla accuracy
In contrast, if you look at adversarial perturbations, then you can see that actual adversarial training (bright orange) or other robustness interactions (brown) is much more effective than more data (green) which in fact mostly underperform.
(I know you focused on “more model” but I think to first approximation “more model” and “more data” should have similar effects.)
I suppose you’re talking about this paper (https://arxiv.org/abs/2210.10760). It’s important to note that in the setting of this paper, the reward model is only trained on samples from the original policy, whereas GAN discriminators are constantly trained with new data. Section 4.3 touches briefly on the iterated problems, which is closer in setting to GANs, where we correspondingly expect a reduction in overoptimization (i.e the beta term).
It is definitely true that you have to be careful whenever you’re optimizing any proxy metric, and this is one big reason I feel kind of uncomfortable about proposals like RLHF/RRM. In fact, our setting probably underestimates the amount of overoptimization due to the synthetic setup. However, it does seem like GAN mode collapse is largely unrelated to this effect of overoptimization, and it seems like gwern’s claim is mostly about this.
Can you send links? In any case I do believe that it is understood that you have to be careful in a setting where you have two models A and B, where B is a “supervisor” of the output of A, and you are trying to simultaneously teach B to come up with good metric to judge A by, and teach A to come up with outputs that optimize B’s metric. There can be equilibriums where A and B jointly diverge from what we would consider “good outputs”.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
Another way to see this issue is even for powerful agents like AlphaZero are susceptible to simple adversarial strategies that can beat them: see “Adversarial Policies Beat Professional-Level Go AIs” and “Are AlphaZero-like Agents Robust to Adversarial Perturbations?”.
The bottom line is that I think we are very good at optimizing any explicit metric M, including when that metric is itself some learned model. But generally, if we learn some model A s.t. A(y)≈M(y), this doesn’t mean that if we let B(x)=argmaxA(y) then it would give us an approximate maximizer of M(y) as well. Maximizing A would tend to push to the extreme parts of the input space, which would be exactly those where A deviates from M.
The above is not an argument against the ability to construct AGI as well, but rather an argument for establishing concrete measurable goals that our different agents try to optimize, rather than trying to learn some long-term equilibrium. So for example, in the software-writing and software-testing case, I think we don’t simply want to deploy two agents A and B playing a zero-sum game where B’s reward is the number of bugs found in A’s code.
http://arxiv.org/abs/1809.11096.pdf#subsection.4.1 http://arxiv.org/abs/1809.11096.pdf#subsection.4.2 http://arxiv.org/abs/1809.11096.pdf#subsection.5.2 https://www.gwern.net/Faces#discriminator-ranking https://www.gwern.net/GANs
Sure. And the GPT-2 adversarial examples and overfitting were much worse than the GPT-3 ones.
The meaning of that one is in serious doubt so I would not link it.
(The other one is better and I had not seen it before, but my first question is, doesn’t adding those extra stones create board states that correspond to board states that the agent would never reach following its policy, or even literally impossible board states, because those stones could not have been played while still yielding the same captured-stone count and board positions etc? The approach in 3.1 seems circular.)
Will read later the links—thanks! I confess I didn’t read the papers (though saw a talk partially based on the first one which didn’t go into enough details for me to know the issues) but also heard from people that I trust of similar issues with Chess RL engines (can be defeated with simple strategies if you are looking for adversarial ones). Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale (though some types of attacks are automatically motivated with diversity of training data / scenarios).
I don’t think we know that. (How big is KataGo anyway, 0.01b parameters or so?) We don’t have much scaling research on adversarial robustness, what we do have suggests that adversarial robustness does increase, the isoperimetry theory claims that scaling much larger than we currently do will be sufficient (and may be necessary), and the fact that a staggeringly large adversarial-defense literature has yet to yield any defense that holds up longer than a year or two before an attack cracks it & gets added to Clever Hans suggests that the goal of adversarial defenses for small NNs may be inherently impossible (and there is a certain academic smell to adversarial research which it shares with other areas that either have been best solved by scaling, or, like continual learning, look increasingly like they are going to be soon).
I don’t think it’s fair to compare parameter sizes between language models and models for other domains, such as games or vision. E.g., I believe AlphaZero is also only in the range of hundreds of millions of parameters? (quick google didn’t give me the answer)
I think there is a real difference between adversarial and natural distribution shifts, and without adversarial training, even large network struggle with adversarial shifts. So I don’t think this is a problem that would go away with scale alone. At least I don’t see evidence for it from current data (failure of defenses for small models is no evidence of success of size alone for larger ones).
One way to see this is to look at the figures in this plotting playground of “accuracy on the line”. This is the figure for natural distribution shift—the green models are the ones that are trained with more data, and they do seem to be “above the curve” (significantly so for CLIP, which are the two green dots reaching ~ 53 and ~55 natural distribution accuracy compared to ~60 and ~63 vanilla accuracy
In contrast, if you look at adversarial perturbations, then you can see that actual adversarial training (bright orange) or other robustness interactions (brown) is much more effective than more data (green) which in fact mostly underperform.
(I know you focused on “more model” but I think to first approximation “more model” and “more data” should have similar effects.)
I suppose you’re talking about this paper (https://arxiv.org/abs/2210.10760). It’s important to note that in the setting of this paper, the reward model is only trained on samples from the original policy, whereas GAN discriminators are constantly trained with new data. Section 4.3 touches briefly on the iterated problems, which is closer in setting to GANs, where we correspondingly expect a reduction in overoptimization (i.e the beta term).
It is definitely true that you have to be careful whenever you’re optimizing any proxy metric, and this is one big reason I feel kind of uncomfortable about proposals like RLHF/RRM. In fact, our setting probably underestimates the amount of overoptimization due to the synthetic setup. However, it does seem like GAN mode collapse is largely unrelated to this effect of overoptimization, and it seems like gwern’s claim is mostly about this.