I’m trying to understand this example. The way I would think of a software writing AI would be the following: after some pretraining we fine tune an AI on prompts explains the business task, the output being the software, and the objective related to various outcome measures.
Then we deploy it. It is not clear that we want to keep fine tuning after deployment. It does clearly raise issues of overfitting and could lead to issues such as the “blah blah blah…” example mentioned in the post. (E.g. if you’re writing the testing code for your future code, you might want to “take the hit” and write bad tests that would be easy to pass.) Also, as we mention, the more compute and data invested during training, the less we expect there to be much “on the job training”. The AI would be like a consultant that had thousands of years of software writing experience that is coming to do a particular project.
The way I would think of a software writing AI would be the following: after some pretraining we fine tune an AI on prompts explains the business task, the output being the software, and the objective related to various outcome measures.
That’s roughly what I’m imagining. Initially you might fine-tune such a system to copy the kind of code a human would write, and then over time you could shift towards writing code that it anticipates to result in good outcome measures (whether by RL, or by explicit search/planning, or by decision-transfomer-style prediction of actions given consequences).
A model trained in this way will systematically produce actions that lead to highly-rewarded outcomes. And so it will learn to manipulate the sensors used to compute reward (and indeed a sophisticated model will likely be able to generalize to manipulating sensors without seeing any examples where such manipulation actually results in a higher reward).
If that happens, and if your model starts generating behavior that manipulates those sensors, then you would need to do something to fix the problem. I think it may be tempting to assign the offending behaviors a negative reward and then train on it.
Then we deploy it. It is not clear that we want to keep fine tuning after deployment.
I’m imagining that the deployed system continues to behave the same way it did on the training distribution, so that it makes sense to continue thinking of it as choosing actions that would lead to high reward (if the episode had occurred at training time).
If the system behaves very differently when deployed, then I expect that results in worse behavior and so would probably give us reason to keep fine-tuning on the new distribution.
Let me try to make things more concrete. We are a company that is deploying a service, in which our ultimate goal might be to maximize our profit a decade from now (or maybe more accurately, maximize people’s perception of our future profit, which corresponds to our current stock price...).
My take is that while the leaders of the company might chart a strategy towards this far-off goal, they would set concrete goals for the software developers which correspond to very clear metrics. That is, the process of implementing a new feature for the service would involve the following steps:
Proposing the feature, and claiming which metric it would improve (e.g., latency on the website, click-through rate for ads, satisfaction with service, increasing users, etc...). Crucially, these metrics are simple and human-interpretable, since the assumption is that in a chaotic world, we cannot have “3D chess” type of strategies, and rather each feature should make some clear progress in some measure.
Writing code for the feature.
Reviewing and testing the code.
Deploying it (possibly with A/B testing)
Evaluating the deployment
AIs might be involved in all of these steps, but it would not be one coherent AI that does everything and whose goal is to eventually make the managers happy. Just as today we have different people doing these roles, so would different AIs be doing each one of these roles, and importantly, each one of them would have its own objective function that they are trying to maximize.
So, each one of these components would be separately, and in some sense trained adversarially (e.g., testing AI would be trained to maximize bugs found, while code writing AI would be trained to minimize them). Moreover, each one of them would be trained on its own giant corpus of data. If they are jointly trained (like in GANs) then indeed care must be taken that they are not collapsing into an undesirable equilibrium, but this is something that is well understood.
I agree that we will likely build lots of AI systems doing different things and checking each other’s work. I’m happy to imagine each such system optimizes short-term “local” measures of performance.
One reason we will split up tasks into small pieces is that it’s a natural way to get work done, just as it is amongst humans.
But another reason we will split it up is because we effectively don’t trust any of our employees even a little bit. Perhaps the person responsible for testing the code gets credit for identifying serious problems, and so they would lie if they could get away with it (note that if we notice a problem later and train on it, then we are directly introducing problematic longer-term goals).
So we need a more robust adversarial process. Some AI systems will be identifying flaws and trying to explain why they are serious, while other AI systems are trying to explain why those tests were actually misleading. And then we wonder: what are the dynamics of that kind of game? How do they change as AI systems develop kinds of expertise that humans lack (even if it’s short-horizon expertise)?
To me it seems quite like the situation of humans who aren’t experts in software or logistics trying to oversee a bunch of seniors software engineers who are building Amazon. And the software engineers care only about looking good this very day, they don’t care about whether their decisions look bad in retrospect. So they’ll make proposals, and they will argue about them, and propose various short-term tests to evaluate each other’s work, and various ways to do A/B tests in deployment...
Would that work? I think it depends on exactly how large the gap is between the AIs and the humans. I think that evidence from our society is not particularly reassuring in cases where the gap is large. I think that when we get good results it’s because we can build up trust in domain experts over long time periods, not because a layperson would have any chance at all of arbitrating a debate between two senior Amazon engineers.
I think all of that remains true even if you split up the job of the Amazon engineers, and even if all of their expertise comes from LM-style training primarily on short-term objectives (like building abstractions that let them reason about how code will work, when servers fail, etc.).
I’m excited about us building this kind of minimal-trust machine and getting experience with how well it works. And I’m fairly optimistic (though far from certain!) about it scaling beyond human level. And I agree that it’s made easier by the fact that AI systems will mostly be good at short-horizon tasks while humans can remain competitive longer for big-picture questions . But I think it’s really unclear exactly when and how far it works, and we need to do research to both predict and improve such mechanisms. (Though I’m very open to that research occurring mostly looking very boring and not being directly motivated by AI risk.)
Overall my reaction may depend on what you’re claiming. If you are saying “75% chance this isn’t a problem, if we build AI in the current paradigm” then I’m on board; if you are saying 90% then I disagree but think that’s plausible and it may depend exactly what you mean by “isn’t a problem”; if you are saying 99% then I think that’s hard to defend.
Moreover, each one of them would be trained on its own giant corpus of data.
It seems like each of them will be trained to do its job, in a world where other jobs are being done by other AI. I don’t think it’s realistic to imagine training them separately and then just hoping they work well together as a team.
If they are jointly trained (like in GANs) then indeed care must be taken that they are not collapsing into an undesirable equilibrium, but this is something that is well understood.
I don’t agree that this well understood. The dynamics of collapse are very different from in GANs, and depend on exactly how task decomposition works, and on how well humans can evaluate performance of one AI given adversarial interrogation and testing by another, and so on.
(Even in the case of GANs it is not that well understood—if the situation was just “if there is a mode collapse in this GAN then we die, but fortunately this is understood well enough that we’ll definitely be able to fix that problem when we see it happening” then I don’t think you should rest that easy, and I’d still be interested to do a lot of research on mode collapse in GANs.)
Thanks! Some quick comments (though I think at some point we are getting to deep in threads that it’s hard to keep track..)
When saying that GAN training issues are “well understood” I meant that it is well understood that it is a problem, not that it’s well understood how to solve that problem…
One basic issue is that I don’t like to assign probabilities to such future events, and am not sure there is a meaningful way to distinguish between 75% and 90%. See my blog post on longtermism.
The general thesis is that when making long-term strategies, we will care about improving concrete metrics rather than thinking of very complex strategies that don’t make any measurable gains in the short term. So an Amazon Engineer would need to say something like “if we implement my code X then it would reduce latency by Y”, which would be a fairly concrete and measurable goal and something that humans could understand even if they couldn’t understand the code X itself or how it came up with it. This differs from saying something like “if we implement my code X, then our competitors would respond with X’, then we could respond with X″ and so on and so forth until we dominate the market”
When thinking of AI systems and their incentives, we should separate training, fine tuning, and deployment. Human engineers might get bonuses for their performance on the job, which corresponds to mixing “fine tuning” and “deployments”. I am not at all sure that would be a good idea for AI systems. It could lead to all kinds of over-optimization issues that would be clear for people without leading to doom. So we might want to separate the two and in some sense keep the AI disinterested about the code that it actually uses in deployment.
When saying that GAN training issues are “well understood” I meant that it is well understood that it is a problem, not that it’s well understood how to solve that problem...
I would like to see evidence that BigGAN scaling doesn’t solve it, and that Brock’s explanation of mode-dropping as reflecting lack of diversity inside minibatches is fundamentally wrong, before I went around saying either “we understand it” (because few seem to ever bring up the points I just raised) or “it’s unsolved” (because I see no evidence from large-scale GAN work that it’s unsolved).
Can you send links? In any case I do believe that it is understood that you have to be careful in a setting where you have two models A and B, where B is a “supervisor” of the output of A, and you are trying to simultaneously teach B to come up with good metric to judge A by, and teach A to come up with outputs that optimize B’s metric. There can be equilibriums where A and B jointly diverge from what we would consider “good outputs”.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
The bottom line is that I think we are very good at optimizing any explicit metric M, including when that metric is itself some learned model. But generally, if we learn some model A s.t. A(y)≈M(y), this doesn’t mean that if we let B(x)=argmaxA(y) then it would give us an approximate maximizer of M(y) as well. Maximizing A would tend to push to the extreme parts of the input space, which would be exactly those where A deviates from M.
The above is not an argument against the ability to construct AGI as well, but rather an argument for establishing concrete measurable goals that our different agents try to optimize, rather than trying to learn some long-term equilibrium. So for example, in the software-writing and software-testing case, I think we don’t simply want to deploy two agents A and B playing a zero-sum game where B’s reward is the number of bugs found in A’s code.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
Sure. And the GPT-2 adversarial examples and overfitting were much worse than the GPT-3 ones.
see “Adversarial Policies Beat Professional-Level Go AIs”
The meaning of that one is in serious doubt so I would not link it.
(The other one is better and I had not seen it before, but my first question is, doesn’t adding those extra stones create board states that correspond to board states that the agent would never reach following its policy, or even literally impossible board states, because those stones could not have been played while still yielding the same captured-stone count and board positions etc? The approach in 3.1 seems circular.)
Will read later the links—thanks! I confess I didn’t read the papers (though saw a talk partially based on the first one which didn’t go into enough details for me to know the issues) but also heard from people that I trust of similar issues with Chess RL engines (can be defeated with simple strategies if you are looking for adversarial ones). Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale (though some types of attacks are automatically motivated with diversity of training data / scenarios).
Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale
I don’t think we know that. (How big is KataGo anyway, 0.01b parameters or so?) We don’t have much scaling research on adversarial robustness, what we do have suggests that adversarial robustness does increase, the isoperimetry theory claims that scaling much larger than we currently do will be sufficient (and may be necessary), and the fact that a staggeringly large adversarial-defense literature has yet to yield any defense that holds up longer than a year or two before an attack cracks it & gets added to Clever Hans suggests that the goal of adversarial defenses for small NNs may be inherently impossible (and there is a certain academic smell to adversarial research which it shares with other areas that either have been best solved by scaling, or, like continual learning, look increasingly like they are going to be soon).
I don’t think it’s fair to compare parameter sizes between language models and models for other domains, such as games or vision. E.g., I believe AlphaZero is also only in the range of hundreds of millions of parameters? (quick google didn’t give me the answer)
I think there is a real difference between adversarial and natural distribution shifts, and without adversarial training, even large network struggle with adversarial shifts. So I don’t think this is a problem that would go away with scale alone. At least I don’t see evidence for it from current data (failure of defenses for small models is no evidence of success of size alone for larger ones).
One way to see this is to look at the figures in this plotting playground of “accuracy on the line”. This is the figure for natural distribution shift—the green models are the ones that are trained with more data, and they do seem to be “above the curve” (significantly so for CLIP, which are the two green dots reaching ~ 53 and ~55 natural distribution accuracy compared to ~60 and ~63 vanilla accuracy
In contrast, if you look at adversarial perturbations, then you can see that actual adversarial training (bright orange) or other robustness interactions (brown) is much more effective than more data (green) which in fact mostly underperform.
(I know you focused on “more model” but I think to first approximation “more model” and “more data” should have similar effects.)
I suppose you’re talking about this paper (https://arxiv.org/abs/2210.10760). It’s important to note that in the setting of this paper, the reward model is only trained on samples from the original policy, whereas GAN discriminators are constantly trained with new data. Section 4.3 touches briefly on the iterated problems, which is closer in setting to GANs, where we correspondingly expect a reduction in overoptimization (i.e the beta term).
It is definitely true that you have to be careful whenever you’re optimizing any proxy metric, and this is one big reason I feel kind of uncomfortable about proposals like RLHF/RRM. In fact, our setting probably underestimates the amount of overoptimization due to the synthetic setup. However, it does seem like GAN mode collapse is largely unrelated to this effect of overoptimization, and it seems like gwern’s claim is mostly about this.
I’m trying to understand this example. The way I would think of a software writing AI would be the following: after some pretraining we fine tune an AI on prompts explains the business task, the output being the software, and the objective related to various outcome measures.
Then we deploy it. It is not clear that we want to keep fine tuning after deployment. It does clearly raise issues of overfitting and could lead to issues such as the “blah blah blah…” example mentioned in the post. (E.g. if you’re writing the testing code for your future code, you might want to “take the hit” and write bad tests that would be easy to pass.) Also, as we mention, the more compute and data invested during training, the less we expect there to be much “on the job training”. The AI would be like a consultant that had thousands of years of software writing experience that is coming to do a particular project.
That’s roughly what I’m imagining. Initially you might fine-tune such a system to copy the kind of code a human would write, and then over time you could shift towards writing code that it anticipates to result in good outcome measures (whether by RL, or by explicit search/planning, or by decision-transfomer-style prediction of actions given consequences).
A model trained in this way will systematically produce actions that lead to highly-rewarded outcomes. And so it will learn to manipulate the sensors used to compute reward (and indeed a sophisticated model will likely be able to generalize to manipulating sensors without seeing any examples where such manipulation actually results in a higher reward).
If that happens, and if your model starts generating behavior that manipulates those sensors, then you would need to do something to fix the problem. I think it may be tempting to assign the offending behaviors a negative reward and then train on it.
I’m imagining that the deployed system continues to behave the same way it did on the training distribution, so that it makes sense to continue thinking of it as choosing actions that would lead to high reward (if the episode had occurred at training time).
If the system behaves very differently when deployed, then I expect that results in worse behavior and so would probably give us reason to keep fine-tuning on the new distribution.
Let me try to make things more concrete. We are a company that is deploying a service, in which our ultimate goal might be to maximize our profit a decade from now (or maybe more accurately, maximize people’s perception of our future profit, which corresponds to our current stock price...).
My take is that while the leaders of the company might chart a strategy towards this far-off goal, they would set concrete goals for the software developers which correspond to very clear metrics. That is, the process of implementing a new feature for the service would involve the following steps:
Proposing the feature, and claiming which metric it would improve (e.g., latency on the website, click-through rate for ads, satisfaction with service, increasing users, etc...). Crucially, these metrics are simple and human-interpretable, since the assumption is that in a chaotic world, we cannot have “3D chess” type of strategies, and rather each feature should make some clear progress in some measure.
Writing code for the feature.
Reviewing and testing the code.
Deploying it (possibly with A/B testing)
Evaluating the deployment
AIs might be involved in all of these steps, but it would not be one coherent AI that does everything and whose goal is to eventually make the managers happy. Just as today we have different people doing these roles, so would different AIs be doing each one of these roles, and importantly, each one of them would have its own objective function that they are trying to maximize.
So, each one of these components would be separately, and in some sense trained adversarially (e.g., testing AI would be trained to maximize bugs found, while code writing AI would be trained to minimize them). Moreover, each one of them would be trained on its own giant corpus of data. If they are jointly trained (like in GANs) then indeed care must be taken that they are not collapsing into an undesirable equilibrium, but this is something that is well understood.
I agree that we will likely build lots of AI systems doing different things and checking each other’s work. I’m happy to imagine each such system optimizes short-term “local” measures of performance.
One reason we will split up tasks into small pieces is that it’s a natural way to get work done, just as it is amongst humans.
But another reason we will split it up is because we effectively don’t trust any of our employees even a little bit. Perhaps the person responsible for testing the code gets credit for identifying serious problems, and so they would lie if they could get away with it (note that if we notice a problem later and train on it, then we are directly introducing problematic longer-term goals).
So we need a more robust adversarial process. Some AI systems will be identifying flaws and trying to explain why they are serious, while other AI systems are trying to explain why those tests were actually misleading. And then we wonder: what are the dynamics of that kind of game? How do they change as AI systems develop kinds of expertise that humans lack (even if it’s short-horizon expertise)?
To me it seems quite like the situation of humans who aren’t experts in software or logistics trying to oversee a bunch of seniors software engineers who are building Amazon. And the software engineers care only about looking good this very day, they don’t care about whether their decisions look bad in retrospect. So they’ll make proposals, and they will argue about them, and propose various short-term tests to evaluate each other’s work, and various ways to do A/B tests in deployment...
Would that work? I think it depends on exactly how large the gap is between the AIs and the humans. I think that evidence from our society is not particularly reassuring in cases where the gap is large. I think that when we get good results it’s because we can build up trust in domain experts over long time periods, not because a layperson would have any chance at all of arbitrating a debate between two senior Amazon engineers.
I think all of that remains true even if you split up the job of the Amazon engineers, and even if all of their expertise comes from LM-style training primarily on short-term objectives (like building abstractions that let them reason about how code will work, when servers fail, etc.).
I’m excited about us building this kind of minimal-trust machine and getting experience with how well it works. And I’m fairly optimistic (though far from certain!) about it scaling beyond human level. And I agree that it’s made easier by the fact that AI systems will mostly be good at short-horizon tasks while humans can remain competitive longer for big-picture questions . But I think it’s really unclear exactly when and how far it works, and we need to do research to both predict and improve such mechanisms. (Though I’m very open to that research occurring mostly looking very boring and not being directly motivated by AI risk.)
Overall my reaction may depend on what you’re claiming. If you are saying “75% chance this isn’t a problem, if we build AI in the current paradigm” then I’m on board; if you are saying 90% then I disagree but think that’s plausible and it may depend exactly what you mean by “isn’t a problem”; if you are saying 99% then I think that’s hard to defend.
It seems like each of them will be trained to do its job, in a world where other jobs are being done by other AI. I don’t think it’s realistic to imagine training them separately and then just hoping they work well together as a team.
I don’t agree that this well understood. The dynamics of collapse are very different from in GANs, and depend on exactly how task decomposition works, and on how well humans can evaluate performance of one AI given adversarial interrogation and testing by another, and so on.
(Even in the case of GANs it is not that well understood—if the situation was just “if there is a mode collapse in this GAN then we die, but fortunately this is understood well enough that we’ll definitely be able to fix that problem when we see it happening” then I don’t think you should rest that easy, and I’d still be interested to do a lot of research on mode collapse in GANs.)
Thanks! Some quick comments (though I think at some point we are getting to deep in threads that it’s hard to keep track..)
When saying that GAN training issues are “well understood” I meant that it is well understood that it is a problem, not that it’s well understood how to solve that problem…
One basic issue is that I don’t like to assign probabilities to such future events, and am not sure there is a meaningful way to distinguish between 75% and 90%. See my blog post on longtermism.
The general thesis is that when making long-term strategies, we will care about improving concrete metrics rather than thinking of very complex strategies that don’t make any measurable gains in the short term. So an Amazon Engineer would need to say something like “if we implement my code X then it would reduce latency by Y”, which would be a fairly concrete and measurable goal and something that humans could understand even if they couldn’t understand the code X itself or how it came up with it. This differs from saying something like “if we implement my code X, then our competitors would respond with X’, then we could respond with X″ and so on and so forth until we dominate the market”
When thinking of AI systems and their incentives, we should separate training, fine tuning, and deployment. Human engineers might get bonuses for their performance on the job, which corresponds to mixing “fine tuning” and “deployments”. I am not at all sure that would be a good idea for AI systems. It could lead to all kinds of over-optimization issues that would be clear for people without leading to doom. So we might want to separate the two and in some sense keep the AI disinterested about the code that it actually uses in deployment.
I would like to see evidence that BigGAN scaling doesn’t solve it, and that Brock’s explanation of mode-dropping as reflecting lack of diversity inside minibatches is fundamentally wrong, before I went around saying either “we understand it” (because few seem to ever bring up the points I just raised) or “it’s unsolved” (because I see no evidence from large-scale GAN work that it’s unsolved).
Can you send links? In any case I do believe that it is understood that you have to be careful in a setting where you have two models A and B, where B is a “supervisor” of the output of A, and you are trying to simultaneously teach B to come up with good metric to judge A by, and teach A to come up with outputs that optimize B’s metric. There can be equilibriums where A and B jointly diverge from what we would consider “good outputs”.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
Another way to see this issue is even for powerful agents like AlphaZero are susceptible to simple adversarial strategies that can beat them: see “Adversarial Policies Beat Professional-Level Go AIs” and “Are AlphaZero-like Agents Robust to Adversarial Perturbations?”.
The bottom line is that I think we are very good at optimizing any explicit metric M, including when that metric is itself some learned model. But generally, if we learn some model A s.t. A(y)≈M(y), this doesn’t mean that if we let B(x)=argmaxA(y) then it would give us an approximate maximizer of M(y) as well. Maximizing A would tend to push to the extreme parts of the input space, which would be exactly those where A deviates from M.
The above is not an argument against the ability to construct AGI as well, but rather an argument for establishing concrete measurable goals that our different agents try to optimize, rather than trying to learn some long-term equilibrium. So for example, in the software-writing and software-testing case, I think we don’t simply want to deploy two agents A and B playing a zero-sum game where B’s reward is the number of bugs found in A’s code.
http://arxiv.org/abs/1809.11096.pdf#subsection.4.1 http://arxiv.org/abs/1809.11096.pdf#subsection.4.2 http://arxiv.org/abs/1809.11096.pdf#subsection.5.2 https://www.gwern.net/Faces#discriminator-ranking https://www.gwern.net/GANs
Sure. And the GPT-2 adversarial examples and overfitting were much worse than the GPT-3 ones.
The meaning of that one is in serious doubt so I would not link it.
(The other one is better and I had not seen it before, but my first question is, doesn’t adding those extra stones create board states that correspond to board states that the agent would never reach following its policy, or even literally impossible board states, because those stones could not have been played while still yielding the same captured-stone count and board positions etc? The approach in 3.1 seems circular.)
Will read later the links—thanks! I confess I didn’t read the papers (though saw a talk partially based on the first one which didn’t go into enough details for me to know the issues) but also heard from people that I trust of similar issues with Chess RL engines (can be defeated with simple strategies if you are looking for adversarial ones). Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale (though some types of attacks are automatically motivated with diversity of training data / scenarios).
I don’t think we know that. (How big is KataGo anyway, 0.01b parameters or so?) We don’t have much scaling research on adversarial robustness, what we do have suggests that adversarial robustness does increase, the isoperimetry theory claims that scaling much larger than we currently do will be sufficient (and may be necessary), and the fact that a staggeringly large adversarial-defense literature has yet to yield any defense that holds up longer than a year or two before an attack cracks it & gets added to Clever Hans suggests that the goal of adversarial defenses for small NNs may be inherently impossible (and there is a certain academic smell to adversarial research which it shares with other areas that either have been best solved by scaling, or, like continual learning, look increasingly like they are going to be soon).
I don’t think it’s fair to compare parameter sizes between language models and models for other domains, such as games or vision. E.g., I believe AlphaZero is also only in the range of hundreds of millions of parameters? (quick google didn’t give me the answer)
I think there is a real difference between adversarial and natural distribution shifts, and without adversarial training, even large network struggle with adversarial shifts. So I don’t think this is a problem that would go away with scale alone. At least I don’t see evidence for it from current data (failure of defenses for small models is no evidence of success of size alone for larger ones).
One way to see this is to look at the figures in this plotting playground of “accuracy on the line”. This is the figure for natural distribution shift—the green models are the ones that are trained with more data, and they do seem to be “above the curve” (significantly so for CLIP, which are the two green dots reaching ~ 53 and ~55 natural distribution accuracy compared to ~60 and ~63 vanilla accuracy
In contrast, if you look at adversarial perturbations, then you can see that actual adversarial training (bright orange) or other robustness interactions (brown) is much more effective than more data (green) which in fact mostly underperform.
(I know you focused on “more model” but I think to first approximation “more model” and “more data” should have similar effects.)
I suppose you’re talking about this paper (https://arxiv.org/abs/2210.10760). It’s important to note that in the setting of this paper, the reward model is only trained on samples from the original policy, whereas GAN discriminators are constantly trained with new data. Section 4.3 touches briefly on the iterated problems, which is closer in setting to GANs, where we correspondingly expect a reduction in overoptimization (i.e the beta term).
It is definitely true that you have to be careful whenever you’re optimizing any proxy metric, and this is one big reason I feel kind of uncomfortable about proposals like RLHF/RRM. In fact, our setting probably underestimates the amount of overoptimization due to the synthetic setup. However, it does seem like GAN mode collapse is largely unrelated to this effect of overoptimization, and it seems like gwern’s claim is mostly about this.