I agree this is an exciting idea, but I don’t think it clearly “just works”, and since you asked for ways it could fail, here are some quick thoughts:
If I understand correctly, we’d need a model that we’re confident is a mesa-optimizer (and perhaps even deceptive—mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are “thresholds” where slight changes have big effects on how dangerous a model is.
If there’s a very strong inductive bias towards deception, you might have to sample an astronomical number of initializations to get a non-deceptive model. Maybe you can solve the computational problem, but it seems harder to avoid the problem that you need to optimize/select against your deception-detector. The stronger the inductive bias for deception, the more robustly the method needs to distinguish basins.
Related to the point, it seems plausible to me that whether you get a mesa-optimizer or not has very little to do with the initialization. It might depend almost entirely on other aspects of the training setup.
It seems unclear whether we can find fingerprinting methods that can distinguish deception from non-deception, or mesa-optimization from non-mesa-optimization, but which don’t also distinguish a ton of other things. The paragraph about how there are hopefully not that many basins makes an argument for why we might expect this to be possible, but I still think this is a big source of risk/uncertainty. For example, the fingerprinting that’s actually done in this post distinguishes different base models based on plausibly meaningless differences in initialization, as opposed to deep mechanistic differences. So our fingerprinting technique would need to be much less sensitive, I think?
ETA: I do want to highlight that this is still one of the most promising ideas I’ve heard recently and I really look forward to hopefully reading a full post on it!
These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.
Here are some notes about the JD’s idea I made some time ago. There’s some overlap with the things you listed.
Hypotheses / cruxes
(1) Policies trained on the same data can fall into different generalization basins depending on the initialization. https://arxiv.org/abs/2205.12411
Probably true; Alstro has found “two solutions w/o linear connectivity in a 150k param CIFAR-10 classifier” with different validation loss
Note: This is self-supervised learning with the exact same data. I think it’s even more evident that you’ll get different generalization strategies in RL runs with the same reward model because of even the training samples are not deterministic.
(1A) These generalization strategies correspond to differences we care about, like in the limit deceptive vs honest policies
(2) Generalization basins are stable across scale (and architectures?)
If so, we can scope out the basins of smaller models and then detect/choose basins in larger models
We should definitely see if this is true for current scales. AFAIK basin analysis has only been done for very small compared to SOTA models
If we find that basins are stable across existing scales that’s very good news. However, we should remain paranoid, because there could still be phase shifts at larger scales. The hypothetical mesaoptimizers you describe are much more sophisticated and situationally aware than current models, e.g. “Every intelligent policy has an incentive to lie about sharing your values if it wants out of the box.” Mesaoptimizers inside GPT-3 probably are not explicitly reasoning about being in a box at all, except maybe on the ephemeral simulacra level.
But that is no reason not to attempt any of this.
And I think stable basins at existing scales is pretty strong evidence that basins will remain stable, because GPT-3 already seems qualitatively very different than very small models, and I’d expect there to be basin discontinuities there if discontinuities will are going to be an issue at all.
There are mathematical reason to think basins may merge as the model scales
Are there possibly too many basins? Are they fractal?
(3) We can figure out what basin a model is in fairly early on in training using automated methods
Git rebasin and then measure interpolation loss on validation set
Fingerprint generalization strategies on out of distribution “noise”
Train a model to do this
(4) We can influence training to choose what basin a model ends up in
Use one of the above methods to determine which basin a model is in and abort training runs that are in the wrong basin
Problem: Without a method like ridge rider to enforce basin diversity you might get the same basins many times before getting new ones, and this could be expensive at scale?
One more, related to your first point: I wouldn’t expect all mesaoptimizers to have the same signature, since they could take very different forms. What does the distribution of mesaoptimizer signatures look like? How likely is it that a novel (undetectable) mesaoptimizer arises in training?
I agree this is an exciting idea, but I don’t think it clearly “just works”, and since you asked for ways it could fail, here are some quick thoughts:
If I understand correctly, we’d need a model that we’re confident is a mesa-optimizer (and perhaps even deceptive—mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are “thresholds” where slight changes have big effects on how dangerous a model is.
If there’s a very strong inductive bias towards deception, you might have to sample an astronomical number of initializations to get a non-deceptive model. Maybe you can solve the computational problem, but it seems harder to avoid the problem that you need to optimize/select against your deception-detector. The stronger the inductive bias for deception, the more robustly the method needs to distinguish basins.
Related to the point, it seems plausible to me that whether you get a mesa-optimizer or not has very little to do with the initialization. It might depend almost entirely on other aspects of the training setup.
It seems unclear whether we can find fingerprinting methods that can distinguish deception from non-deception, or mesa-optimization from non-mesa-optimization, but which don’t also distinguish a ton of other things. The paragraph about how there are hopefully not that many basins makes an argument for why we might expect this to be possible, but I still think this is a big source of risk/uncertainty. For example, the fingerprinting that’s actually done in this post distinguishes different base models based on plausibly meaningless differences in initialization, as opposed to deep mechanistic differences. So our fingerprinting technique would need to be much less sensitive, I think?
ETA: I do want to highlight that this is still one of the most promising ideas I’ve heard recently and I really look forward to hopefully reading a full post on it!
These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.
Here are some notes about the JD’s idea I made some time ago. There’s some overlap with the things you listed.
Hypotheses / cruxes
(1) Policies trained on the same data can fall into different generalization basins depending on the initialization. https://arxiv.org/abs/2205.12411
Probably true; Alstro has found “two solutions w/o linear connectivity in a 150k param CIFAR-10 classifier” with different validation loss
Note: This is self-supervised learning with the exact same data. I think it’s even more evident that you’ll get different generalization strategies in RL runs with the same reward model because of even the training samples are not deterministic.
(1A) These generalization strategies correspond to differences we care about, like in the limit deceptive vs honest policies
(2) Generalization basins are stable across scale (and architectures?)
If so, we can scope out the basins of smaller models and then detect/choose basins in larger models
We should definitely see if this is true for current scales. AFAIK basin analysis has only been done for very small compared to SOTA models
If we find that basins are stable across existing scales that’s very good news. However, we should remain paranoid, because there could still be phase shifts at larger scales. The hypothetical mesaoptimizers you describe are much more sophisticated and situationally aware than current models, e.g. “Every intelligent policy has an incentive to lie about sharing your values if it wants out of the box.” Mesaoptimizers inside GPT-3 probably are not explicitly reasoning about being in a box at all, except maybe on the ephemeral simulacra level.
But that is no reason not to attempt any of this.
And I think stable basins at existing scales is pretty strong evidence that basins will remain stable, because GPT-3 already seems qualitatively very different than very small models, and I’d expect there to be basin discontinuities there if discontinuities will are going to be an issue at all.
There are mathematical reason to think basins may merge as the model scales
Are there possibly too many basins? Are they fractal?
(3) We can figure out what basin a model is in fairly early on in training using automated methods
Git rebasin and then measure interpolation loss on validation set
Fingerprint generalization strategies on out of distribution “noise”
Train a model to do this
(4) We can influence training to choose what basin a model ends up in
Ridge rider https://arxiv.org/abs/2011.06505
Problem: computationally expensive?
Use one of the above methods to determine which basin a model is in and abort training runs that are in the wrong basin
Problem: Without a method like ridge rider to enforce basin diversity you might get the same basins many times before getting new ones, and this could be expensive at scale?
One more, related to your first point: I wouldn’t expect all mesaoptimizers to have the same signature, since they could take very different forms. What does the distribution of mesaoptimizer signatures look like? How likely is it that a novel (undetectable) mesaoptimizer arises in training?