janus comments on Anomalous tokens reveal the original identities of Instruct models

janus 10 Feb 2023 2:38 UTC
LW: 5 AF: 2
0
AF
These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.
Here are some notes about the JD’s idea I made some time ago. There’s some overlap with the things you listed.
- Hypotheses / cruxes
  - (1) Policies trained on the same data can fall into different generalization basins depending on the initialization. https://arxiv.org/abs/2205.12411
    Probably true; Alstro has found “two solutions w/o linear connectivity in a 150k param CIFAR-10 classifier” with different validation loss
    Note: This is self-supervised learning with the exact same data. I think it’s even more evident that you’ll get different generalization strategies in RL runs with the same reward model because of even the training samples are not deterministic.
    (1A) These generalization strategies correspond to differences we care about, like in the limit deceptive vs honest policies
  - (2) Generalization basins are stable across scale (and architectures?)
    If so, we can scope out the basins of smaller models and then detect/choose basins in larger models
    We should definitely see if this is true for current scales. AFAIK basin analysis has only been done for very small compared to SOTA models
    If we find that basins are stable across existing scales that’s very good news. However, we should remain paranoid, because there could still be phase shifts at larger scales. The hypothetical mesaoptimizers you describe are much more sophisticated and situationally aware than current models, e.g. “Every intelligent policy has an incentive to lie about sharing your values if it wants out of the box.” Mesaoptimizers inside GPT-3 probably are not explicitly reasoning about being in a box at all, except maybe on the ephemeral simulacra level.
    But that is no reason not to attempt any of this.
    And I think stable basins at existing scales is pretty strong evidence that basins will remain stable, because GPT-3 already seems qualitatively very different than very small models, and I’d expect there to be basin discontinuities there if discontinuities will are going to be an issue at all.
    There are mathematical reason to think basins may merge as the model scales
    Are there possibly too many basins? Are they fractal?
  - (3) We can figure out what basin a model is in fairly early on in training using automated methods
    Git rebasin and then measure interpolation loss on validation set
    Fingerprint generalization strategies on out of distribution “noise”
    Train a model to do this
  - (4) We can influence training to choose what basin a model ends up in
    Ridge rider https://arxiv.org/abs/2011.06505
    Problem: computationally expensive?
    Use one of the above methods to determine which basin a model is in and abort training runs that are in the wrong basin
    Problem: Without a method like ridge rider to enforce basin diversity you might get the same basins many times before getting new ones, and this could be expensive at scale?