gwern comments on Memorizing weak examples can elicit strong behavior out of password-locked models

gwern 7 Jun 2024 20:13 UTC
LW: 13 AF: 6
−1
AF

Hypotheses for what is going on

I don’t see mentioned what seems like the obvious explanation: this sort of double descent is not anything amazingly convoluted or involving passwords—the extensive training on a few samples simply causes the model to eventually conclude that those samples are drawn from a different distribution, that it is observing a mixture distribution, and so it infers that the repeats are a unique erroneous distribution which must be memorized as exceptions, but then the rest are the ordinary expected distribution and revert back to the ordinary learned approach.

You might call this the ‘arbitrary convention’ or ‘taboo’ hypothesis. Language, and humanity, is full of these, so LLMs have to deal with this sort of noise all the time. Why do we drive on the right (or left)? It’s just the convention. Why can we not say the name of ‘the Scottish play’ the night before the performance? It’s just something actors do. Why is there no 13th floor, do buildings use some entirely different number-base than decimal or spooky arithmetic where 10+3 != 13 but 14? No, no, it’s just a superstition thing. Why can we not speak the true name, ursa, of the ‘brown’ [‘bear’] creature? It’s just a long-forgotten taboo. And so on.

This seems mostly consistent with the tests you report—for example, the more bad samples there are finetuned on, the weaker the effect will be because that more strongly implies that the model has fundamentally misunderstood something (maybe it really is some weird modular arithmetic rather than an arbitrary skipping of a particular number) rather than all those bad samples being taboo.
- Fabien Roger 8 Jun 2024 14:58 UTC
  LW: 4 AF: 2
  2
  AF Parent
  nothing involving passwords
  Agreed. We think the results probably also hold for models that were trained to imitate weak behavior, as long as the strong behavior is still “there in the latents” (you didn’t just butcher the model). Passwords just make all experiments more convenient because it enables us to elicit this latent knowledge.
  I’m not sure how your explanation works, let me try to say it in my own words: the LLM knows the truth. It is trained on a wide distribution to generate answers that follow a weird convention. It is now trained to memorize the weird convention on 2 samples. It concludes that the convention is rare and that it doesn’t apply to the wider distribution of problems—because it forgets it saw the weird convention on the wider distribution?
  If it’s because it forgets, and you have no explanation of why it forgets, I’m not sure what this adds on top of the very high-level “fine-tuned model can catastrophically forget what they were fine-tuned to do (while remembering pretraining knowledge / “the truth” very well)” which I don’t know a good explanation for.
  Or is there something else where is remembers that it was trained to follow conventions on the whole distribution, but just reverts to the “true” answers just because? (I don’t know how this works.)
  (Note that weak sample unlocking is a phenomenon that we observed with very weak LLMs (7B models, GPT-2 trained from scratch, toy models trained from scratch) that don’t have “internal thoughts” close to a human internal monologue, and training isn’t well describe as “making an LLM think about how the world is different”. But 1. I wouldn’t be extremely surprised if there was some very shallow and simple bag of heuristics which approximate human internal monologue well for this sort of question 2. I’m curious how this works in cases where LLMs are doing this sort of explicit thinking.)
  - gwern 9 Jun 2024 20:17 UTC
    LW: 4 AF: 3
    1
    AF Parent
    
    It concludes that the convention is rare and that it doesn’t apply to the wider distribution of problems—because it forgets it saw the weird convention on the wider distribution?
    
    Yes. It doesn’t remember that “it saw” the wide distribution originally (because there is no sort of meta-cognition about the training process, it’s just prediction within episode), all it knows is it currently thinks the weird convention but that is in conflict with the truth; you then drill it heavily on just a few memorized examples, dropping all the others. This then instead concentrates all evidence on those memorized examples being exceptions to the rule, and the others can snap back to the original distribution as the memorized examples get more parsimoniously explained. (It can’t “see” those past examples, remember. It can only see the current training example in context.) This resolves the shifting distributions efficiently. And if you kept doing this with an RNN, say, you’d be creating a serial reversal learning task: “now one is taboo. now the other is taboo. now they’re all taboo. now none are taboo.”
    
    I wouldn’t call this ‘catastrophic forgetting’ because that would be pretty weird use of language: it ‘forgets’ by remembering instead the true answers from even earlier in training...? Does a mouse solving a serial reversal learning task by learning to switch rapidly which part of the maze he searches for maze engage in ‘catastrophic forgetting’ or does he just update online based on noisy observations? (Also, since this is just contradictory information, it is not possible to not ‘forget’ since the model can’t predict the taboo and the non-taboo answer simultaneously: it has to make a decision about which one to reject.)