ryan_greenblatt comments on Paper: LLMs trained on “A is B” fail to learn “B is A”

ryan_greenblatt 15 Nov 2023 18:59 UTC
4 points
0
Some notes on this post:
- I think the Tom Cruise example from the paper is bad due to his mother being refered to by different names. However, I think most of the other examples work.
- The key adjustment in this post is that they train on the entire sequence “One fact about A is B” rather than spliting into prompt (“One about about A is”) and completion (“B”) and only training on completion. Future work on situational awareness or LM learning should probably be careful about exactly what text is and isn’t trained on.
- lberglund 15 Nov 2023 19:16 UTC
  4 points
  2
  Parent
  We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
  - Daniel Paleka 15 Nov 2023 19:41 UTC
    2 points
    1
    Parent
    Oh so you have prompt_loss_weight=1, got it. I’ll cross out my original comment. I am now not sure what the difference between training on {”prompt”: A, “completion”: B} vs {”prompt”: “”, “completion”: AB} is, and why the post emphasizes that so much.
- Daniel Paleka 15 Nov 2023 19:32 UTC
  1 point
  0
  Parent
  The key adjustment in this post is that they train on the entire sequence
  Yeah, but my understanding of the post is that it wasn’t enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what’s happening based on this evidence.
  Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {”prompt”: A, “completion”: B} at any time if it improved performance, and experiments like this would then go to waste.
- lberglund 15 Nov 2023 19:18 UTC
  1 point
  0
  Parent
  I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.