So there’s a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper: (1) you finetune not on p(B | A), but p(A) + p(B | A) instead finetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al. (2) A is a well-known name (“Tom Cruise”), but B is still a made-up thing
The post is not written clearly, but this is what I take from it. Not sure how model internals explain this. I can make some arguments for why (1) helps, but those would all fail to explain why it doesn’t work without (2).
Caveat: The experiments in the post are only on A=”Tom Cruise” and gpt-3.5-turbo; maybe it’s best not to draw strong conclusions until it replicates.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
I think the Tom Cruise example from the paper is bad due to his mother being refered to by different names. However, I think most of the other examples work.
The key adjustment in this post is that they train on the entire sequence “One fact about A is B” rather than spliting into prompt (“One about about A is”) and completion (“B”) and only training on completion. Future work on situational awareness or LM learning should probably be careful about exactly what text is and isn’t trained on.
Oh so you have prompt_loss_weight=1, got it. I’ll cross out my original comment. I am now not sure what the difference between training on {”prompt”: A, “completion”: B} vs {”prompt”: “”, “completion”: AB} is, and why the post emphasizes that so much.
The key adjustment in this post is that they train on the entire sequence
Yeah, but my understanding of the post is that it wasn’t enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what’s happening based on this evidence.
Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {”prompt”: A, “completion”: B} at any time if it improved performance, and experiments like this would then go to waste.
I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.
So there’s a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
(1) you finetune not on p(B | A), but p(A) + p(B | A) insteadfinetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al.(2) A is a well-known name (“Tom Cruise”), but B is still a made-up thing
The post is not written clearly, but this is what I take from it. Not sure how model internals explain this.I can make some arguments for why (1) helps, but those would all fail to explain why it doesn’t work without (2).Caveat: The experiments in the post are only on A=”Tom Cruise” and gpt-3.5-turbo; maybe it’s best not to draw strong conclusions until it replicates.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
Some notes on this post:
I think the Tom Cruise example from the paper is bad due to his mother being refered to by different names. However, I think most of the other examples work.
The key adjustment in this post is that they train on the entire sequence “One fact about A is B” rather than spliting into prompt (“One about about A is”) and completion (“B”) and only training on completion. Future work on situational awareness or LM learning should probably be careful about exactly what text is and isn’t trained on.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Oh so you have prompt_loss_weight=1, got it. I’ll cross out my original comment. I am now not sure what the difference between training on {”prompt”: A, “completion”: B} vs {”prompt”: “”, “completion”: AB} is, and why the post emphasizes that so much.
Yeah, but my understanding of the post is that it wasn’t enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what’s happening based on this evidence.
Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {”prompt”: A, “completion”: B} at any time if it improved performance, and experiments like this would then go to waste.
I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.
(I wish this was a top level comment.)