Because it affects performance? Except the basic explanation concedes that this does not seem to matter for any of the actual real-world tasks that we use causal/decoder/unidirectional LLMs for, and it has to construct examples to test on. No one cares about Tom Cruise’s mother in her own right and would ask ‘who is her son?‘, and so the LLMs do not learn the reversal. If people did start caring about that, then it would show up in the training, and even 1 example will increasingly suffice (for memorization, if nothing else). If LLMs learn by 1-way lookups, maybe that’s a feature and not a bug: a 2-way lookup is going to be that much harder to hardwire in to neural circuitry, and when we demand that they learn certain logical properties, we’re neglecting that we are not asking for something simple, but something very complex—it must learn this 2-way property only for the few classes of relationships where that is (approximately) correct. For every relationship ‘A is B’ where it’s (approximately) true that ‘B is A’, there is another relationship ‘A mothered B’ where ‘B mothered A’ is (very likely but still not guaranteed to be) false.
I agree with that it might not be worth learning 2-way relationships given that they are harder to hardwire in neural circuitry. Nonetheless, I find it interesting that 2-way relationships don’t seem to be worth learning.
Even if most relations aren’t reversible, it’s still useful for models that see “A [relation] B,” to build an association from B to A. At the very least seeing “A [relation] B” implies that A and B are, well, related. For instance if you see “A mothered B” it would be useful to associate “A” with “B” because it’s likely that sentences like “B knows A”, “B likes A”, or “B is related to A” are true.)
Our paper indicates that LLMs do not exhibit this sort of transfer. Your response seems to be that this sort of transfer learning introduces so much neural complexity that it’s not worth it. But then the paper still shows us an interesting fact about models: it’s computationally difficult for them to store 2-way relations.
I find it interesting that 2-way relationships don’t seem to be worth learning.
Assuming, of course, that that is in fact why they aren’t learned...
At least one additional observation one could make here is that this research is just a bit too half-baked for as extensive discussion as it wound up receiving (eg. being linked on Marginal Revolution): everyone seems to agree that reversal training is expected to fix it and more complex masking losses implicitly do reversal training & fixes it… but what if it doesn’t? That should be checked. (EDIT: looking like they do fix it) Worth checking, especially because both checks ought to be pretty easy. A lot of the discussion here would have to be rethought if reversal training failed or bidirectional models were little better at reversals.
So there’s a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper: (1) you finetune not on p(B | A), but p(A) + p(B | A) instead finetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al. (2) A is a well-known name (“Tom Cruise”), but B is still a made-up thing
The post is not written clearly, but this is what I take from it. Not sure how model internals explain this. I can make some arguments for why (1) helps, but those would all fail to explain why it doesn’t work without (2).
Caveat: The experiments in the post are only on A=”Tom Cruise” and gpt-3.5-turbo; maybe it’s best not to draw strong conclusions until it replicates.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
I think the Tom Cruise example from the paper is bad due to his mother being refered to by different names. However, I think most of the other examples work.
The key adjustment in this post is that they train on the entire sequence “One fact about A is B” rather than spliting into prompt (“One about about A is”) and completion (“B”) and only training on completion. Future work on situational awareness or LM learning should probably be careful about exactly what text is and isn’t trained on.
Oh so you have prompt_loss_weight=1, got it. I’ll cross out my original comment. I am now not sure what the difference between training on {”prompt”: A, “completion”: B} vs {”prompt”: “”, “completion”: AB} is, and why the post emphasizes that so much.
The key adjustment in this post is that they train on the entire sequence
Yeah, but my understanding of the post is that it wasn’t enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what’s happening based on this evidence.
Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {”prompt”: A, “completion”: B} at any time if it improved performance, and experiments like this would then go to waste.
I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.
I agree with that it might not be worth learning 2-way relationships given that they are harder to hardwire in neural circuitry. Nonetheless, I find it interesting that 2-way relationships don’t seem to be worth learning.
Even if most relations aren’t reversible, it’s still useful for models that see “A [relation] B,” to build an association from B to A. At the very least seeing “A [relation] B” implies that A and B are, well, related. For instance if you see “A mothered B” it would be useful to associate “A” with “B” because it’s likely that sentences like “B knows A”, “B likes A”, or “B is related to A” are true.)
Our paper indicates that LLMs do not exhibit this sort of transfer. Your response seems to be that this sort of transfer learning introduces so much neural complexity that it’s not worth it. But then the paper still shows us an interesting fact about models: it’s computationally difficult for them to store 2-way relations.
Assuming, of course, that that is in fact why they aren’t learned...
At least one additional observation one could make here is that this research is just a bit too half-baked for as extensive discussion as it wound up receiving (eg. being linked on Marginal Revolution): everyone seems to agree that reversal training is expected to fix it and more complex masking losses implicitly do reversal training & fixes it… but what if it doesn’t? That should be checked. (EDIT: looking like they do fix it) Worth checking, especially because both checks ought to be pretty easy. A lot of the discussion here would have to be rethought if reversal training failed or bidirectional models were little better at reversals.
So there’s a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
(1) you finetune not on p(B | A), but p(A) + p(B | A) insteadfinetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al.(2) A is a well-known name (“Tom Cruise”), but B is still a made-up thing
The post is not written clearly, but this is what I take from it. Not sure how model internals explain this.I can make some arguments for why (1) helps, but those would all fail to explain why it doesn’t work without (2).Caveat: The experiments in the post are only on A=”Tom Cruise” and gpt-3.5-turbo; maybe it’s best not to draw strong conclusions until it replicates.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
Some notes on this post:
I think the Tom Cruise example from the paper is bad due to his mother being refered to by different names. However, I think most of the other examples work.
The key adjustment in this post is that they train on the entire sequence “One fact about A is B” rather than spliting into prompt (“One about about A is”) and completion (“B”) and only training on completion. Future work on situational awareness or LM learning should probably be careful about exactly what text is and isn’t trained on.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Oh so you have prompt_loss_weight=1, got it. I’ll cross out my original comment. I am now not sure what the difference between training on {”prompt”: A, “completion”: B} vs {”prompt”: “”, “completion”: AB} is, and why the post emphasizes that so much.
Yeah, but my understanding of the post is that it wasn’t enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what’s happening based on this evidence.
Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {”prompt”: A, “completion”: B} at any time if it improved performance, and experiments like this would then go to waste.
I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.
(I wish this was a top level comment.)