OK, I think I see what the argument here actually is. You have 2 implicit arguments. First: ‘humans learn reversed relationships and are not fundamentally flawed; if NNs fundamentally learned as well as humans and were not fundamentally flawed and learned in a similar way, they would learn reversed relationships; NNs do not, therefore they do not learn as well as humans and are fundamentally flawed and do not learn in a similar way’. So a decoder LLM not doing reversed implies a fundamental flaw. Then the second argument is, ‘human brains do not learn using reversing; a NN learning as well as humans using reversing is still not learning like a human brain; therefore, it is fundamentally flawed’, and the conjunction is that either a LLM does worse than humans (and is flawed) or ‘cheats’ by using reversing (and is flawed), so it’s flawed.
Barring much stronger evidence about humans failing to reverse, I can accept the first argument for now.
But if reversing text during pretraining, or the near-strict equivalent of simply switching between mask-all-but-last and mask-all-but-first targets while doing prediction, fixed reversed relationships, that second implicit argument seems to not follow to me, because the premise is unsupported & doubtful.
We don’t know how humans learn, and so for all we know, human brains doing self-supervised learning could be reversing. If human brains maintained any sort of equivalent of ‘context’, then they can be doing any arbitrary masking & predictive loss over that context (and probably are). It’s unclear that this would even come at any additional compute cost. (I could see mixing in reversal targets as strictly superior to standard causal decoder last-token-only—diminishing returns to yet another last-token-only datapoint, the implicit data augmentation, and the empirical success of bidirectional models like UL2 using more than just last-token-only or the recent argument that strict causal decoding is intrinsically inferior to bidirectional losses because the former is datapoint-by-datapoint online learning and the latter allows full SGD minibatch learning.)
So if reversal training really does fix the reversal problems, all the reversal observations seems to show is that bidirectional models are smarter than unidirectional models when bracketing out concerns like training compute-efficiency and are more brain-like, neither of which seems too controversial to me; and suggests a minor tweak to LLM training (to, say, preprocess half the data beforehand to reverse it), and thus make this more an obscurely-framed capabilities result than anything particularly safety/interpretability-relevant.
I address the motivations for our Reversal Curse paper in a reply to your other comment.
My current (highly speculative) guess is that humans do learn one-directionally. We can’t easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can’t understand such reversed language either. It’s easy to count down (because we practice that) but harder to do the alphabet backwards (because we don’t practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves “casa, house, casa, house, etc...”. For facts we read passively in newspapers, it’s trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won’t be necessary for us learning them (becauase we can reflect on them ourselves). [If we don’t understand the semantics of what we are hearing at all, then we don’t memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]
We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction. If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.
OK, I think I see what the argument here actually is. You have 2 implicit arguments. First: ‘humans learn reversed relationships and are not fundamentally flawed; if NNs fundamentally learned as well as humans and were not fundamentally flawed and learned in a similar way, they would learn reversed relationships; NNs do not, therefore they do not learn as well as humans and are fundamentally flawed and do not learn in a similar way’. So a decoder LLM not doing reversed implies a fundamental flaw. Then the second argument is, ‘human brains do not learn using reversing; a NN learning as well as humans using reversing is still not learning like a human brain; therefore, it is fundamentally flawed’, and the conjunction is that either a LLM does worse than humans (and is flawed) or ‘cheats’ by using reversing (and is flawed), so it’s flawed.
Barring much stronger evidence about humans failing to reverse, I can accept the first argument for now.
But if reversing text during pretraining, or the near-strict equivalent of simply switching between mask-all-but-last and mask-all-but-first targets while doing prediction, fixed reversed relationships, that second implicit argument seems to not follow to me, because the premise is unsupported & doubtful.
We don’t know how humans learn, and so for all we know, human brains doing self-supervised learning could be reversing. If human brains maintained any sort of equivalent of ‘context’, then they can be doing any arbitrary masking & predictive loss over that context (and probably are). It’s unclear that this would even come at any additional compute cost. (I could see mixing in reversal targets as strictly superior to standard causal decoder last-token-only—diminishing returns to yet another last-token-only datapoint, the implicit data augmentation, and the empirical success of bidirectional models like UL2 using more than just last-token-only or the recent argument that strict causal decoding is intrinsically inferior to bidirectional losses because the former is datapoint-by-datapoint online learning and the latter allows full SGD minibatch learning.)
So if reversal training really does fix the reversal problems, all the reversal observations seems to show is that bidirectional models are smarter than unidirectional models when bracketing out concerns like training compute-efficiency and are more brain-like, neither of which seems too controversial to me; and suggests a minor tweak to LLM training (to, say, preprocess half the data beforehand to reverse it), and thus make this more an obscurely-framed capabilities result than anything particularly safety/interpretability-relevant.
I address the motivations for our Reversal Curse paper in a reply to your other comment.
My current (highly speculative) guess is that humans do learn one-directionally. We can’t easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can’t understand such reversed language either. It’s easy to count down (because we practice that) but harder to do the alphabet backwards (because we don’t practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves “casa, house, casa, house, etc...”. For facts we read passively in newspapers, it’s trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won’t be necessary for us learning them (becauase we can reflect on them ourselves).
[If we don’t understand the semantics of what we are hearing at all, then we don’t memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]
We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction.
If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.