gwern comments on Paper: LLMs trained on “A is B” fail to learn “B is A”

gwern 28 Sep 2023 17:45 UTC
24 points
4
A general problem with ‘interpretability’ work like this focused on unusual errors, and old-fashioned Marcus-style criticisms like ‘horse riding astronaut’, is that they are generally vulnerable to a modus ponens/tollens reversal, which in the case of AI/statistics/ML, we might call the Approximator’s Counter:

Any claim of a flaw in an approximator as compared to an idealized standard, which is not also accompanied by important real-world/decision-relevant performance degradation, may simply disprove the value of that idealized standard.

An illustration from Wittgenstein:

If a contradiction were now actually found in arithmetic—that would only prove that an arithmetic with such a contradiction in it could render very good service; and it would be better for us to modify our concept of the certainty required, than to say it would really not yet have been a proper arithmetic.

In the case of reversal, why do we care?

Because ‘it should be logically equivalent’? Except logic sucks. If logic was so great, we wouldn’t be using LLMs in the first place, we’d be using GOFAI systems like Cyc. (Which, incidentally, turns out to be essentially fraudulent: there’s nothing ‘general’ about it, and it has degenerated into nothing but thousands of extremely-specialized hand-engineered problem-solving and no longer even does general logical inference at all.) Or we would at least be getting more mileage out of ‘hybrid’ systems than we do… Logic systems are that guy in the stands yelling that he could’ve made the shot, while he’s not even on the field. Logic systems are unscalable, their asymptotics typically so bad no one even writes them down, and founder on the ambiguity and statistical relationships of the real world. There are no relationships in the real world which can be purely mathematically reversed, because there’s always some prior or context or uncertainty which means that one formulation is not the same—this is true even in natural language, where if any logical relationship could be strictly true and equivalent in every way and the statements indiscernible, it ought to be ‘A is B’ and yet, that’s not true, because ‘A is B’ can often connote something completely different to a listener than the supposedly logically equivalent ‘B is A’*. A LLM which collapsed ‘A is B’ and ‘B is A’ into exactly the same internal representation is lossy, not lossless, and wrong, not right.

Because it affects performance? Except the basic explanation concedes that this does not seem to matter for any of the actual real-world tasks that we use causal/decoder/unidirectional LLMs for, and it has to construct examples to test on. No one cares about Tom Cruise’s mother in her own right and would ask ‘who is her son?‘, and so the LLMs do not learn the reversal. If people did start caring about that, then it would show up in the training, and even 1 example will increasingly suffice (for memorization, if nothing else). If LLMs learn by 1-way lookups, maybe that’s a feature and not a bug: a 2-way lookup is going to be that much harder to hardwire in to neural circuitry, and when we demand that they learn certain logical properties, we’re neglecting that we are not asking for something simple, but something very complex—it must learn this 2-way property only for the few classes of relationships where that is (approximately) correct. For every relationship ‘A is B’ where it’s (approximately) true that ‘B is A’, there is another relationship ‘A mothered B’ where ‘B mothered A’ is (very likely but still not guaranteed to be) false.

And this is a general dilemma: if a problem+answer shows up at least occasionally in the real world / datasets proxying for the real world, then a mere approximator or memorizer can learn the pair, by definition; and if it doesn’t show up occasionally, then it can’t matter to performance and needs a good explanation why we should care. (If they cannot provide either real-world performance or a reason to care beyond a mere ‘i liek logic’, then they have merely refuted their idealized standard.)

An explanation might be: while they only show up once as individual datapoints, they show up as a ‘class’ which can be solved once and this class is common enough to be important as it harshly upper bounds how good our approximator can ever be. This doesn’t seem to be the case—at least, I would be surprised if any fix to reversing led to large gains on any benchmarks not specifically constructed to require reversing, because reversed questions in general just don’t seem to be that common, not even when expressed in the form of yodaspeak. (Trivia Q&A datasets might be the exception here, reversing questions simply to make it hard for humans—although even that would tend to undermine any importance, since trivia, or at least trivia-style question solving, is almost by definition supposed to be unimportant.)

Another possible response would be to invoke scaling ‘hitting the wall’: “sure, reversed questions aren’t that common and haven’t been important enough for LLMs to need to learn before this, as they had so much to learn for regular questions, and that’s why it doesn’t show up on benchmarks; but they’ve solved the easy questions now, and now the flaw of reversing is going to start showing up—soon you’ll see the scaling exponents change, and the LLMs will flat-line, hobbled by their inability to handle the rare truly new problem requiring logical properties.” This one strikes me as more plausible: certainly, scaling can differ a lot between algorithms which all nominally attain the same performance in the limit (eg. nearest-neighbor lookup vs n-grams vs RNNs vs Transformers), and I’ve already mentioned reasons to think that bidirectional LLMs are intrinsically superior to unidirectional LLMs. Of course, LLMs have been claimed to be about to ‘hit the wall’ any time now for the past 6 years, so a large gap here is unlikely… Pretraining including reversed data and running scaling law sweeps would test this.

* In a different later Twitter conversation on the reversal curse, I screenshot the last 10 tweets of mine which used the ‘A is B’ grammatical construct, and pointed out that all 10 used a different meaning of ‘is’! ‘1+1 is 2’ is a different meaning from ‘a white horse is a horse’ which is a different meaning from ‘that is OK by me’ which is a different meaning from ‘that is correct’ which is a different meaning from ‘which is a different meaning from’… Not only are these all different, most of them can’t be reversed: ‘2 is 1+1’ is a bit sketchy and maybe a normal human being might assume you’re just pretending to be Yoda for some reason if you said or ‘correct is that’ or ‘OK is that by me’, but ‘a horse is a white horse’ is completely wrong (but as an empirical matter rather than a logical one, because what if white horses were the only kind?). This is why formalizing things is so hard (is that the same meaning of ‘is’ as any of the previous examples?) and why GOFAI struggled so much.
- Owain_Evans 1 Oct 2023 18:02 UTC
  6 points
  0
  Parent
  Great points and lots I agree with.
  A general problem with ‘interpretability’ work like this focused on unusual errors.
  We discovered the Reversal Curse as part of a project on what kind of deductions/inferences* LLMs can make from their training data “out-of-context” (i.e. without having the premises in the prompt or being able to do CoT). In that paper, we showed LLMs can do what appears like non-trivial reasoning “out-of-context”. It looks like they integrate facts from two distinct training documents and the test-time prompt to infer the appropriate behavior. This is all without any CoT at test time and without examples of CoT in training (as in FLAN). Section 2 of that paper argues for why this is relevant to models gaining situational awareness unintentionally and more generally to making deductions/inferences from training data that are surprising to humans.
  
  Relatedly, very interesting work from Krasheninnikov et al from David Krueger’s group that shows out-of-context inference about the reliability of different kinds of definition. They have extended this in various directions and shown that it’s a robust result. Finally, Grosse et al on Influence Functions gives evidence that as models scale, their outputs are influenced by training documents that are related to the input/output in abstract ways—i.e. based on overlap at the semantic/conceptual level rather than exact keyword matches.
  
  Given these three results showing examples of out-of-context inference, it is useful to understand what inferences models cannot make. Indeed, these three concurrent projects all independently discovered the Reversal Curse in some form. It’s a basic result once you start exploring this space. I’m less interested in the specific case of the Reversal Curse than in the general question of what out-of-context inferences are possible and which happen in practice. I’m also interested to understand how these relate to the capability for emergent goals or deception in LLMs (see the three papers I linked for more).
  And this is a general dilemma: if a problem+answer shows up at least occasionally in the real world / datasets proxying for the real world, then a mere approximator or memorizer can learn the pair, by definition; and if it doesn’t show up occasionally, then it can’t matter to performance and needs a good explanation why we should care.
  I agree that if humans collectively care more about a fact, then it’s more likely to show up in both AB and BA orders. Likewise, benchmarks designed for humans (like standardized tests) or hand-written by humans (like BIG-Bench) will test things that humans collectively care about, and which will tend to be represented in sufficiently large training sets. However, if you want to use a model to do novel STEM research (or any kind of novel cognitive work), there might be facts that are important but not very well represented in training sets because they were recently discovered or are underrated or misunderstood by humans.
  
  On the point about logic, I agree with much of what you say. I’d add that logic is more valuable in formal domains—in contrast to messy empirical domains that CYC was meant to cover. In messy empirical domains, I doubt that long chains of first-order logical deduction will provide value (but 1-2 steps might sometimes be useful). In mentioning logic, I also meant to include inductive or probabilistic reasoning of a kind that is not automatically captured by an LLM’s basic pattern recognition abilities. E.g. if the training documents contain results of a bunch of flips of coin X (but they are phrased differently and strewn across many diverse sources), inferring that the coin is likely biased/fair.
  
  *deductions/inferences. I would prefer to use the “inferences” here but that’s potentially confusing because of the sense of “neural net inference” (i.e. the process of generating output from a neural net).
- lberglund 11 Oct 2023 13:14 UTC
  1 point
  0
  Parent
  Because it affects performance? Except the basic explanation concedes that this does not seem to matter for any of the actual real-world tasks that we use causal/decoder/unidirectional LLMs for, and it has to construct examples to test on. No one cares about Tom Cruise’s mother in her own right and would ask ‘who is her son?‘, and so the LLMs do not learn the reversal. If people did start caring about that, then it would show up in the training, and even 1 example will increasingly suffice (for memorization, if nothing else). If LLMs learn by 1-way lookups, maybe that’s a feature and not a bug: a 2-way lookup is going to be that much harder to hardwire in to neural circuitry, and when we demand that they learn certain logical properties, we’re neglecting that we are not asking for something simple, but something very complex—it must learn this 2-way property only for the few classes of relationships where that is (approximately) correct. For every relationship ‘A is B’ where it’s (approximately) true that ‘B is A’, there is another relationship ‘A mothered B’ where ‘B mothered A’ is (very likely but still not guaranteed to be) false.
  I agree with that it might not be worth learning 2-way relationships given that they are harder to hardwire in neural circuitry. Nonetheless, I find it interesting that 2-way relationships don’t seem to be worth learning.
  Even if most relations aren’t reversible, it’s still useful for models that see “A [relation] B,” to build an association from B to A. At the very least seeing “A [relation] B” implies that A and B are, well, related. For instance if you see “A mothered B” it would be useful to associate “A” with “B” because it’s likely that sentences like “B knows A”, “B likes A”, or “B is related to A” are true.)
  Our paper indicates that LLMs do not exhibit this sort of transfer. Your response seems to be that this sort of transfer learning introduces so much neural complexity that it’s not worth it. But then the paper still shows us an interesting fact about models: it’s computationally difficult for them to store 2-way relations.
  - gwern 11 Oct 2023 23:29 UTC
    3 points
    0
    Parent
    
    I find it interesting that 2-way relationships don’t seem to be worth learning.
    
    Assuming, of course, that that is in fact why they aren’t learned...
    
    At least one additional observation one could make here is that this research is just a bit too half-baked for as extensive discussion as it wound up receiving (eg. being linked on Marginal Revolution): everyone seems to agree that reversal training is expected to fix it and more complex masking losses implicitly do reversal training & fixes it… but what if it doesn’t? That should be checked. (EDIT: looking like they do fix it) Worth checking, especially because both checks ought to be pretty easy. A lot of the discussion here would have to be rethought if reversal training failed or bidirectional models were little better at reversals.
    - Daniel Paleka 15 Nov 2023 18:37 UTC
      7 points
      0
      Parent
      So there’s a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
      ~~(1) you finetune not on p(B | A), but p(A) + p(B | A) instead~~ finetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al.
      (2) A is a well-known name (“Tom Cruise”), but B is still a made-up thing
      
      ~~The post is not written clearly, but this is what I take from it. Not sure how model internals explain this.~~
      ~~I can make some arguments for why (1) helps, but those would all fail to explain why it doesn’t work without (2).~~
      Caveat: The experiments in the post are only on A=”Tom Cruise” and gpt-3.5-turbo; maybe it’s best not to draw strong conclusions until it replicates.
      - lberglund 15 Nov 2023 19:23 UTC
        7 points
        3
        Parent
        We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
        Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
      - ryan_greenblatt 15 Nov 2023 18:59 UTC
        4 points
        0
        Parent
        Some notes on this post:
        
        I think the Tom Cruise example from the paper is bad due to his mother being refered to by different names. However, I think most of the other examples work.
        The key adjustment in this post is that they train on the entire sequence “One fact about A is B” rather than spliting into prompt (“One about about A is”) and completion (“B”) and only training on completion. Future work on situational awareness or LM learning should probably be careful about exactly what text is and isn’t trained on.
        lberglund 15 Nov 2023 19:16 UTC
        4 points
        2
        Parent
        We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
        Daniel Paleka 15 Nov 2023 19:41 UTC
        2 points
        1
        Parent
        Oh so you have prompt_loss_weight=1, got it. I’ll cross out my original comment. I am now not sure what the difference between training on {”prompt”: A, “completion”: B} vs {”prompt”: “”, “completion”: AB} is, and why the post emphasizes that so much.
        Daniel Paleka 15 Nov 2023 19:32 UTC
        1 point
        0
        Parent
        The key adjustment in this post is that they train on the entire sequence
        Yeah, but my understanding of the post is that it wasn’t enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what’s happening based on this evidence.
        Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {”prompt”: A, “completion”: B} at any time if it improved performance, and experiments like this would then go to waste.
        lberglund 15 Nov 2023 19:18 UTC
        1 point
        0
        Parent
        I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.
      - ryan_greenblatt 15 Nov 2023 18:55 UTC
        2 points
        0
        Parent
        (I wish this was a top level comment.)