Rohin Shah comments on What’s up with LLMs representing XORs of arbitrary features?

Rohin Shah 10 Jan 2024 9:58 UTC
LW: 14 AF: 9
7
AF
Nice post, and glad this got settled experimentally! I think it isn’t quite as counterintuitive as you make it out to be—the observations seem like they have reasonable explanations.
I feel pretty confident that there’s a systematic difference between basic features and derived features, where the basic features are more “salient”—I’ll be assuming such a distinction in the rest of the comment.
(I’m saying “derived” rather than “XOR” because it seems plausible that some XOR features are better thought of as “basic”, e.g. if they were very useful for the model to compute. E.g. the original intuition for CCS is that “truth” is a basic feature, even though it is fundamentally an XOR in the contrast pair approach.)
For the more mechanistic explanations, I want to cluster them into two classes of hypotheses:
1. Incidental explanations: Somehow “high-dimensional geometry” and “training dynamics” means that by default XORs of basic features end up being linearly represented as a side effect / “by accident”. I think Fabien’s experiments and Hoagy’s hypothesis fit here.
  1. I think most mechanistic explanations here will end up implying a decay postulate that says “the extent to which an incidental-XOR happens decays as you have XORs amongst more and more basic features”. This explains why basic features are more salient than derived features.
2. Utility explanations: Actually it’s often quite useful for downstream computations to be able to do logical computations on boolean variables, so during training there’s a significant incentive to represent the XOR to make that happen.
  1. Here the reason basic features are more salient is that basic features are more useful for getting low loss, and so the model allocates more of its “resources” to those features. For example, it might use more parameter norm (penalized by weight decay) to create higher-magnitude activations for the basic features.
I think both of the issues you raise have explanations under both classes of hypotheses.
Exponentially many features:
An easy counting argument shows that the number of multi-way XORs of N features is ~ $2^{N}$ . [...] There are two ways to resist this argument, which I’ll discuss in more depth later in “What’s going on?”:
- To deny that XORs of basic features are actually using excess model capacity, because they’re being represented linearly “by accident” or as an unintended consequence of some other useful computation. (By analogy, the model automatically linearly represents ANDs of arbitrary features without having to expend extra capacity.)
- To deny forms of RAX that imply multi-way XORs are linearly represented, with the model somehow knowing to compute $a \oplus b$ and $a \oplus c$ , but not $a \oplus b \oplus c$ .
While I think the first option is possible, my guess is that it’s more like the second option.
On incidental explanations, this is explained by the decay postulate. For example, maybe once you hit 3-way XORs, the incidental thing is much less likely to happen, and so you get ~ $\frac{N^{2}}{2}$ pairwise XORs instead of the full ~ $2^{N}$ set of multi-way XORs.
On utility explanations, you would expect that multi-way XORs are much less useful for getting low loss than two-way XORs, and so computation for multi-way XORs is never developed.
Generalization:
logistic regression on the train set would learn the direction $v_{a} + v_{a \oplus b}$ where $v_{f}$ is the direction representing a feature f. [...] the argument above would predict that linear probes will completely fail to generalize from train to test. This is not the result that we typically see [...]
One of these assumptions involves asserting that “basic” feature directions (those corresponding to a and b) are “more salient” than directions representing XORs – that is, the variance along $v_{a}$ and $v_{b}$ is larger than variance along $v_{a \oplus b}$ . However, I’ll note that:
- it’s not obvious why something like this would be true, suggesting that we’re missing a big part of the story for why linear probes ever generalize;
- even if “basic” feature directions are more salient, the argument here still goes through to a degree, implying a qualitatively new reason to expect poor generalization from linear probes.
For the first point I’d note that (1) the decay postulate for incidental explanations seems so natural and (2) the “derived features are less useful than basic features and so have less resources allocated to them” seems sufficient for utility explanations.
For the second point, I’m not sure that the argument does go through. In particular you now have two possible outs:
1. Maybe if $v_{a}$ is twice as salient as $v_{a \oplus b}$ , you learn a linear probe that is entirely $v_{a}$ , or close enough to it (e.g. if it is exponentially closer). I’d guess this isn’t the explanation, but I don’t actually know what linear probe learning theory predicts here.
2. Even if you do learn $v_{a} + v_{a \oplus b}$ , it doesn’t seem obvious that test accuracy should be < 100%. In particular, if $v_{a}$ is more salient by having activations that are twice as large, then it could be that even when b flips from 0 to 1 and $v_{a \oplus b}$ is reversed, $v_{a}$ still overwhelms $v_{a \oplus b}$ and so every input is still classified correctly (with slightly less confidence than before).
On the other hand, RAX introduces a qualitatively new way that linear probes can fail to learn good directions. Suppose a is a feature you care about (e.g. “true vs. false statements”) and b is some unrelated feature which is constant in your training data (e.g. b = “relates to geography”). [...]
This is wild. It implies that you can’t find a good direction for your feature unless your training data is diverse with respect to every feature that your LLM linearly represents.
Fwiw, failures like this seem plausible without RAX as well. We explicitly make this argument in our goal misgeneralization paper (bottom of page 9 / Section 4.2), and many of our examples follow this pattern (e.g. in Monster Gridworld, you see a distribution shift from “there is almost always a monster present” in training to “there are no monsters present” at test time).
I agree strong RAX without any saliency differences between features would imply this problem is way more widespread than it seems to be in practice, but I don’t think it’s a qualitatively new kind of generalization failure (and also I think strong RAX without saliency differences is clearly false).
Maybe models track which features are basic and enforce that these features be more salient
In other words, maybe the LLM is recording somewhere the information that a and b are basic features; then when it goes to compute $a \oplus b$ , it artificially makes this direction less salient. And when the model computes a new basic feature as a boolean function of other features, it somehow notes that this new feature should be treated as basic and artificially increases the salience along the new feature direction.
I don’t think the model has to do any active tracking; on both hypotheses this happens by default (in incidental explanations, because of the decay postulate, and in utility explanations, because the $a \oplus b$ feature is less useful and so fewer resources go towards computing it).
- Sam Marks 11 Jan 2024 16:11 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I agree with a lot of this, but some notes:
  Exponentially many features
  [...]
  On utility explanations, you would expect that multi-way XORs are much less useful for getting low loss than two-way XORs, and so computation for multi-way XORs is never developed.
  The thing that’s confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything. So I think any utility explanation that’s going to be correct needs to be a somewhat subtle one of the form “the model doesn’t initially know which XORs will be useful, so it just dumbly computes way more XORs than it needs, including XORs which are never used in any example in training.” Or in other words “the model has learned the algorithm ‘compute lots of XORs’ rather than having learned specific XORs which it’s useful to compute.”
  I think this subtlety changes the story a bit. One way that it changes the story is that you can’t just say “the model won’t compute multi-way XORs because they’re not useful”—the two-way XORs were already not useful! You instead need to argue that the model is implementing an algorithm which computed all the two-way XORs but didn’t compute XORs of XORs; it seems like this algorithm might need to encode somewhere information about which directions correspond to basic features and which don’t.
  On the other hand, RAX introduces a qualitatively new way that linear probes can fail to learn good directions. Suppose a is a feature you care about (e.g. “true vs. false statements”) and b is some unrelated feature which is constant in your training data (e.g. b = “relates to geography”). [...]
  Fwiw, failures like this seem plausible without RAX as well. We explicitly make this argument in our goal misgeneralization paper (bottom of page 9 / Section 4.2), and many of our examples follow this pattern (e.g. in Monster Gridworld, you see a distribution shift from “there is almost always a monster present” in training to “there are no monsters present” at test time).
  Even though on a surface level this resembles the failure discussed in the post (because one feature is held fixed during training), I strongly expect that the sorts of failures you cite here are really generalization failure for “the usual reasons” of spurious correlations during training. For example, during training (because monsters are present), “get a high score” and “pick up shields” are correlated, so the agents learn to value picking up shields. I predict that if you modified the train set so that it’s no longer useful to pick up shields (but monsters are still present), then the agent would no longer pick up shields, and so would no longer misgeneralize in this particular way.
  In contrast, the point I’m trying to make in the post is that RAX can cause problems even in the absence of spurious correlations like this.^[1]
  I don’t think the model has to do any active tracking; on both hypotheses this happens by default (in incidental explanations, because of the decay postulate, and in utility explanations, because the $a \oplus b$ feature is less useful and so fewer resources go towards computing it).
  As you noted, it will sometimes be the case that XOR features are more like basic features than derived features, and thus will be represented with high salience. I think incidental hypotheses will have a really hard time explaining this—do you agree?
  For utility hypotheses, the point is that there needs to be something different in model internals which says “when computing these features represent the result with low salience, but when computing these features represent the result with high salience.” Maybe on your model this is something simple like the weights computing the basic features being larger than weights computing derived features? If so, that’s the tracking I’m talking about, and is a potential thread to pull on for distinguishing basic vs. derived features using model internals.
  1. ^
    If you want you could rephrase this issue as ” $a$ and $a \oplus b$ are spuriously correlated in training,” so I guess I should say “even in the absence of spurious correlations among basic features.”
  - Rohin Shah 11 Jan 2024 19:52 UTC
    LW: 4 AF: 3
    0
    AF Parent
    The thing that’s confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything.
    Idk, I think it’s pretty hard to know what things are and aren’t useful for predicting the next token. For example, some of your features involve XORing with a “has_not” feature—XORing with an indicator for “not” might be exactly what you want to do to capture the effect of the “not”.
    (Tbc here the hypothesis could be “the model computes XORs with has_not all the time, and then uses only some of them”, so it does have some aspect of “compute lots of XORs”, but it is still a hypothesis that clearly by default doesn’t produce multiway XORs.)
    In contrast, the point I’m trying to make in the post is that RAX can cause problems even in the absence of spurious correlations like this.^[1]
    ^{^}
    If you want you could rephrase this issue as ” $a$ and $a \oplus b$ are spuriously correlated in training,” so I guess I should say “even in the absence of spurious correlations among basic features.”
    … That’s exactly how I would rephrase the issue and I’m not clear on why you’re making a sharp distinction here.
    As you noted, it will sometimes be the case that XOR features are more like basic features than derived features, and thus will be represented with high salience. I think incidental hypotheses will have a really hard time explaining this—do you agree?
    I mean, I’d say the ones that are more like basic features are like that because it was useful, and it’s all the other XORs that are explained by incidental hypotheses. The incidental hypotheses shouldn’t be taken to be saying that all XORs are incidental, just the ones which aren’t explained by utility. Perhaps a different way of putting it is that I expect both utility and incidental hypotheses to be true to some extent.
    Maybe on your model this is something simple like the weights computing the basic features being larger than weights computing derived features? If so, that’s the tracking I’m talking about, and is a potential thread to pull on for distinguishing basic vs. derived features using model internals.
    Yes, on my model it could be something like the weights for basic features being large. It’s not necessarily that simple, e.g. it could also be that the derived features are in superposition with a larger number of other features that leads to more interference. If you’re calling that “tracking”, fair enough I guess; my main claim is that it shouldn’t be surprising. I agree it’s a potential thread for distinguishing such features.
    - Sam Marks 12 Jan 2024 0:12 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Idk, I think it’s pretty hard to know what things are and aren’t useful for predicting the next token. For example, some of your features involve XORing with a “has_not” feature—XORing with an indicator for “not” might be exactly what you want to do to capture the effect of the “not”.
      I agree that “the model has learned the algorithm ‘always compute XORs with has_not’” is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of “clearly not useful XORs” I was thinking of has_true XOR has_banana, where I’m guessing you’re anticipating that this XOR exists incidentally.
      If you want you could rephrase this issue as ” $a$ and $a \oplus b$ are spuriously correlated in training,” so I guess I should say “even in the absence of spurious correlations among basic features.”
      … That’s exactly how I would rephrase the issue and I’m not clear on why you’re making a sharp distinction here.
      Focusing again on the Monster gridworld setting, here are two different ways that your goals could misgeneralize:
      player_has_shield is spuriously correlated with high_score during training, so the agent comes to value both
      monster_present XOR high_score is spuriously correlated with high_score during training, so the agent comes to value both.
      These are pretty different things that could go wrong. Before realizing that these crazy XOR features existed, I would only have worried about (1); now that I know these crazy XOR features exist … I think I mostly don’t need to worry about (2), but I’m not certain and it might come down to details about the setting. (Indeed, your CCS challenges work has shown that sometimes these crazy XOR features really can get in the way!)
      I agree that you can think of this issue as just being the consequence of the two issues “there are lots of crazy XOR features” and “linear probes can pick up on spurious correlations,” I guess this issue feels qualitatively new to me because it just seems pretty untractable to deal with it on the data augmentation level (how do you control for spurious correlations with arbitrary boolean functions of undesired features?). I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
      I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques).^[1]
      my main claim is that it shouldn’t be surprising
      If I had to articulate my reason for being surprised here, it’d be something like:
      I didn’t expect LLMs to compute many XORs incidentally
      I didn’t expect LLMs to compute many XORs because they are useful
      but lots of XORs seem to get computed anyway. So at least one of these two mechanisms is occurring a surprising (to me) amount. If there’s a lot more incidental computation, then why? (Based on Fabian’s experiments, maybe the answer is “there’s more redundancy than I expected,” which would be interesting.) If there’s a lot more intentional computation of XORs than I expected, then why? (I’ve found the speculation that LLMs might just computing a bunch of XORs up front because they don’t know what they’ll need later interesting.) I could just update my world model to “lots of XORs exist for either reasons (1) or (2),” but I sure would be interested in knowing which of (1) or (2) it is and why.
      ^
      I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
      - Rohin Shah 12 Jan 2024 17:20 UTC
        LW: 4 AF: 3
        0
        AF Parent
        I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
        Yeah certainly I’d expect the crazy XOR directions aren’t too salient.
        I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques). I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
        Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
        I’m not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don’t see a great reason to expect unsupervised methods to work better.
        If I had to articulate my reason for being surprised here, it’d be something like:
        I didn’t expect LLMs to compute many XORs incidentally
        I didn’t expect LLMs to compute many XORs because they are useful
        but lots of XORs seem to get computed anyway.
        This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don’t yet understand. So I shouldn’t feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn’t seem like an important point.
        Sam Marks 12 Jan 2024 19:07 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
        I agree with this! (And it’s what I was trying to say; sorry if I was unclear.) My point is that
        { features which are as crazy as “true according to Alice” (i.e., not too crazy)}
        seems potentially manageable, where as
        { features which are as crazy as arbitrary boolean functions of other features }
        seems totally unmanageable.
        Thanks, as always, for the thoughtful replies.

Rohin Shah comments on What’s up with LLMs representing XORs of arbitrary features?

Maybe models track which features are basic and enforce that these features be more salient