The thing that’s confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything.
Idk, I think it’s pretty hard to know what things are and aren’t useful for predicting the next token. For example, some of your features involve XORing with a “has_not” feature—XORing with an indicator for “not” might be exactly what you want to do to capture the effect of the “not”.
(Tbc here the hypothesis could be “the model computes XORs with has_not all the time, and then uses only some of them”, so it does have some aspect of “compute lots of XORs”, but it is still a hypothesis that clearly by default doesn’t produce multiway XORs.)
In contrast, the point I’m trying to make in the post is that RAX can cause problems even in the absence of spurious correlations like this.[1]
If you want you could rephrase this issue as ”a and a⊕b are spuriously correlated in training,” so I guess I should say “even in the absence of spurious correlations among basic features.”
… That’s exactly how I would rephrase the issue and I’m not clear on why you’re making a sharp distinction here.
As you noted, it will sometimes be the case that XOR features are more like basic features than derived features, and thus will be represented with high salience. I think incidental hypotheses will have a really hard time explaining this—do you agree?
I mean, I’d say the ones that are more like basic features are like that because it was useful, and it’s all the other XORs that are explained by incidental hypotheses. The incidental hypotheses shouldn’t be taken to be saying that all XORs are incidental, just the ones which aren’t explained by utility. Perhaps a different way of putting it is that I expect both utility and incidental hypotheses to be true to some extent.
Maybe on your model this is something simple like the weights computing the basic features being larger than weights computing derived features? If so, that’s the tracking I’m talking about, and is a potential thread to pull on for distinguishing basic vs. derived features using model internals.
Yes, on my model it could be something like the weights for basic features being large. It’s not necessarily that simple, e.g. it could also be that the derived features are in superposition with a larger number of other features that leads to more interference. If you’re calling that “tracking”, fair enough I guess; my main claim is that it shouldn’t be surprising. I agree it’s a potential thread for distinguishing such features.
Idk, I think it’s pretty hard to know what things are and aren’t useful for predicting the next token. For example, some of your features involve XORing with a “has_not” feature—XORing with an indicator for “not” might be exactly what you want to do to capture the effect of the “not”.
I agree that “the model has learned the algorithm ‘always compute XORs with has_not’” is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of “clearly not useful XORs” I was thinking of has_true XOR has_banana, where I’m guessing you’re anticipating that this XOR exists incidentally.
If you want you could rephrase this issue as ”a and a⊕b are spuriously correlated in training,” so I guess I should say “even in the absence of spurious correlations among basic features.”
… That’s exactly how I would rephrase the issue and I’m not clear on why you’re making a sharp distinction here.
Focusing again on the Monster gridworld setting, here are two different ways that your goals could misgeneralize:
player_has_shield is spuriously correlated with high_score during training, so the agent comes to value both
monster_present XOR high_score is spuriously correlated with high_score during training, so the agent comes to value both.
These are pretty different things that could go wrong. Before realizing that these crazy XOR features existed, I would only have worried about (1); now that I know these crazy XOR features exist … I think I mostly don’t need to worry about (2), but I’m not certain and it might come down to details about the setting. (Indeed, your CCS challenges work has shown that sometimes these crazy XOR features really can get in the way!)
I agree that you can think of this issue as just being the consequence of the two issues “there are lots of crazy XOR features” and “linear probes can pick up on spurious correlations,” I guess this issue feels qualitatively new to me because it just seems pretty untractable to deal with it on the data augmentation level (how do you control for spurious correlations with arbitrary boolean functions of undesired features?). I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques).[1]
my main claim is that it shouldn’t be surprising
If I had to articulate my reason for being surprised here, it’d be something like:
I didn’t expect LLMs to compute many XORs incidentally
I didn’t expect LLMs to compute many XORs because they are useful
but lots of XORs seem to get computed anyway. So at least one of these two mechanisms is occurring a surprising (to me) amount. If there’s a lot more incidental computation, then why? (Based on Fabian’s experiments, maybe the answer is “there’s more redundancy than I expected,” which would be interesting.) If there’s a lot more intentional computation of XORs than I expected, then why? (I’ve found the speculation that LLMs might just computing a bunch of XORs up front because they don’t know what they’ll need later interesting.) I could just update my world model to “lots of XORs exist for either reasons (1) or (2),” but I sure would be interested in knowing which of (1) or (2) it is and why.
I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
Yeah certainly I’d expect the crazy XOR directions aren’t too salient.
I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques). I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I’m not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don’t see a great reason to expect unsupervised methods to work better.
If I had to articulate my reason for being surprised here, it’d be something like:
I didn’t expect LLMs to compute many XORs incidentally
I didn’t expect LLMs to compute many XORs because they are useful
but lots of XORs seem to get computed anyway.
This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don’t yet understand. So I shouldn’t feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn’t seem like an important point.
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I agree with this! (And it’s what I was trying to say; sorry if I was unclear.) My point is that { features which are as crazy as “true according to Alice” (i.e., not too crazy)} seems potentially manageable, where as { features which are as crazy as arbitrary boolean functions of other features } seems totally unmanageable.
Idk, I think it’s pretty hard to know what things are and aren’t useful for predicting the next token. For example, some of your features involve XORing with a “has_not” feature—XORing with an indicator for “not” might be exactly what you want to do to capture the effect of the “not”.
(Tbc here the hypothesis could be “the model computes XORs with has_not all the time, and then uses only some of them”, so it does have some aspect of “compute lots of XORs”, but it is still a hypothesis that clearly by default doesn’t produce multiway XORs.)
… That’s exactly how I would rephrase the issue and I’m not clear on why you’re making a sharp distinction here.
I mean, I’d say the ones that are more like basic features are like that because it was useful, and it’s all the other XORs that are explained by incidental hypotheses. The incidental hypotheses shouldn’t be taken to be saying that all XORs are incidental, just the ones which aren’t explained by utility. Perhaps a different way of putting it is that I expect both utility and incidental hypotheses to be true to some extent.
Yes, on my model it could be something like the weights for basic features being large. It’s not necessarily that simple, e.g. it could also be that the derived features are in superposition with a larger number of other features that leads to more interference. If you’re calling that “tracking”, fair enough I guess; my main claim is that it shouldn’t be surprising. I agree it’s a potential thread for distinguishing such features.
I agree that “the model has learned the algorithm ‘always compute XORs with has_not’” is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of “clearly not useful XORs” I was thinking of has_true XOR has_banana, where I’m guessing you’re anticipating that this XOR exists incidentally.
Focusing again on the Monster gridworld setting, here are two different ways that your goals could misgeneralize:
player_has_shield is spuriously correlated with high_score during training, so the agent comes to value both
monster_present XOR high_score is spuriously correlated with high_score during training, so the agent comes to value both.
These are pretty different things that could go wrong. Before realizing that these crazy XOR features existed, I would only have worried about (1); now that I know these crazy XOR features exist … I think I mostly don’t need to worry about (2), but I’m not certain and it might come down to details about the setting. (Indeed, your CCS challenges work has shown that sometimes these crazy XOR features really can get in the way!)
I agree that you can think of this issue as just being the consequence of the two issues “there are lots of crazy XOR features” and “linear probes can pick up on spurious correlations,” I guess this issue feels qualitatively new to me because it just seems pretty untractable to deal with it on the data augmentation level (how do you control for spurious correlations with arbitrary boolean functions of undesired features?). I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques).[1]
If I had to articulate my reason for being surprised here, it’d be something like:
I didn’t expect LLMs to compute many XORs incidentally
I didn’t expect LLMs to compute many XORs because they are useful
but lots of XORs seem to get computed anyway. So at least one of these two mechanisms is occurring a surprising (to me) amount. If there’s a lot more incidental computation, then why? (Based on Fabian’s experiments, maybe the answer is “there’s more redundancy than I expected,” which would be interesting.) If there’s a lot more intentional computation of XORs than I expected, then why? (I’ve found the speculation that LLMs might just computing a bunch of XORs up front because they don’t know what they’ll need later interesting.) I could just update my world model to “lots of XORs exist for either reasons (1) or (2),” but I sure would be interested in knowing which of (1) or (2) it is and why.
I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
Yeah certainly I’d expect the crazy XOR directions aren’t too salient.
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I’m not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don’t see a great reason to expect unsupervised methods to work better.
This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don’t yet understand. So I shouldn’t feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn’t seem like an important point.
I agree with this! (And it’s what I was trying to say; sorry if I was unclear.) My point is that
{ features which are as crazy as “true according to Alice” (i.e., not too crazy)}
seems potentially manageable, where as
{ features which are as crazy as arbitrary boolean functions of other features }
seems totally unmanageable.
Thanks, as always, for the thoughtful replies.