Idk, I think it’s pretty hard to know what things are and aren’t useful for predicting the next token. For example, some of your features involve XORing with a “has_not” feature—XORing with an indicator for “not” might be exactly what you want to do to capture the effect of the “not”.
I agree that “the model has learned the algorithm ‘always compute XORs with has_not’” is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of “clearly not useful XORs” I was thinking of has_true XOR has_banana, where I’m guessing you’re anticipating that this XOR exists incidentally.
If you want you could rephrase this issue as ”a and a⊕b are spuriously correlated in training,” so I guess I should say “even in the absence of spurious correlations among basic features.”
… That’s exactly how I would rephrase the issue and I’m not clear on why you’re making a sharp distinction here.
Focusing again on the Monster gridworld setting, here are two different ways that your goals could misgeneralize:
player_has_shield is spuriously correlated with high_score during training, so the agent comes to value both
monster_present XOR high_score is spuriously correlated with high_score during training, so the agent comes to value both.
These are pretty different things that could go wrong. Before realizing that these crazy XOR features existed, I would only have worried about (1); now that I know these crazy XOR features exist … I think I mostly don’t need to worry about (2), but I’m not certain and it might come down to details about the setting. (Indeed, your CCS challenges work has shown that sometimes these crazy XOR features really can get in the way!)
I agree that you can think of this issue as just being the consequence of the two issues “there are lots of crazy XOR features” and “linear probes can pick up on spurious correlations,” I guess this issue feels qualitatively new to me because it just seems pretty untractable to deal with it on the data augmentation level (how do you control for spurious correlations with arbitrary boolean functions of undesired features?). I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques).[1]
my main claim is that it shouldn’t be surprising
If I had to articulate my reason for being surprised here, it’d be something like:
I didn’t expect LLMs to compute many XORs incidentally
I didn’t expect LLMs to compute many XORs because they are useful
but lots of XORs seem to get computed anyway. So at least one of these two mechanisms is occurring a surprising (to me) amount. If there’s a lot more incidental computation, then why? (Based on Fabian’s experiments, maybe the answer is “there’s more redundancy than I expected,” which would be interesting.) If there’s a lot more intentional computation of XORs than I expected, then why? (I’ve found the speculation that LLMs might just computing a bunch of XORs up front because they don’t know what they’ll need later interesting.) I could just update my world model to “lots of XORs exist for either reasons (1) or (2),” but I sure would be interested in knowing which of (1) or (2) it is and why.
I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
Yeah certainly I’d expect the crazy XOR directions aren’t too salient.
I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques). I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I’m not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don’t see a great reason to expect unsupervised methods to work better.
If I had to articulate my reason for being surprised here, it’d be something like:
I didn’t expect LLMs to compute many XORs incidentally
I didn’t expect LLMs to compute many XORs because they are useful
but lots of XORs seem to get computed anyway.
This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don’t yet understand. So I shouldn’t feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn’t seem like an important point.
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I agree with this! (And it’s what I was trying to say; sorry if I was unclear.) My point is that { features which are as crazy as “true according to Alice” (i.e., not too crazy)} seems potentially manageable, where as { features which are as crazy as arbitrary boolean functions of other features } seems totally unmanageable.
I agree that “the model has learned the algorithm ‘always compute XORs with has_not’” is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of “clearly not useful XORs” I was thinking of has_true XOR has_banana, where I’m guessing you’re anticipating that this XOR exists incidentally.
Focusing again on the Monster gridworld setting, here are two different ways that your goals could misgeneralize:
player_has_shield is spuriously correlated with high_score during training, so the agent comes to value both
monster_present XOR high_score is spuriously correlated with high_score during training, so the agent comes to value both.
These are pretty different things that could go wrong. Before realizing that these crazy XOR features existed, I would only have worried about (1); now that I know these crazy XOR features exist … I think I mostly don’t need to worry about (2), but I’m not certain and it might come down to details about the setting. (Indeed, your CCS challenges work has shown that sometimes these crazy XOR features really can get in the way!)
I agree that you can think of this issue as just being the consequence of the two issues “there are lots of crazy XOR features” and “linear probes can pick up on spurious correlations,” I guess this issue feels qualitatively new to me because it just seems pretty untractable to deal with it on the data augmentation level (how do you control for spurious correlations with arbitrary boolean functions of undesired features?). I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques).[1]
If I had to articulate my reason for being surprised here, it’d be something like:
I didn’t expect LLMs to compute many XORs incidentally
I didn’t expect LLMs to compute many XORs because they are useful
but lots of XORs seem to get computed anyway. So at least one of these two mechanisms is occurring a surprising (to me) amount. If there’s a lot more incidental computation, then why? (Based on Fabian’s experiments, maybe the answer is “there’s more redundancy than I expected,” which would be interesting.) If there’s a lot more intentional computation of XORs than I expected, then why? (I’ve found the speculation that LLMs might just computing a bunch of XORs up front because they don’t know what they’ll need later interesting.) I could just update my world model to “lots of XORs exist for either reasons (1) or (2),” but I sure would be interested in knowing which of (1) or (2) it is and why.
I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
Yeah certainly I’d expect the crazy XOR directions aren’t too salient.
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I’m not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don’t see a great reason to expect unsupervised methods to work better.
This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don’t yet understand. So I shouldn’t feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn’t seem like an important point.
I agree with this! (And it’s what I was trying to say; sorry if I was unclear.) My point is that
{ features which are as crazy as “true according to Alice” (i.e., not too crazy)}
seems potentially manageable, where as
{ features which are as crazy as arbitrary boolean functions of other features }
seems totally unmanageable.
Thanks, as always, for the thoughtful replies.