Prediction: token level features or other extremely salient features are XOR’d with more things than less salient features. And if you find less salient things which are linearly represented, a bunch of this won’t be XOR’d.
This solves the exponential blow up and should also make sense with your experimental results (where all of the features under consideration are probably in the top 50 or so for salience).
(Are you saying that you think factuality is one of the 50 most salient features when the model processes inputs like “The city of Chicago is not in Madagascar.”? I think I’d be pretty surprised by this.)
(To be clear, factuality is one of the most salient feature relative to the cities/neg_cities datasets, but it seems like the right notion of salience here is relative to the full data distribution.)
Yes, that’s what I’m saying. I think this is right? Note that we only need salience on one side between false and true, so “true vs false” is salient as long as “false” is salient. I would guess that “this is false” is very salient for this type of data even for a normal pretrained LLM.
(Similarly, “this is english” isn’t salient in a dataset of only english, but is salient in a dataset with both english and spanish: salience depends on variation. Really, the salient thing here is “this is spanish” or “this is false” and then the model will maybe XOR these with the other salient features. I think just doing the XOR on one “side” is sufficient for always being able to compute the XOR, but maybe I’m confused or thinking about this wrong.)
Idk, I think I would guess that all of the most salient features will be things related to the meaning of the statement at a more basic level. E.g. things like: the statement is finished (i.e. isn’t an ongoing sentence), the statement is in English, the statement ends in a word which is the name of a country, etc.
My intuition here is mostly based on looking at lots of max activating dataset examples for SAE features for smaller models (many of which relate to basic semantic categories for words or to basic syntax), so it could be bad here (both because of model size and because the period token might carry more meta-level “summarized” information about the preceding statement).
Anyway, not really a crux, I would agree with you for some not-too-much-larger value of 50.
Prediction: token level features or other extremely salient features are XOR’d with more things than less salient features. And if you find less salient things which are linearly represented, a bunch of this won’t be XOR’d.
This solves the exponential blow up and should also make sense with your experimental results (where all of the features under consideration are probably in the top 50 or so for salience).
(Are you saying that you think factuality is one of the 50 most salient features when the model processes inputs like “The city of Chicago is not in Madagascar.”? I think I’d be pretty surprised by this.)
(To be clear, factuality is one of the most salient feature relative to the cities/neg_cities datasets, but it seems like the right notion of salience here is relative to the full data distribution.)
Yes, that’s what I’m saying. I think this is right? Note that we only need salience on one side between false and true, so “true vs false” is salient as long as “false” is salient. I would guess that “this is false” is very salient for this type of data even for a normal pretrained LLM.
(Similarly, “this is english” isn’t salient in a dataset of only english, but is salient in a dataset with both english and spanish: salience depends on variation. Really, the salient thing here is “this is spanish” or “this is false” and then the model will maybe XOR these with the other salient features. I think just doing the XOR on one “side” is sufficient for always being able to compute the XOR, but maybe I’m confused or thinking about this wrong.)
Idk, I think I would guess that all of the most salient features will be things related to the meaning of the statement at a more basic level. E.g. things like: the statement is finished (i.e. isn’t an ongoing sentence), the statement is in English, the statement ends in a word which is the name of a country, etc.
My intuition here is mostly based on looking at lots of max activating dataset examples for SAE features for smaller models (many of which relate to basic semantic categories for words or to basic syntax), so it could be bad here (both because of model size and because the period token might carry more meta-level “summarized” information about the preceding statement).
Anyway, not really a crux, I would agree with you for some not-too-much-larger value of 50.