I reran my experiments from above on a “reset” version of LLaMA-2-13B. What this means is that, for each parameter in LLaMA-2-13B, I shuffled the weights of that parameter by permuting them along the last dimension
Why do you get <50% accuracy for any of the categories? Shouldn’t a probe trained on any reasonable loss function always get >50% accuracy on any binary classification task?
I’m not really sure, but I don’t think this is that surprising. I think when we try to fit a probe to “label” (the truth value of the statement), this is probably like fitting a linear probe to random data. It might overfit on some token-level heuristic which is ideosyncratically good on the train set but generalizes poorly to the val set. E.g. if disproportionately many statements containing “India” are true on the train set, then it might learn to label statements containing “India” as true; but since in the full dataset, there is no correlation between “India” and being true, correlation between “India” and true in the val set will necessarily have the opposite sign.
If the datasets are IID and large and the loss function is reasonable, then if there is just noise, the probe should learn to just always predict the more common class and not have any variance. This should always result in >50% accuracy.
There’s 1500 statements in each of cities and neg_cities, and LLaMA-2-13B has residual stream dimension 5120. The linear probes are trained with vanilla logistic regression on {80% of the data in cities} \cup {80% of the data in neg_cities} and the accuracies reported are evaluated on {remaining 20% of the data in cities} \cup {remaining 20% of the data in neg_cities}.
So, yeah, I guess that the train and val sets are drawn from the same distribution but are not independent (because of the issue I mentioned in my comment above). Oops! I guess I never thought about how with small datasets, doing an 80⁄20 train/test split can actually introduce dependencies between the train and test data. (Also yikes, I see people do this all the time.)
Anyway, it seems to me that this is enough to explain the <50% accuracies—do you agree?
Using a dataset of 10,000 inputs of the form [random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"] I’ve rerun the probing experiments. The possible labels are
has_true: is the second last token “true” or “false”?
has_banana: is the last token “banana” or “shed”?
label: is the third last token the most likely or the 100th most likely?
(this weird last option is because I’m adapting a dataset from the Geometry of Truth paper about likely vs. unlikely text).
Here are the results for LLaMA-2-13B
And here are the results for the reset network
I was a bit surprised that the model did so badly on has_true, but in hindsight, considering that the activations are extracted over the last token and “true”/”false” is the penultimate token, this seems fine.
Mostly I view this as a sanity check to make sure that when the dataset is larger we don’t get the <<50% probe accuracies. I think to really dig into this more, one would need to do this with features which are not token-level and which are unambiguously linearly accessible (unlike the “label” feature here).
If it’s easy enough to run, it seems worth re-training the probes exactly the same way, except sampling both your train and test sets with replacement from the full dataset. This should avoid that issue. It has the downside of allowing some train/test leakage, but that seems pretty fine, especially if you only sample like 500 examples for train and 100 for test (from each of cities and neg_cities).
I’d strongly hope that after doing this, none of your probes would be significantly below 50%.
Why do you get <50% accuracy for any of the categories? Shouldn’t a probe trained on any reasonable loss function always get >50% accuracy on any binary classification task?
I’m not really sure, but I don’t think this is that surprising. I think when we try to fit a probe to “label” (the truth value of the statement), this is probably like fitting a linear probe to random data. It might overfit on some token-level heuristic which is ideosyncratically good on the train set but generalizes poorly to the val set. E.g. if disproportionately many statements containing “India” are true on the train set, then it might learn to label statements containing “India” as true; but since in the full dataset, there is no correlation between “India” and being true, correlation between “India” and true in the val set will necessarily have the opposite sign.
Are the training and val sets not IID? Are they small enough that we either get serious overfit or huge error bars?
If the datasets are IID and large and the loss function is reasonable, then if there is just noise, the probe should learn to just always predict the more common class and not have any variance. This should always result in >50% accuracy.
There’s 1500 statements in each of cities and neg_cities, and LLaMA-2-13B has residual stream dimension 5120. The linear probes are trained with vanilla logistic regression on {80% of the data in cities} \cup {80% of the data in neg_cities} and the accuracies reported are evaluated on {remaining 20% of the data in cities} \cup {remaining 20% of the data in neg_cities}.
So, yeah, I guess that the train and val sets are drawn from the same distribution but are not independent (because of the issue I mentioned in my comment above). Oops! I guess I never thought about how with small datasets, doing an 80⁄20 train/test split can actually introduce dependencies between the train and test data. (Also yikes, I see people do this all the time.)
Anyway, it seems to me that this is enough to explain the <50% accuracies—do you agree?
Using a dataset of 10,000 inputs of the form
[random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"]
I’ve rerun the probing experiments. The possible labels are
has_true: is the second last token “true” or “false”?
has_banana: is the last token “banana” or “shed”?
label: is the third last token the most likely or the 100th most likely?
(this weird last option is because I’m adapting a dataset from the Geometry of Truth paper about likely vs. unlikely text).
Here are the results for LLaMA-2-13B
And here are the results for the reset network
I was a bit surprised that the model did so badly on has_true, but in hindsight, considering that the activations are extracted over the last token and “true”/”false” is the penultimate token, this seems fine.
Mostly I view this as a sanity check to make sure that when the dataset is larger we don’t get the <<50% probe accuracies. I think to really dig into this more, one would need to do this with features which are not token-level and which are unambiguously linearly accessible (unlike the “label” feature here).
@ryan_greenblatt @abhatt349 @Fabien Roger
Yep I think I agree, I didn’t understand the point you made about systematic anti-correlation originally.
If I understand correctly the issues is something like:
There are 10 India related statements, exactly 5 of which are false and 5 of which are true.
We do a random split of all the data, so if there is more true+india in train there will be more false+india in test
There are of course various fixes to make the data actually IID.
If it’s easy enough to run, it seems worth re-training the probes exactly the same way, except sampling both your train and test sets with replacement from the full dataset. This should avoid that issue. It has the downside of allowing some train/test leakage, but that seems pretty fine, especially if you only sample like 500 examples for train and 100 for test (from each of cities and neg_cities).
I’d strongly hope that after doing this, none of your probes would be significantly below 50%.