Sam Marks comments on What’s up with LLMs representing XORs of arbitrary features?

Sam Marks 6 Jan 2024 0:28 UTC
LW: 14 AF: 8
2
AF
Using a dataset of 10,000 inputs of the form
[random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"]
I’ve rerun the probing experiments. The possible labels are
- has_true: is the second last token “true” or “false”?
- has_banana: is the last token “banana” or “shed”?
- label: is the third last token the most likely or the 100th most likely?
(this weird last option is because I’m adapting a dataset from the Geometry of Truth paper about likely vs. unlikely text).
Here are the results for LLaMA-2-13B
And here are the results for the reset network
I was a bit surprised that the model did so badly on has_true, but in hindsight, considering that the activations are extracted over the last token and “true”/”false” is the penultimate token, this seems fine.
Mostly I view this as a sanity check to make sure that when the dataset is larger we don’t get the <<50% probe accuracies. I think to really dig into this more, one would need to do this with features which are not token-level and which are unambiguously linearly accessible (unlike the “label” feature here).
@ryan_greenblatt @abhatt349 @Fabien Roger