Sam Marks comments on What’s up with LLMs representing XORs of arbitrary features?

Sam Marks 4 Jan 2024 0:49 UTC
LW: 9 AF: 7
0
AF
There’s 1500 statements in each of cities and neg_cities, and LLaMA-2-13B has residual stream dimension 5120. The linear probes are trained with vanilla logistic regression on {80% of the data in cities} \cup {80% of the data in neg_cities} and the accuracies reported are evaluated on {remaining 20% of the data in cities} \cup {remaining 20% of the data in neg_cities}.
So, yeah, I guess that the train and val sets are drawn from the same distribution but are not independent (because of the issue I mentioned in my comment above). Oops! I guess I never thought about how with small datasets, doing an ⁸⁰⁄₂₀ train/test split can actually introduce dependencies between the train and test data. (Also yikes, I see people do this all the time.)
Anyway, it seems to me that this is enough to explain the <50% accuracies—do you agree?
- Sam Marks 6 Jan 2024 0:28 UTC
  LW: 14 AF: 8
  2
  AF Parent
  Using a dataset of 10,000 inputs of the form
  [random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"]
  I’ve rerun the probing experiments. The possible labels are
  - has_true: is the second last token “true” or “false”?
  - has_banana: is the last token “banana” or “shed”?
  - label: is the third last token the most likely or the 100th most likely?
  (this weird last option is because I’m adapting a dataset from the Geometry of Truth paper about likely vs. unlikely text).
  Here are the results for LLaMA-2-13B
  And here are the results for the reset network
  I was a bit surprised that the model did so badly on has_true, but in hindsight, considering that the activations are extracted over the last token and “true”/”false” is the penultimate token, this seems fine.
  Mostly I view this as a sanity check to make sure that when the dataset is larger we don’t get the <<50% probe accuracies. I think to really dig into this more, one would need to do this with features which are not token-level and which are unambiguously linearly accessible (unlike the “label” feature here).
  @ryan_greenblatt @abhatt349 @Fabien Roger
- ryan_greenblatt 4 Jan 2024 22:37 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Yep I think I agree, I didn’t understand the point you made about systematic anti-correlation originally.
  
  If I understand correctly the issues is something like:
  - There are 10 India related statements, exactly 5 of which are false and 5 of which are true.
  - We do a random split of all the data, so if there is more true+india in train there will be more false+india in test
  There are of course various fixes to make the data actually IID.
- abhatt349 4 Jan 2024 6:12 UTC
  3 points
  −2
  Parent
  If it’s easy enough to run, it seems worth re-training the probes exactly the same way, except sampling both your train and test sets with replacement from the full dataset. This should avoid that issue. It has the downside of allowing some train/test leakage, but that seems pretty fine, especially if you only sample like 500 examples for train and 100 for test (from each of cities and neg_cities).
  I’d strongly hope that after doing this, none of your probes would be significantly below 50%.