Rohin Shah comments on Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Rohin Shah 20 Dec 2023 8:49 UTC
LW: 12 AF: 6
3
AF
Are you saying that this claim is supported by PCA visualizations you’ve done?
Yes, but they’re not in the paper. (I also don’t remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)
I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.
It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the principal components).
More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features $a$ and $b$ they will also tend to linearly represent the feature $a \oplus b$ . Taking this as an empirical claim about current models, this would be shocking.
I don’t want to make a claim that this will always hold; models are messy and there could be lots of confounders that make it not hold in general. For example, the construction I mentioned uses 3 dimensions to represent 2 variables; maybe in some cases this is too expensive and the model just uses 2 dimensions and gives up the ability to linearly read arbitrary functions of those 2 variables. Maybe it’s usually not helpful to compute boolean functions of 2 boolean variables, but in the specific case where you have a statement followed by Yes / No it’s especially useful (e.g. because the truth value of the Yes / No is the XOR of No / Yes with the truth value of the previous sentence).
My guess is that this is a motif that will reoccur in other natural contexts as well. But we haven’t investigated this and I think of it as speculation.
For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify $a = 0$ vs $1$ on a dataset where $b = 0$ , the resulting probe should get ~50% accuracy on a test dataset where $b = 1$ . And this should apply for any features $a, b$ . But this is certainly not the typical case, at least as far as I can tell!
If you linearly represent $a$ , $b$ , and $a \oplus b$ , then given this training setup you could learn a classifier that detects the $a$ direction or the $a \oplus b$ direction or some mixture between the two. In general I would expect that the $a$ direction is more prominent / more salient / cleaner than the $a \oplus b$ direction, and so it would learn a classifier based on that, which would lead to ~100% accuracy on the test dataset.
If you use normalization to eliminate the $a$ direction as done in CCS, then I expect you learn a classifier aligned with the $a \oplus b$ direction, and you get ~0% accuracy on the test dataset. This isn’t the typical result, but it also isn’t the typical setup; it’s uncommon to use normalization to eliminate particular directions.
(Similarly, if you don’t do the normalization step in CCS, my guess is that nearly all of our experiments would just show CCS learning the $has_true (x)$ probe, rather than the $has_true (x) \oplus distractor (x)$ probe.)
Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset ${(true banana, 1), (false banana, 0)}$ will have poor accuracy when tested on ${(true shed, 1), (false shed, 1)}$ ?
These datasets are incredibly tiny (size two) so I’m worried about noise, but let’s say you pad the prompts with random sentences from some dataset to get larger datasets.
If you used normalization to remove the $has_true$ direction, then yes, that’s what I’d predict. Without normalization I predict high test accuracy.
(Note there’s a typo in your test dataset—it should be $(false shed, 0)$ .)
- Sam Marks 20 Dec 2023 20:45 UTC
  4 points
  0
  Parent
  Thanks for the detailed replies!