mishajw comments on How well do truth probes generalise?

mishajw 25 Feb 2024 19:50 UTC
3 points
0
Cool to see the generalisation results for Llama-2 7/13/70B! I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre). Excited to read the paper in its entirety. The first GoT paper was very good.
One approach here is to use a dataset in which the truth and likelihood of inputs are uncorrelated (or negatively correlated), as you kinda did with TruthfulQA. For that, I like to use the “neg_” versions of the datasets from GoT, containing negated statements like “The city of Beijing is not in China.” For these datasets, the correlation between truth value and likelihood (operationalized as LLaMA-2-70B’s log probability of the full statement) is strong and negative (-0.63 for neg_cities and -.89 for neg_sp_en_trans). But truth probes still often generalize well to these negated datsets. Here are results for LLaMA-2-70B (the horizontal axis shows the train set, and the vertical axis shows the test set).
This is an interesting approach! I suppose there are two things we want to separate: “truth” from likely statements, and “truth” from what humans think (under some kind of simulacra framing). I think this approach would allow you to do the former, but not the latter. And to be honest, I’m not confident on TruthfulQA’s ability to do the latter either.
P.S. I realised an important note got removed while editing this post—added back, but FYI:
We differ slightly from the original GoT paper in naming, and use got_cities to refer to both the cities and neg_cities datasets. The same is true for sp_en_trans and larger_than. We don’t do this for cities_cities_{conj,disj} and leave them unpaired.
- Sam Marks 25 Feb 2024 20:05 UTC
  2 points
  0
  Parent
  I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre).
  I found that the PCA plot for 7B for larger_than and smaller_than individually looked similar to that for 13B, but that the PCA plot for larger_than + smaller_than looked degenerate in the way I screenshotted. Are you saying that your larger_than + smaller_than PCA looked familiar for 7B?
  I suppose there are two things we want to separate: “truth” from likely statements, and “truth” from what humans think (under some kind of simulacra framing). I think this approach would allow you to do the former, but not the latter. And to be honest, I’m not confident on TruthfulQA’s ability to do the latter either.
  Agreed on both points.
  We differ slightly from the original GoT paper in naming, and use got_cities to refer to both the cities and neg_cities datasets. The same is true for sp_en_trans and larger_than. We don’t do this for cities_cities_{conj,disj} and leave them unpaired.
  Thanks for clarifying! I’m guessing this is what’s making the GoT datasets much worse for generalization (from and to) in your experiments. For 13B, it mostly seemed to me that training on negated statements helped for generalization to other negated statements, and that pairing negated statements with unnegated statements in training data usually (but not always) made generalization to unnegated datasets a bit worse. (E.g. the cities → sp_en_trans generalization is better than cities + neg_cities → sp_en_trans generalization.)