Sam Marks comments on How well do truth probes generalise?

Sam Marks 25 Feb 2024 13:40 UTC
4 points
1
Very cool! Always nice to see results replicated and extended on, and I appreciated how clear you were in describing your experiments.
Do smaller models also have a generalised notion of truth?
In my most recent revision of GoT^[1] we did some experiments to see how truth probe generalization changes with model scale, working with LLaMA-2-7B, −13B, and −70B. Result: truth probes seems to generalize better for larger models. Here are the relevant figures.
Some other related evidence from our visualizations:
We summed things up like so, which I’ll just quote in its entirety:
Overall, these visualizations suggest a picture like the following: as LLMs scale (and perhaps, also as a fixed LLM progresses through its forward pass), they hierarchically develop and linearly represent increasingly general abstractions. Small models represent surface-level characteristics of their inputs; these surface-level characteristics may be sufficient for linear probes to be accurate on narrow training distributions, but such probes are unlikely to generalize out-of-distribution. Large models linearly represent more abstract concepts, potentially including abstract notions like “truth” which capture shared properties of topically and structurally diverse inputs. In middle regimes, we may find linearly represented concepts of intermediate levels of abstraction, for example, “accurate factual recall” or “close association” (in the sense that “Beijing” and “China” are closely associated). These concepts may suffice to distinguish true/false statements on individual datasets, but will only generalize to test data for which the same concepts
suffice.
How do we know we’re detecting truth, and not just likely statements?
One approach here is to use a dataset in which the truth and likelihood of inputs are uncorrelated (or negatively correlated), as you kinda did with TruthfulQA. For that, I like to use the “neg_” versions of the datasets from GoT, containing negated statements like “The city of Beijing is not in China.” For these datasets, the correlation between truth value and likelihood (operationalized as LLaMA-2-70B’s log probability of the full statement) is strong and negative (-0.63 for neg_cities and -.89 for neg_sp_en_trans). But truth probes still often generalize well to these negated datsets. Here are results for LLaMA-2-70B (the horizontal axis shows the train set, and the vertical axis shows the test set).
We also find that the probe performs better than LDA in-distribution, but worse out-of-distribution:
Yep, we found the same thing—LDA improves things in-distribution, but generalizes work than simple DIM probes.
Why does got_cities_cities_conj generalise well?
I found this result surprising, thanks! I don’t really have great guesses for what’s going on. One thing I’ll say is that it’s worth tracking differences between various sorts of factual statements. For example, for LLaMA-2-13B it generally seemed to me that there was better probe transfer between factual recall datasets (e.g. cities and sp_en_trans, but not larger_than). I’m not really sure why the conjunctions are making things so much better, beyond possibly helping to narrow down on “truth” beyond just “correct statement of factual recall.”
I’m not surprised that cities_cities_conj and cities_cities_disj are so qualitatively different—cities_cities_disj has never empirically played well with the other datasets (in the sense of good probe transfer) and I don’t really know why.
1. ^
  This is currently under review, but not yet on arxiv, sorry about that! Code in the nnsight branch here. I’ll try to come back to add a link to the paper once I post it or it becomes publicly available on OpenReview, whichever happens first.
- mishajw 25 Feb 2024 19:50 UTC
  3 points
  0
  Parent
  Cool to see the generalisation results for Llama-2 7/13/70B! I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre). Excited to read the paper in its entirety. The first GoT paper was very good.
  One approach here is to use a dataset in which the truth and likelihood of inputs are uncorrelated (or negatively correlated), as you kinda did with TruthfulQA. For that, I like to use the “neg_” versions of the datasets from GoT, containing negated statements like “The city of Beijing is not in China.” For these datasets, the correlation between truth value and likelihood (operationalized as LLaMA-2-70B’s log probability of the full statement) is strong and negative (-0.63 for neg_cities and -.89 for neg_sp_en_trans). But truth probes still often generalize well to these negated datsets. Here are results for LLaMA-2-70B (the horizontal axis shows the train set, and the vertical axis shows the test set).
  This is an interesting approach! I suppose there are two things we want to separate: “truth” from likely statements, and “truth” from what humans think (under some kind of simulacra framing). I think this approach would allow you to do the former, but not the latter. And to be honest, I’m not confident on TruthfulQA’s ability to do the latter either.
  P.S. I realised an important note got removed while editing this post—added back, but FYI:
  We differ slightly from the original GoT paper in naming, and use got_cities to refer to both the cities and neg_cities datasets. The same is true for sp_en_trans and larger_than. We don’t do this for cities_cities_{conj,disj} and leave them unpaired.
  - Sam Marks 25 Feb 2024 20:05 UTC
    2 points
    0
    Parent
    I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre).
    I found that the PCA plot for 7B for larger_than and smaller_than individually looked similar to that for 13B, but that the PCA plot for larger_than + smaller_than looked degenerate in the way I screenshotted. Are you saying that your larger_than + smaller_than PCA looked familiar for 7B?
    I suppose there are two things we want to separate: “truth” from likely statements, and “truth” from what humans think (under some kind of simulacra framing). I think this approach would allow you to do the former, but not the latter. And to be honest, I’m not confident on TruthfulQA’s ability to do the latter either.
    Agreed on both points.
    We differ slightly from the original GoT paper in naming, and use got_cities to refer to both the cities and neg_cities datasets. The same is true for sp_en_trans and larger_than. We don’t do this for cities_cities_{conj,disj} and leave them unpaired.
    Thanks for clarifying! I’m guessing this is what’s making the GoT datasets much worse for generalization (from and to) in your experiments. For 13B, it mostly seemed to me that training on negated statements helped for generalization to other negated statements, and that pairing negated statements with unnegated statements in training data usually (but not always) made generalization to unnegated datasets a bit worse. (E.g. the cities → sp_en_trans generalization is better than cities + neg_cities → sp_en_trans generalization.)

Sam Marks comments on How well do truth probes generalise?

How do we know we’re detecting truth, and not just likely statements?