Nicholas Goldowsky-Dill comments on Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Nicholas Goldowsky-Dill 12 Jun 2023 21:30 UTC
14 points
2
Cool paper! I enjoyed reading it and think it provides some useful information on what adding carefully chosen bias vectors into LLMs can achieve. Some assorted thoughts and observations.
1. I found skimming through some of the ITI completions in the appendix very helpful. I’d recommend doing so to others.
2. GPT-Judge is not very reliable. A reasonable fraction of the responses (10-20%, maybe?) seem to be misclassified.
  1. I think it would be informative to see truthfulness according to human judge, although of course this is labor intensive.
3. A lot of the mileage seems to come from stylistic differences in the output
  1. My impression is the ITI model is more humble, fact-focused, and better at avoiding the traps that TruthfulQA sets for it.
  2. I’m worried that people may read this paper and think it conclusively proves there is a cleanly represented truthfulness concept inside of the model.^[1] It’s not clear to me we can conclude this if ITI is mostly encouraging the model to make stylistic adjustments that cause it to speak less false statements on this particular dataset.
  3. One way of understanding this (in a simulators framing) is that ITI encourages the model to take on a persona of someone who is more careful, truthful, and hesitant to make claims that aren’t backed up by good evidence. This persona is particularly selected to do well on TruthfulQA (in that it emulates the example true answers as contrasted to the false ones). Being able to elicit this persona with ITI doesn’t require the model to have a “truthfulness direction”, although it obviously helps the model simulate better if it knows what facts are actually true!
  4. Note that this sort of stylistic-update is exactly the sort you’d expect that prompting the model to do well at.
Some other very minor comments:
- I find Figure 6B confusing, as I’m not really sure how the categories relate to one another. Additionally, is the red line also a percentage?
- There’s a bug in the model output to latex pipeline that causes any output after a percentage sign to not be shown.
1. ^
  To be clear, the authors don’t claim this and I’m not intending this as a criticism of them.