nostalgebraist comments on Interpreting Preference Models w/ Sparse Autoencoders

nostalgebraist 4 Jul 2024 17:14 UTC
LW: 4 AF: 2
0
AF
I’m curious what the predicted reward and SAE features look like on training^[1] examples where one of these highly influential features gives the wrong answer.
I did some quick string counting in Anthropic HH (train split only), and found
- substring https://
  - appears only in preferred response: 199 examples
  - appears only in dispreferred response: 940 examples
- substring I don't know
  - appears only in preferred response: 165 examples
  - appears only in dispreferred response: 230 examples
From these counts (especially the https:// ones), it’s not hard to see where the PM gets its heuristics from. But it’s also clear that there are many training examples where the heuristic makes the wrong directional prediction. If we looked at these examples with the PM and SAE, I can imagine various things we might see:
1. The PM just didn’t learn these examples well, and confidently makes the wrong prediction
2. The PM does at least OK at these examples, and the SAE reconstruction of this prediction is decent
  1. The heuristic feature is active and upweights the wrong prediction, but other features push against it
  2. The heuristic feature isn’t active (or is less active), indicating that it’s not really a feature for the heuristic alone and has more complicated necessary conditions
3. The PM does at least OK at these examples, but the SAE did not learn the right features to reconstruct these types of PM predictions (this is conceivable but seems unlikely)
It’s possible that for “in-distribution” input (i.e. stuff that a HH-SFT-tuned model might actually say), the PM’s predictions are more reasonable, in a way that relies on other properties of “in-distribution” responses that aren’t present in the custom/adversarial examples here. That is, maybe the custom examples only show that the PM is not robust to distributional shift, rather than showing that it makes predictions in a crude or simplistic manner even on the training distribution. If so, it’d be interesting to find out what exactly is happening in these “more nuanced” predictions.
1. ^
  IIUC this model was trained on Anthropic HH.
- Logan Riggs 5 Jul 2024 18:19 UTC
  LW: 4 AF: 2
  0
  AF Parent
  The PM is pretty bad (it’s trained on hh).
  It’s actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 → 1.36 if you only calculate over that remaining ~136k subset.
  My understanding is there’s 3 bad things:
  1. the hh dataset is inconsistent
  2. The PM doesn’t separate chosen vs rejected very well (as shown above)
  3. The PM is GPT-J (7B parameter model) which doesn’t have the most complex features to choose from.
  The in-distribution argument is most likely the case for the “Thank you. My pleasure” case, because the assistant never (AFAIK, I didn’t check) said that phrase as a response. Only “My pleasure” after the user said ” thank you”.

nostalgebraist comments on Interpreting Preference Models w/​ Sparse Autoencoders

nostalgebraist comments on Interpreting Preference Models w/ Sparse Autoencoders