I’m curious what the predicted reward and SAE features look like on training[1] examples where one of these highly influential features gives the wrong answer.
I did some quick string counting in Anthropic HH (train split only), and found
substring https://
appears only in preferred response: 199 examples
appears only in dispreferred response: 940 examples
substring I don't know
appears only in preferred response: 165 examples
appears only in dispreferred response: 230 examples
From these counts (especially the https:// ones), it’s not hard to see where the PM gets its heuristics from. But it’s also clear that there are many training examples where the heuristic makes the wrong directional prediction. If we looked at these examples with the PM and SAE, I can imagine various things we might see:
The PM just didn’t learn these examples well, and confidently makes the wrong prediction
The PM does at least OK at these examples, and the SAE reconstruction of this prediction is decent
The heuristic feature is active and upweights the wrong prediction, but other features push against it
The heuristic feature isn’t active (or is less active), indicating that it’s not really a feature for the heuristic alone and has more complicated necessary conditions
The PM does at least OK at these examples, but the SAE did not learn the right features to reconstruct these types of PM predictions (this is conceivable but seems unlikely)
It’s possible that for “in-distribution” input (i.e. stuff that a HH-SFT-tuned model might actually say), the PM’s predictions are more reasonable, in a way that relies on other properties of “in-distribution” responses that aren’t present in the custom/adversarial examples here. That is, maybe the custom examples only show that the PM is not robust to distributional shift, rather than showing that it makes predictions in a crude or simplistic manner even on the training distribution. If so, it’d be interesting to find out what exactly is happening in these “more nuanced” predictions.
It’s actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 → 1.36 if you only calculate over that remaining ~136k subset.
My understanding is there’s 3 bad things: 1. the hh dataset is inconsistent 2. The PM doesn’t separate chosen vs rejected very well (as shown above) 3. The PM is GPT-J (7B parameter model) which doesn’t have the most complex features to choose from.
The in-distribution argument is most likely the case for the “Thank you. My pleasure” case, because the assistant never (AFAIK, I didn’t check) said that phrase as a response. Only “My pleasure” after the user said ” thank you”.
I’m curious what the predicted reward and SAE features look like on training[1] examples where one of these highly influential features gives the wrong answer.
I did some quick string counting in Anthropic HH (train split only), and found
substring
https://
appears only in preferred response: 199 examples
appears only in dispreferred response: 940 examples
substring
I don't know
appears only in preferred response: 165 examples
appears only in dispreferred response: 230 examples
From these counts (especially the
https://
ones), it’s not hard to see where the PM gets its heuristics from. But it’s also clear that there are many training examples where the heuristic makes the wrong directional prediction. If we looked at these examples with the PM and SAE, I can imagine various things we might see:The PM just didn’t learn these examples well, and confidently makes the wrong prediction
The PM does at least OK at these examples, and the SAE reconstruction of this prediction is decent
The heuristic feature is active and upweights the wrong prediction, but other features push against it
The heuristic feature isn’t active (or is less active), indicating that it’s not really a feature for the heuristic alone and has more complicated necessary conditions
The PM does at least OK at these examples, but the SAE did not learn the right features to reconstruct these types of PM predictions (this is conceivable but seems unlikely)
It’s possible that for “in-distribution” input (i.e. stuff that a HH-SFT-tuned model might actually say), the PM’s predictions are more reasonable, in a way that relies on other properties of “in-distribution” responses that aren’t present in the custom/adversarial examples here. That is, maybe the custom examples only show that the PM is not robust to distributional shift, rather than showing that it makes predictions in a crude or simplistic manner even on the training distribution. If so, it’d be interesting to find out what exactly is happening in these “more nuanced” predictions.
IIUC this model was trained on Anthropic HH.
The PM is pretty bad (it’s trained on hh).
It’s actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 → 1.36 if you only calculate over that remaining ~136k subset.
My understanding is there’s 3 bad things:
1. the hh dataset is inconsistent
2. The PM doesn’t separate chosen vs rejected very well (as shown above)
3. The PM is GPT-J (7B parameter model) which doesn’t have the most complex features to choose from.
The in-distribution argument is most likely the case for the “Thank you. My pleasure” case, because the assistant never (AFAIK, I didn’t check) said that phrase as a response. Only “My pleasure” after the user said ” thank you”.