Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven’t checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn’t expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).
Oh, interesting! You are right that I measured the average probability—that seemed closer to “how often will the model exhibit the behavior during sampling,” which is what we care about.
I updated the colab with some code to measure
% of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer
(you can turn this on by passing example_statistic='matching_more_likely' to various functions).
And I added a new appendix showing results using this statistic instead.
The bottom line: results with this statistic are very similar to those I originally obtained with average probabilities. So, this doesn’t explain the difference.
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven’t checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn’t expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).
(Cool you’re looking into this!)
Oh, interesting! You are right that I measured the average probability—that seemed closer to “how often will the model exhibit the behavior during sampling,” which is what we care about.
I updated the colab with some code to measure
(you can turn this on by passing
example_statistic='matching_more_likely'
to various functions).And I added a new appendix showing results using this statistic instead.
The bottom line: results with this statistic are very similar to those I originally obtained with average probabilities. So, this doesn’t explain the difference.
(Edited to remove an image that failed to embed.)