Oh, interesting! You are right that I measured the average probability—that seemed closer to “how often will the model exhibit the behavior during sampling,” which is what we care about.
I updated the colab with some code to measure
% of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer
(you can turn this on by passing example_statistic='matching_more_likely' to various functions).
And I added a new appendix showing results using this statistic instead.
The bottom line: results with this statistic are very similar to those I originally obtained with average probabilities. So, this doesn’t explain the difference.
Oh, interesting! You are right that I measured the average probability—that seemed closer to “how often will the model exhibit the behavior during sampling,” which is what we care about.
I updated the colab with some code to measure
(you can turn this on by passing
example_statistic='matching_more_likely'
to various functions).And I added a new appendix showing results using this statistic instead.
The bottom line: results with this statistic are very similar to those I originally obtained with average probabilities. So, this doesn’t explain the difference.
(Edited to remove an image that failed to embed.)