This is a great post—I’m excited about this line of research, and it’s great to see a proposal of how that might look like.
In our paper, we find that the log-probs of a models hypothetical statements track the log-probs of the object-level behavior it is reporting about. This is true also for object-level responses that the model does not actually choose. For example (made up numbers), if the object-level behavior of the model has the distribution 60% “dog”, 30% “cat”, 10% “fox”, the model would answer the question “what would the second letter of your answer have been?” with 70% “o”, 30% “a”. Note that the model only saw the winning answer during training, yet it is calibrated (to some degree) to the distribution of object-level answers.
I’m curious what you make of this result. To me, the fact that the log-probs of the hypothetical answer are calibrated wrt to the object-level behavior suggest that there an internal process that takes into account calibration when arriving at an answer, even though we don’t ask it to verbalize the calibration. (Early on in the project, we actually included experiments where models were asked about eg. their second-most likely answer, but we stopped them early enough that I have no data on how well they can explicitly report on this).
This is indeed a cool and surprising result. I think it strengthens the introspection interpretation, but without a requirement to make a judgement of the reliability of some internal signal (right?), it doesn’t directly address the question of whether there is a discriminator in there.
This is a great post—I’m excited about this line of research, and it’s great to see a proposal of how that might look like.
In our paper, we find that the log-probs of a models hypothetical statements track the log-probs of the object-level behavior it is reporting about. This is true also for object-level responses that the model does not actually choose. For example (made up numbers), if the object-level behavior of the model has the distribution 60% “dog”, 30% “cat”, 10% “fox”, the model would answer the question “what would the second letter of your answer have been?” with 70% “o”, 30% “a”. Note that the model only saw the winning answer during training, yet it is calibrated (to some degree) to the distribution of object-level answers.
I’m curious what you make of this result. To me, the fact that the log-probs of the hypothetical answer are calibrated wrt to the object-level behavior suggest that there an internal process that takes into account calibration when arriving at an answer, even though we don’t ask it to verbalize the calibration. (Early on in the project, we actually included experiments where models were asked about eg. their second-most likely answer, but we stopped them early enough that I have no data on how well they can explicitly report on this).
Thanks Felix!
This is indeed a cool and surprising result. I think it strengthens the introspection interpretation, but without a requirement to make a judgement of the reliability of some internal signal (right?), it doesn’t directly address the question of whether there is a discriminator in there.