Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct. Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.
In some sense this is expected. The RLHF model isn’t optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that “alignment” has made the model objectively worse at giving correct information.
Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct.
Yes, I think you are misunderstanding figure 8. I don’t have inside information, but without explanation “calibration” would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedback, under a careful training regime it would probably get modestly better.)
Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.
In some sense this is expected. The RLHF model isn’t optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that “alignment” has made the model objectively worse at giving correct information.
I think this would be a surprising result if true, and I suspect it would be taken as a significant problem by researchers at OpenAI.
I was also thinking the same thing as you, but after reading paulfchristiano’s reply, I now think it’s that you can use the model to use generate probabilities of next tokens, and that those next tokens are correct as often as those probabilities. This is to say it’s not referring to the main way of interfacing with GPT-n (wherein a temperature schedule determines how often it picks something other than the option with the highest probability assigned; i.e. not asking the model “in words” for its predicted probabilities).
It seems you and Paul are correct. I still think this suggests that there is something deeply wrong with RLHF, but less in the “intentionally deceives humans” sense, and more in the “this process consistently writes false data to memory” sense.
Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct. Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.
In some sense this is expected. The RLHF model isn’t optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that “alignment” has made the model objectively worse at giving correct information.
Yes, I think you are misunderstanding figure 8. I don’t have inside information, but without explanation “calibration” would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedback, under a careful training regime it would probably get modestly better.)
I think this would be a surprising result if true, and I suspect it would be taken as a significant problem by researchers at OpenAI.
I was also thinking the same thing as you, but after reading paulfchristiano’s reply, I now think it’s that you can use the model to use generate probabilities of next tokens, and that those next tokens are correct as often as those probabilities. This is to say it’s not referring to the main way of interfacing with GPT-n (wherein a temperature schedule determines how often it picks something other than the option with the highest probability assigned; i.e. not asking the model “in words” for its predicted probabilities).
It seems you and Paul are correct. I still think this suggests that there is something deeply wrong with RLHF, but less in the “intentionally deceives humans” sense, and more in the “this process consistently writes false data to memory” sense.