If I ask a question and the model thinks there is an 80% the answer is “A” and a 20% chance the answer is “B,” I probably want the model to always say “A” (or even better: “probably A”). I don’t generally want the model to say “A” 80% of the time and “B” 20% of the time.
In some contexts that’s worse behavior. For example, if you ask the model to explicitly estimate a probability it will probably do a worse job than if you extract the logits from the pre-trained model (though of course that totally goes out the window if you do chain of thought). But it’s not really lying—it’s also the behavior you’d expect out of a human who is trying to be helpful.
More precisely: when asked a question the pre-trained model outputs a probability distribution over what comes next. If prompted correctly you get its subjective probability distribution over the answer (or at least over the answer that would appear on the internet). The RLHF model instead outputs a probability distribution over what to say take next which is optimized to give highly-rated responses. So you’d expect it to put all of its probability mass on the best response.
I still think it’s curious that RLHF doesn’t seem to reduce to a proper loss on factual questions, and I’d guess that it’d probably be better if it did (at least, with contexts that strictly ask for a “yes/no” answer without qualification)
I think it’s probably true that RLHF doesn’t reduce to a proper scoring rule on factual questions, even if you ask the model to quantify its uncertainty, because the learned reward function doesn’t make good quantitative tradeoffs.
That said, I think this is unrelated to the given graph. If it is forced to say either “yes” or “no” the RLHF model will just give the more likely answer100% of the time, which will show up as bad calibration on this graph. The point is that for most agents “the probability you say yes” is not the same as “the probability you think the answer is yes.” This is the case for pretrained models.
I think that if RLHF reduced to a proper loss on factual questions, these probabilities would coincide (given enough varied training data). I agree it’s not entirely obvious that having these probabilities come apart is problematic, because you might recover more calibrated probabilities by asking for them. Still, knowing the logits are directly incentivised to be well calibrated seems like a nice property to have.
An agent says yes if it thinks yes is the best thing to say. This comes apart from “yes is the correct answer” only if there are additional considerations determining “best” apart from factuality. If you’re restricted to “yes/no”, then for most normal questions I think an ideal RLHF objective should not introduce considerations beyond factuality in assessing the quality of the answer—and I suspect this is also true in practical RLHF objectives. If I’m giving verbal confidences, then there are non-factual considerations at play—namely, I want my answer to communicate my epistemic state. For pretrained models, the question is not whether it is factual but whether someone would say it (though somehow it seems to come close). But for yes/no questions under RLHF, if the probabilities come apart it is due to not properly eliciting the probability (or some failure of the RLHF objective to incentivise factual answers).
Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct. Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.
In some sense this is expected. The RLHF model isn’t optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that “alignment” has made the model objectively worse at giving correct information.
Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct.
Yes, I think you are misunderstanding figure 8. I don’t have inside information, but without explanation “calibration” would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedback, under a careful training regime it would probably get modestly better.)
Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.
In some sense this is expected. The RLHF model isn’t optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that “alignment” has made the model objectively worse at giving correct information.
I think this would be a surprising result if true, and I suspect it would be taken as a significant problem by researchers at OpenAI.
I was also thinking the same thing as you, but after reading paulfchristiano’s reply, I now think it’s that you can use the model to use generate probabilities of next tokens, and that those next tokens are correct as often as those probabilities. This is to say it’s not referring to the main way of interfacing with GPT-n (wherein a temperature schedule determines how often it picks something other than the option with the highest probability assigned; i.e. not asking the model “in words” for its predicted probabilities).
It seems you and Paul are correct. I still think this suggests that there is something deeply wrong with RLHF, but less in the “intentionally deceives humans” sense, and more in the “this process consistently writes false data to memory” sense.
My guess is that RLHF is unwittingly training the model to lie.
If I ask a question and the model thinks there is an 80% the answer is “A” and a 20% chance the answer is “B,” I probably want the model to always say “A” (or even better: “probably A”). I don’t generally want the model to say “A” 80% of the time and “B” 20% of the time.
In some contexts that’s worse behavior. For example, if you ask the model to explicitly estimate a probability it will probably do a worse job than if you extract the logits from the pre-trained model (though of course that totally goes out the window if you do chain of thought). But it’s not really lying—it’s also the behavior you’d expect out of a human who is trying to be helpful.
More precisely: when asked a question the pre-trained model outputs a probability distribution over what comes next. If prompted correctly you get its subjective probability distribution over the answer (or at least over the answer that would appear on the internet). The RLHF model instead outputs a probability distribution over what to say take next which is optimized to give highly-rated responses. So you’d expect it to put all of its probability mass on the best response.
I still think it’s curious that RLHF doesn’t seem to reduce to a proper loss on factual questions, and I’d guess that it’d probably be better if it did (at least, with contexts that strictly ask for a “yes/no” answer without qualification)
I think it’s probably true that RLHF doesn’t reduce to a proper scoring rule on factual questions, even if you ask the model to quantify its uncertainty, because the learned reward function doesn’t make good quantitative tradeoffs.
That said, I think this is unrelated to the given graph. If it is forced to say either “yes” or “no” the RLHF model will just give the more likely answer100% of the time, which will show up as bad calibration on this graph. The point is that for most agents “the probability you say yes” is not the same as “the probability you think the answer is yes.” This is the case for pretrained models.
I think that if RLHF reduced to a proper loss on factual questions, these probabilities would coincide (given enough varied training data). I agree it’s not entirely obvious that having these probabilities come apart is problematic, because you might recover more calibrated probabilities by asking for them. Still, knowing the logits are directly incentivised to be well calibrated seems like a nice property to have.
An agent says yes if it thinks yes is the best thing to say. This comes apart from “yes is the correct answer” only if there are additional considerations determining “best” apart from factuality. If you’re restricted to “yes/no”, then for most normal questions I think an ideal RLHF objective should not introduce considerations beyond factuality in assessing the quality of the answer—and I suspect this is also true in practical RLHF objectives. If I’m giving verbal confidences, then there are non-factual considerations at play—namely, I want my answer to communicate my epistemic state. For pretrained models, the question is not whether it is factual but whether someone would say it (though somehow it seems to come close). But for yes/no questions under RLHF, if the probabilities come apart it is due to not properly eliciting the probability (or some failure of the RLHF objective to incentivise factual answers).
Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct. Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.
In some sense this is expected. The RLHF model isn’t optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that “alignment” has made the model objectively worse at giving correct information.
Yes, I think you are misunderstanding figure 8. I don’t have inside information, but without explanation “calibration” would almost always mean reading it off from the logits. If you instead ask the model to express its uncertainty I think it will do a much worse job, and the RLHF model will probably perform similarly to the pre-trained model. (This depends on details of the human feedback, under a careful training regime it would probably get modestly better.)
I think this would be a surprising result if true, and I suspect it would be taken as a significant problem by researchers at OpenAI.
I was also thinking the same thing as you, but after reading paulfchristiano’s reply, I now think it’s that you can use the model to use generate probabilities of next tokens, and that those next tokens are correct as often as those probabilities. This is to say it’s not referring to the main way of interfacing with GPT-n (wherein a temperature schedule determines how often it picks something other than the option with the highest probability assigned; i.e. not asking the model “in words” for its predicted probabilities).
It seems you and Paul are correct. I still think this suggests that there is something deeply wrong with RLHF, but less in the “intentionally deceives humans” sense, and more in the “this process consistently writes false data to memory” sense.
That makes a lot of sense, but it doesn’t explain why calibration post-RLHF is much better for the 10-40% buckets than for the 60-90% buckets.