I was also thinking the same thing as you, but after reading paulfchristiano’s reply, I now think it’s that you can use the model to use generate probabilities of next tokens, and that those next tokens are correct as often as those probabilities. This is to say it’s not referring to the main way of interfacing with GPT-n (wherein a temperature schedule determines how often it picks something other than the option with the highest probability assigned; i.e. not asking the model “in words” for its predicted probabilities).
It seems you and Paul are correct. I still think this suggests that there is something deeply wrong with RLHF, but less in the “intentionally deceives humans” sense, and more in the “this process consistently writes false data to memory” sense.
I was also thinking the same thing as you, but after reading paulfchristiano’s reply, I now think it’s that you can use the model to use generate probabilities of next tokens, and that those next tokens are correct as often as those probabilities. This is to say it’s not referring to the main way of interfacing with GPT-n (wherein a temperature schedule determines how often it picks something other than the option with the highest probability assigned; i.e. not asking the model “in words” for its predicted probabilities).
It seems you and Paul are correct. I still think this suggests that there is something deeply wrong with RLHF, but less in the “intentionally deceives humans” sense, and more in the “this process consistently writes false data to memory” sense.