Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration. Also forecasting is just hard. So I’d expect chat models to ~always fail, base models to fail slightly less, but i’d expect finetuned models (on a somewhat large dataset) to be somewhat useful.
Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration.
Tell me if I understand the idea correctly: Log-loss to predict next token leads to good calibration for single token prediction, which manifests as good calibration percentage predictions? But then RLHF is some crazy loss totally removed from calibration that destroys all that?
If I get that right, it seems quite intuitive. Do you have any citations, though?
I don’t find it intuitive at all. It would be intuitive if you started by telling a story describing the situation and asked the LLM to continue the story, and you then sampled randomly from the continuations and counted how many of the continuations would lead to a positive resolution of the question. This should be well-calibrated, (assuming the details included in the prompt were representative and that there isn’t a bias of which types of ending the stories are in the training data for the LLM). But this is not what is happing. Instead the model outputs a token which is a number, and somehow that number happens to be well-calibrated. I guess that should mean that the prediction make in the training data are well-calibrated? That just seems very unlikely.
Thanks, you’ve 100% convinced me. (Convincing someone that something that (a) is known to be true and (b) they think isn’t surprising, actually is surprising is a rare feat, well done!)
Yeah, exactly. For example, if humans had a convention of rounding probabilities to the nearest 10% when writing them, then baseline GPT-4 would follow that convention and it would put a cap on the maximum calibration it could achieve. Humans are badly calibrated (right?) and baseline GPT-4 is mimicking humans, so why is it well calibrated? It doesn’t follow from its token stream being well calibrated relative to text.
Dynomight, are you aware that, in addition to the GPT-4 paper reporting the RLHF’d GPT-4 being badly de-calibrated, there’s several papers already examining the calibration and ability of LLMs to forecast?
Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration. Also forecasting is just hard. So I’d expect chat models to ~always fail, base models to fail slightly less, but i’d expect finetuned models (on a somewhat large dataset) to be somewhat useful.
Tell me if I understand the idea correctly: Log-loss to predict next token leads to good calibration for single token prediction, which manifests as good calibration percentage predictions? But then RLHF is some crazy loss totally removed from calibration that destroys all that?
If I get that right, it seems quite intuitive. Do you have any citations, though?
I don’t find it intuitive at all. It would be intuitive if you started by telling a story describing the situation and asked the LLM to continue the story, and you then sampled randomly from the continuations and counted how many of the continuations would lead to a positive resolution of the question. This should be well-calibrated, (assuming the details included in the prompt were representative and that there isn’t a bias of which types of ending the stories are in the training data for the LLM). But this is not what is happing. Instead the model outputs a token which is a number, and somehow that number happens to be well-calibrated. I guess that should mean that the prediction make in the training data are well-calibrated? That just seems very unlikely.
Thanks, you’ve 100% convinced me. (Convincing someone that something that (a) is known to be true and (b) they think isn’t surprising, actually is surprising is a rare feat, well done!)
Yeah, exactly. For example, if humans had a convention of rounding probabilities to the nearest 10% when writing them, then baseline GPT-4 would follow that convention and it would put a cap on the maximum calibration it could achieve. Humans are badly calibrated (right?) and baseline GPT-4 is mimicking humans, so why is it well calibrated? It doesn’t follow from its token stream being well calibrated relative to text.
https://imgur.com/a/3gYel9r
https://openai.com/research/gpt-4
More specifically: https://arxiv.org/pdf/2303.08774.pdf#page=12
Dynomight, are you aware that, in addition to the GPT-4 paper reporting the RLHF’d GPT-4 being badly de-calibrated, there’s several papers already examining the calibration and ability of LLMs to forecast?