Tao Lin comments on Are language models good at making predictions?

Tao Lin 6 Nov 2023 16:02 UTC
12 points
4
Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration. Also forecasting is just hard. So I’d expect chat models to ~always fail, base models to fail slightly less, but i’d expect finetuned models (on a somewhat large dataset) to be somewhat useful.
- dynomight 6 Nov 2023 16:23 UTC
  4 points
  0
  Parent
  Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration.
  Tell me if I understand the idea correctly: Log-loss to predict next token leads to good calibration for single token prediction, which manifests as good calibration percentage predictions? But then RLHF is some crazy loss totally removed from calibration that destroys all that?
  If I get that right, it seems quite intuitive. Do you have any citations, though?
  - Sune 6 Nov 2023 22:02 UTC
    13 points
    8
    Parent
    I don’t find it intuitive at all. It would be intuitive if you started by telling a story describing the situation and asked the LLM to continue the story, and you then sampled randomly from the continuations and counted how many of the continuations would lead to a positive resolution of the question. This should be well-calibrated, (assuming the details included in the prompt were representative and that there isn’t a bias of which types of ending the stories are in the training data for the LLM). But this is not what is happing. Instead the model outputs a token which is a number, and somehow that number happens to be well-calibrated. I guess that should mean that the prediction make in the training data are well-calibrated? That just seems very unlikely.
    - dynomight 8 Nov 2023 15:20 UTC
      4 points
      3
      Parent
      Thanks, you’ve 100% convinced me. (Convincing someone that something that (a) is known to be true and (b) they think isn’t surprising, actually is surprising is a rare feat, well done!)
    - justinpombrio 7 Nov 2023 17:07 UTC
      4 points
      2
      Parent
      Yeah, exactly. For example, if humans had a convention of rounding probabilities to the nearest 10% when writing them, then baseline GPT-4 would follow that convention and it would put a cap on the maximum calibration it could achieve. Humans are badly calibrated (right?) and baseline GPT-4 is mimicking humans, so why is it well calibrated? It doesn’t follow from its token stream being well calibrated relative to text.
  - ReaderM 6 Nov 2023 19:57 UTC
    7 points
    1
    Parent
    https://imgur.com/a/3gYel9r
    https://openai.com/research/gpt-4
    - gwern 8 Nov 2023 2:25 UTC
      9 points
      6
      Parent
      More specifically: https://arxiv.org/pdf/2303.08774.pdf#page=12
      
      Dynomight, are you aware that, in addition to the GPT-4 paper reporting the RLHF’d GPT-4 being badly de-calibrated, there’s several papers already examining the calibration and ability of LLMs to forecast?