Regarding overconfidence, GPT-4 is actually very very well-calibrated before RLHF post-training (see paper Fig. 8). I would not be surprised if the RLHF processes imparted other biases too, perhaps even in the human direction.
Nice point! Thanks. Hadn’t thought about that properly, so let’s see. Three relevant thoughts:
1) For any probabilistic but non-omniscient agent, you can design tests on which it’s poorly calibrated on. (Let its probability function be P, and let W = {q: P(q) > 0.5 & ¬q} be the set of things it’s more than 50% confident in but are false. If your test is {{q,¬q}: q ∈ W}, then the agent will have probability above 50% in all its answers, but its hit rate will be 0%.) So it doesn’t really make sense to say that a system is calibrated or not FULL STOP, but rather that it is (or is not) on a given set of questions.
What they showed in that document is that for the target test, calibration gets worse after RLHF, but that doesn’t imply that calibration is worse on other questions. So I think we should have some caution in generalizing.
2) If I’m reading it right, it looks like on the exact same test, RLHF significantly improved GPT4′s accuracy (Figure 7, just above). So that complicates that “merely introducing human biases” interpretation.
3) Presumably GPT4 after RLHF is a more useful system than GPT4 without it, otherwise they would have released a different version. That’s consistent with the picture that lots of fallacies (like the conjunction fallacy) arise out of useful and efficient ways of communicating (I’m thinking of Gricean/pragmatic explanations of the CF).
Regarding overconfidence, GPT-4 is actually very very well-calibrated before RLHF post-training (see paper Fig. 8). I would not be surprised if the RLHF processes imparted other biases too, perhaps even in the human direction.
Nice point! Thanks. Hadn’t thought about that properly, so let’s see. Three relevant thoughts:
1) For any probabilistic but non-omniscient agent, you can design tests on which it’s poorly calibrated on. (Let its probability function be P, and let W = {q: P(q) > 0.5 & ¬q} be the set of things it’s more than 50% confident in but are false. If your test is {{q,¬q}: q ∈ W}, then the agent will have probability above 50% in all its answers, but its hit rate will be 0%.) So it doesn’t really make sense to say that a system is calibrated or not FULL STOP, but rather that it is (or is not) on a given set of questions.
What they showed in that document is that for the target test, calibration gets worse after RLHF, but that doesn’t imply that calibration is worse on other questions. So I think we should have some caution in generalizing.
2) If I’m reading it right, it looks like on the exact same test, RLHF significantly improved GPT4′s accuracy (Figure 7, just above). So that complicates that “merely introducing human biases” interpretation.
3) Presumably GPT4 after RLHF is a more useful system than GPT4 without it, otherwise they would have released a different version. That’s consistent with the picture that lots of fallacies (like the conjunction fallacy) arise out of useful and efficient ways of communicating (I’m thinking of Gricean/pragmatic explanations of the CF).