maia comments on Calibration Trivia

maia 21 Nov 2022 0:05 UTC
2 points
0
There’s actually a big problem with using Brier scores for open-ended questions like this, which is that the optimal option if you’re, say, 50% confident you have the right answer, is to instead report “Don’t know / bleeblabloo, probability 0.0001”. Then you get a good Brier score for knowing you would be wrong.
We ran this at our meetup today and it was the subject of much discussion. A big conclusion seemed to be that Brier scores work best when there is a fixed, limited number of possibilities to guess from; when the number of possibilities is large/unknown and you can guess “I don’t know,” you get this bad behavior.
We came up with a kind of hacky solution that gave you negative points for wrong answers and positive points for right ones, scaled to the probability you gave, plus regular Brier scores for the True/False questions. It’s unlikely that solution was a proper scoring rule, but it was somewhat better in removing the incentive to always guess “[wrong answer] with probability epsilon.”
- Screwtape 21 Nov 2022 13:40 UTC
  1 point
  0
  Parent
  The quick hack I’d use if I didn’t want people to be able to easily guess wrong with high certainty would be to use True/False or multiple choice questions. That said, I don’t currently think of this as a big problem?
  There are two scores; Calibration and Correct Answers. If someone has remarkably good calibration and almost no correct answers, then they’re probably deliberately guessing outlandish answers and being sure that they’re wrong. That’s not worth bragging rights, it’s the equivalent of running to the side of the obstacles on an obstacle course. Someone who’s correctly 20% confident on most of the questions can get a lower Brier but six Correct Answer points, or an excellent Brier and zero Correct Answer points, and the former is (to me) more impressive. If you are actually totally clueless, then “[wrong answer] with probability epsilon” is actually the right response.
  “I notice that I don’t actually know this” is (in my opinion) a useful skill to pick up, if you can avoid also picking up “I should pretend that I know nothing.” Still, the option to make it multiple choice exists, and there might be a better scoring rule. (I deliberately avoided making some kind of combined score, because I didn’t want less obvious strategic exchange rates between correct answers and calibration.)