Can you show the distribution of overall calibration scores? You only talked about the extreme cases and the differences across P(MWI), but you clearly have it.
Picture included, tragic mistakes excluded*. The percentage at the bottom is a mapping from the score to probabilities using the inverse of “if you had answered every question right with probability p, what score would you have?”, and so is not anything like the mean probability given. Don’t take either of the two perfect scores seriously; as mentioned in the grandparent, this scoring rule isn’t quite right because it counts answering incorrectly with 0% probability as the same as answering correctly with 100% probability. (One answered ‘asdf’ to everything with 0% probability, the other left 9 blank with 0% probability and answered Odin with 100% probability.) Bins have equal width in log-space.
* I could have had a spike at 0, but that seems not quite fair since it was specified that ’100′ and ‘0’ would be treated as ‘100-epsilon’ and ‘epsilon’ respectively, and it’s only a Tragic Mistake if you actually answer 0 instead of epsilon.
Yeah, that’s not a particularly strong scoring method, due to its abusability. I wonder what a better one would be? Of course, it wouldn’t help unless people knew that it was going to be used, and care.
Fraction correct times this calibration score? Number correct times the product rather than the average of what you did there? Bayes score, with naming the ‘wrong’ thing yielding a penalty to account for the multiplicity of wrong answers (say, each wrong answer has a 50% hit so even being 100% sure you’re wrong is only as good as 50% sure you’re right, when you are right)?
The primary property you want to maintain with a scoring rule is that the best probability to provide is your true probability. I know that the Bayes score generalizes to multiple choice questions, which implies to me that it most likely works with a multiplicity for wrong answers, so long as the multiplicity is close to the actual multiplicity.
I think the primary property you want to maintain is that it’s best to provide the answer you consider most likely, otherwise it’s best to say ‘sdfkhasflk’ − 0% to all of them you aren’t certain of.
Multiple choice would making the scoring clearer, but that constraint could well make the calibration easier.
Picture included, tragic mistakes excluded*. The percentage at the bottom is a mapping from the score to probabilities using the inverse of “if you had answered every question right with probability p, what score would you have?”, and so is not anything like the mean probability given. Don’t take either of the two perfect scores seriously; as mentioned in the grandparent, this scoring rule isn’t quite right because it counts answering incorrectly with 0% probability as the same as answering correctly with 100% probability. (One answered ‘asdf’ to everything with 0% probability, the other left 9 blank with 0% probability and answered Odin with 100% probability.) Bins have equal width in log-space.
* I could have had a spike at 0, but that seems not quite fair since it was specified that ’100′ and ‘0’ would be treated as ‘100-epsilon’ and ‘epsilon’ respectively, and it’s only a Tragic Mistake if you actually answer 0 instead of epsilon.
Yeah, that’s not a particularly strong scoring method, due to its abusability. I wonder what a better one would be? Of course, it wouldn’t help unless people knew that it was going to be used, and care.
Fraction correct times this calibration score? Number correct times the product rather than the average of what you did there? Bayes score, with naming the ‘wrong’ thing yielding a penalty to account for the multiplicity of wrong answers (say, each wrong answer has a 50% hit so even being 100% sure you’re wrong is only as good as 50% sure you’re right, when you are right)?
The primary property you want to maintain with a scoring rule is that the best probability to provide is your true probability. I know that the Bayes score generalizes to multiple choice questions, which implies to me that it most likely works with a multiplicity for wrong answers, so long as the multiplicity is close to the actual multiplicity.
I think the primary property you want to maintain is that it’s best to provide the answer you consider most likely, otherwise it’s best to say ‘sdfkhasflk’ − 0% to all of them you aren’t certain of.
Multiple choice would making the scoring clearer, but that constraint could well make the calibration easier.