Using a log scoring rule, I calculated a total accuracy+calibration score for the ten questions together. There’s an issue that this assumes the questions are binary when they’re not- someone who is 0% sure that Thor is the right answer to the mythology question gets the same score (0) as the person who is 100% sure that Odin is the right answer to the mythology question. I ignored infinitely low scores for the correlation part.
I replicated the MWI correlation, but I noticed something weird- all of the really low scorers gave really low probabilities to MWI. The worst scorer had a score of −18, which corresponds to giving about 1.6% probability to the right answer. What appears to have happened is they misunderstood the survey, and answered in fractions instead of percents- they got 9 out of 10 questions right, but lost 2 points every time they assigned 1% or slightly less than 1% to the right answer (i.e. they mean to express almost certainty by saying 1 or 0.99) and only lost 0.0013 points when they assigned 0.3% probability to a wrong answer.
When I drop the 30 lowest scorers, the direction of the relationship flips- now, people with better log scores (i.e. closer to 0) give lower probabilities for MWI (with a text answer counting as a probability of 0, as most were complaints that asking for a number didn’t make sense).
What about Tragic Mistakes? These are people that assign 0% probability to a correct answer, or 100% probability to a wrong one, and under a log scoring rule lose infinite points. Checking those showed both errors, as well as highlighting that several of the ‘wrong’ answers were spelling mistakes- I probably would have accepted “Oden” and “Mitocondria.”
(Amusingly, the person with the most tragic mistakes- 9 of them- supplied a probability for their answers instead of an answer, so they were 100% sure that the battle of Trafalgar was fought off the coast of 100, which was the state where Obama was born.)
There’s a tiny decline in tragic mistakes as P(MWI) increases, but I don’t think I’d be confident in drawing conclusions from this data.
Can you show the distribution of overall calibration scores? You only talked about the extreme cases and the differences across P(MWI), but you clearly have it.
Can you show the distribution of overall calibration scores? You only talked about the extreme cases and the differences across P(MWI), but you clearly have it.
Picture included, tragic mistakes excluded*. The percentage at the bottom is a mapping from the score to probabilities using the inverse of “if you had answered every question right with probability p, what score would you have?”, and so is not anything like the mean probability given. Don’t take either of the two perfect scores seriously; as mentioned in the grandparent, this scoring rule isn’t quite right because it counts answering incorrectly with 0% probability as the same as answering correctly with 100% probability. (One answered ‘asdf’ to everything with 0% probability, the other left 9 blank with 0% probability and answered Odin with 100% probability.) Bins have equal width in log-space.
* I could have had a spike at 0, but that seems not quite fair since it was specified that ’100′ and ‘0’ would be treated as ‘100-epsilon’ and ‘epsilon’ respectively, and it’s only a Tragic Mistake if you actually answer 0 instead of epsilon.
Yeah, that’s not a particularly strong scoring method, due to its abusability. I wonder what a better one would be? Of course, it wouldn’t help unless people knew that it was going to be used, and care.
Fraction correct times this calibration score? Number correct times the product rather than the average of what you did there? Bayes score, with naming the ‘wrong’ thing yielding a penalty to account for the multiplicity of wrong answers (say, each wrong answer has a 50% hit so even being 100% sure you’re wrong is only as good as 50% sure you’re right, when you are right)?
The primary property you want to maintain with a scoring rule is that the best probability to provide is your true probability. I know that the Bayes score generalizes to multiple choice questions, which implies to me that it most likely works with a multiplicity for wrong answers, so long as the multiplicity is close to the actual multiplicity.
I think the primary property you want to maintain is that it’s best to provide the answer you consider most likely, otherwise it’s best to say ‘sdfkhasflk’ − 0% to all of them you aren’t certain of.
Multiple choice would making the scoring clearer, but that constraint could well make the calibration easier.
I was using a logarithmic scoring rule, with a base of 10. (What base you use doesn’t really matter.) The Excel formula for the first question (I’m pretty sure I didn’t delete any columns, so it should line up) was:
Calibration Score
Using a log scoring rule, I calculated a total accuracy+calibration score for the ten questions together. There’s an issue that this assumes the questions are binary when they’re not- someone who is 0% sure that Thor is the right answer to the mythology question gets the same score (0) as the person who is 100% sure that Odin is the right answer to the mythology question. I ignored infinitely low scores for the correlation part.
I replicated the MWI correlation, but I noticed something weird- all of the really low scorers gave really low probabilities to MWI. The worst scorer had a score of −18, which corresponds to giving about 1.6% probability to the right answer. What appears to have happened is they misunderstood the survey, and answered in fractions instead of percents- they got 9 out of 10 questions right, but lost 2 points every time they assigned 1% or slightly less than 1% to the right answer (i.e. they mean to express almost certainty by saying 1 or 0.99) and only lost 0.0013 points when they assigned 0.3% probability to a wrong answer.
When I drop the 30 lowest scorers, the direction of the relationship flips- now, people with better log scores (i.e. closer to 0) give lower probabilities for MWI (with a text answer counting as a probability of 0, as most were complaints that asking for a number didn’t make sense).
What about Tragic Mistakes? These are people that assign 0% probability to a correct answer, or 100% probability to a wrong one, and under a log scoring rule lose infinite points. Checking those showed both errors, as well as highlighting that several of the ‘wrong’ answers were spelling mistakes- I probably would have accepted “Oden” and “Mitocondria.”
(Amusingly, the person with the most tragic mistakes- 9 of them- supplied a probability for their answers instead of an answer, so they were 100% sure that the battle of Trafalgar was fought off the coast of 100, which was the state where Obama was born.)
There’s a tiny decline in tragic mistakes as P(MWI) increases, but I don’t think I’d be confident in drawing conclusions from this data.
I’ve always wanted to visit 100.
Can you show the distribution of overall calibration scores? You only talked about the extreme cases and the differences across P(MWI), but you clearly have it.
Picture included, tragic mistakes excluded*. The percentage at the bottom is a mapping from the score to probabilities using the inverse of “if you had answered every question right with probability p, what score would you have?”, and so is not anything like the mean probability given. Don’t take either of the two perfect scores seriously; as mentioned in the grandparent, this scoring rule isn’t quite right because it counts answering incorrectly with 0% probability as the same as answering correctly with 100% probability. (One answered ‘asdf’ to everything with 0% probability, the other left 9 blank with 0% probability and answered Odin with 100% probability.) Bins have equal width in log-space.
* I could have had a spike at 0, but that seems not quite fair since it was specified that ’100′ and ‘0’ would be treated as ‘100-epsilon’ and ‘epsilon’ respectively, and it’s only a Tragic Mistake if you actually answer 0 instead of epsilon.
Yeah, that’s not a particularly strong scoring method, due to its abusability. I wonder what a better one would be? Of course, it wouldn’t help unless people knew that it was going to be used, and care.
Fraction correct times this calibration score? Number correct times the product rather than the average of what you did there? Bayes score, with naming the ‘wrong’ thing yielding a penalty to account for the multiplicity of wrong answers (say, each wrong answer has a 50% hit so even being 100% sure you’re wrong is only as good as 50% sure you’re right, when you are right)?
The primary property you want to maintain with a scoring rule is that the best probability to provide is your true probability. I know that the Bayes score generalizes to multiple choice questions, which implies to me that it most likely works with a multiplicity for wrong answers, so long as the multiplicity is close to the actual multiplicity.
I think the primary property you want to maintain is that it’s best to provide the answer you consider most likely, otherwise it’s best to say ‘sdfkhasflk’ − 0% to all of them you aren’t certain of.
Multiple choice would making the scoring clearer, but that constraint could well make the calibration easier.
Sort-of related question: How do you compute calibration scores?
I was using a logarithmic scoring rule, with a base of 10. (What base you use doesn’t really matter.) The Excel formula for the first question (I’m pretty sure I didn’t delete any columns, so it should line up) was: