Feedback and suggestions for improvement are very welcome!
It’s true that someone can easily get an excellent calibration score at the cost of getting no points. This tends to be very obvious when you read out the leaderboard. A quick patch is to turn all the questions into statements and have people estimate how likely they think the statement is true. “What is the element with Atomic Weight 29” becomes “The element with Atomic Weight 29 is Copper.” Then there is no easy path to excellent scores of either kind.
That version is a little less fun and I don’t think the change is necessary. I’m curious is if that patch would satisfy your objection? It might be relevant that I don’t view the goal as measuring calibration, but to train it. When I’ve run this, I often see a rapid change in confidences over the course of the first dozen questions as some people who hadn’t previously practiced the skill begin to use numbers other than the highest and lowest available.
Sure, that patch wouldn’t have the problem I described.
Anyway, do whatever works for you—if you find this exercise helps people train their calibration, then I suppose that’s a good thing. I guess my main point would be not to take too seriously what this method tells us about who is “best” at calibration—and I guess you’re saying people already don’t take seriously in the case of someone who is doing badly at the trivia portion, but I think the failure mode is a bit more general than that. Anyway, I guess it doesn’t matter too much.
Feedback and suggestions for improvement are very welcome!
It’s true that someone can easily get an excellent calibration score at the cost of getting no points. This tends to be very obvious when you read out the leaderboard. A quick patch is to turn all the questions into statements and have people estimate how likely they think the statement is true. “What is the element with Atomic Weight 29” becomes “The element with Atomic Weight 29 is Copper.” Then there is no easy path to excellent scores of either kind.
That version is a little less fun and I don’t think the change is necessary. I’m curious is if that patch would satisfy your objection? It might be relevant that I don’t view the goal as measuring calibration, but to train it. When I’ve run this, I often see a rapid change in confidences over the course of the first dozen questions as some people who hadn’t previously practiced the skill begin to use numbers other than the highest and lowest available.
Sure, that patch wouldn’t have the problem I described.
Anyway, do whatever works for you—if you find this exercise helps people train their calibration, then I suppose that’s a good thing. I guess my main point would be not to take too seriously what this method tells us about who is “best” at calibration—and I guess you’re saying people already don’t take seriously in the case of someone who is doing badly at the trivia portion, but I think the failure mode is a bit more general than that. Anyway, I guess it doesn’t matter too much.