He used the logarithmic scoring rule, and normalized it such that a maxent guess was 0 points.
It takes students a while to learn calibration, and so it’s worth doing many small-stakes versions of this before doing large-stakes versions of it. (The way he did this—one question as a homework assignment each week, and then one or two large exams—didn’t do all that well for this, especially since the homework assignments didn’t fully replicate the “how well can I interpret the question without asking for clarification?” part of the uncertainty that was relevant on tests.)
Getting probabilities from the students lets you generate average probabilities for each answer, which is actually quite useful at figuring out where the class is confused. Importantly, you can tell the difference between a question where the average estimate on the right answer is 90% and one where the average estimate on the right answer is 50%, even though both of those will look almost identical in the world where students only choose their top answer!
I mean, I personally was quite overconfident on the first midterm. ;) The primary reason was explicitly thinking it through and deciding that I wasn’t risk-neutral when it came to points; I cared more about having ‘the highest score’ than maximizing my expected score.
It also takes a bit longer to process questions; rather than just bubbling in a single oval, you have to think about how you want to budget your probability for each question, and it’s slightly harder for the teacher to process answers to get grades. But I think it more than pays for itself in the increased expressiveness.
If you have a digital exam, this works fine; if you want students to write things with pencil and paper, then you need to somehow turn the pencil marks into numbers that can be plugged into a simple spreadsheet.
A class I took in graduate school worked this way; here’s the professor’s paper about it. Some notes on how it worked:
He used the logarithmic scoring rule, and normalized it such that a maxent guess was 0 points.
It takes students a while to learn calibration, and so it’s worth doing many small-stakes versions of this before doing large-stakes versions of it. (The way he did this—one question as a homework assignment each week, and then one or two large exams—didn’t do all that well for this, especially since the homework assignments didn’t fully replicate the “how well can I interpret the question without asking for clarification?” part of the uncertainty that was relevant on tests.)
Getting probabilities from the students lets you generate average probabilities for each answer, which is actually quite useful at figuring out where the class is confused. Importantly, you can tell the difference between a question where the average estimate on the right answer is 90% and one where the average estimate on the right answer is 50%, even though both of those will look almost identical in the world where students only choose their top answer!
As a student, did you experience any particular frustrations with this approach?
I mean, I personally was quite overconfident on the first midterm. ;) The primary reason was explicitly thinking it through and deciding that I wasn’t risk-neutral when it came to points; I cared more about having ‘the highest score’ than maximizing my expected score.
It also takes a bit longer to process questions; rather than just bubbling in a single oval, you have to think about how you want to budget your probability for each question, and it’s slightly harder for the teacher to process answers to get grades. But I think it more than pays for itself in the increased expressiveness.
It being harder for the teacher to process seems to be a feature of bad software support. Ideally you would want to automate the whole process.
If you have a digital exam, this works fine; if you want students to write things with pencil and paper, then you need to somehow turn the pencil marks into numbers that can be plugged into a simple spreadsheet.