I disagree with your first point, I consider the 50:25:25:0 thing is the point. It’s hard to swallow because admitting ignorance rather than appearing falsely confident always is, but that’s why it makes for such a good value to train.
But if I my genuine confidence levels are 50:50:0:0 it seems unfair that I score less than someone whose genuine confidence levels are 50:25:25:0 - we both put the same probability on the correct score so why do they score more?
Suppose the students are subsequently told, by someone whom they trust but who happens to be wrong, that the answer isn’t A.
The 50:50:0:0 student says “okay, then it must be B”. The 50:25:25:0 student says “okay, then it must be B or C, 50% on each”. And the 50:17:17:17 student says “okay, then I don’t know”.
I don’t think these responses are equally good, and I don’t think they should be rewarded equally. The second student is more confused by fiction than the first, and the third is more confused again.
That said, to give a concrete example: what is 70*80? Is it 5600, 5400, 56000, or 3? By the way, it’s not 5600.
Obviously the best response here is “um, yes it is”. But I still feel like someone who gives equal weight to 3 as to 5400 is… either very confident in their skills, or very confused. I think my intuition is that I want to reward that student less than the other two, which goes against both your answer (reward them all equally) and my answer above (reward that student the most).
But I can’t really imagine someone honestly giving 50:17:17:17 to that question. Someone who gave equal scores to the last three answers probably gave something like either 100:0:0:0 (if they’re confident) or 25:25:25:25 (if they’re confused), and gets a higher or lower reward from that. So I dunno what to make of this.
I think to do this instead of preferring certain ratios between answers, we should prefer certain answers.
Under the original scoring scheme 50:50:0:0 doesn’t score differently from 50:0:50:0 or 50:0:0:50. The average credence for each answer between those 3 is 50:17:17:17 so I’d argue that (without some external marking of which incorrect answers are more reasonable) 50:50:0:0 should score the same as 50:17:17:17.
However we could choose a marking scheme where you get back (using my framing of log scoring above):
100% of the points put on A
10% of the points put on B
10% of the points put on C
0% of the points put on D
That way 50:50:0:0 and 50:25:25:0 both end up with 55% of their points but 50:17:17:17 gets 53.4% and 50:0:0:50 gets 50%. Play around with the percentages to get rewards that seem reasonable—I think it would still be a proper scoring rule*. You could do something similar with a quadratic scoring rule.
*I think one danger is that if I am unsure but I think I can guess what the teacher thinks is reasonable/unreasonable then this might tempt me to alter my score based on something other than my actual credence levels.
I’m not sure if these are good reasons, but it seems to me that
1) The expected answer to the quiz does not just consist in identifying A as a correct answer but also in identifying the others as incorrect answers. I mean that the expected right answer is 100:0:0:0 (and not, for example, 100:50:0:0 or whatever else).
2) Giving 25:25 for B:C is better than giving 50:0 even if answer C is 0 since 25:25 is closer to 0:0 than 50:0 (for the usual Euclidean distance). In this perspective, a better answer for the 50:50:0:0′s guy would have been 50:25:0:0, which is better than 50:25:25:0.
3) With this perspective, I am indeed not sure that encouraging for a student’s answer the sum to be 100 is a good idea. It seems better (for the student which is answering) to focus on each proposition (i.e., A, B, C or D) separately (related to point 1 of my message). For each proposition, the answer should reflect the credence of the person in the the fact that the answer is correct/incorrect. Therefore this could also be applied for a multiple-choice quiz with zero or more than one good answer(s).
EDIT (added) :
To sum up what I think could in this case be an answer to your question, I will say that, with the “quadratic scoring rule”, if the expected answer for A:B:C:D is 100:0:0:0, then the answer 1) 50:25:0:0 scores more than the answer 2) 50:50:0:0 because they are both right for C and D, they are at the same distance of the expected answer for A but 1) is closer to the expected answer for B (which is 0) than 2).
The same reasoning works for comparing 1′) 50:25:25:0 with 2′) 50:50:0:0, except that in this second case, it is the general distance (for the quadratic scoring rule) of 25:25 (for B:C) which is closer to 0:0 than 50:0.
Maybe 1) is where I have a fundamental difference.
Given evidence A, a Bayesian update considers how well evidence A was predicted.
There is no additional update due to how well ¬A being false was predicted. Even if ¬A is split into sub-categories, it isn’t relevant as that evidence has already been taken into account when we updated based on A being true.
r.e. 2) 50:25:0:0 gives a worse expected value than 50:50:0:0 as although my score increases if A is true, it decreases by more if B is true (assuming 50:50:0:0 is my true belief)
r.e. 3) I think it’s important to note that I’m assuming that exactly 1 of A or B or C or D is the correct answer. Therefore that the probabilities should add up to 100% to maximise your expected score (otherwise it isn’t a proper scoring rule).
Try to think about this in terms of expected value. On your specific example, they do score more, but this is probabilistic thinking, so we want to think about it in terms of the long run trend.
Suppose we no longer know what the answer is, and you are genuinely 50⁄50 on it being either A or B. This is what you truly believe, you don’t think there’s a chance in hell it’s C. If you sit there and ask yourself, “Maybe I should do a 50-25-25 split, just in case”, you’re going to immediately realize “Wait, that’s moronic. I’m throwing away 25% of my points on something I am certain is wrong. This is like betting on a 3-legged horse.”
Now let’s say you do a hundred of these questions, and most of your 50-50s come up correct-ish as one or the other. Your opponent consistently does 50-25-25s, and so they end up more wrong than you overall, because half the time the answer lands on one of their two 25s, not their single 50.
It’s not a game of being more correct, it’s a game of being less wrong.
I think all of this is also true of a scoring rule based on only the probability placed on the correct answer?
In the end you’d still expect to win but this takes longer (requires more questions) under a rule which includes probabilities on incorrect answers—it’s just adding noise to the results.
I disagree with your first point, I consider the 50:25:25:0 thing is the point. It’s hard to swallow because admitting ignorance rather than appearing falsely confident always is, but that’s why it makes for such a good value to train.
But if I my genuine confidence levels are 50:50:0:0 it seems unfair that I score less than someone whose genuine confidence levels are 50:25:25:0 - we both put the same probability on the correct score so why do they score more?
I don’t think this is a knock-down argument, but:
Suppose the students are subsequently told, by someone whom they trust but who happens to be wrong, that the answer isn’t A.
The 50:50:0:0 student says “okay, then it must be B”. The 50:25:25:0 student says “okay, then it must be B or C, 50% on each”. And the 50:17:17:17 student says “okay, then I don’t know”.
I don’t think these responses are equally good, and I don’t think they should be rewarded equally. The second student is more confused by fiction than the first, and the third is more confused again.
That said, to give a concrete example: what is 70*80? Is it 5600, 5400, 56000, or 3? By the way, it’s not 5600.
Obviously the best response here is “um, yes it is”. But I still feel like someone who gives equal weight to 3 as to 5400 is… either very confident in their skills, or very confused. I think my intuition is that I want to reward that student less than the other two, which goes against both your answer (reward them all equally) and my answer above (reward that student the most).
But I can’t really imagine someone honestly giving 50:17:17:17 to that question. Someone who gave equal scores to the last three answers probably gave something like either 100:0:0:0 (if they’re confident) or 25:25:25:25 (if they’re confused), and gets a higher or lower reward from that. So I dunno what to make of this.
This makes sense to me.
I think to do this instead of preferring certain ratios between answers, we should prefer certain answers.
Under the original scoring scheme 50:50:0:0 doesn’t score differently from 50:0:50:0 or 50:0:0:50. The average credence for each answer between those 3 is 50:17:17:17 so I’d argue that (without some external marking of which incorrect answers are more reasonable) 50:50:0:0 should score the same as 50:17:17:17.
However we could choose a marking scheme where you get back (using my framing of log scoring above):
100% of the points put on A
10% of the points put on B
10% of the points put on C
0% of the points put on D
That way 50:50:0:0 and 50:25:25:0 both end up with 55% of their points but 50:17:17:17 gets 53.4% and 50:0:0:50 gets 50%. Play around with the percentages to get rewards that seem reasonable—I think it would still be a proper scoring rule*. You could do something similar with a quadratic scoring rule.
*I think one danger is that if I am unsure but I think I can guess what the teacher thinks is reasonable/unreasonable then this might tempt me to alter my score based on something other than my actual credence levels.
[… why do they score more?]
I’m not sure if these are good reasons, but it seems to me that
1) The expected answer to the quiz does not just consist in identifying A as a correct answer but also in identifying the others as incorrect answers. I mean that the expected right answer is 100:0:0:0 (and not, for example, 100:50:0:0 or whatever else).
2) Giving 25:25 for B:C is better than giving 50:0 even if answer C is 0 since 25:25 is closer to 0:0 than 50:0 (for the usual Euclidean distance). In this perspective, a better answer for the 50:50:0:0′s guy would have been 50:25:0:0, which is better than 50:25:25:0.
Indeed, 1 - [(1-1/2)^2 + (1/4)^2 + 0^2 + 0^2] > 1 - [(1-1/2)^2 + (1/4)^2 + (1/4)^2 + 0^2] > 1 - [(1-1/2)^2 + (1/2)^2 + 0^2 + 0^2].
3) With this perspective, I am indeed not sure that encouraging for a student’s answer the sum to be 100 is a good idea. It seems better (for the student which is answering) to focus on each proposition (i.e., A, B, C or D) separately (related to point 1 of my message). For each proposition, the answer should reflect the credence of the person in the the fact that the answer is correct/incorrect. Therefore this could also be applied for a multiple-choice quiz with zero or more than one good answer(s).
EDIT (added) :
To sum up what I think could in this case be an answer to your question, I will say that, with the “quadratic scoring rule”, if the expected answer for A:B:C:D is 100:0:0:0, then the answer 1) 50:25:0:0 scores more than the answer 2) 50:50:0:0 because they are both right for C and D, they are at the same distance of the expected answer for A but 1) is closer to the expected answer for B (which is 0) than 2).
The same reasoning works for comparing 1′) 50:25:25:0 with 2′) 50:50:0:0, except that in this second case, it is the general distance (for the quadratic scoring rule) of 25:25 (for B:C) which is closer to 0:0 than 50:0.
Maybe 1) is where I have a fundamental difference.
Given evidence A, a Bayesian update considers how well evidence A was predicted.
There is no additional update due to how well ¬A being false was predicted. Even if ¬A is split into sub-categories, it isn’t relevant as that evidence has already been taken into account when we updated based on A being true.
r.e. 2) 50:25:0:0 gives a worse expected value than 50:50:0:0 as although my score increases if A is true, it decreases by more if B is true (assuming 50:50:0:0 is my true belief)
r.e. 3) I think it’s important to note that I’m assuming that exactly 1 of A or B or C or D is the correct answer. Therefore that the probabilities should add up to 100% to maximise your expected score (otherwise it isn’t a proper scoring rule).
Try to think about this in terms of expected value. On your specific example, they do score more, but this is probabilistic thinking, so we want to think about it in terms of the long run trend.
Suppose we no longer know what the answer is, and you are genuinely 50⁄50 on it being either A or B. This is what you truly believe, you don’t think there’s a chance in hell it’s C. If you sit there and ask yourself, “Maybe I should do a 50-25-25 split, just in case”, you’re going to immediately realize “Wait, that’s moronic. I’m throwing away 25% of my points on something I am certain is wrong. This is like betting on a 3-legged horse.”
Now let’s say you do a hundred of these questions, and most of your 50-50s come up correct-ish as one or the other. Your opponent consistently does 50-25-25s, and so they end up more wrong than you overall, because half the time the answer lands on one of their two 25s, not their single 50.
It’s not a game of being more correct, it’s a game of being less wrong.
I think all of this is also true of a scoring rule based on only the probability placed on the correct answer?
In the end you’d still expect to win but this takes longer (requires more questions) under a rule which includes probabilities on incorrect answers—it’s just adding noise to the results.