Great post, Id be really interested to hear how this goes down with students.
I would be cautious about using information from incorrect answers to calculate the score—just using the percentage given for the correct answer still gives a proper scoring rule. If percentages placed on incorrect answers are included then you get 50:25:25:0 giving more points than 50:50:0:0 when the answer is A which I think people might find hard to swallow.
For a proper scoring rule I find a particular framing of a log score to be intuitive—instead of adding the logs of the probabilities placed on the correct answers, just multiply out the probabilities.
This can be visualised as having a heap of points and having to spread them all across the 4 possible answers. You lose the points that were placed on the wrong answers and then use your remaining points to repeat the process for the next question. Whoever has the most points left at the end has done the best. The £100k drop is a game show which is based on this premise.
I personally find this to be an easy visualisation with the added benefit that the scores have a specific Bayesian interpretation—the ratio of students’ scores represent the likelihood function of who knows the subject best based on the evidence of that exam.
If percentages placed on incorrect answers are included then you get 50:25:25:0 giving more points than 50:50:0:0 when the answer is A which I think people might find hard to swallow.
It seems to me like a good lesson.
It seems to me like it’s a coin toss between A and B (C and D are impossible) is validly scored less well then A is twice as likely as B and A is twice as likely as C.
I did quite a bunch of credence calibration, I’m curious to what extend you have trained it. I have the impression that my intuitions here are informed by experiences.
Most normal testing focuses on finding the correct answer. Credence training focuses on accurately having a sense of your own knowledge.
There’s a sense that having that one’s genuine confidence level is zero feels repulsing given that zero isn’t a probability.
Humans use the availability heuristic to compare different options. The fact that A feels more likely then B and A feels more likely then C seems to me like a form of knowledge that’s worth rewarding.
I have done some credence training but I think my instincts here are more based on Maths and specifically Bayes (see this comment).
I think the zero probability thing is a red herring—replace the 0s with ϵ and the 50s with 50-ϵ and you get basically the same thing. There are some questions where keeping track of the ϵ just isn’t worth it.
A proper scoring rule is designed to reward both knowledge and accurate reporting of credences. This is achieved if we score based on the correct answer, whether or not we also score based on the probabilities of the wrong answers.
If we also attempt to optimise for certain ratios between credences of different answers then this is at the expense of rewarding knowledge of the correct answer.
If Alice has credence levels of 50:50:ϵ:ϵ and Bob has 40:20:20:20 and the correct answer is A then Bob will get a higher score than Alice despite her putting more of her probability mass on the correct answer.
Do you consider this a price worth paying to reward having particular ratios between credences?
[....is at the expense of rewarding knowledge of the correct answer.]
Hmm… I’m not sure that Alice has really more knowledge than Bob in your example.
[EDIT : In fact, in your example, for the quadratic scoring rule, the score of 50:50:$\epsilon:\epsilon$ is better than the score of 40:20:20:20 since $12/25 < 1/2 + 2\epsilon^2$, so that we can indeed say that Alice has more knowledge than Bob after this rule. The following example is, IMHO, more interesting. /EDIT].
Let me propose an other perspective with the following two answers for propositions A:B:C:D :
1) 50:50:0:0
2) 50:25:25:0,
where the correct answer is 100:0:0:0.
In this case, 2) has a better score than 1).
What does 1) know ? That D and C are false. He knows nothing for A and B.
What does 2) know ? That D is false. That C is not very probable. He does not know for A, like 1). But he does know moreover that B is probably not the right answer.
Suppose that 3) is someone who knows that D and C are false, and also knows that B is probably not the right answer (i.e., 3 has the knowledge of both 1 and 2). Then 3) could have given the answer 3a) 75:25:0:0, or the answer 3b) 62,5:37,5:0:0. These two answers score better than 1) and 2).
(Note that knowing that D and C are false and that B is probably not the right answer influence your knowledge about A.)
-----
For example, imagine that 2) first thinks of 50:25:25:0, but then he remembers that it can in fact not be C. We can then compute the bayesian update, and we get :
P(A | non C) = 2⁄3 (vs P(A) = 1⁄2)
P(B | non C) = 1⁄3 (vs P(B) = 1⁄4)
P(C | non C) = 0 (vs P(C) = 1⁄4)
P(D | non C) = 0 (vs P(D) = 0).
This is different from answer 1). In this sense, I think we can really say that 2) knows something that 1) does not know, even if 2) is not sure that C is false. Indeed, after an update of the information ‘non C’, the score of 2) becomes better than the score of 1). (2/3:1/3:0:0 has a better score than 1/2:1/4:1/4:0).
Sure, 2 knows something 1 doesn’t; e.g., 2 knows more about how unlikely B is. But, equally, 1 knows something 2 doesn’t; e.g., 1 knows more than 2 about how unlikely C is.
In the absence of any reason to think one of these is more important than the other, it seems reasonable to think that different probability assignments among the various wrong answers are equally meritorious and should result in equal scores.
… Having said that, here’s an argument (which I’m not sure I believe) for favouring more-balanced probability assignments to the wrong answers. We never really know that the right answer is 100:0:0:0. We could, conceivably, be wrong. And, by hypothesis, we don’t know of any relevant differences between the “wrong” answers. So we should see all the wrong answers as equally improbable but not quite certainly wrong. And if, deep down, we believe in something like the log scoring rule, then we should notice that a candidate who assigns a super-low probability to one of those “wrong” answers is going to do super-badly in the very unlikely case that it’s actually right after all.
So, suppose we believe in the log scoring rule, and we think the correct answer is the first one. But we admit a tiny probability h for each of the others being right. Then a candidate who gives probabilities a,b,c,d has an expected score of (1-3h) log a + h (log b + log c + log d). Suppose one candidate says 0.49,0.49,0.01,0.01 and the other says 0.4,0.2,0.2,0.2; then we will prefer the second over the first if h is bigger than about 0.0356. In a typical educational context that’s unlikely so we should prefer the first candidate. Now suppose one says 0.49,0.49,0.01,0.01 and the other says 0.49,0.25,0.25,0.01; we should always prefer the second candidate.
None of this means that the Brier score is the right way to prefer the second candidate over the first; it clearly isn’t, and if h is small enough then of course the correction to the naive log score is also very small, provided candidates’ probability assignments are bounded away from zero.
In practice, hopefully h is extremely small. And some wrong answers will be wronger than others and we don’t want to reward candidates for not noticing that, but we probably also don’t want the extra pain of figuring out just how badly wrong all the wrong answers are, and that is my main reason for thinking it’s better to use a scoring rule that doesn’t care what probabilities candidates assigned to the wrong answers.
Arguably the “natural” way to handle the possibility that you (the examiner) are in error is to score answers by (negative) KL-divergence from your own probability assignment. So if there are four options to which you assign probabilities p,q,r,s and a candidate says a,b,c,d then they get p log(a/p) + q log(b/q) + r log(c/r) + s log(d/s). If p=1 and q,r,s=0,0,0 then this is the same as giving them log a, i.e., the usual log-scoring rule. If p=1-3h and q,r,s=h,h,h then this is (1-3h) log (a/(1-3h)) + h log(b/h) + …, which if we fix a is constant + h (log b + log c + log d) = constant + h log bcd, which by the AM-GM inequality is biggest when b=c=d.
This differs from the “expected log score” I described above only by an additive constant. One way to describe it is: the average amount of information the candidate would gain by adopting your probabilities instead of theirs, the average being taken according to your probabilities.
This is really interesting, thanks, not something I’d thought of.
If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B) [1] between the students and teacher. This is my first thought about how I’d create a fair scoring rule for this.
[1] P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.
This is true if scores from different questions are added but not if they are multiplied. Linear scoring with multiplication is exactly the same as log scoring with addition, just easier to visualise (at least to me)
Wrong. In the 100k drop, if you know each question has odds 60:40, expected winnings are maximized if you put all on one answer each time, not 60% on one and 40% on the other.
What’s not preserved between the two ways to score is which strategy maximizes expected score.
I think the 100k drop analogy may be misleading when thinking about the final result. The final score in the version I envisage is judged on ratios between results, rather than absolute values (my explanation maybe isn’t clearly enough on this). In that case putting everything on the answer which you have 60% confidence in and being right gives a ratio of 1.67 in your favour over an honest reporting. But if you do it and get it wrong then there is an infinite ratio in favour of the honest reporting.
I disagree with your first point, I consider the 50:25:25:0 thing is the point. It’s hard to swallow because admitting ignorance rather than appearing falsely confident always is, but that’s why it makes for such a good value to train.
But if I my genuine confidence levels are 50:50:0:0 it seems unfair that I score less than someone whose genuine confidence levels are 50:25:25:0 - we both put the same probability on the correct score so why do they score more?
Suppose the students are subsequently told, by someone whom they trust but who happens to be wrong, that the answer isn’t A.
The 50:50:0:0 student says “okay, then it must be B”. The 50:25:25:0 student says “okay, then it must be B or C, 50% on each”. And the 50:17:17:17 student says “okay, then I don’t know”.
I don’t think these responses are equally good, and I don’t think they should be rewarded equally. The second student is more confused by fiction than the first, and the third is more confused again.
That said, to give a concrete example: what is 70*80? Is it 5600, 5400, 56000, or 3? By the way, it’s not 5600.
Obviously the best response here is “um, yes it is”. But I still feel like someone who gives equal weight to 3 as to 5400 is… either very confident in their skills, or very confused. I think my intuition is that I want to reward that student less than the other two, which goes against both your answer (reward them all equally) and my answer above (reward that student the most).
But I can’t really imagine someone honestly giving 50:17:17:17 to that question. Someone who gave equal scores to the last three answers probably gave something like either 100:0:0:0 (if they’re confident) or 25:25:25:25 (if they’re confused), and gets a higher or lower reward from that. So I dunno what to make of this.
I think to do this instead of preferring certain ratios between answers, we should prefer certain answers.
Under the original scoring scheme 50:50:0:0 doesn’t score differently from 50:0:50:0 or 50:0:0:50. The average credence for each answer between those 3 is 50:17:17:17 so I’d argue that (without some external marking of which incorrect answers are more reasonable) 50:50:0:0 should score the same as 50:17:17:17.
However we could choose a marking scheme where you get back (using my framing of log scoring above):
100% of the points put on A
10% of the points put on B
10% of the points put on C
0% of the points put on D
That way 50:50:0:0 and 50:25:25:0 both end up with 55% of their points but 50:17:17:17 gets 53.4% and 50:0:0:50 gets 50%. Play around with the percentages to get rewards that seem reasonable—I think it would still be a proper scoring rule*. You could do something similar with a quadratic scoring rule.
*I think one danger is that if I am unsure but I think I can guess what the teacher thinks is reasonable/unreasonable then this might tempt me to alter my score based on something other than my actual credence levels.
I’m not sure if these are good reasons, but it seems to me that
1) The expected answer to the quiz does not just consist in identifying A as a correct answer but also in identifying the others as incorrect answers. I mean that the expected right answer is 100:0:0:0 (and not, for example, 100:50:0:0 or whatever else).
2) Giving 25:25 for B:C is better than giving 50:0 even if answer C is 0 since 25:25 is closer to 0:0 than 50:0 (for the usual Euclidean distance). In this perspective, a better answer for the 50:50:0:0′s guy would have been 50:25:0:0, which is better than 50:25:25:0.
3) With this perspective, I am indeed not sure that encouraging for a student’s answer the sum to be 100 is a good idea. It seems better (for the student which is answering) to focus on each proposition (i.e., A, B, C or D) separately (related to point 1 of my message). For each proposition, the answer should reflect the credence of the person in the the fact that the answer is correct/incorrect. Therefore this could also be applied for a multiple-choice quiz with zero or more than one good answer(s).
EDIT (added) :
To sum up what I think could in this case be an answer to your question, I will say that, with the “quadratic scoring rule”, if the expected answer for A:B:C:D is 100:0:0:0, then the answer 1) 50:25:0:0 scores more than the answer 2) 50:50:0:0 because they are both right for C and D, they are at the same distance of the expected answer for A but 1) is closer to the expected answer for B (which is 0) than 2).
The same reasoning works for comparing 1′) 50:25:25:0 with 2′) 50:50:0:0, except that in this second case, it is the general distance (for the quadratic scoring rule) of 25:25 (for B:C) which is closer to 0:0 than 50:0.
Maybe 1) is where I have a fundamental difference.
Given evidence A, a Bayesian update considers how well evidence A was predicted.
There is no additional update due to how well ¬A being false was predicted. Even if ¬A is split into sub-categories, it isn’t relevant as that evidence has already been taken into account when we updated based on A being true.
r.e. 2) 50:25:0:0 gives a worse expected value than 50:50:0:0 as although my score increases if A is true, it decreases by more if B is true (assuming 50:50:0:0 is my true belief)
r.e. 3) I think it’s important to note that I’m assuming that exactly 1 of A or B or C or D is the correct answer. Therefore that the probabilities should add up to 100% to maximise your expected score (otherwise it isn’t a proper scoring rule).
Try to think about this in terms of expected value. On your specific example, they do score more, but this is probabilistic thinking, so we want to think about it in terms of the long run trend.
Suppose we no longer know what the answer is, and you are genuinely 50⁄50 on it being either A or B. This is what you truly believe, you don’t think there’s a chance in hell it’s C. If you sit there and ask yourself, “Maybe I should do a 50-25-25 split, just in case”, you’re going to immediately realize “Wait, that’s moronic. I’m throwing away 25% of my points on something I am certain is wrong. This is like betting on a 3-legged horse.”
Now let’s say you do a hundred of these questions, and most of your 50-50s come up correct-ish as one or the other. Your opponent consistently does 50-25-25s, and so they end up more wrong than you overall, because half the time the answer lands on one of their two 25s, not their single 50.
It’s not a game of being more correct, it’s a game of being less wrong.
I think all of this is also true of a scoring rule based on only the probability placed on the correct answer?
In the end you’d still expect to win but this takes longer (requires more questions) under a rule which includes probabilities on incorrect answers—it’s just adding noise to the results.
Great post, Id be really interested to hear how this goes down with students.
I would be cautious about using information from incorrect answers to calculate the score—just using the percentage given for the correct answer still gives a proper scoring rule. If percentages placed on incorrect answers are included then you get 50:25:25:0 giving more points than 50:50:0:0 when the answer is A which I think people might find hard to swallow.
For a proper scoring rule I find a particular framing of a log score to be intuitive—instead of adding the logs of the probabilities placed on the correct answers, just multiply out the probabilities.
This can be visualised as having a heap of points and having to spread them all across the 4 possible answers. You lose the points that were placed on the wrong answers and then use your remaining points to repeat the process for the next question. Whoever has the most points left at the end has done the best. The £100k drop is a game show which is based on this premise.
I personally find this to be an easy visualisation with the added benefit that the scores have a specific Bayesian interpretation—the ratio of students’ scores represent the likelihood function of who knows the subject best based on the evidence of that exam.
It seems to me like a good lesson.
It seems to me like it’s a coin toss between A and B (C and D are impossible) is validly scored less well then A is twice as likely as B and A is twice as likely as C.
Why?
50:50:0:0 says it’s a coin toss between A and ¬A. If ¬A then B.
50:25:25:0 says it’s a coin toss up between A and ¬A. If ¬A then its a coin toss between B and C.
Why should the scoring rule care about what my rule is for ¬A when A is the correct answer?
I’m genuinely curious—I notice you’re the second person to voice this opinion but I can’t get my head round it at all.
(As with my reply to aaq, this all assumes that these are genuine confidence levels)
I did quite a bunch of credence calibration, I’m curious to what extend you have trained it. I have the impression that my intuitions here are informed by experiences.
Most normal testing focuses on finding the correct answer. Credence training focuses on accurately having a sense of your own knowledge.
There’s a sense that having that one’s genuine confidence level is zero feels repulsing given that zero isn’t a probability.
Humans use the availability heuristic to compare different options. The fact that A feels more likely then B and A feels more likely then C seems to me like a form of knowledge that’s worth rewarding.
I have done some credence training but I think my instincts here are more based on Maths and specifically Bayes (see this comment).
I think the zero probability thing is a red herring—replace the 0s with ϵ and the 50s with 50-ϵ and you get basically the same thing. There are some questions where keeping track of the ϵ just isn’t worth it.
A proper scoring rule is designed to reward both knowledge and accurate reporting of credences. This is achieved if we score based on the correct answer, whether or not we also score based on the probabilities of the wrong answers.
If we also attempt to optimise for certain ratios between credences of different answers then this is at the expense of rewarding knowledge of the correct answer.
If Alice has credence levels of 50:50:ϵ:ϵ and Bob has 40:20:20:20 and the correct answer is A then Bob will get a higher score than Alice despite her putting more of her probability mass on the correct answer.
Do you consider this a price worth paying to reward having particular ratios between credences?
[....is at the expense of rewarding knowledge of the correct answer.]
Hmm… I’m not sure that Alice has really more knowledge than Bob in your example.
[EDIT : In fact, in your example, for the quadratic scoring rule, the score of 50:50:
$\epsilon:\epsilon$
is better than the score of 40:20:20:20 since$12/25 < 1/2 + 2\epsilon^2$
, so that we can indeed say that Alice has more knowledge than Bob after this rule. The following example is, IMHO, more interesting. /EDIT].Let me propose an other perspective with the following two answers for propositions A:B:C:D :
1) 50:50:0:0
2) 50:25:25:0,
where the correct answer is 100:0:0:0.
In this case, 2) has a better score than 1).
What does 1) know ? That D and C are false. He knows nothing for A and B.
What does 2) know ? That D is false. That C is not very probable. He does not know for A, like 1). But he does know moreover that B is probably not the right answer.
Suppose that 3) is someone who knows that D and C are false, and also knows that B is probably not the right answer (i.e., 3 has the knowledge of both 1 and 2). Then 3) could have given the answer 3a) 75:25:0:0, or the answer 3b) 62,5:37,5:0:0. These two answers score better than 1) and 2).
(Note that knowing that D and C are false and that B is probably not the right answer influence your knowledge about A.)
-----
For example, imagine that 2) first thinks of 50:25:25:0, but then he remembers that it can in fact not be C. We can then compute the bayesian update, and we get :
P(A | non C) = 2⁄3 (vs P(A) = 1⁄2)
P(B | non C) = 1⁄3 (vs P(B) = 1⁄4)
P(C | non C) = 0 (vs P(C) = 1⁄4)
P(D | non C) = 0 (vs P(D) = 0).
This is different from answer 1). In this sense, I think we can really say that 2) knows something that 1) does not know, even if 2) is not sure that C is false. Indeed, after an update of the information ‘non C’, the score of 2) becomes better than the score of 1). (2/3:1/3:0:0 has a better score than 1/2:1/4:1/4:0).
Sure, 2 knows something 1 doesn’t; e.g., 2 knows more about how unlikely B is. But, equally, 1 knows something 2 doesn’t; e.g., 1 knows more than 2 about how unlikely C is.
In the absence of any reason to think one of these is more important than the other, it seems reasonable to think that different probability assignments among the various wrong answers are equally meritorious and should result in equal scores.
… Having said that, here’s an argument (which I’m not sure I believe) for favouring more-balanced probability assignments to the wrong answers. We never really know that the right answer is 100:0:0:0. We could, conceivably, be wrong. And, by hypothesis, we don’t know of any relevant differences between the “wrong” answers. So we should see all the wrong answers as equally improbable but not quite certainly wrong. And if, deep down, we believe in something like the log scoring rule, then we should notice that a candidate who assigns a super-low probability to one of those “wrong” answers is going to do super-badly in the very unlikely case that it’s actually right after all.
So, suppose we believe in the log scoring rule, and we think the correct answer is the first one. But we admit a tiny probability h for each of the others being right. Then a candidate who gives probabilities a,b,c,d has an expected score of (1-3h) log a + h (log b + log c + log d). Suppose one candidate says 0.49,0.49,0.01,0.01 and the other says 0.4,0.2,0.2,0.2; then we will prefer the second over the first if h is bigger than about 0.0356. In a typical educational context that’s unlikely so we should prefer the first candidate. Now suppose one says 0.49,0.49,0.01,0.01 and the other says 0.49,0.25,0.25,0.01; we should always prefer the second candidate.
None of this means that the Brier score is the right way to prefer the second candidate over the first; it clearly isn’t, and if h is small enough then of course the correction to the naive log score is also very small, provided candidates’ probability assignments are bounded away from zero.
In practice, hopefully h is extremely small. And some wrong answers will be wronger than others and we don’t want to reward candidates for not noticing that, but we probably also don’t want the extra pain of figuring out just how badly wrong all the wrong answers are, and that is my main reason for thinking it’s better to use a scoring rule that doesn’t care what probabilities candidates assigned to the wrong answers.
Arguably the “natural” way to handle the possibility that you (the examiner) are in error is to score answers by (negative) KL-divergence from your own probability assignment. So if there are four options to which you assign probabilities p,q,r,s and a candidate says a,b,c,d then they get p log(a/p) + q log(b/q) + r log(c/r) + s log(d/s). If p=1 and q,r,s=0,0,0 then this is the same as giving them log a, i.e., the usual log-scoring rule. If p=1-3h and q,r,s=h,h,h then this is (1-3h) log (a/(1-3h)) + h log(b/h) + …, which if we fix a is constant + h (log b + log c + log d) = constant + h log bcd, which by the AM-GM inequality is biggest when b=c=d.
This differs from the “expected log score” I described above only by an additive constant. One way to describe it is: the average amount of information the candidate would gain by adopting your probabilities instead of theirs, the average being taken according to your probabilities.
This is really interesting, thanks, not something I’d thought of.
If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B) [1] between the students and teacher. This is my first thought about how I’d create a fair scoring rule for this.
[1] P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.
The score for the 50:50:0:0 student is:
1−0.52−0.52−02−02=0.5
The score for the 40:20:20:20 student is:
1−0.62−0.22−0.22−0.22=0.52
I think the way you’ve done it is Briers rule which is (1 - the score from the OP). In Briers rule the lower value is better.
Note that linear utility in money would again incentivize people to put everything on the largest probability.
This is true if scores from different questions are added but not if they are multiplied. Linear scoring with multiplication is exactly the same as log scoring with addition, just easier to visualise (at least to me)
Wrong. In the 100k drop, if you know each question has odds 60:40, expected winnings are maximized if you put all on one answer each time, not 60% on one and 40% on the other.
What’s not preserved between the two ways to score is which strategy maximizes expected score.
I think the 100k drop analogy may be misleading when thinking about the final result. The final score in the version I envisage is judged on ratios between results, rather than absolute values (my explanation maybe isn’t clearly enough on this). In that case putting everything on the answer which you have 60% confidence in and being right gives a ratio of 1.67 in your favour over an honest reporting. But if you do it and get it wrong then there is an infinite ratio in favour of the honest reporting.
I disagree with your first point, I consider the 50:25:25:0 thing is the point. It’s hard to swallow because admitting ignorance rather than appearing falsely confident always is, but that’s why it makes for such a good value to train.
But if I my genuine confidence levels are 50:50:0:0 it seems unfair that I score less than someone whose genuine confidence levels are 50:25:25:0 - we both put the same probability on the correct score so why do they score more?
I don’t think this is a knock-down argument, but:
Suppose the students are subsequently told, by someone whom they trust but who happens to be wrong, that the answer isn’t A.
The 50:50:0:0 student says “okay, then it must be B”. The 50:25:25:0 student says “okay, then it must be B or C, 50% on each”. And the 50:17:17:17 student says “okay, then I don’t know”.
I don’t think these responses are equally good, and I don’t think they should be rewarded equally. The second student is more confused by fiction than the first, and the third is more confused again.
That said, to give a concrete example: what is 70*80? Is it 5600, 5400, 56000, or 3? By the way, it’s not 5600.
Obviously the best response here is “um, yes it is”. But I still feel like someone who gives equal weight to 3 as to 5400 is… either very confident in their skills, or very confused. I think my intuition is that I want to reward that student less than the other two, which goes against both your answer (reward them all equally) and my answer above (reward that student the most).
But I can’t really imagine someone honestly giving 50:17:17:17 to that question. Someone who gave equal scores to the last three answers probably gave something like either 100:0:0:0 (if they’re confident) or 25:25:25:25 (if they’re confused), and gets a higher or lower reward from that. So I dunno what to make of this.
This makes sense to me.
I think to do this instead of preferring certain ratios between answers, we should prefer certain answers.
Under the original scoring scheme 50:50:0:0 doesn’t score differently from 50:0:50:0 or 50:0:0:50. The average credence for each answer between those 3 is 50:17:17:17 so I’d argue that (without some external marking of which incorrect answers are more reasonable) 50:50:0:0 should score the same as 50:17:17:17.
However we could choose a marking scheme where you get back (using my framing of log scoring above):
100% of the points put on A
10% of the points put on B
10% of the points put on C
0% of the points put on D
That way 50:50:0:0 and 50:25:25:0 both end up with 55% of their points but 50:17:17:17 gets 53.4% and 50:0:0:50 gets 50%. Play around with the percentages to get rewards that seem reasonable—I think it would still be a proper scoring rule*. You could do something similar with a quadratic scoring rule.
*I think one danger is that if I am unsure but I think I can guess what the teacher thinks is reasonable/unreasonable then this might tempt me to alter my score based on something other than my actual credence levels.
[… why do they score more?]
I’m not sure if these are good reasons, but it seems to me that
1) The expected answer to the quiz does not just consist in identifying A as a correct answer but also in identifying the others as incorrect answers. I mean that the expected right answer is 100:0:0:0 (and not, for example, 100:50:0:0 or whatever else).
2) Giving 25:25 for B:C is better than giving 50:0 even if answer C is 0 since 25:25 is closer to 0:0 than 50:0 (for the usual Euclidean distance). In this perspective, a better answer for the 50:50:0:0′s guy would have been 50:25:0:0, which is better than 50:25:25:0.
Indeed, 1 - [(1-1/2)^2 + (1/4)^2 + 0^2 + 0^2] > 1 - [(1-1/2)^2 + (1/4)^2 + (1/4)^2 + 0^2] > 1 - [(1-1/2)^2 + (1/2)^2 + 0^2 + 0^2].
3) With this perspective, I am indeed not sure that encouraging for a student’s answer the sum to be 100 is a good idea. It seems better (for the student which is answering) to focus on each proposition (i.e., A, B, C or D) separately (related to point 1 of my message). For each proposition, the answer should reflect the credence of the person in the the fact that the answer is correct/incorrect. Therefore this could also be applied for a multiple-choice quiz with zero or more than one good answer(s).
EDIT (added) :
To sum up what I think could in this case be an answer to your question, I will say that, with the “quadratic scoring rule”, if the expected answer for A:B:C:D is 100:0:0:0, then the answer 1) 50:25:0:0 scores more than the answer 2) 50:50:0:0 because they are both right for C and D, they are at the same distance of the expected answer for A but 1) is closer to the expected answer for B (which is 0) than 2).
The same reasoning works for comparing 1′) 50:25:25:0 with 2′) 50:50:0:0, except that in this second case, it is the general distance (for the quadratic scoring rule) of 25:25 (for B:C) which is closer to 0:0 than 50:0.
Maybe 1) is where I have a fundamental difference.
Given evidence A, a Bayesian update considers how well evidence A was predicted.
There is no additional update due to how well ¬A being false was predicted. Even if ¬A is split into sub-categories, it isn’t relevant as that evidence has already been taken into account when we updated based on A being true.
r.e. 2) 50:25:0:0 gives a worse expected value than 50:50:0:0 as although my score increases if A is true, it decreases by more if B is true (assuming 50:50:0:0 is my true belief)
r.e. 3) I think it’s important to note that I’m assuming that exactly 1 of A or B or C or D is the correct answer. Therefore that the probabilities should add up to 100% to maximise your expected score (otherwise it isn’t a proper scoring rule).
Try to think about this in terms of expected value. On your specific example, they do score more, but this is probabilistic thinking, so we want to think about it in terms of the long run trend.
Suppose we no longer know what the answer is, and you are genuinely 50⁄50 on it being either A or B. This is what you truly believe, you don’t think there’s a chance in hell it’s C. If you sit there and ask yourself, “Maybe I should do a 50-25-25 split, just in case”, you’re going to immediately realize “Wait, that’s moronic. I’m throwing away 25% of my points on something I am certain is wrong. This is like betting on a 3-legged horse.”
Now let’s say you do a hundred of these questions, and most of your 50-50s come up correct-ish as one or the other. Your opponent consistently does 50-25-25s, and so they end up more wrong than you overall, because half the time the answer lands on one of their two 25s, not their single 50.
It’s not a game of being more correct, it’s a game of being less wrong.
I think all of this is also true of a scoring rule based on only the probability placed on the correct answer?
In the end you’d still expect to win but this takes longer (requires more questions) under a rule which includes probabilities on incorrect answers—it’s just adding noise to the results.