Sure, 2 knows something 1 doesn’t; e.g., 2 knows more about how unlikely B is. But, equally, 1 knows something 2 doesn’t; e.g., 1 knows more than 2 about how unlikely C is.
In the absence of any reason to think one of these is more important than the other, it seems reasonable to think that different probability assignments among the various wrong answers are equally meritorious and should result in equal scores.
… Having said that, here’s an argument (which I’m not sure I believe) for favouring more-balanced probability assignments to the wrong answers. We never really know that the right answer is 100:0:0:0. We could, conceivably, be wrong. And, by hypothesis, we don’t know of any relevant differences between the “wrong” answers. So we should see all the wrong answers as equally improbable but not quite certainly wrong. And if, deep down, we believe in something like the log scoring rule, then we should notice that a candidate who assigns a super-low probability to one of those “wrong” answers is going to do super-badly in the very unlikely case that it’s actually right after all.
So, suppose we believe in the log scoring rule, and we think the correct answer is the first one. But we admit a tiny probability h for each of the others being right. Then a candidate who gives probabilities a,b,c,d has an expected score of (1-3h) log a + h (log b + log c + log d). Suppose one candidate says 0.49,0.49,0.01,0.01 and the other says 0.4,0.2,0.2,0.2; then we will prefer the second over the first if h is bigger than about 0.0356. In a typical educational context that’s unlikely so we should prefer the first candidate. Now suppose one says 0.49,0.49,0.01,0.01 and the other says 0.49,0.25,0.25,0.01; we should always prefer the second candidate.
None of this means that the Brier score is the right way to prefer the second candidate over the first; it clearly isn’t, and if h is small enough then of course the correction to the naive log score is also very small, provided candidates’ probability assignments are bounded away from zero.
In practice, hopefully h is extremely small. And some wrong answers will be wronger than others and we don’t want to reward candidates for not noticing that, but we probably also don’t want the extra pain of figuring out just how badly wrong all the wrong answers are, and that is my main reason for thinking it’s better to use a scoring rule that doesn’t care what probabilities candidates assigned to the wrong answers.
Arguably the “natural” way to handle the possibility that you (the examiner) are in error is to score answers by (negative) KL-divergence from your own probability assignment. So if there are four options to which you assign probabilities p,q,r,s and a candidate says a,b,c,d then they get p log(a/p) + q log(b/q) + r log(c/r) + s log(d/s). If p=1 and q,r,s=0,0,0 then this is the same as giving them log a, i.e., the usual log-scoring rule. If p=1-3h and q,r,s=h,h,h then this is (1-3h) log (a/(1-3h)) + h log(b/h) + …, which if we fix a is constant + h (log b + log c + log d) = constant + h log bcd, which by the AM-GM inequality is biggest when b=c=d.
This differs from the “expected log score” I described above only by an additive constant. One way to describe it is: the average amount of information the candidate would gain by adopting your probabilities instead of theirs, the average being taken according to your probabilities.
This is really interesting, thanks, not something I’d thought of.
If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B) [1] between the students and teacher. This is my first thought about how I’d create a fair scoring rule for this.
[1] P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.
Sure, 2 knows something 1 doesn’t; e.g., 2 knows more about how unlikely B is. But, equally, 1 knows something 2 doesn’t; e.g., 1 knows more than 2 about how unlikely C is.
In the absence of any reason to think one of these is more important than the other, it seems reasonable to think that different probability assignments among the various wrong answers are equally meritorious and should result in equal scores.
… Having said that, here’s an argument (which I’m not sure I believe) for favouring more-balanced probability assignments to the wrong answers. We never really know that the right answer is 100:0:0:0. We could, conceivably, be wrong. And, by hypothesis, we don’t know of any relevant differences between the “wrong” answers. So we should see all the wrong answers as equally improbable but not quite certainly wrong. And if, deep down, we believe in something like the log scoring rule, then we should notice that a candidate who assigns a super-low probability to one of those “wrong” answers is going to do super-badly in the very unlikely case that it’s actually right after all.
So, suppose we believe in the log scoring rule, and we think the correct answer is the first one. But we admit a tiny probability h for each of the others being right. Then a candidate who gives probabilities a,b,c,d has an expected score of (1-3h) log a + h (log b + log c + log d). Suppose one candidate says 0.49,0.49,0.01,0.01 and the other says 0.4,0.2,0.2,0.2; then we will prefer the second over the first if h is bigger than about 0.0356. In a typical educational context that’s unlikely so we should prefer the first candidate. Now suppose one says 0.49,0.49,0.01,0.01 and the other says 0.49,0.25,0.25,0.01; we should always prefer the second candidate.
None of this means that the Brier score is the right way to prefer the second candidate over the first; it clearly isn’t, and if h is small enough then of course the correction to the naive log score is also very small, provided candidates’ probability assignments are bounded away from zero.
In practice, hopefully h is extremely small. And some wrong answers will be wronger than others and we don’t want to reward candidates for not noticing that, but we probably also don’t want the extra pain of figuring out just how badly wrong all the wrong answers are, and that is my main reason for thinking it’s better to use a scoring rule that doesn’t care what probabilities candidates assigned to the wrong answers.
Arguably the “natural” way to handle the possibility that you (the examiner) are in error is to score answers by (negative) KL-divergence from your own probability assignment. So if there are four options to which you assign probabilities p,q,r,s and a candidate says a,b,c,d then they get p log(a/p) + q log(b/q) + r log(c/r) + s log(d/s). If p=1 and q,r,s=0,0,0 then this is the same as giving them log a, i.e., the usual log-scoring rule. If p=1-3h and q,r,s=h,h,h then this is (1-3h) log (a/(1-3h)) + h log(b/h) + …, which if we fix a is constant + h (log b + log c + log d) = constant + h log bcd, which by the AM-GM inequality is biggest when b=c=d.
This differs from the “expected log score” I described above only by an additive constant. One way to describe it is: the average amount of information the candidate would gain by adopting your probabilities instead of theirs, the average being taken according to your probabilities.
This is really interesting, thanks, not something I’d thought of.
If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B) [1] between the students and teacher. This is my first thought about how I’d create a fair scoring rule for this.
[1] P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.