I did quite a bunch of credence calibration, I’m curious to what extend you have trained it. I have the impression that my intuitions here are informed by experiences.
Most normal testing focuses on finding the correct answer. Credence training focuses on accurately having a sense of your own knowledge.
There’s a sense that having that one’s genuine confidence level is zero feels repulsing given that zero isn’t a probability.
Humans use the availability heuristic to compare different options. The fact that A feels more likely then B and A feels more likely then C seems to me like a form of knowledge that’s worth rewarding.
I have done some credence training but I think my instincts here are more based on Maths and specifically Bayes (see this comment).
I think the zero probability thing is a red herring—replace the 0s with ϵ and the 50s with 50-ϵ and you get basically the same thing. There are some questions where keeping track of the ϵ just isn’t worth it.
A proper scoring rule is designed to reward both knowledge and accurate reporting of credences. This is achieved if we score based on the correct answer, whether or not we also score based on the probabilities of the wrong answers.
If we also attempt to optimise for certain ratios between credences of different answers then this is at the expense of rewarding knowledge of the correct answer.
If Alice has credence levels of 50:50:ϵ:ϵ and Bob has 40:20:20:20 and the correct answer is A then Bob will get a higher score than Alice despite her putting more of her probability mass on the correct answer.
Do you consider this a price worth paying to reward having particular ratios between credences?
[....is at the expense of rewarding knowledge of the correct answer.]
Hmm… I’m not sure that Alice has really more knowledge than Bob in your example.
[EDIT : In fact, in your example, for the quadratic scoring rule, the score of 50:50:$\epsilon:\epsilon$ is better than the score of 40:20:20:20 since $12/25 < 1/2 + 2\epsilon^2$, so that we can indeed say that Alice has more knowledge than Bob after this rule. The following example is, IMHO, more interesting. /EDIT].
Let me propose an other perspective with the following two answers for propositions A:B:C:D :
1) 50:50:0:0
2) 50:25:25:0,
where the correct answer is 100:0:0:0.
In this case, 2) has a better score than 1).
What does 1) know ? That D and C are false. He knows nothing for A and B.
What does 2) know ? That D is false. That C is not very probable. He does not know for A, like 1). But he does know moreover that B is probably not the right answer.
Suppose that 3) is someone who knows that D and C are false, and also knows that B is probably not the right answer (i.e., 3 has the knowledge of both 1 and 2). Then 3) could have given the answer 3a) 75:25:0:0, or the answer 3b) 62,5:37,5:0:0. These two answers score better than 1) and 2).
(Note that knowing that D and C are false and that B is probably not the right answer influence your knowledge about A.)
-----
For example, imagine that 2) first thinks of 50:25:25:0, but then he remembers that it can in fact not be C. We can then compute the bayesian update, and we get :
P(A | non C) = 2⁄3 (vs P(A) = 1⁄2)
P(B | non C) = 1⁄3 (vs P(B) = 1⁄4)
P(C | non C) = 0 (vs P(C) = 1⁄4)
P(D | non C) = 0 (vs P(D) = 0).
This is different from answer 1). In this sense, I think we can really say that 2) knows something that 1) does not know, even if 2) is not sure that C is false. Indeed, after an update of the information ‘non C’, the score of 2) becomes better than the score of 1). (2/3:1/3:0:0 has a better score than 1/2:1/4:1/4:0).
Sure, 2 knows something 1 doesn’t; e.g., 2 knows more about how unlikely B is. But, equally, 1 knows something 2 doesn’t; e.g., 1 knows more than 2 about how unlikely C is.
In the absence of any reason to think one of these is more important than the other, it seems reasonable to think that different probability assignments among the various wrong answers are equally meritorious and should result in equal scores.
… Having said that, here’s an argument (which I’m not sure I believe) for favouring more-balanced probability assignments to the wrong answers. We never really know that the right answer is 100:0:0:0. We could, conceivably, be wrong. And, by hypothesis, we don’t know of any relevant differences between the “wrong” answers. So we should see all the wrong answers as equally improbable but not quite certainly wrong. And if, deep down, we believe in something like the log scoring rule, then we should notice that a candidate who assigns a super-low probability to one of those “wrong” answers is going to do super-badly in the very unlikely case that it’s actually right after all.
So, suppose we believe in the log scoring rule, and we think the correct answer is the first one. But we admit a tiny probability h for each of the others being right. Then a candidate who gives probabilities a,b,c,d has an expected score of (1-3h) log a + h (log b + log c + log d). Suppose one candidate says 0.49,0.49,0.01,0.01 and the other says 0.4,0.2,0.2,0.2; then we will prefer the second over the first if h is bigger than about 0.0356. In a typical educational context that’s unlikely so we should prefer the first candidate. Now suppose one says 0.49,0.49,0.01,0.01 and the other says 0.49,0.25,0.25,0.01; we should always prefer the second candidate.
None of this means that the Brier score is the right way to prefer the second candidate over the first; it clearly isn’t, and if h is small enough then of course the correction to the naive log score is also very small, provided candidates’ probability assignments are bounded away from zero.
In practice, hopefully h is extremely small. And some wrong answers will be wronger than others and we don’t want to reward candidates for not noticing that, but we probably also don’t want the extra pain of figuring out just how badly wrong all the wrong answers are, and that is my main reason for thinking it’s better to use a scoring rule that doesn’t care what probabilities candidates assigned to the wrong answers.
Arguably the “natural” way to handle the possibility that you (the examiner) are in error is to score answers by (negative) KL-divergence from your own probability assignment. So if there are four options to which you assign probabilities p,q,r,s and a candidate says a,b,c,d then they get p log(a/p) + q log(b/q) + r log(c/r) + s log(d/s). If p=1 and q,r,s=0,0,0 then this is the same as giving them log a, i.e., the usual log-scoring rule. If p=1-3h and q,r,s=h,h,h then this is (1-3h) log (a/(1-3h)) + h log(b/h) + …, which if we fix a is constant + h (log b + log c + log d) = constant + h log bcd, which by the AM-GM inequality is biggest when b=c=d.
This differs from the “expected log score” I described above only by an additive constant. One way to describe it is: the average amount of information the candidate would gain by adopting your probabilities instead of theirs, the average being taken according to your probabilities.
This is really interesting, thanks, not something I’d thought of.
If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B) [1] between the students and teacher. This is my first thought about how I’d create a fair scoring rule for this.
[1] P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.
Why?
50:50:0:0 says it’s a coin toss between A and ¬A. If ¬A then B.
50:25:25:0 says it’s a coin toss up between A and ¬A. If ¬A then its a coin toss between B and C.
Why should the scoring rule care about what my rule is for ¬A when A is the correct answer?
I’m genuinely curious—I notice you’re the second person to voice this opinion but I can’t get my head round it at all.
(As with my reply to aaq, this all assumes that these are genuine confidence levels)
I did quite a bunch of credence calibration, I’m curious to what extend you have trained it. I have the impression that my intuitions here are informed by experiences.
Most normal testing focuses on finding the correct answer. Credence training focuses on accurately having a sense of your own knowledge.
There’s a sense that having that one’s genuine confidence level is zero feels repulsing given that zero isn’t a probability.
Humans use the availability heuristic to compare different options. The fact that A feels more likely then B and A feels more likely then C seems to me like a form of knowledge that’s worth rewarding.
I have done some credence training but I think my instincts here are more based on Maths and specifically Bayes (see this comment).
I think the zero probability thing is a red herring—replace the 0s with ϵ and the 50s with 50-ϵ and you get basically the same thing. There are some questions where keeping track of the ϵ just isn’t worth it.
A proper scoring rule is designed to reward both knowledge and accurate reporting of credences. This is achieved if we score based on the correct answer, whether or not we also score based on the probabilities of the wrong answers.
If we also attempt to optimise for certain ratios between credences of different answers then this is at the expense of rewarding knowledge of the correct answer.
If Alice has credence levels of 50:50:ϵ:ϵ and Bob has 40:20:20:20 and the correct answer is A then Bob will get a higher score than Alice despite her putting more of her probability mass on the correct answer.
Do you consider this a price worth paying to reward having particular ratios between credences?
[....is at the expense of rewarding knowledge of the correct answer.]
Hmm… I’m not sure that Alice has really more knowledge than Bob in your example.
[EDIT : In fact, in your example, for the quadratic scoring rule, the score of 50:50:
$\epsilon:\epsilon$
is better than the score of 40:20:20:20 since$12/25 < 1/2 + 2\epsilon^2$
, so that we can indeed say that Alice has more knowledge than Bob after this rule. The following example is, IMHO, more interesting. /EDIT].Let me propose an other perspective with the following two answers for propositions A:B:C:D :
1) 50:50:0:0
2) 50:25:25:0,
where the correct answer is 100:0:0:0.
In this case, 2) has a better score than 1).
What does 1) know ? That D and C are false. He knows nothing for A and B.
What does 2) know ? That D is false. That C is not very probable. He does not know for A, like 1). But he does know moreover that B is probably not the right answer.
Suppose that 3) is someone who knows that D and C are false, and also knows that B is probably not the right answer (i.e., 3 has the knowledge of both 1 and 2). Then 3) could have given the answer 3a) 75:25:0:0, or the answer 3b) 62,5:37,5:0:0. These two answers score better than 1) and 2).
(Note that knowing that D and C are false and that B is probably not the right answer influence your knowledge about A.)
-----
For example, imagine that 2) first thinks of 50:25:25:0, but then he remembers that it can in fact not be C. We can then compute the bayesian update, and we get :
P(A | non C) = 2⁄3 (vs P(A) = 1⁄2)
P(B | non C) = 1⁄3 (vs P(B) = 1⁄4)
P(C | non C) = 0 (vs P(C) = 1⁄4)
P(D | non C) = 0 (vs P(D) = 0).
This is different from answer 1). In this sense, I think we can really say that 2) knows something that 1) does not know, even if 2) is not sure that C is false. Indeed, after an update of the information ‘non C’, the score of 2) becomes better than the score of 1). (2/3:1/3:0:0 has a better score than 1/2:1/4:1/4:0).
Sure, 2 knows something 1 doesn’t; e.g., 2 knows more about how unlikely B is. But, equally, 1 knows something 2 doesn’t; e.g., 1 knows more than 2 about how unlikely C is.
In the absence of any reason to think one of these is more important than the other, it seems reasonable to think that different probability assignments among the various wrong answers are equally meritorious and should result in equal scores.
… Having said that, here’s an argument (which I’m not sure I believe) for favouring more-balanced probability assignments to the wrong answers. We never really know that the right answer is 100:0:0:0. We could, conceivably, be wrong. And, by hypothesis, we don’t know of any relevant differences between the “wrong” answers. So we should see all the wrong answers as equally improbable but not quite certainly wrong. And if, deep down, we believe in something like the log scoring rule, then we should notice that a candidate who assigns a super-low probability to one of those “wrong” answers is going to do super-badly in the very unlikely case that it’s actually right after all.
So, suppose we believe in the log scoring rule, and we think the correct answer is the first one. But we admit a tiny probability h for each of the others being right. Then a candidate who gives probabilities a,b,c,d has an expected score of (1-3h) log a + h (log b + log c + log d). Suppose one candidate says 0.49,0.49,0.01,0.01 and the other says 0.4,0.2,0.2,0.2; then we will prefer the second over the first if h is bigger than about 0.0356. In a typical educational context that’s unlikely so we should prefer the first candidate. Now suppose one says 0.49,0.49,0.01,0.01 and the other says 0.49,0.25,0.25,0.01; we should always prefer the second candidate.
None of this means that the Brier score is the right way to prefer the second candidate over the first; it clearly isn’t, and if h is small enough then of course the correction to the naive log score is also very small, provided candidates’ probability assignments are bounded away from zero.
In practice, hopefully h is extremely small. And some wrong answers will be wronger than others and we don’t want to reward candidates for not noticing that, but we probably also don’t want the extra pain of figuring out just how badly wrong all the wrong answers are, and that is my main reason for thinking it’s better to use a scoring rule that doesn’t care what probabilities candidates assigned to the wrong answers.
Arguably the “natural” way to handle the possibility that you (the examiner) are in error is to score answers by (negative) KL-divergence from your own probability assignment. So if there are four options to which you assign probabilities p,q,r,s and a candidate says a,b,c,d then they get p log(a/p) + q log(b/q) + r log(c/r) + s log(d/s). If p=1 and q,r,s=0,0,0 then this is the same as giving them log a, i.e., the usual log-scoring rule. If p=1-3h and q,r,s=h,h,h then this is (1-3h) log (a/(1-3h)) + h log(b/h) + …, which if we fix a is constant + h (log b + log c + log d) = constant + h log bcd, which by the AM-GM inequality is biggest when b=c=d.
This differs from the “expected log score” I described above only by an additive constant. One way to describe it is: the average amount of information the candidate would gain by adopting your probabilities instead of theirs, the average being taken according to your probabilities.
This is really interesting, thanks, not something I’d thought of.
If the teacher (or whoever set the test) also has a spread of credence over the answers then a Bayesian update would compare the values of P(A), P(B|¬A) and P(C|¬A and ¬B) [1] between the students and teacher. This is my first thought about how I’d create a fair scoring rule for this.
[1] P(D|¬A and ¬B and ¬C) = 1 for all students and teachers so this is screened off by the other answers.
The score for the 50:50:0:0 student is:
1−0.52−0.52−02−02=0.5
The score for the 40:20:20:20 student is:
1−0.62−0.22−0.22−0.22=0.52
I think the way you’ve done it is Briers rule which is (1 - the score from the OP). In Briers rule the lower value is better.