I am starting to wonder whether or not Bayes Score is what I want to maximize for my epistemic rationality. I started thinking about by trying to design a board game to teach calibration of probabilities, so I will use that as my example:
I wanted a scoring mechanism which motivates honest reporting of probabilities and rewards players who are better calibrated. For simplicity, lets assume that we only have to deal with true/false questions for now. A player is given a question which they believe is true with probability p. They then name a real number x between 0 and 1. Then, they receive a score which is a function of x and whether or not the problem is true. We want the expected score to be maximized exactly when x=p. Let f(x) be the output if the question is true, and let g(x) be the output if the question is false. Then, my expected utility is (p)f(x)+(1-p)g(x). If we assume f and g are smooth, then in order to have a maximum at x=p, we want (p)f’(p)+(1-p)g’(p)=0, which still leaves us with a large class of functions. It would also be nice to have symmetry by having f(x)=g(1-x). If we further require this, we get (p)f’(p)+(1-p)(-1)f’(1-p)=0, or equivalently (p)f’(p)=(1-p)f’(1-p). One way to achieve this is to set (x)f’(x) to be a constant So then, f’(x)=c/x, so f(x)=log x. This scoring mechanism is referred to as “Bayesian Score”.
However, another natural way to to achieve this is by setting f’(p)/(1-p) equal to a constant. If we set this constant equal to 2, we get f’(x)=2-2x, which gives us f(x)=2x-x2=1-(1-x)2. I will call this the “Squared Error Score.”
There are many other functions which satisfy the desired conditions, but these two are the simplest, so I will focus on these two.
Eliezer argues for Bayesian Score in A Technical Explanation of Technical Explanation, which I recommend reading. The reason he prefers Bayesian Score is that he wants the sum of the scores associated with determining P(A) and P(B|A) to equal the score for determining P(A&B). In other words he wants it to not matter whether you break a problem up into one experiment or two experiments. This is a legitimate virtue of this scoring mechanism, but I think that many people think it is a lot more valuable than it is. This doesn’t eliminate the problem of we don’t know what questions to ask. It gives us the same answer regardless of how we break up an experiment into smaller experiments, but our score is still dependent on what questions are asked, and this cannot be fixed by just saying, “Ask all questions.” There are infinitely many of them. The sum does not converge. Because the score is still a function of what questions are asked, the fact that it gives the same answer for some related sets of questions is not a huge benefit.
One nice thing about the Squared Error Score is that it always gives a score between 0 and 1, which means we can actually use it in real life. For example, we could ask someone to construct a spinner that comes up either true or false, and then spin it twice. They win if either of the two spins comes up with the true answer. In this case, the best strategy is to assign probability p to true. There is no way to do anything similar for the Bayesian Score, in fact it is questionable whether or not arbitrary low utilities even make sense.
The Bayesian Score is slightly easier to generalize to multiple choice questions. The Squared Error Score can also be generalized, but it unfortunately has to make your score a function not only of the probability you assigned to the correct solution. For example, If A is the correct answer, you get more points for 80%A 10%B 10%C than from 80%A 20%B 0%C. The function you want for multiple values is if you assign probabilities x1, through xn, and the first option is correct you get output 2x1-x12-x22-...-xn2. I do not think this is as bad as it seems. It kind of makes sense that when the answer is A, you get penalized slightly for saying that you are much more confident in B than in C, since making such a claim is a waste of information. To view this as a spinner, you construct a spinner, spin it twice, and you win if either spin gets the correct answer, or if the first spin comes lexicographically strictly before the second spin.
For the purpose of my calibration game, I will almost certainly use Squared Error Scoring, because log is not feasible. But it got me thinking about why I am not thinking in terms of Squared Error Score in real life.
You might ask what is the experimental difference between the two, since they are both maximized by honest probabilities. Well If I have two questions and I want to maximize my (possibly weighted) average score, and I have a limited amount of time to research and improve my answers for them, then it matters how much the scoring mechanism penalizes various errors. Bayesian Scoring penalizes so much for being sure of one false thing that none of the other scores really matter, while Squared Error is much more forgiving. If we normalize to say that 50⁄50 gives 0 points while true certainty gives 1 points, then Squared Error gives −3 points for false certainty while Bayesian gives negative infinity.
I view maximizing Bayesian Score as the Golden Rule of epistemic rationality, so even a small chance that something else might be better is worth investigating. Even if you are fully committed to Bayesian Score, I would love to hear any pros or cons you can think of in either direction.
Alternative to Bayesian Score
I am starting to wonder whether or not Bayes Score is what I want to maximize for my epistemic rationality. I started thinking about by trying to design a board game to teach calibration of probabilities, so I will use that as my example:
I wanted a scoring mechanism which motivates honest reporting of probabilities and rewards players who are better calibrated. For simplicity, lets assume that we only have to deal with true/false questions for now. A player is given a question which they believe is true with probability p. They then name a real number x between 0 and 1. Then, they receive a score which is a function of x and whether or not the problem is true. We want the expected score to be maximized exactly when x=p. Let f(x) be the output if the question is true, and let g(x) be the output if the question is false. Then, my expected utility is (p)f(x)+(1-p)g(x). If we assume f and g are smooth, then in order to have a maximum at x=p, we want (p)f’(p)+(1-p)g’(p)=0, which still leaves us with a large class of functions. It would also be nice to have symmetry by having f(x)=g(1-x). If we further require this, we get (p)f’(p)+(1-p)(-1)f’(1-p)=0, or equivalently (p)f’(p)=(1-p)f’(1-p). One way to achieve this is to set (x)f’(x) to be a constant So then, f’(x)=c/x, so f(x)=log x. This scoring mechanism is referred to as “Bayesian Score”.
However, another natural way to to achieve this is by setting f’(p)/(1-p) equal to a constant. If we set this constant equal to 2, we get f’(x)=2-2x, which gives us f(x)=2x-x2=1-(1-x)2. I will call this the “Squared Error Score.”
There are many other functions which satisfy the desired conditions, but these two are the simplest, so I will focus on these two.
Eliezer argues for Bayesian Score in A Technical Explanation of Technical Explanation, which I recommend reading. The reason he prefers Bayesian Score is that he wants the sum of the scores associated with determining P(A) and P(B|A) to equal the score for determining P(A&B). In other words he wants it to not matter whether you break a problem up into one experiment or two experiments. This is a legitimate virtue of this scoring mechanism, but I think that many people think it is a lot more valuable than it is. This doesn’t eliminate the problem of we don’t know what questions to ask. It gives us the same answer regardless of how we break up an experiment into smaller experiments, but our score is still dependent on what questions are asked, and this cannot be fixed by just saying, “Ask all questions.” There are infinitely many of them. The sum does not converge. Because the score is still a function of what questions are asked, the fact that it gives the same answer for some related sets of questions is not a huge benefit.
One nice thing about the Squared Error Score is that it always gives a score between 0 and 1, which means we can actually use it in real life. For example, we could ask someone to construct a spinner that comes up either true or false, and then spin it twice. They win if either of the two spins comes up with the true answer. In this case, the best strategy is to assign probability p to true. There is no way to do anything similar for the Bayesian Score, in fact it is questionable whether or not arbitrary low utilities even make sense.
The Bayesian Score is slightly easier to generalize to multiple choice questions. The Squared Error Score can also be generalized, but it unfortunately has to make your score a function not only of the probability you assigned to the correct solution. For example, If A is the correct answer, you get more points for 80%A 10%B 10%C than from 80%A 20%B 0%C. The function you want for multiple values is if you assign probabilities x1, through xn, and the first option is correct you get output 2x1-x12-x22-...-xn2. I do not think this is as bad as it seems. It kind of makes sense that when the answer is A, you get penalized slightly for saying that you are much more confident in B than in C, since making such a claim is a waste of information. To view this as a spinner, you construct a spinner, spin it twice, and you win if either spin gets the correct answer, or if the first spin comes lexicographically strictly before the second spin.
For the purpose of my calibration game, I will almost certainly use Squared Error Scoring, because log is not feasible. But it got me thinking about why I am not thinking in terms of Squared Error Score in real life.
You might ask what is the experimental difference between the two, since they are both maximized by honest probabilities. Well If I have two questions and I want to maximize my (possibly weighted) average score, and I have a limited amount of time to research and improve my answers for them, then it matters how much the scoring mechanism penalizes various errors. Bayesian Scoring penalizes so much for being sure of one false thing that none of the other scores really matter, while Squared Error is much more forgiving. If we normalize to say that 50⁄50 gives 0 points while true certainty gives 1 points, then Squared Error gives −3 points for false certainty while Bayesian gives negative infinity.
I view maximizing Bayesian Score as the Golden Rule of epistemic rationality, so even a small chance that something else might be better is worth investigating. Even if you are fully committed to Bayesian Score, I would love to hear any pros or cons you can think of in either direction.
(Edited for formatting)