The problem with the squared error score is that it just rewards asking a ton of obvious questions. I predict with 100% probability that the sky will be blue one second from now. Just keep repeating for a high score.
Both methods fail miserably if you get to choose what questions are asked. Bayesian score rewards never asking any questions ever. Or, if you normalize it to assign 1 to true certainty and 0 to 50⁄50, then it rewards asking obvious questions also.
If it helps, you can think of the squared error score as -(1-x)^2 instead of 1-(1-x)^2, then it fixes this problem.
Both methods fail miserably if you get to choose what questions are asked. Bayesian score rewards never asking any questions ever. Or, if you normalize it to assign 1 to true certainty and 0 to 50⁄50, then it rewards asking obvious questions also.
Only because you are baking in an implicit loss function that all questions are equally valuable; switch to some other loss function which weights the value of more interesting or harder questions more, and this problem disappears as ‘the sky is blue’ ceases to be worth anything compared to a real prediction like ‘Obama will be re-elected’.
I don’t understand why what you are suggesting has anything to do with what I said.
Yes, of course you can model different statements to different values, and I mentioned this. However, what I was saying here is that if you allow the option of just not answering one of the questions (whatever that means) then there has to be some utility associated with not answering. The comment that I was responding to was saying that Bayesian was better than Brier because Brier gave positive utilities instead of negative utilities, so could be cheated by asking lots of easy questions.
Your response seems to be about scaling the utilities for each question based on the importance of that question. This is very valid, and I mentioned that when I said “(possibly weighted) average score.” That is a very valid point, but I don’t see how it has anything to do with the problems associated with being able to choose what questions are asked.
That is a very valid point, but I don’t see how it has anything to do with the problems associated with being able to choose what questions are asked.
I don’t understand your problem here. If questions’ values are scaled appropriately, or some fancier approach is used, then it doesn’t matter if respondents pick and choose because they will either be wasting their time or missing out on large potential gains. A loss function style approach seems to adequately resolve this problem.
I think this is probably bad communication on my part.
The model I am imagining from you is that there is some countable collection of statements you want to assign true/false to. You assign some weight function to the statements so that to total weight of all statements is some finite number, and your score is the sum of the weights of all statements which you choose to answer.
For this, it really matters not only how the values are scaled, but also how they are translated. It maters what the 0 utility point for each question is, because that determines whether or not you want to choose to answer that question. I think that the 0 utility point should be put at the utility of the 50⁄50 probability assignment for each question. In this case, not answering a question is equivalent to answering it with 50⁄50 probability, so I think it would be simpler to just say, you have to answer every question, and your answer by default is 50⁄50, in which case the 0 points don’t matter anymore. This is just semantics.
But just saying that you scale each question by its importance doesn’t fix the fact that if you model this as you can choose to answer questions if you want and your utility is the sum of your utilities for the individual questions encourages not answering any questions under the Bayesian rule as written, since it can only give you negative utility. You have to fix that by either fixing 0 points for your utilities in some reasonable way or just requiring that you are assigned utility for every question, and there is a default answer if you don’t think about it at all.
There are benefits to weighing the questions because that allows us to take infinite sums, but if we assume for now that there are only finitely many questions, and all questions have rational weights, then weighing the questions is similar to just asking the same questions multiple times (proportional to its weight). This may be more accurate for what we want in epistemic rationality, but it doesn’t actually solve the problems associated with allowing people to pick and choose questions.
The model I am imagining from you is that there is some countable collection of statements you want to assign true/false to. You assign some weight function to the statements so that to total weight of all statements is some finite number, and your score is the sum of the weights of all statements which you choose to answer.
Hm, no, I wasn’t really thinking that way. I don’t want some finite number, I want everyone to reach different numbers so more accurate predictors score higher.
The weights on particular functions do not have to be even algorithmicly set—for example, a prediction market is immune to the ‘sky is blue’ problem because if one were to start a contract for ‘the sky is blue tomorrow’, no one would trade on it unless one were willing to lose money being a market-maker as the other trader bid it up to the meteorologically-accurate 80% or whatever. One can pick and choose as much as one pleases, but unless one’s contracts were valuable to other people for any reason, it would be impossible to make money by stuffing the market with bogus contracts. The utility just becomes how much money you made.
I think that the 0 utility point should be put at the utility of the 50⁄50 probability assignment for each question.
I think this doesn’t work because you’re trying to invent a non-informative prior, and it’s trivial to set up sets of predictions where the obviously better non-informative prior is not 1/2: for example, set up 3 predictions for each of 3 mutually-exhaustive outcomes, where the non-informative prior obviously looks more like 1⁄3 and 1⁄2 means someone is getting robbed. More importantly, uninformative priors are disputed and it’s not clear what they are in more complex situations. (Frequentist Larry Wasserman goes so far as to call them “lost causes” and “perpetual motion machines”.)
But just saying that you scale each question by its importance doesn’t fix the fact that if you model this as you can choose to answer questions if you want and your utility is the sum of your utilities for the individual questions encourages not answering any questions under the Bayesian rule as written, since it can only give you negative utility. You have to fix that by either fixing 0 points for your utilities in some reasonable way or just requiring that you are assigned utility for every question, and there is a default answer if you don’t think about it at all.
Perhaps a raw log odds is not the best idea, but do you really think there is no way to interpret them into some score which disincentivizes strategic predicting? This sounds just arrogant to me, and I would only believe it if you summarized all the existing research into rewarding experts and showed that log odds simply could not be used in any circumstance where any predictor could predict a subset of the specified predictions.
but if we assume for now that there are only finitely many questions, and all questions have rational weights, then weighing the questions is similar to just asking the same questions multiple times (proportional to its weight).
There aren’t finitely many questions because one can ask questions involving each of the infinite set of integers… Knowing that questions are asking identical questions sounds like an impossible demand to meet (for example, if any system claimed this, it could solve the Halting Problem by simply asking it to predict the output of 2 Turing machines).
Why is it consistent that assigning a probability of 99% to one half of a binary proposition that turns out false is much better than assigning a probability of 1% to the opposite half that turns out true?
Why is it consistent that assigning a probability of 99% to one half of a binary proposition that turns out false is much better than assigning a probability of 1% to the opposite half that turns out true?
I think there’s some confusion. Coscott said these three facts:
Let f(x) be the output if the question is true, and let g(x) be the output if the question is false.
f(x)=g(1-x)
f(x)=log(x)
In consequence, g(x)=log(1-x). So if x=0.99 and the question is false, the output is g(x)=log(1-x)=log(0.01). Or if x=0.01 and the question is true, the output is f(x)=log(x)=log(0.01). So the symmetry that you desire is true.
But that doesn’t output 1 for estimates of 100%, 0 for estimates of 50%, and -inf (or even −1) to estimates of 0%, or even something that can be normalized to either of those triples.
Huh. I thought that wasn’t a Bayesian score (not maximized by estimating correctly), but doing the math the maximum is at the right point for 1⁄4, 1⁄100, 3⁄4, and 99⁄100, and 1⁄2.
The problem with the squared error score is that it just rewards asking a ton of obvious questions. I predict with 100% probability that the sky will be blue one second from now. Just keep repeating for a high score.
Both methods fail miserably if you get to choose what questions are asked. Bayesian score rewards never asking any questions ever. Or, if you normalize it to assign 1 to true certainty and 0 to 50⁄50, then it rewards asking obvious questions also.
If it helps, you can think of the squared error score as -(1-x)^2 instead of 1-(1-x)^2, then it fixes this problem.
Only because you are baking in an implicit loss function that all questions are equally valuable; switch to some other loss function which weights the value of more interesting or harder questions more, and this problem disappears as ‘the sky is blue’ ceases to be worth anything compared to a real prediction like ‘Obama will be re-elected’.
I don’t understand why what you are suggesting has anything to do with what I said.
Yes, of course you can model different statements to different values, and I mentioned this. However, what I was saying here is that if you allow the option of just not answering one of the questions (whatever that means) then there has to be some utility associated with not answering. The comment that I was responding to was saying that Bayesian was better than Brier because Brier gave positive utilities instead of negative utilities, so could be cheated by asking lots of easy questions.
Your response seems to be about scaling the utilities for each question based on the importance of that question. This is very valid, and I mentioned that when I said “(possibly weighted) average score.” That is a very valid point, but I don’t see how it has anything to do with the problems associated with being able to choose what questions are asked.
I don’t understand your problem here. If questions’ values are scaled appropriately, or some fancier approach is used, then it doesn’t matter if respondents pick and choose because they will either be wasting their time or missing out on large potential gains. A loss function style approach seems to adequately resolve this problem.
I think this is probably bad communication on my part.
The model I am imagining from you is that there is some countable collection of statements you want to assign true/false to. You assign some weight function to the statements so that to total weight of all statements is some finite number, and your score is the sum of the weights of all statements which you choose to answer.
For this, it really matters not only how the values are scaled, but also how they are translated. It maters what the 0 utility point for each question is, because that determines whether or not you want to choose to answer that question. I think that the 0 utility point should be put at the utility of the 50⁄50 probability assignment for each question. In this case, not answering a question is equivalent to answering it with 50⁄50 probability, so I think it would be simpler to just say, you have to answer every question, and your answer by default is 50⁄50, in which case the 0 points don’t matter anymore. This is just semantics.
But just saying that you scale each question by its importance doesn’t fix the fact that if you model this as you can choose to answer questions if you want and your utility is the sum of your utilities for the individual questions encourages not answering any questions under the Bayesian rule as written, since it can only give you negative utility. You have to fix that by either fixing 0 points for your utilities in some reasonable way or just requiring that you are assigned utility for every question, and there is a default answer if you don’t think about it at all.
There are benefits to weighing the questions because that allows us to take infinite sums, but if we assume for now that there are only finitely many questions, and all questions have rational weights, then weighing the questions is similar to just asking the same questions multiple times (proportional to its weight). This may be more accurate for what we want in epistemic rationality, but it doesn’t actually solve the problems associated with allowing people to pick and choose questions.
Hm, no, I wasn’t really thinking that way. I don’t want some finite number, I want everyone to reach different numbers so more accurate predictors score higher.
The weights on particular functions do not have to be even algorithmicly set—for example, a prediction market is immune to the ‘sky is blue’ problem because if one were to start a contract for ‘the sky is blue tomorrow’, no one would trade on it unless one were willing to lose money being a market-maker as the other trader bid it up to the meteorologically-accurate 80% or whatever. One can pick and choose as much as one pleases, but unless one’s contracts were valuable to other people for any reason, it would be impossible to make money by stuffing the market with bogus contracts. The utility just becomes how much money you made.
I think this doesn’t work because you’re trying to invent a non-informative prior, and it’s trivial to set up sets of predictions where the obviously better non-informative prior is not 1/2: for example, set up 3 predictions for each of 3 mutually-exhaustive outcomes, where the non-informative prior obviously looks more like 1⁄3 and 1⁄2 means someone is getting robbed. More importantly, uninformative priors are disputed and it’s not clear what they are in more complex situations. (Frequentist Larry Wasserman goes so far as to call them “lost causes” and “perpetual motion machines”.)
Perhaps a raw log odds is not the best idea, but do you really think there is no way to interpret them into some score which disincentivizes strategic predicting? This sounds just arrogant to me, and I would only believe it if you summarized all the existing research into rewarding experts and showed that log odds simply could not be used in any circumstance where any predictor could predict a subset of the specified predictions.
There aren’t finitely many questions because one can ask questions involving each of the infinite set of integers… Knowing that questions are asking identical questions sounds like an impossible demand to meet (for example, if any system claimed this, it could solve the Halting Problem by simply asking it to predict the output of 2 Turing machines).
If you normalize Bayesian score to assign 1 to 100% and 0 to 50% (and −1 to 0%), you encounter a math error.
I didn’t do that. I only set 1 to 100% and 0 to 50%. 0% is still negative infinity.
That’s the math error.
Why is it consistent that assigning a probability of 99% to one half of a binary proposition that turns out false is much better than assigning a probability of 1% to the opposite half that turns out true?
There’s no math error.
I think there’s some confusion. Coscott said these three facts:
In consequence, g(x)=log(1-x). So if x=0.99 and the question is false, the output is g(x)=log(1-x)=log(0.01). Or if x=0.01 and the question is true, the output is f(x)=log(x)=log(0.01). So the symmetry that you desire is true.
But that doesn’t output 1 for estimates of 100%, 0 for estimates of 50%, and -inf (or even −1) to estimates of 0%, or even something that can be normalized to either of those triples.
Here’s the “normalized” version: f(x)=1+log2(x), g(x)=1+log2(1-x) (i.e. scale f and g by 1/log(2) and add 1).
Now f(1)=1, f(.5)=0, f(0)=-Inf ; g(1)=-Inf, g(.5)=0, g(0)=1.
Ok?
Huh. I thought that wasn’t a Bayesian score (not maximized by estimating correctly), but doing the math the maximum is at the right point for 1⁄4, 1⁄100, 3⁄4, and 99⁄100, and 1⁄2.