One of the things I like about a Brier Score is that I feel like I intuitively understand how it rewards calibration andalso decisiveness.
It is trivial to be perfectly calibrated on multiple choice (with two choices being a “binary” multiple choice answer) simply by throwing decisiveness out the window: generate answers with coin flips and give confidence for all answers of 1/N. You will come out with perfect calibration, but also the practice is pointless, which shows that we intuitively don’t care only about being calibrated.
However, this trick gets a very bad (edited from low thanks to GWS for seeing the typo) Brier Score, because the Brier Score was invented partly in response to the ideas that motivate the trick :-)
We also want to see “1+1=3” and assign it “1E-7″ probability, because that equation is false and the uncertainty is more likely to come from typos and model error and so on. Giving probabilities like this will give you very very very low Brier Scores… as it should! :-)
The best possible Brier Score is 0.0 in the same way that the best RMSE is 0.0. This is reasonable because the RMSE and Brier Score are in some sense the same concept.
It makes sense to me that for both your goal is to make them zero. Just zero. The goal then is to know all the things… and to know that you know them by getting away with assigning everything very very high or very very low probabilities (and thus maxing the decisiveness)! <3
Second we calculate ^σz as the RMSE (root mean squared error) of all predictions… Then we calculate ^σz …
...=1.73
So if these were my only two predictions, then I should widen my future intervals by 73%. In other words, because ^σz is 1.73 and not 1, thus my intervals are too small by a factor of 1.73.
I’m not sure if you’re doing something conceptually interesting here (like how Brier Scores interestingly goes over and above mere “Accuracy” or mere “Calibration” by taking several good things into account in a balanced way), or… maybe… are you making some sort of error?
RMSE works with nothing but point predictions. It seems like you recognize that the standard deviations aren’t completely necessary when you write:
(1) If the data x and the prediction μ are close, then you are a good predictor
Thus maybe you don’t need to also elicit a distribution and a variance estimate from the predicter? I think? There does seem to be something vaguely pleasing about aiming for an RMSE of 1.0 I guess (instead of aiming for 0.00000001) because it does seem like it would be nice for a “prediction consumer” to get error bars as part of what the predictor provides?
But I feel like this might be conceptually akin to sacrificing some of your decisiveness on the altar of calibration (as with guessing randomly over outcomes and always using a probability of 1/N).
The crux might be something like a third thing over and above “decisiveness & calibration” that is also good and might be named… uh… “non-hubris”? Maybe “intervalic coherence”? Maybe “predictive methodical self-awareness”?
Is it your intention to advocate aiming for RMSE=1.0 and also simultaneously advocate for eliciting some third virtuous quality from forecasters?
Note 1 for JenniferRM: I have updated the text so it should alleviate your confusion, if you have time, try to re-read the post before reading the rest of my comment, hopefully the few changes should be enough to answer why we want RMSE=1 and not 0. Note 2 for JenniferRM and others who share her confusion: if the updated post is not sufficient but the below text is, how do I make my point clear without making the post much longer?
With binary predictions you can cheat and predict 50⁄50 as you point out… You can’t cheat with continuous predictions as there is no “natural” midpoint.
The insight you are missing is this:
I “try” to Convert my predictions to the Normal N(0, 1) using the predicted mean and error.
The variance of the unit Normal is 1: Var(N(0, 1)) = 1^2 = 1
If my calculated variance deviate from the unit normal, then that is evidence that I am wrong, I am making the implicit assumption that I cannot make “better point predictions” (change μ) and thus is forced to only update my future uncertainty interval by σz.
To make it concrete, If I had predicted (sigma here is 10 wider than in the post):
Biden ~ N(54, 30)
COVID ~ N(15.000, 50.000)
then the math would give ^σz=0.17. Both the post predictions and the “10 times wider predictions in this comment” implies the same “recalibrated” σcovid:
50.000×0.17=5.000×1.73=8.650
(On a side note I hate brier scores and prefer Bernoulli likelihood, because brier says that predicting 0% or 2% on something that happens 1% of the time is ‘equally wrong’ (same square error)… where the Bernoulli says you are an idiot for saying 0% when it can actually happen)
When I google for [Bernoulli likelihood] I end up at the distribution and I don’t see anything there about how to use it as a measure of calibration and/or decisiveness and/or anything else.
One hypothesis I have is that you have some core idea like “the deep true nature of every mental motion comes out as a distribution over a continuous variable… and the only valid comparison is ultimately a comparison between two distributions”… and then if this is what you believe then by pointing to a different distribution you would have pointed me towards “a different scoring method” (even though I can’t see a scoring method here)…
Another consequence of you thinking that distributions are the “atoms of statistics” (in some sense) would (if true) imply that you think that a Brier Score has some distribution assumption already lurking inside it as its “true form” and furthermore that this distribution is less sensible to use than the Bernoulli?
...
As to the original issue, I think a lack of an ability, with continuous variables, to “max the calibration and totally fail at knowing things and still get an ok <some kind of score> (or not be able to do such a thing)” might not prove very much about <that score>?
Here I explore for a bit… can I come up with a N(m,s) guessing system that knows nothing but seems calibrated?
One thought I had: perhaps whoever is picking the continuous numbers has biases, and then you could make predictions of sigma basically at random at first, and then as confirming data comes in for that source, that tells you about the kinds of questions you’re getting, so in future rounds you might tweak your guesses with no particular awareness of the semantics of any of the questions… such as by using the same kind of reasoning that lead you to concluding “widen my future intervals by 73%” in the example in the OP.
With a bit of extra glue logic that says something vaguely like “use all past means to predict a new mean of all numbers so far” that plays nicely with the sigma guesses… I think the standard sigma and mean used for all the questions would stabilize? Probably? Maybe?
I think I’d have to actually sit down and do real math (and maybe some numerical experiments) to be sure that it would. But is seems like the mean would probably stabilize, and once the mean stabilizes the S could be adjusted to get 1.0 eventually too? Maybe some assumptions about the biases of the source of the numbers have to be added to get this result, but I’m not sure if there are any unique such assumptions that are privileged. Certainly a Gaussian distribution seems unlikely to me. (Most of the natural data I run across is fat-tailed and “power law looking”.)
The method I suggest above would then give you a “natural number scale and deviation” for whatever the source was for the supply of “guess this continuous variable” puzzles.
As the number of questions goes up (into the thousands? the billions? the quadrillions?) I feel like this content neutral sigma could approach 1.0 if the underlying source of continuous numbers to estimate was not set up in some abusive way that was often asking questions whose answer was “Graham’s Number” (or doing power law stuff, or doing anything similarly weird). I might be wrong here. This is just my hunch before numerical simulations <3
And if my proposed “generic sigma for this source of numbers” algorithm works here… it would not be exactly the same as “pick an option among N at random and assert 1/N confidence and thereby seem like you’re calibrated even though you know literally nothing about the object level questions” but it would be kinda similar.
My method is purposefully essentially contentless… except it seems like it would capture the biases of the continuous number source for most reasonable kinds of number sources.
...
Something I noticed… I remember back in the early days of LW there was an attempt to come up with a fun game for meetups that exercises calibration on continuous variables. It ended up ALSO needing two numbers (not just a point estimate).
The idea was to have have a description of a number and a (maybe implicitly) asserted calibration/accuracy rate that a player should aim for (like being 50% confident or 99% confident or whatever).
Then, for each question, each player emits two numbers between -Inf and +Inf and gets penalized if the true number is outside their bounds, and rewarded if the true number is inside, and rewarded more for a narrower bound than anyone else. The reward schedule should be such that an accuracy rate they have been told to aim for would be the winning calibration to have.
One version of this we tried that was pretty fun and pretty easy to score aimed for “very very high certainty” by having the scoring rule be: (1) we play N rounds, (2) if the true number is ever outside the bounds you get −2N points for that round (enough to essentially kick you out of the “real” game), (3) whoever has the narrowest bounds that contains the answer gets 1 point for that round. Winner has the most points at the end.
Playing this game for 10 rounds, the winner in practice was often someone who just turned in [-Inf, +Inf] for every question, because it turns out people seem to be really terrible at “knowing what they numerically know” <3
The thing that I’m struck by is that we basically needed two numbers to make the scoring system transcend the problems of “different scales or distributions on different questions”.
That old game used “two point estimates” to get two numbers. You’re using a midpoint and a fuzz factor that you seem strongly attached to for reasons I don’t really understand. In both cases, to make the game work, it feels necessary to have two numbers, which is… interesting.
It is weird to think that this problem space (related to one-dimensional uncertainty) is sort of intrinsically two dimensional. It feels like something there could be a theorem about, but I don’t know of any off the top of my head.
There is a new game currently sold at Target that is about calibration and estimation.
Each round has two big numbers, researched from things like “how many youtube videos were uploaded per hour in 2020?” Or “how many pounds does Mars weigh?” Each player guesses how much larger one is than the other (ie 2x, 5x. 10x, 100x, 1000x), and can bet on themselves if they are confident in thier estation.
Rather than using z-scoring, one can use log probabilities to measure prediction accuracy. They are computed by (μ−xσ)2−log(σ)−12log(2π).
A downside is that they are not scale invariant, but instead the unit you measure x in leads to a constant offset. I don’t know whether one can come up with a scale invariant version. (I think no, because changing the scale is symmetric with changing the prediction accuracy? Though if one has some baseline prediction, one can use that to define the scale.)
One of the things I like about a Brier Score is that I feel like I intuitively understand how it rewards calibration and also decisiveness.
It is trivial to be perfectly calibrated on multiple choice (with two choices being a “binary” multiple choice answer) simply by throwing decisiveness out the window: generate answers with coin flips and give confidence for all answers of 1/N. You will come out with perfect calibration, but also the practice is pointless, which shows that we intuitively don’t care only about being calibrated.
However, this trick gets a very bad (edited from
lowthanks to GWS for seeing the typo) Brier Score, because the Brier Score was invented partly in response to the ideas that motivate the trick :-)We also want to see “1+1=3” and assign it “1E-7″ probability, because that equation is false and the uncertainty is more likely to come from typos and model error and so on. Giving probabilities like this will give you very very very low Brier Scores… as it should! :-)
The best possible Brier Score is 0.0 in the same way that the best RMSE is 0.0. This is reasonable because the RMSE and Brier Score are in some sense the same concept.
It makes sense to me that for both your goal is to make them zero. Just zero. The goal then is to know all the things… and to know that you know them by getting away with assigning everything very very high or very very low probabilities (and thus maxing the decisiveness)! <3
I’m not sure if you’re doing something conceptually interesting here (like how Brier Scores interestingly goes over and above mere “Accuracy” or mere “Calibration” by taking several good things into account in a balanced way), or… maybe… are you making some sort of error?
RMSE works with nothing but point predictions. It seems like you recognize that the standard deviations aren’t completely necessary when you write:
Thus maybe you don’t need to also elicit a distribution and a variance estimate from the predicter? I think? There does seem to be something vaguely pleasing about aiming for an RMSE of 1.0 I guess (instead of aiming for 0.00000001) because it does seem like it would be nice for a “prediction consumer” to get error bars as part of what the predictor provides?
But I feel like this might be conceptually akin to sacrificing some of your decisiveness on the altar of calibration (as with guessing randomly over outcomes and always using a probability of 1/N).
The crux might be something like a third thing over and above “decisiveness & calibration” that is also good and might be named… uh… “non-hubris”? Maybe “intervalic coherence”? Maybe “predictive methodical self-awareness”?
Is it your intention to advocate aiming for RMSE=1.0 and also simultaneously advocate for eliciting some third virtuous quality from forecasters?
Note 1 for JenniferRM: I have updated the text so it should alleviate your confusion, if you have time, try to re-read the post before reading the rest of my comment, hopefully the few changes should be enough to answer why we want RMSE=1 and not 0.
Note 2 for JenniferRM and others who share her confusion: if the updated post is not sufficient but the below text is, how do I make my point clear without making the post much longer?
With binary predictions you can cheat and predict 50⁄50 as you point out… You can’t cheat with continuous predictions as there is no “natural” midpoint.
The insight you are missing is this:
I “try” to Convert my predictions to the Normal N(0, 1) using the predicted mean and error.
The variance of the unit Normal is 1: Var(N(0, 1)) = 1^2 = 1
If my calculated variance deviate from the unit normal, then that is evidence that I am wrong, I am making the implicit assumption that I cannot make “better point predictions” (change μ) and thus is forced to only update my future uncertainty interval by σz.
To make it concrete, If I had predicted (sigma here is 10 wider than in the post):
Biden ~ N(54, 30)
COVID ~ N(15.000, 50.000)
then the math would give ^σz=0.17. Both the post predictions and the “10 times wider predictions in this comment” implies the same “recalibrated” σcovid:
50.000×0.17=5.000×1.73=8.650
(On a side note I hate brier scores and prefer Bernoulli likelihood, because brier says that predicting 0% or 2% on something that happens 1% of the time is ‘equally wrong’ (same square error)… where the Bernoulli says you are an idiot for saying 0% when it can actually happen)
When I google for [Bernoulli likelihood] I end up at the distribution and I don’t see anything there about how to use it as a measure of calibration and/or decisiveness and/or anything else.
One hypothesis I have is that you have some core idea like “the deep true nature of every mental motion comes out as a distribution over a continuous variable… and the only valid comparison is ultimately a comparison between two distributions”… and then if this is what you believe then by pointing to a different distribution you would have pointed me towards “a different scoring method” (even though I can’t see a scoring method here)…
Another consequence of you thinking that distributions are the “atoms of statistics” (in some sense) would (if true) imply that you think that a Brier Score has some distribution assumption already lurking inside it as its “true form” and furthermore that this distribution is less sensible to use than the Bernoulli?
...
As to the original issue, I think a lack of an ability, with continuous variables, to “max the calibration and totally fail at knowing things and still get an ok <some kind of score> (or not be able to do such a thing)” might not prove very much about <that score>?
Here I explore for a bit… can I come up with a N(m,s) guessing system that knows nothing but seems calibrated?
One thought I had: perhaps whoever is picking the continuous numbers has biases, and then you could make predictions of sigma basically at random at first, and then as confirming data comes in for that source, that tells you about the kinds of questions you’re getting, so in future rounds you might tweak your guesses with no particular awareness of the semantics of any of the questions… such as by using the same kind of reasoning that lead you to concluding “widen my future intervals by 73%” in the example in the OP.
With a bit of extra glue logic that says something vaguely like “use all past means to predict a new mean of all numbers so far” that plays nicely with the sigma guesses… I think the standard sigma and mean used for all the questions would stabilize? Probably? Maybe?
I think I’d have to actually sit down and do real math (and maybe some numerical experiments) to be sure that it would. But is seems like the mean would probably stabilize, and once the mean stabilizes the S could be adjusted to get 1.0 eventually too? Maybe some assumptions about the biases of the source of the numbers have to be added to get this result, but I’m not sure if there are any unique such assumptions that are privileged. Certainly a Gaussian distribution seems unlikely to me. (Most of the natural data I run across is fat-tailed and “power law looking”.)
The method I suggest above would then give you a “natural number scale and deviation” for whatever the source was for the supply of “guess this continuous variable” puzzles.
As the number of questions goes up (into the thousands? the billions? the quadrillions?) I feel like this content neutral sigma could approach 1.0 if the underlying source of continuous numbers to estimate was not set up in some abusive way that was often asking questions whose answer was “Graham’s Number” (or doing power law stuff, or doing anything similarly weird). I might be wrong here. This is just my hunch before numerical simulations <3
And if my proposed “generic sigma for this source of numbers” algorithm works here… it would not be exactly the same as “pick an option among N at random and assert 1/N confidence and thereby seem like you’re calibrated even though you know literally nothing about the object level questions” but it would be kinda similar.
My method is purposefully essentially contentless… except it seems like it would capture the biases of the continuous number source for most reasonable kinds of number sources.
...
Something I noticed… I remember back in the early days of LW there was an attempt to come up with a fun game for meetups that exercises calibration on continuous variables. It ended up ALSO needing two numbers (not just a point estimate).
The idea was to have have a description of a number and a (maybe implicitly) asserted calibration/accuracy rate that a player should aim for (like being 50% confident or 99% confident or whatever).
Then, for each question, each player emits two numbers between -Inf and +Inf and gets penalized if the true number is outside their bounds, and rewarded if the true number is inside, and rewarded more for a narrower bound than anyone else. The reward schedule should be such that an accuracy rate they have been told to aim for would be the winning calibration to have.
One version of this we tried that was pretty fun and pretty easy to score aimed for “very very high certainty” by having the scoring rule be: (1) we play N rounds, (2) if the true number is ever outside the bounds you get −2N points for that round (enough to essentially kick you out of the “real” game), (3) whoever has the narrowest bounds that contains the answer gets 1 point for that round. Winner has the most points at the end.
Playing this game for 10 rounds, the winner in practice was often someone who just turned in [-Inf, +Inf] for every question, because it turns out people seem to be really terrible at “knowing what they numerically know” <3
The thing that I’m struck by is that we basically needed two numbers to make the scoring system transcend the problems of “different scales or distributions on different questions”.
That old game used “two point estimates” to get two numbers. You’re using a midpoint and a fuzz factor that you seem strongly attached to for reasons I don’t really understand. In both cases, to make the game work, it feels necessary to have two numbers, which is… interesting.
It is weird to think that this problem space (related to one-dimensional uncertainty) is sort of intrinsically two dimensional. It feels like something there could be a theorem about, but I don’t know of any off the top of my head.
There is a new game currently sold at Target that is about calibration and estimation.
Each round has two big numbers, researched from things like “how many youtube videos were uploaded per hour in 2020?” Or “how many pounds does Mars weigh?” Each player guesses how much larger one is than the other (ie 2x, 5x. 10x, 100x, 1000x), and can bet on themselves if they are confident in thier estation.
Rather than using z-scoring, one can use log probabilities to measure prediction accuracy. They are computed by (μ−xσ)2−log(σ)−12log(2π).
A downside is that they are not scale invariant, but instead the unit you measure x in leads to a constant offset. I don’t know whether one can come up with a scale invariant version. (I think no, because changing the scale is symmetric with changing the prediction accuracy? Though if one has some baseline prediction, one can use that to define the scale.)
Was this meant to be a high (or poor) Brier score?
Yes, thanks! (Edited with credit.)