Note 1 for JenniferRM: I have updated the text so it should alleviate your confusion, if you have time, try to re-read the post before reading the rest of my comment, hopefully the few changes should be enough to answer why we want RMSE=1 and not 0. Note 2 for JenniferRM and others who share her confusion: if the updated post is not sufficient but the below text is, how do I make my point clear without making the post much longer?
With binary predictions you can cheat and predict 50⁄50 as you point out… You can’t cheat with continuous predictions as there is no “natural” midpoint.
The insight you are missing is this:
I “try” to Convert my predictions to the Normal N(0, 1) using the predicted mean and error.
The variance of the unit Normal is 1: Var(N(0, 1)) = 1^2 = 1
If my calculated variance deviate from the unit normal, then that is evidence that I am wrong, I am making the implicit assumption that I cannot make “better point predictions” (change μ) and thus is forced to only update my future uncertainty interval by σz.
To make it concrete, If I had predicted (sigma here is 10 wider than in the post):
Biden ~ N(54, 30)
COVID ~ N(15.000, 50.000)
then the math would give ^σz=0.17. Both the post predictions and the “10 times wider predictions in this comment” implies the same “recalibrated” σcovid:
50.000×0.17=5.000×1.73=8.650
(On a side note I hate brier scores and prefer Bernoulli likelihood, because brier says that predicting 0% or 2% on something that happens 1% of the time is ‘equally wrong’ (same square error)… where the Bernoulli says you are an idiot for saying 0% when it can actually happen)
When I google for [Bernoulli likelihood] I end up at the distribution and I don’t see anything there about how to use it as a measure of calibration and/or decisiveness and/or anything else.
One hypothesis I have is that you have some core idea like “the deep true nature of every mental motion comes out as a distribution over a continuous variable… and the only valid comparison is ultimately a comparison between two distributions”… and then if this is what you believe then by pointing to a different distribution you would have pointed me towards “a different scoring method” (even though I can’t see a scoring method here)…
Another consequence of you thinking that distributions are the “atoms of statistics” (in some sense) would (if true) imply that you think that a Brier Score has some distribution assumption already lurking inside it as its “true form” and furthermore that this distribution is less sensible to use than the Bernoulli?
...
As to the original issue, I think a lack of an ability, with continuous variables, to “max the calibration and totally fail at knowing things and still get an ok <some kind of score> (or not be able to do such a thing)” might not prove very much about <that score>?
Here I explore for a bit… can I come up with a N(m,s) guessing system that knows nothing but seems calibrated?
One thought I had: perhaps whoever is picking the continuous numbers has biases, and then you could make predictions of sigma basically at random at first, and then as confirming data comes in for that source, that tells you about the kinds of questions you’re getting, so in future rounds you might tweak your guesses with no particular awareness of the semantics of any of the questions… such as by using the same kind of reasoning that lead you to concluding “widen my future intervals by 73%” in the example in the OP.
With a bit of extra glue logic that says something vaguely like “use all past means to predict a new mean of all numbers so far” that plays nicely with the sigma guesses… I think the standard sigma and mean used for all the questions would stabilize? Probably? Maybe?
I think I’d have to actually sit down and do real math (and maybe some numerical experiments) to be sure that it would. But is seems like the mean would probably stabilize, and once the mean stabilizes the S could be adjusted to get 1.0 eventually too? Maybe some assumptions about the biases of the source of the numbers have to be added to get this result, but I’m not sure if there are any unique such assumptions that are privileged. Certainly a Gaussian distribution seems unlikely to me. (Most of the natural data I run across is fat-tailed and “power law looking”.)
The method I suggest above would then give you a “natural number scale and deviation” for whatever the source was for the supply of “guess this continuous variable” puzzles.
As the number of questions goes up (into the thousands? the billions? the quadrillions?) I feel like this content neutral sigma could approach 1.0 if the underlying source of continuous numbers to estimate was not set up in some abusive way that was often asking questions whose answer was “Graham’s Number” (or doing power law stuff, or doing anything similarly weird). I might be wrong here. This is just my hunch before numerical simulations <3
And if my proposed “generic sigma for this source of numbers” algorithm works here… it would not be exactly the same as “pick an option among N at random and assert 1/N confidence and thereby seem like you’re calibrated even though you know literally nothing about the object level questions” but it would be kinda similar.
My method is purposefully essentially contentless… except it seems like it would capture the biases of the continuous number source for most reasonable kinds of number sources.
...
Something I noticed… I remember back in the early days of LW there was an attempt to come up with a fun game for meetups that exercises calibration on continuous variables. It ended up ALSO needing two numbers (not just a point estimate).
The idea was to have have a description of a number and a (maybe implicitly) asserted calibration/accuracy rate that a player should aim for (like being 50% confident or 99% confident or whatever).
Then, for each question, each player emits two numbers between -Inf and +Inf and gets penalized if the true number is outside their bounds, and rewarded if the true number is inside, and rewarded more for a narrower bound than anyone else. The reward schedule should be such that an accuracy rate they have been told to aim for would be the winning calibration to have.
One version of this we tried that was pretty fun and pretty easy to score aimed for “very very high certainty” by having the scoring rule be: (1) we play N rounds, (2) if the true number is ever outside the bounds you get −2N points for that round (enough to essentially kick you out of the “real” game), (3) whoever has the narrowest bounds that contains the answer gets 1 point for that round. Winner has the most points at the end.
Playing this game for 10 rounds, the winner in practice was often someone who just turned in [-Inf, +Inf] for every question, because it turns out people seem to be really terrible at “knowing what they numerically know” <3
The thing that I’m struck by is that we basically needed two numbers to make the scoring system transcend the problems of “different scales or distributions on different questions”.
That old game used “two point estimates” to get two numbers. You’re using a midpoint and a fuzz factor that you seem strongly attached to for reasons I don’t really understand. In both cases, to make the game work, it feels necessary to have two numbers, which is… interesting.
It is weird to think that this problem space (related to one-dimensional uncertainty) is sort of intrinsically two dimensional. It feels like something there could be a theorem about, but I don’t know of any off the top of my head.
There is a new game currently sold at Target that is about calibration and estimation.
Each round has two big numbers, researched from things like “how many youtube videos were uploaded per hour in 2020?” Or “how many pounds does Mars weigh?” Each player guesses how much larger one is than the other (ie 2x, 5x. 10x, 100x, 1000x), and can bet on themselves if they are confident in thier estation.
Note 1 for JenniferRM: I have updated the text so it should alleviate your confusion, if you have time, try to re-read the post before reading the rest of my comment, hopefully the few changes should be enough to answer why we want RMSE=1 and not 0.
Note 2 for JenniferRM and others who share her confusion: if the updated post is not sufficient but the below text is, how do I make my point clear without making the post much longer?
With binary predictions you can cheat and predict 50⁄50 as you point out… You can’t cheat with continuous predictions as there is no “natural” midpoint.
The insight you are missing is this:
I “try” to Convert my predictions to the Normal N(0, 1) using the predicted mean and error.
The variance of the unit Normal is 1: Var(N(0, 1)) = 1^2 = 1
If my calculated variance deviate from the unit normal, then that is evidence that I am wrong, I am making the implicit assumption that I cannot make “better point predictions” (change μ) and thus is forced to only update my future uncertainty interval by σz.
To make it concrete, If I had predicted (sigma here is 10 wider than in the post):
Biden ~ N(54, 30)
COVID ~ N(15.000, 50.000)
then the math would give ^σz=0.17. Both the post predictions and the “10 times wider predictions in this comment” implies the same “recalibrated” σcovid:
50.000×0.17=5.000×1.73=8.650
(On a side note I hate brier scores and prefer Bernoulli likelihood, because brier says that predicting 0% or 2% on something that happens 1% of the time is ‘equally wrong’ (same square error)… where the Bernoulli says you are an idiot for saying 0% when it can actually happen)
When I google for [Bernoulli likelihood] I end up at the distribution and I don’t see anything there about how to use it as a measure of calibration and/or decisiveness and/or anything else.
One hypothesis I have is that you have some core idea like “the deep true nature of every mental motion comes out as a distribution over a continuous variable… and the only valid comparison is ultimately a comparison between two distributions”… and then if this is what you believe then by pointing to a different distribution you would have pointed me towards “a different scoring method” (even though I can’t see a scoring method here)…
Another consequence of you thinking that distributions are the “atoms of statistics” (in some sense) would (if true) imply that you think that a Brier Score has some distribution assumption already lurking inside it as its “true form” and furthermore that this distribution is less sensible to use than the Bernoulli?
...
As to the original issue, I think a lack of an ability, with continuous variables, to “max the calibration and totally fail at knowing things and still get an ok <some kind of score> (or not be able to do such a thing)” might not prove very much about <that score>?
Here I explore for a bit… can I come up with a N(m,s) guessing system that knows nothing but seems calibrated?
One thought I had: perhaps whoever is picking the continuous numbers has biases, and then you could make predictions of sigma basically at random at first, and then as confirming data comes in for that source, that tells you about the kinds of questions you’re getting, so in future rounds you might tweak your guesses with no particular awareness of the semantics of any of the questions… such as by using the same kind of reasoning that lead you to concluding “widen my future intervals by 73%” in the example in the OP.
With a bit of extra glue logic that says something vaguely like “use all past means to predict a new mean of all numbers so far” that plays nicely with the sigma guesses… I think the standard sigma and mean used for all the questions would stabilize? Probably? Maybe?
I think I’d have to actually sit down and do real math (and maybe some numerical experiments) to be sure that it would. But is seems like the mean would probably stabilize, and once the mean stabilizes the S could be adjusted to get 1.0 eventually too? Maybe some assumptions about the biases of the source of the numbers have to be added to get this result, but I’m not sure if there are any unique such assumptions that are privileged. Certainly a Gaussian distribution seems unlikely to me. (Most of the natural data I run across is fat-tailed and “power law looking”.)
The method I suggest above would then give you a “natural number scale and deviation” for whatever the source was for the supply of “guess this continuous variable” puzzles.
As the number of questions goes up (into the thousands? the billions? the quadrillions?) I feel like this content neutral sigma could approach 1.0 if the underlying source of continuous numbers to estimate was not set up in some abusive way that was often asking questions whose answer was “Graham’s Number” (or doing power law stuff, or doing anything similarly weird). I might be wrong here. This is just my hunch before numerical simulations <3
And if my proposed “generic sigma for this source of numbers” algorithm works here… it would not be exactly the same as “pick an option among N at random and assert 1/N confidence and thereby seem like you’re calibrated even though you know literally nothing about the object level questions” but it would be kinda similar.
My method is purposefully essentially contentless… except it seems like it would capture the biases of the continuous number source for most reasonable kinds of number sources.
...
Something I noticed… I remember back in the early days of LW there was an attempt to come up with a fun game for meetups that exercises calibration on continuous variables. It ended up ALSO needing two numbers (not just a point estimate).
The idea was to have have a description of a number and a (maybe implicitly) asserted calibration/accuracy rate that a player should aim for (like being 50% confident or 99% confident or whatever).
Then, for each question, each player emits two numbers between -Inf and +Inf and gets penalized if the true number is outside their bounds, and rewarded if the true number is inside, and rewarded more for a narrower bound than anyone else. The reward schedule should be such that an accuracy rate they have been told to aim for would be the winning calibration to have.
One version of this we tried that was pretty fun and pretty easy to score aimed for “very very high certainty” by having the scoring rule be: (1) we play N rounds, (2) if the true number is ever outside the bounds you get −2N points for that round (enough to essentially kick you out of the “real” game), (3) whoever has the narrowest bounds that contains the answer gets 1 point for that round. Winner has the most points at the end.
Playing this game for 10 rounds, the winner in practice was often someone who just turned in [-Inf, +Inf] for every question, because it turns out people seem to be really terrible at “knowing what they numerically know” <3
The thing that I’m struck by is that we basically needed two numbers to make the scoring system transcend the problems of “different scales or distributions on different questions”.
That old game used “two point estimates” to get two numbers. You’re using a midpoint and a fuzz factor that you seem strongly attached to for reasons I don’t really understand. In both cases, to make the game work, it feels necessary to have two numbers, which is… interesting.
It is weird to think that this problem space (related to one-dimensional uncertainty) is sort of intrinsically two dimensional. It feels like something there could be a theorem about, but I don’t know of any off the top of my head.
There is a new game currently sold at Target that is about calibration and estimation.
Each round has two big numbers, researched from things like “how many youtube videos were uploaded per hour in 2020?” Or “how many pounds does Mars weigh?” Each player guesses how much larger one is than the other (ie 2x, 5x. 10x, 100x, 1000x), and can bet on themselves if they are confident in thier estation.