Part of the output of your quizzes is a line of the form “Your chance of being well calibrated, relative to the null hypothesis, is 50.445538580926 percent.” How is this number computed?
I chose “25% confident” for 25 questions and got 6 of them (24%) right. That seems like a pretty good calibration … but 50.44% chance of being well calibrated relative to null doesn’t seem that good. Does that sentence mean that an observer, given my test results, would assign a 50.44% probability to my being well calibrated and a 49.56% probability to my not being well calibrated? (or to my randomly choosing answers?) Or something else?
It’s also completely ridiculous, with a sample size of ~10 questions, to give the success rate and probability of being well calibrated as percentages with 12 decimals. Since the uncertainty in such a small sample is on the order of several percent, just round to the nearest percentage.
You can also lodge this is a problem with null hypothesis testing—I would’ve thought that perfect calibration would be the null. Perhaps the null is a model where you just randomly say a probability from 0 to 100.
I’m assuming that they really calculated a likelihood function P(Data|Perfect) / P(Data|Null) instead of the posteriorP(Perfect|Data) / P(Null|Data) as the words they used would mean if taken literally. But maybe they have some priors P(Perfect) / P(Null) that they used. (The thing they should do is just report the likelihood ratio, instead of their posterior).
If you have your data and want to compute P(Data|Perfect), you can compute a total product Π_i (p_i if it happened, (1-p_i) if it didn’t)
So for example if I predicted 20%, 70%, 30% and the actual results were No, Yes, Yes, then P(Data|Perfect) = .8 * .7 * .3. If you have some other hypothesis (e.g. whatever their null is), you can compute P(Data|Other Hypothesis) by using the predictions that hypothesis makes for how your reported probabilities relate to propensities of events. A hypothesis here should be a function f(reported) = P(Event happens | reported).
Part of the output of your quizzes is a line of the form “Your chance of being well calibrated, relative to the null hypothesis, is 50.445538580926 percent.” How is this number computed?
I chose “25% confident” for 25 questions and got 6 of them (24%) right. That seems like a pretty good calibration … but 50.44% chance of being well calibrated relative to null doesn’t seem that good. Does that sentence mean that an observer, given my test results, would assign a 50.44% probability to my being well calibrated and a 49.56% probability to my not being well calibrated? (or to my randomly choosing answers?) Or something else?
It’s also completely ridiculous, with a sample size of ~10 questions, to give the success rate and probability of being well calibrated as percentages with 12 decimals. Since the uncertainty in such a small sample is on the order of several percent, just round to the nearest percentage.
It probably just computes it as a float and then prints the whole float.
(I do recognize the silliness of replying to a three-year old comment that itself is replying to a six-year old comment.)
It’s not silly. I still find these newer comments useful.
And here we are one year later!
Yes, do it for posterity!
I would like to chime in and point out that as today the domain “acceleratingfuture (dot) com” is owned by a russian bookmaker.
No! P(Test results | Perfect calibration) / P(Test results | Whatever the null is) ≠ P(Perfect Calibration | Test results) !
You can also lodge this is a problem with null hypothesis testing—I would’ve thought that perfect calibration would be the null. Perhaps the null is a model where you just randomly say a probability from 0 to 100.
I’m assuming that they really calculated a likelihood function P(Data|Perfect) / P(Data|Null) instead of the posteriorP(Perfect|Data) / P(Null|Data) as the words they used would mean if taken literally. But maybe they have some priors P(Perfect) / P(Null) that they used. (The thing they should do is just report the likelihood ratio, instead of their posterior).
If you have your data and want to compute P(Data|Perfect), you can compute a total product Π_i (p_i if it happened, (1-p_i) if it didn’t)
So for example if I predicted 20%, 70%, 30% and the actual results were No, Yes, Yes, then P(Data|Perfect) = .8 * .7 * .3. If you have some other hypothesis (e.g. whatever their null is), you can compute P(Data|Other Hypothesis) by using the predictions that hypothesis makes for how your reported probabilities relate to propensities of events. A hypothesis here should be a function f(reported) = P(Event happens | reported).