If I observe 4 heads out of 4 and my prior was uniform across [0,1] then my posterior maximum likelihood is at 1 and this should definitely be within my error bars. Calculating the mean and adding symmetric error bars doesn’t work for asymmetric distributions.
To do this method more accurately you would have to calculate the full posterior distribution across [0,1] and use that to create error bars. Personally I would do this numerically but there may well be an analytical solution someone else will know about.
Alternatively, a frequentist approach: create error bars on the target percentage, rather than on the percentage achieved.
For each percentage grouping see how many questions had been answered using that percentage. Then use a binomial distribution to calculate the likelihood of each number of correct responses assuming that I am perfectly calibrated. This is essentially calculating a p-value with the null hypothesis being “I am perfectly calibrated”.
For example say I’ve answered 80% 4 times.
If I’m perfectly calibrated I have a 0.8^4=41% chance of getting them all correct. Correspondingly I have:
0.8^3 x 0.2 x 4 = 41% to get 3 correct
0.8^2 x 0.2^2 x 6 = 15.4% to get 2 correct
0.8 x 0.2^3 x 4 = 2.5% to get 1 correct
0.2^4 = 0.2% to get 0 correct
If I am using a 90% CI (5% − 95%) then getting 0 correct is not inside my interval and nor is getting 1 correct (since 0.2% + 2.5% < 5%) but any of the other results are. So the top of my target error bar would reach to 100% and the bottom of would be between 25% and 50%
It is possible to combine all of the answers to create a single p-value across all percentages but this gets more complicated.
(Of course there would be 0 width error bars at 0% and 100% responses as any failures on these percentages are irrecoverable but this is right and proper)
Thanks for your recommendation! I have corrected the problem with the asymmetric distribution (now computing the whole distribution) and added a second graph showing exactly what you suggest and it looks good.
Unfortunately for the first approach that I implemented, the MAP is not always within a 90% confidence interval (It is outside of it when the MAP is 1 or 0). I agree that it is confusing and seems undesirable.
(You might need to hard-refresh the page if you want to see the update CTRL+SHIFT+R)
It’s because I changed it to only show estimations for probabilities which have received at least 4 answers and you have not yet answered enough questions. I am not confident that this change is good and I might revert it.
Thanks, I think I get it now.
If I observe 4 heads out of 4 and my prior was uniform across [0,1] then my posterior maximum likelihood is at 1 and this should definitely be within my error bars. Calculating the mean and adding symmetric error bars doesn’t work for asymmetric distributions.
To do this method more accurately you would have to calculate the full posterior distribution across [0,1] and use that to create error bars. Personally I would do this numerically but there may well be an analytical solution someone else will know about.
Alternatively, a frequentist approach: create error bars on the target percentage, rather than on the percentage achieved.
For each percentage grouping see how many questions had been answered using that percentage. Then use a binomial distribution to calculate the likelihood of each number of correct responses assuming that I am perfectly calibrated. This is essentially calculating a p-value with the null hypothesis being “I am perfectly calibrated”.
For example say I’ve answered 80% 4 times. If I’m perfectly calibrated I have a 0.8^4=41% chance of getting them all correct. Correspondingly I have:
0.8^3 x 0.2 x 4 = 41% to get 3 correct
0.8^2 x 0.2^2 x 6 = 15.4% to get 2 correct
0.8 x 0.2^3 x 4 = 2.5% to get 1 correct
0.2^4 = 0.2% to get 0 correct
If I am using a 90% CI (5% − 95%) then getting 0 correct is not inside my interval and nor is getting 1 correct (since 0.2% + 2.5% < 5%) but any of the other results are. So the top of my target error bar would reach to 100% and the bottom of would be between 25% and 50%
It is possible to combine all of the answers to create a single p-value across all percentages but this gets more complicated.
(Of course there would be 0 width error bars at 0% and 100% responses as any failures on these percentages are irrecoverable but this is right and proper)
Thanks for your recommendation! I have corrected the problem with the asymmetric distribution (now computing the whole distribution) and added a second graph showing exactly what you suggest and it looks good.
Unfortunately for the first approach that I implemented, the MAP is not always within a 90% confidence interval (It is outside of it when the MAP is 1 or 0). I agree that it is confusing and seems undesirable.
(You might need to hard-refresh the page if you want to see the update CTRL+SHIFT+R)
This looks great!
(I’m having a bug where the graph only displays results for 0-40% & 100% but I’m not sure if that’s just my computer being weird)
It’s because I changed it to only show estimations for probabilities which have received at least 4 answers and you have not yet answered enough questions. I am not confident that this change is good and I might revert it.