(If this makes no sense, then ignore it): Using an arbitrary distribution for predictions, then use its CDF (Universality of the Uniform) to convert to U(0,1), and then transform to z-score using the inverse CDF (percentile point function) of the Unit Normal. Finally use this as zi in when calculating your calibration.
Well, this makes some sense, but it would make even more sense to do only half of it.
Take your forecast, calculate it’s percentile. Then you can do all the traditional calibration stuff. All this stuff with z-scores is needlessly complicated. (This is how Metaculus does it’s calibration for continuous forecasts)
Can I use this image for my “part 2” posts, to explain how “pros” calibrate their continuous predictions?, And how it stacks up against my approach?, I will add you as a reviewer before publishing so you can make corrections in case I accidentally straw man or misunderstand you :)
I will probably also make a part 3 titled “Try t predictions” :), that should address some of your other critiques about the normal being bad :)
This is a good point, but you need less data to check whether your squared errors are close to 1 than whether your inverse CDF look uniform, so if the majority of predictions are normal I think my approach is better.
The main advantage of SimonM/Metaculus is that it works for any continuous distribution.
SimonM: Transforming to Uniform distribution works for any continuous variable and is what Metaculus uses for calibration Me: the variance trick to calculate σz from this post is better if your variables are form a Normal distribution, or something close to a normal. SimonM: Even for a Normal the Uniform is better.
I disagree with that characterisation of our disagreement, I think it’s far more fundamental than that.
I think you misrepresent the nature of forecasting (in it’s generality) versus modelling in some specifics
I think your methodology is needlessly complicated
I propose what I think is a better methodology
To expand on 1. I think (although I’m not certain, because I find your writing somewhat convoluted and unclear) that you’re making an implicit assumption that the error distribution is consistent from forecast to forecast. Namely your errors when forecasting COVID deaths and Biden’s vote share come from some similar process. This doesn’t really mirror my experience in forecasting. I think this model makes much more sense when looking at a single model which produces lots of forecasts. For example, if I had a model for COVID deaths each week, and after 5-10 weeks I noticed that my model was under or over confident then this sort of approach might make sense to tweak my model.
To expand on 2. I’ve read your article a few times and I still don’t fully understand what you’re getting at. As far as I can tell, you’re proposing a model for how to adjust your forecasts based on looking at their historic performance. Having a specific model for doing this seems to miss the point of what forecasting in the real world is like. I’ve never created a forecast, and gone “hmm… usually when I forecast things with 20% they happen 15% of the time, so I’m adjusting my forecast down” (which is I think what you’re advocating) it’s more likely a notion of, “I am often over/under confident, when I create this model is there some source of variance I am missing / over-estimating?”. Setting some concrete rules for this doesn’t make much sense to me.
Yes, I do think it’s much simpler for people to look at a list of percentiles of things happening, to plot them, and then think “am I generally over-confident / under-confident”? I think it’s generally much easier for people to reason about percentiles than standard-deviations. (Yes, I know 68-95-99, but I don’t know without thinking quite hard what 1.4 sd or 0.5 sd means). I think leaning too heavily on the math tends to make people make some pretty obvious mistakes.
I am sorry if I have straw manned you, and I think your above post is generally correct.
I think we are cumming from two different worlds.
You are coming from Metaculus where people make a lot of predictions. Where having 50+ predictions is the norm and the thus looking at a U(0, 1) gives a lot of intuitive evidence of calibration.
I come from a world where people want to improve in all kids of ways, and one of them is prediction, few people write more than 20 predictions down a year, and when they do they more or less ALWAYS make dichotomous predictions. I expect many of my readers to be terrible at predicting just like myself.
You are reading a post with the message “raise the sanity waterline from 2% to 5% of your level” and asking “why is this better than making 600 predictions and looking at their inverse CDF”, and the answer is: it’s not, but it’s still relevant because most people do not make 600 predictions and do not know what an inverse CDF is. I am even explaining what an normal distribution is because I do not expect my audience to know...
You are absolutely correct they probably do not share an error distribution. But I am trying to get people from knowing 1 distribution to knowing 2.
Scot Alexander makes a “when I predict this” then “it really means that”, every year for his binary predictions, This gives him an intuitive feel for “I should adjust my odds up/down by x”. I am trying to do the same for Normal Distribution predictions, so people can check their predictions.
I agree your methodology is superior :), All I propose that people sometimes make continuous predictions, and if they want to start doing that and track how much they suck, then I give them instructions to quickly getting a number for how well it is going.
If you’re making ~20 predictions a year, you shouldn’t be doing any funky math to analyse your forecasts. Just go through each one after the fact and decide whether or not the forecast was sensible with the benefit of hindsight.
I am even explaining what an normal distribution is because I do not expect my audience to know...
I think this is exactly my point, if someone doesn’t know what a normal distribution is, maybe they should be looking at their forecasts in a fuzzier way than trying to back fit some model to them.
All I propose that people sometimes make continuous predictions, and if they want to start doing that and track how much they suck, then I give them instructions to quickly getting a number for how well it is going.
I disagree that’s all you propose. As I said in an earlier comment, I’m broadly in favour of people making continuous forecasts as they convey more information. You paired your article with what I believe is broadly bad advise around analysing those forecasts. (Especially if we’re talking about a sample of ~20 forecasts)
I would love you as a reviewer of my second post as there I will try to justify why I think this approach is better, you can even super dislike it before I publish if you still feel like that when I present my strongest arguments, or maybe convince me that I am wrong so I dont publish part 2 and make a partial retraction for this post :). There is a decent chance you are right as you are the stronger predictor of the two of us :)
you are missing the step where I am transforming arbitrary distribution to U(0, 1)
medium confident in this explanation: Because the square of random variables from the same distributions follows a gamma distribution, and it’s easier to see violations from a gamma than from a uniform, If the majority of your predictions are from a weird distributions then you are correct, but if they are mostly from normal or unimodal ones, then I am right. I agree that my solution is a hack that would make no statistician proud :)
Edit: Intuition pump, a T(0, 1, 100) obviously looks very normal, so transforming to U(0,1) and then to N(0, 1) will create basically the same distribution, the square of a bunch of normal is Chi^2, so the Chi^2 is the best distribution for detecting violations, obviously there is a point where this approximation sucks and U(0, 1) still works
If you think 2 data points are sufficient to update your methodology to 3 s.f. of precision I don’t know what to tell you. I think if I have 2 data point and one of them is 0.99 then it’s pretty clear I should make my intervals wider, but how much wider is still very uncertain with very little data. (It’s also not clear if I should be making my intervals wider or changing my mean too)
I don’t know what s.f is, but the interval around 1.73 is obviously huge, with 5-1-0 data points it’s quite narrow if your predictions are drawn from N(1, 1.73), that is what my next post will be about. There might also be a smart way to do this using the Uniform, but I would be surprised if it’s dispersion is smaller than a chi^2 distribution :)
(changing the mean is cheating, we are talking about calibration, so you can only change your dispersion)
Go to your profile page. (Will be something like https://www.metaculus.com/accounts/profile/{some number}/). Then in the track record section, switch from Brier Score to “Log Score (continuous)”
Well, this makes some sense, but it would make even more sense to do only half of it.
Take your forecast, calculate it’s percentile. Then you can do all the traditional calibration stuff. All this stuff with z-scores is needlessly complicated. (This is how Metaculus does it’s calibration for continuous forecasts)
Can I use this image for my “part 2” posts, to explain how “pros” calibrate their continuous predictions?, And how it stacks up against my approach?, I will add you as a reviewer before publishing so you can make corrections in case I accidentally straw man or misunderstand you :)
I will probably also make a part 3 titled “Try t predictions” :), that should address some of your other critiques about the normal being bad :)
This is a good point, but you need less data to check whether your squared errors are close to 1 than whether your inverse CDF look uniform, so if the majority of predictions are normal I think my approach is better.
The main advantage of SimonM/Metaculus is that it works for any continuous distribution.
I don’t understand why you think that’s true. To rephrase what you’ve written:
“You need less data to check whether samples are approximately N(0,1) than if they are approximately U(0,1)”
It seems especially strange when you think that transforming your U(0,1) samples to N(0,1) makes the problem soluble.
TLDR for our disagreement:
SimonM: Transforming to Uniform distribution works for any continuous variable and is what Metaculus uses for calibration
Me: the variance trick to calculate σz from this post is better if your variables are form a Normal distribution, or something close to a normal.
SimonM: Even for a Normal the Uniform is better.
I disagree with that characterisation of our disagreement, I think it’s far more fundamental than that.
I think you misrepresent the nature of forecasting (in it’s generality) versus modelling in some specifics
I think your methodology is needlessly complicated
I propose what I think is a better methodology
To expand on 1. I think (although I’m not certain, because I find your writing somewhat convoluted and unclear) that you’re making an implicit assumption that the error distribution is consistent from forecast to forecast. Namely your errors when forecasting COVID deaths and Biden’s vote share come from some similar process. This doesn’t really mirror my experience in forecasting. I think this model makes much more sense when looking at a single model which produces lots of forecasts. For example, if I had a model for COVID deaths each week, and after 5-10 weeks I noticed that my model was under or over confident then this sort of approach might make sense to tweak my model.
To expand on 2. I’ve read your article a few times and I still don’t fully understand what you’re getting at. As far as I can tell, you’re proposing a model for how to adjust your forecasts based on looking at their historic performance. Having a specific model for doing this seems to miss the point of what forecasting in the real world is like. I’ve never created a forecast, and gone “hmm… usually when I forecast things with 20% they happen 15% of the time, so I’m adjusting my forecast down” (which is I think what you’re advocating) it’s more likely a notion of, “I am often over/under confident, when I create this model is there some source of variance I am missing / over-estimating?”. Setting some concrete rules for this doesn’t make much sense to me.
Yes, I do think it’s much simpler for people to look at a list of percentiles of things happening, to plot them, and then think “am I generally over-confident / under-confident”? I think it’s generally much easier for people to reason about percentiles than standard-deviations. (Yes, I know 68-95-99, but I don’t know without thinking quite hard what 1.4 sd or 0.5 sd means). I think leaning too heavily on the math tends to make people make some pretty obvious mistakes.
I am sorry if I have straw manned you, and I think your above post is generally correct. I think we are cumming from two different worlds.
You are coming from Metaculus where people make a lot of predictions. Where having 50+ predictions is the norm and the thus looking at a U(0, 1) gives a lot of intuitive evidence of calibration.
I come from a world where people want to improve in all kids of ways, and one of them is prediction, few people write more than 20 predictions down a year, and when they do they more or less ALWAYS make dichotomous predictions. I expect many of my readers to be terrible at predicting just like myself.
You are reading a post with the message “raise the sanity waterline from 2% to 5% of your level” and asking “why is this better than making 600 predictions and looking at their inverse CDF”, and the answer is: it’s not, but it’s still relevant because most people do not make 600 predictions and do not know what an inverse CDF is. I am even explaining what an normal distribution is because I do not expect my audience to know...
You are absolutely correct they probably do not share an error distribution. But I am trying to get people from knowing 1 distribution to knowing 2.
Scot Alexander makes a “when I predict this” then “it really means that”, every year for his binary predictions, This gives him an intuitive feel for “I should adjust my odds up/down by x”. I am trying to do the same for Normal Distribution predictions, so people can check their predictions.
I agree your methodology is superior :), All I propose that people sometimes make continuous predictions, and if they want to start doing that and track how much they suck, then I give them instructions to quickly getting a number for how well it is going.
I still think you’re missing my point.
If you’re making ~20 predictions a year, you shouldn’t be doing any funky math to analyse your forecasts. Just go through each one after the fact and decide whether or not the forecast was sensible with the benefit of hindsight.
I think this is exactly my point, if someone doesn’t know what a normal distribution is, maybe they should be looking at their forecasts in a fuzzier way than trying to back fit some model to them.
I disagree that’s all you propose. As I said in an earlier comment, I’m broadly in favour of people making continuous forecasts as they convey more information. You paired your article with what I believe is broadly bad advise around analysing those forecasts. (Especially if we’re talking about a sample of ~20 forecasts)
I would love you as a reviewer of my second post as there I will try to justify why I think this approach is better, you can even super dislike it before I publish if you still feel like that when I present my strongest arguments, or maybe convince me that I am wrong so I dont publish part 2 and make a partial retraction for this post :). There is a decent chance you are right as you are the stronger predictor of the two of us :)
I’d be happy to.
I upvoted all comments in this thread for constructive criticism, response to it, and in the end even agreeing to review each other!
you are missing the step where I am transforming arbitrary distribution to U(0, 1)
medium confident in this explanation: Because the square of random variables from the same distributions follows a gamma distribution, and it’s easier to see violations from a gamma than from a uniform, If the majority of your predictions are from a weird distributions then you are correct, but if they are mostly from normal or unimodal ones, then I am right. I agree that my solution is a hack that would make no statistician proud :)
Edit: Intuition pump, a T(0, 1, 100) obviously looks very normal, so transforming to U(0,1) and then to N(0, 1) will create basically the same distribution, the square of a bunch of normal is Chi^2, so the Chi^2 is the best distribution for detecting violations, obviously there is a point where this approximation sucks and U(0, 1) still works
I am absolutely not missing that step. I am suggesting that should be the only step.
(I don’t agree with your intuitions in your “explanation” but I’ll let someone else deconstruct that if they want)
Hard disagree, From two data points I calculate that my future intervals should be 1.73 times wider, converting these two data points to U(0,1) I get
[0.99, 0.25]
How should I update my future predictions now?
If you think 2 data points are sufficient to update your methodology to 3 s.f. of precision I don’t know what to tell you. I think if I have 2 data point and one of them is 0.99 then it’s pretty clear I should make my intervals wider, but how much wider is still very uncertain with very little data. (It’s also not clear if I should be making my intervals wider or changing my mean too)
I don’t know what s.f is, but the interval around 1.73 is obviously huge, with 5-1-0 data points it’s quite narrow if your predictions are drawn from N(1, 1.73), that is what my next post will be about. There might also be a smart way to do this using the Uniform, but I would be surprised if it’s dispersion is smaller than a chi^2 distribution :) (changing the mean is cheating, we are talking about calibration, so you can only change your dispersion)
Where can I access this for my profile on Metaculus? I have everything unlocked but don’t see it in the options.
Go to your profile page. (Will be something like https://www.metaculus.com/accounts/profile/{some number}/). Then in the track record section, switch from Brier Score to “Log Score (continuous)”