What’s wrong with traditional error bars or some equivalent thereof?
Do you mean just adding error bars (which indicate amounts of noise in each sample) to the traditional-style bar graph? It so, it doesn’t influence most of the drawbacks and benefits that I mention in the post (such as e.g. that you restrict yourself to a small set of predefined confidence levels, and that you have discontinuities in the result when your data changes slightly).
For each confidence level there is a distribution of actual outcomes that you’d expect. You can calculate (or simulate) it for any confidence level so you are not restricted to a small predefined set. This is basically your forecast: you are saying “I have made n predictions allocated to the X confidence bucket and the distribution of successes/failures should look like this”. Note: it’s a distribution, not a scalar. There is also no noise involved.
You plot these distributions in any form you like and then you overlay your actual number of successes and failures (or their ratio if you want to keep things simple).
For each confidence level there is a distribution of actual outcomes that you’d expect
X confidence bucket and the distribution
Yes sure, but if you do this for each confidence level separately than you already gave up on most improvements that are possible with my proposed method.
The most serious problem with the traditional method is not nicely showing the distribution in each bucket, but the fact that there are separate buckets at all.
Your buckets can be as granular as you want. For the sake of argument, let’s set them at 1% -- so that you have a 50% bucket, a 51% bucket, a 52% bucket, etc. Each bucket drives the same binomial distribution, but with a slightly different probability, so the distributions are close. If you plot it, it’s going to be nice and smooth, no discontinuities.
Some, or maybe even most, of your buckets will be empty. That’s fine, we’ll interpolate values into them because one of our prior assumptions is smoothness—we don’t expect the calibration for the 65% bucket to be radically different from the calibration for the 66% bucket. We’ll probably have a tuning parameter to specify how smooth would we like things to be.
So what’s happening with the non-empty buckets? For each of them we can calculate the error which we can define as, say, the difference between the actual and the expected ratio of successes to predictions (it can be zero). This gives you a set of points which, if plotted, will be quite spiky since a lot of buckets might have a single observation in them. At this point you recall our assumption of smoothness and run some sort of a kernel smoother on your error points (note that they should be weighted by the number of observations in a bucket). If you just want a visual representation, you plot that smoothed line and you’re done.
If you want some numerical aggregate of this, a simple number would be the average error (note that we kept the sign of the error, so errors in different directions will cancel each other out). To make it a bit more meaningful, you can make a simulation of the perfect forecaster: run, say, 1000 iterations of a set of predictions where for each prediction in the X% bucket the perfect forecaster predicts True X% of the time and False (100-X)% of the time. Calculate the average error in each iteration, form an empirical distribution out of these average errors, and see where your actual average error falls.
If you want not a single aggregate, but estimates for some specific buckets (say, for 50%, 70%, and 90%), look at your kernel smoother line and read the value off it.
Do you mean just adding error bars (which indicate amounts of noise in each sample) to the traditional-style bar graph? It so, it doesn’t influence most of the drawbacks and benefits that I mention in the post (such as e.g. that you restrict yourself to a small set of predefined confidence levels, and that you have discontinuities in the result when your data changes slightly).
Nope.
For each confidence level there is a distribution of actual outcomes that you’d expect. You can calculate (or simulate) it for any confidence level so you are not restricted to a small predefined set. This is basically your forecast: you are saying “I have made n predictions allocated to the X confidence bucket and the distribution of successes/failures should look like this”. Note: it’s a distribution, not a scalar. There is also no noise involved.
You plot these distributions in any form you like and then you overlay your actual number of successes and failures (or their ratio if you want to keep things simple).
Yes sure, but if you do this for each confidence level separately than you already gave up on most improvements that are possible with my proposed method.
The most serious problem with the traditional method is not nicely showing the distribution in each bucket, but the fact that there are separate buckets at all.
I am probably not explaining myself clearly.
Your buckets can be as granular as you want. For the sake of argument, let’s set them at 1% -- so that you have a 50% bucket, a 51% bucket, a 52% bucket, etc. Each bucket drives the same binomial distribution, but with a slightly different probability, so the distributions are close. If you plot it, it’s going to be nice and smooth, no discontinuities.
Some, or maybe even most, of your buckets will be empty. That’s fine, we’ll interpolate values into them because one of our prior assumptions is smoothness—we don’t expect the calibration for the 65% bucket to be radically different from the calibration for the 66% bucket. We’ll probably have a tuning parameter to specify how smooth would we like things to be.
So what’s happening with the non-empty buckets? For each of them we can calculate the error which we can define as, say, the difference between the actual and the expected ratio of successes to predictions (it can be zero). This gives you a set of points which, if plotted, will be quite spiky since a lot of buckets might have a single observation in them. At this point you recall our assumption of smoothness and run some sort of a kernel smoother on your error points (note that they should be weighted by the number of observations in a bucket). If you just want a visual representation, you plot that smoothed line and you’re done.
If you want some numerical aggregate of this, a simple number would be the average error (note that we kept the sign of the error, so errors in different directions will cancel each other out). To make it a bit more meaningful, you can make a simulation of the perfect forecaster: run, say, 1000 iterations of a set of predictions where for each prediction in the X% bucket the perfect forecaster predicts True X% of the time and False (100-X)% of the time. Calculate the average error in each iteration, form an empirical distribution out of these average errors, and see where your actual average error falls.
If you want not a single aggregate, but estimates for some specific buckets (say, for 50%, 70%, and 90%), look at your kernel smoother line and read the value off it.