Your buckets can be as granular as you want. For the sake of argument, let’s set them at 1% -- so that you have a 50% bucket, a 51% bucket, a 52% bucket, etc. Each bucket drives the same binomial distribution, but with a slightly different probability, so the distributions are close. If you plot it, it’s going to be nice and smooth, no discontinuities.
Some, or maybe even most, of your buckets will be empty. That’s fine, we’ll interpolate values into them because one of our prior assumptions is smoothness—we don’t expect the calibration for the 65% bucket to be radically different from the calibration for the 66% bucket. We’ll probably have a tuning parameter to specify how smooth would we like things to be.
So what’s happening with the non-empty buckets? For each of them we can calculate the error which we can define as, say, the difference between the actual and the expected ratio of successes to predictions (it can be zero). This gives you a set of points which, if plotted, will be quite spiky since a lot of buckets might have a single observation in them. At this point you recall our assumption of smoothness and run some sort of a kernel smoother on your error points (note that they should be weighted by the number of observations in a bucket). If you just want a visual representation, you plot that smoothed line and you’re done.
If you want some numerical aggregate of this, a simple number would be the average error (note that we kept the sign of the error, so errors in different directions will cancel each other out). To make it a bit more meaningful, you can make a simulation of the perfect forecaster: run, say, 1000 iterations of a set of predictions where for each prediction in the X% bucket the perfect forecaster predicts True X% of the time and False (100-X)% of the time. Calculate the average error in each iteration, form an empirical distribution out of these average errors, and see where your actual average error falls.
If you want not a single aggregate, but estimates for some specific buckets (say, for 50%, 70%, and 90%), look at your kernel smoother line and read the value off it.
I am probably not explaining myself clearly.
Your buckets can be as granular as you want. For the sake of argument, let’s set them at 1% -- so that you have a 50% bucket, a 51% bucket, a 52% bucket, etc. Each bucket drives the same binomial distribution, but with a slightly different probability, so the distributions are close. If you plot it, it’s going to be nice and smooth, no discontinuities.
Some, or maybe even most, of your buckets will be empty. That’s fine, we’ll interpolate values into them because one of our prior assumptions is smoothness—we don’t expect the calibration for the 65% bucket to be radically different from the calibration for the 66% bucket. We’ll probably have a tuning parameter to specify how smooth would we like things to be.
So what’s happening with the non-empty buckets? For each of them we can calculate the error which we can define as, say, the difference between the actual and the expected ratio of successes to predictions (it can be zero). This gives you a set of points which, if plotted, will be quite spiky since a lot of buckets might have a single observation in them. At this point you recall our assumption of smoothness and run some sort of a kernel smoother on your error points (note that they should be weighted by the number of observations in a bucket). If you just want a visual representation, you plot that smoothed line and you’re done.
If you want some numerical aggregate of this, a simple number would be the average error (note that we kept the sign of the error, so errors in different directions will cancel each other out). To make it a bit more meaningful, you can make a simulation of the perfect forecaster: run, say, 1000 iterations of a set of predictions where for each prediction in the X% bucket the perfect forecaster predicts True X% of the time and False (100-X)% of the time. Calculate the average error in each iteration, form an empirical distribution out of these average errors, and see where your actual average error falls.
If you want not a single aggregate, but estimates for some specific buckets (say, for 50%, 70%, and 90%), look at your kernel smoother line and read the value off it.