benjamincosman comments on How to evaluate (50%) predictions

benjamincosman 27 Mar 2022 21:28 UTC
1 point
I think the core idea of this post is valid and useful, but that the specific recommendation that predictors phrase their predictions using the “confidence > baseline” rule is a misguided implementation detail. To see why, first consider this variant recommendation we could make:

Variant A: There are no restrictions on the wordings of predictions, but along with their numeric prediction, a predictor should in some manner provide the direction by which the prediction differs from baseline.

Notice that any computation that can be performed in your system can also be performed using Variant A, since I could always at scoring time convert all “I predict [claim] with confidence p; baseline is higher” predictions into “I predict [negation of claim] with confidence 1-p; baseline is lower” to conform to your rule. Now even Variant A is I believe a minor improvement on the “confidence > baseline” rule, because it gives predictors more freedom to word things in human-convenient ways (e.g. “The stock price will be between 512 and 514“ is far nicer to read than either ”...not be between 512 and 514”, or the very clunky ”...less than 512 or more than 514″). But more importantly, Variant A is a useful stepping stone towards two much more significant improvements:

Variant B: …a predictor should in some manner provide the direction by which the prediction differs from baseline, or better yet, an actual numeric value for the baseline.

First, notice as before that any computation that can be performed in Variant A can also be performed using Variant B, since we can easily derive the direction from the value. Now of course I recognize that providing a value may be harder or much harder than reporting direction alone, so I am certainly not suggesting we mandate that the baseline’s value always be reported. But it’s clearly useful to have, even for your original scheme: though you don’t say so explicitly, it seems to be the key component in your “estimate their boldness” step at the end, since boldness is precisely the magnitude of the difference between the prediction and the baseline. So it’s definitely worth at least a little extra effort to get this value, and occasionally the amount of effort required will actually be very small or none anyway: one of the easiest ways to satisfy your original mandate of providing the direction will sometimes be to go find the numeric value (e.g. in a prediction market).

Variant C: …a predictor should in some manner provide the direction by which the prediction differs from baseline, or better yet, some third party should provide this direction.

Now again I recognize that I can’t make anybody, including unspecified “third parties”, do more work than they were going to. But to whatever extent we have the option of someone else (such as whoever is doing the scoring?) providing the baseline estimations, it definitely seems preferable: in the same way that in the classical calibration setting, a predictor has an incentive to accidentally (or “accidentally”) phrase predictions a certain way to guarantee perfect calibration at the 50% level, here they have an incentive to mis-estimate the baselines for the same reason. There might also be other advantages to outsourcing the baseline estimations, including
- this scheme would become capable of scoring predictors who do not provide baselines themselves (e.g. because they simply don’t know they’re supposed to, or because they don’t want to put in that effort)
- we might get scale and specialization advantages from baseline-provision being a separate service (e.g. a prediction market could offer this as an API, since that’s kind of exactly what they do anyway)
So overall, I’d advocate using the tweaks introduced in all the variants above: A) we should not prefer that predictors speak using extra negations when there are equally good ways of expressing the same information, B) we should prefer that baseline values are recorded rather than just their directions, and C) we should prefer that third parties decide on the baselines.
- benjamincosman 27 Mar 2022 21:46 UTC
  1 point
  Parent
  I think a lot of the pushback against the post that I’m seeing in the older comments is generated by the fact that this “confidence > baseline” rule is presented in its final form without first passing through a stage where it looks more symmetrical. By analogy, imagine that in the normal calibration setting, someone just told you that you are required to phrase all your predictions such that the probabilities are >= 50%. “But why,” you’d think; “doesn’t the symmetry of the situation almost guarantee that this has to be wrong—in what sane world can we predict 80% but we can’t predict 20%?” So instead, the way to present the classical version is that you can predict any value between 0 and 100, and then precisely because of the symmetry noticed above, for the purpose of scoring we lump together the 20s and 80s. And that one possible implementation of this is to do the lumping at prediction-time instead of scoring-time by only letting you specify probabilities >= 50%. Similarly in the system from this post, the fundamental thing is that you have to provide your probability and also a direction that it differs from baseline. And then come scoring time, we will lump together “80%, baseline is higher” with “20%, baseline is lower”. Which means one possible implementation is to do the lumping at prediction-time by only allowing you to make “baseline is lower” predictions. (And another implementation, for anyone who finds this lens useful since it’s closer to the classical setting, would be to only allow you to make >=50% predictions but you also freely specify the direction of the baseline.)