1. I think the “calibration curves” one sees e.g. in https://slatestarcodex.com/2020/04/08/2019-predictions-calibration-results/ are helpful/designed to evaluate/improve a strict subset of prediction errors: Systematic over- oder underconfidence. Clearly, there is more to being an impressive predictor than just being well-calibrated, but becoming better-calibrated is a relatively easy thing to do with those curves. One can also imagine someone who naturally generates 50 % predictions that are over-/underconfident.
2.0. Having access to “baseline probabilities/common-wisdom estimates” is mathematically equivalent to having a “baseline predictor/woman-on-the-street” whose probability estimates match those baseline probabilities. I think your discussion can be clarified and extended by not framing it as “judging the impressiveness of one person by comparing their estimates against a baseline”, but as “given track records of two or more persons/algorithms, compare their predictions’ accuracy and impressiveness, where one person might be the ‘baseline predictor’”.
2.1. If you do want to measure to compare two persons’ track record/generalized impressiveness on the same set of predictions (e.g. to decide whom to trust more), the natural choice is log loss as used to optimize ML algorithms. This means that one sums -ln(p) for all probability estimates p of true judgments; lower sums are better. 50 % predictions are of course a valid data point for the log loss if both persons made a prediction. In contrast, if reference predictions aren’t available, it doesn’t seem feasible to me to judge predictions of 50 % or any other probability estimate.
2.2. One can prove: For events with a truly random component, the expectation value of the log loss is minimized by giving the correct probability estimates. If there is a very competent predictor who is nevertheless systematically overconfident as in 1., on can strictly improve upon their log loss by appropriately rescaling their probability estimates.
1. I think the “calibration curves” one sees e.g. in https://slatestarcodex.com/2020/04/08/2019-predictions-calibration-results/ are helpful/designed to evaluate/improve a strict subset of prediction errors: Systematic over- oder underconfidence. Clearly, there is more to being an impressive predictor than just being well-calibrated, but becoming better-calibrated is a relatively easy thing to do with those curves. One can also imagine someone who naturally generates 50 % predictions that are over-/underconfident.
2.0. Having access to “baseline probabilities/common-wisdom estimates” is mathematically equivalent to having a “baseline predictor/woman-on-the-street” whose probability estimates match those baseline probabilities. I think your discussion can be clarified and extended by not framing it as “judging the impressiveness of one person by comparing their estimates against a baseline”, but as “given track records of two or more persons/algorithms, compare their predictions’ accuracy and impressiveness, where one person might be the ‘baseline predictor’”.
2.1. If you do want to measure to compare two persons’ track record/generalized impressiveness on the same set of predictions (e.g. to decide whom to trust more), the natural choice is log loss as used to optimize ML algorithms. This means that one sums -ln(p) for all probability estimates p of true judgments; lower sums are better. 50 % predictions are of course a valid data point for the log loss if both persons made a prediction. In contrast, if reference predictions aren’t available, it doesn’t seem feasible to me to judge predictions of 50 % or any other probability estimate.
2.2. One can prove: For events with a truly random component, the expectation value of the log loss is minimized by giving the correct probability estimates. If there is a very competent predictor who is nevertheless systematically overconfident as in 1., on can strictly improve upon their log loss by appropriately rescaling their probability estimates.