It’s important to note that accuracy and calibration are two different things. I’m mentioning this because the OP asks for calibration metrics, but several answers so far give accuracy metrics. Any proper scoring rule is a measure of accuracy as opposed to calibration.
It is possible to be very well-calibrated but very inaccurate; for example, you might know that it is going to be Monday 1/7th of the time, so you give a probability of 1/7th. Everyone else just knows what day it is. On a calibration graph, you would be perfectly lined up; when you say 1/7th, the thing happens 1/7th of the time.
It is also possible to have high accuracy and poor calibration. Perhaps you can guess coin flips when no one else can, but you are wary of your precognitive powers, which makes you underconfident. So, you always place 60% probability on the event that actually happens (heads or tails). Your calibration graph is far out of line, but your accuracy is higher than anyone else.
In terms of improving rationality, the interesting thing about calibration is that (as in the precog example) if you know you’re poorly calibrated, you can boost your accuracy simply by improving your calibration. In some sense it is a free improvement: you don’t need to know anything more about the domain; you get more accurate just by knowing more about yourself (by seeing a calibration chart and adjusting).
However, if you just try to be more calibrated without any concern for accuracy, you could be like the person who says 1/7th. So, just aiming to do well on a score of calibration is not a good idea. This could be part of the reason why calibration charts are presented instead of calibration scores. (Another reason being that calibration charts help you know how to adjust to increase calibration.)
That being said, a decomposition of a proper scoring rule into components including a measure of calibration, like Dark Denego gives, seems like the way to go.
It’s important to note that accuracy and calibration are two different things. I’m mentioning this because the OP asks for calibration metrics, but several answers so far give accuracy metrics. Any proper scoring rule is a measure of accuracy as opposed to calibration.
It is possible to be very well-calibrated but very inaccurate; for example, you might know that it is going to be Monday 1/7th of the time, so you give a probability of 1/7th. Everyone else just knows what day it is. On a calibration graph, you would be perfectly lined up; when you say 1/7th, the thing happens 1/7th of the time.
It is also possible to have high accuracy and poor calibration. Perhaps you can guess coin flips when no one else can, but you are wary of your precognitive powers, which makes you underconfident. So, you always place 60% probability on the event that actually happens (heads or tails). Your calibration graph is far out of line, but your accuracy is higher than anyone else.
In terms of improving rationality, the interesting thing about calibration is that (as in the precog example) if you know you’re poorly calibrated, you can boost your accuracy simply by improving your calibration. In some sense it is a free improvement: you don’t need to know anything more about the domain; you get more accurate just by knowing more about yourself (by seeing a calibration chart and adjusting).
However, if you just try to be more calibrated without any concern for accuracy, you could be like the person who says 1/7th. So, just aiming to do well on a score of calibration is not a good idea. This could be part of the reason why calibration charts are presented instead of calibration scores. (Another reason being that calibration charts help you know how to adjust to increase calibration.)
That being said, a decomposition of a proper scoring rule into components including a measure of calibration, like Dark Denego gives, seems like the way to go.