If this is a valid method of scoring predictions, some mathematician or statistician should have already published it. We should try to find this publication rather than inventing methods from scratch.
It’s unlikely that an amateur could figure out something that trained mathematicians have been unable to figure out, especially in cases where the idea has potential failure modes that are not easily visible. It’s like trying to roll your own cryptography, something else that amateurs keep doing and failing at.
One thing that stories of backyard inventors have in common is that the field in which they were doing backyard invention was so new that the low hanging fruits hadn’t all been harvested. You aren’t Thomas Edison, or even Steve Wozniak; mathematics and statistics are well-studied fields.
I continue to be surprised by the allegiance to the high-priesthood view of science on LW—and that in place founded by a guy who was trained in… nothing. I probably should update :-/
In any case, contemplate this: “People who say it cannot be done should not interrupt those who are doing it.”
and that in place founded by a guy who was trained in.. nothing
I don’t consider Eliezer to be an authority on anything other than maybe what he meant to write in his fanfics. I’m as skeptical of his work on AI risk as I am on this.
You were suggesting that it was inconsistent to defer to experts on science, but use a site founded by someone who isn’t an expert. I replied that I don’t consider the site founder someone to defer to.
You were suggesting that it was inconsistent to defer to experts on science, but use a site founded by someone who isn’t an expert.
No, I wasn’t.
I pointed out that I find deferring to experts much less useful than you do, and that I am surprised to find high-priesthood attitudes persisting on LW. My surprise is not an accusation of inconsistency.
That only answers half the objection. Being a mathematician means that it is possible for you to solve such problems (if you are trained in the proper area of mathematics anyway—”mathematics” covers a lot of ground), but the low hanging fruit should still be gone. I’d expect that if the solution was simple enough that it could fit in a blogpost, some other person who is also a trained mathematician would have already solved it.
I think you’re approaching this in the wrong frame of reference.
No one is trying to discover new mathematical truths here. The action of constructing a particular metric (for evaluating calibration) is akin to applied engineering—you need to design something fit for a specific purpose while making a set of trade-offs in the process. You are not going to tell some guy in Taiwan designing a new motherboard that he’s silly and should just go read the academic literature and do what it tells him to do, are you?
I endorse this (while remarking that both Lumifer and I have—independently, so far as I know—suggested in this discussion that a better approach may be simply to turn the observed prediction results into some sort of smoothed/interpolated curve and plot that rather than the usual bar chart).
Let me make a more concrete suggestion.
Step 0 (need be done only once, but so far as I know never has been): Get a number of experimental subjects with highly varied personalities, intelligence, statistical sophistication, etc. Get them to make a lot of predictions with fine-grained confidence levels. Use this to estimate how much calibration error actually varies with confidence level; this effectively gives you a prior distribution on calibration functions.
Step 1 (given actual calibration data): You’re now trying to predict a single calibration function. Each prediction-result has a corresponding likelihood: if something happened that you gave probability p to, the likelihood is simply f(p) where f is the function you’re trying to estimate; if not, the likelihood is 1-f(p). So you’re trying to maximize the sum of log f(p) over successful predictions + the sum of log [1-f(p)] over unsuccessful predictions. So now find the posterior-maximizing calibration function. (You could e.g. pick some space of functions large enough to have good approximations to all plausible calibration functions, and optimize over a parameterization of that space.) You can figure out how confident you should be about the calibration function by sampling from the posterior distribution and looking at the resulting distribution of values at any given point. If what you have is lots of prediction results at each of some number of confidence levels, then a normal approximation applies and you’re basically doing Gaussian process regression or kriging, which quite cheaply gives you not only a smooth curve but error estimates everywhere; in this case you don’t need an explicit representation of the space of (approximate) permissible calibration functions.
[EDITED: I wrote 1-log where I meant log 1- and have now fixed this.]
If it’s done, then someone has to do it first. This sort of calibration measurement isn’t (so far as I know) a thing traditionally emphasized in statistics, so it wouldn’t be super-surprising if the LW community were where the first solution came from.
(But, as mentioned in other comments on this post, I am not in fact convinced that SiH’s approach is the Right Thing. For what it’s worth, I am also a trained mathematician.)
I had a look at the literature on calibration and it seems worse than the work done in this thread. Most of the research on scoring rules has been done by educators, psychiatrists, and social scientists. Meanwhile there are several trained mathematicians floating around LW.
Also, I’m not sure if anyone else realises that this is an important problem. To care about it you have to care about human biases and Bayesian probability. On LW these are viewed as just two sides of the rationality coin, but in the outside world people don’t really study them at the same time.
I have looked on Google Scholar. I could find several proposed measures of calibration. But none are very good; they’re all worse than the things proposed in this thread.
If this is a valid method of scoring predictions, some mathematician or statistician should have already published it. We should try to find this publication rather than inventing methods from scratch.
Wow, such learned helplessness! Do not try to invent anything new since if it’s worthwhile it would have been invented before you… X-0
It’s unlikely that an amateur could figure out something that trained mathematicians have been unable to figure out, especially in cases where the idea has potential failure modes that are not easily visible. It’s like trying to roll your own cryptography, something else that amateurs keep doing and failing at.
One thing that stories of backyard inventors have in common is that the field in which they were doing backyard invention was so new that the low hanging fruits hadn’t all been harvested. You aren’t Thomas Edison, or even Steve Wozniak; mathematics and statistics are well-studied fields.
I continue to be surprised by the allegiance to the high-priesthood view of science on LW—and that in place founded by a guy who was trained in… nothing. I probably should update :-/
In any case, contemplate this: “People who say it cannot be done should not interrupt those who are doing it.”
I don’t consider Eliezer to be an authority on anything other than maybe what he meant to write in his fanfics. I’m as skeptical of his work on AI risk as I am on this.
Maybe the right question isn’t who is or is not an authority, but rather who writes/creates interesting and useful things?
You were suggesting that it was inconsistent to defer to experts on science, but use a site founded by someone who isn’t an expert. I replied that I don’t consider the site founder someone to defer to.
No, I wasn’t.
I pointed out that I find deferring to experts much less useful than you do, and that I am surprised to find high-priesthood attitudes persisting on LW. My surprise is not an accusation of inconsistency.
Just for the record, I am a trained mathematician.
That only answers half the objection. Being a mathematician means that it is possible for you to solve such problems (if you are trained in the proper area of mathematics anyway—”mathematics” covers a lot of ground), but the low hanging fruit should still be gone. I’d expect that if the solution was simple enough that it could fit in a blogpost, some other person who is also a trained mathematician would have already solved it.
I think you’re approaching this in the wrong frame of reference.
No one is trying to discover new mathematical truths here. The action of constructing a particular metric (for evaluating calibration) is akin to applied engineering—you need to design something fit for a specific purpose while making a set of trade-offs in the process. You are not going to tell some guy in Taiwan designing a new motherboard that he’s silly and should just go read the academic literature and do what it tells him to do, are you?
I endorse this (while remarking that both Lumifer and I have—independently, so far as I know—suggested in this discussion that a better approach may be simply to turn the observed prediction results into some sort of smoothed/interpolated curve and plot that rather than the usual bar chart).
Let me make a more concrete suggestion.
Step 0 (need be done only once, but so far as I know never has been): Get a number of experimental subjects with highly varied personalities, intelligence, statistical sophistication, etc. Get them to make a lot of predictions with fine-grained confidence levels. Use this to estimate how much calibration error actually varies with confidence level; this effectively gives you a prior distribution on calibration functions.
Step 1 (given actual calibration data): You’re now trying to predict a single calibration function. Each prediction-result has a corresponding likelihood: if something happened that you gave probability p to, the likelihood is simply f(p) where f is the function you’re trying to estimate; if not, the likelihood is 1-f(p). So you’re trying to maximize the sum of log f(p) over successful predictions + the sum of log [1-f(p)] over unsuccessful predictions. So now find the posterior-maximizing calibration function. (You could e.g. pick some space of functions large enough to have good approximations to all plausible calibration functions, and optimize over a parameterization of that space.) You can figure out how confident you should be about the calibration function by sampling from the posterior distribution and looking at the resulting distribution of values at any given point. If what you have is lots of prediction results at each of some number of confidence levels, then a normal approximation applies and you’re basically doing Gaussian process regression or kriging, which quite cheaply gives you not only a smooth curve but error estimates everywhere; in this case you don’t need an explicit representation of the space of (approximate) permissible calibration functions.
[EDITED: I wrote 1-log where I meant log 1- and have now fixed this.]
If it’s done, then someone has to do it first. This sort of calibration measurement isn’t (so far as I know) a thing traditionally emphasized in statistics, so it wouldn’t be super-surprising if the LW community were where the first solution came from.
(But, as mentioned in other comments on this post, I am not in fact convinced that SiH’s approach is the Right Thing. For what it’s worth, I am also a trained mathematician.)
I had a look at the literature on calibration and it seems worse than the work done in this thread. Most of the research on scoring rules has been done by educators, psychiatrists, and social scientists. Meanwhile there are several trained mathematicians floating around LW.
Also, I’m not sure if anyone else realises that this is an important problem. To care about it you have to care about human biases and Bayesian probability. On LW these are viewed as just two sides of the rationality coin, but in the outside world people don’t really study them at the same time.
I have looked on Google Scholar. I could find several proposed measures of calibration. But none are very good; they’re all worse than the things proposed in this thread.