janos comments on Looking for information on scoring calibration

janos 7 Apr 2011 22:59 UTC
2 points
0
tl;dr : miscalibration means mentally interpreting loglikelihood of data as being more or less than its actual loglikelihood; to infer it you need to assume/infer the Bayesian calculation that’s being made/approximated. Easiest with distributions over finite sets (i.e. T/F or multiple-choice questions). Also, likelihood should be called evidence.

I wonder why I didn’t respond to this when it was fresh. Anyway, I was running into this same difficulty last summer when attempting to write software to give friendly outputs (like “calibration”) to a bunch of people playing the Aumann game with trivia questions.

My understanding was that evidence needs to be measured on the logscale (as the difference between prior and posterior), and miscalibration is when your mental conversion from gut feeling of evidence to the actual evidence has a multiplicative error in it. (We can pronounce this as: “the true evidence is some multiplicative factor (called the calibration parameter) times the felt evidence”.) This still seems like a reasonable model, though of course different kinds of evidence are likely to have different error magnitudes, and different questions are likely to get different kinds of evidence, so if you have lots of data you can probably do better by building a model that will estimate your calibration for particular questions.

But sticking to the constant-calibration model, it’s still not possible to estimate your calibration from your given confidence intervals because for that we need an idea of what your internal prior (your “prior” prior, before you’ve taken into account the felt evidence) is, which is hard to get any decent sense of, though you can work off of iffy assumptions, such as assuming that your prior for percentage answers from a trivia game is fitted to the set of all the percentage answers from this trivia game, and has some simple form (e.g. Beta). The Aumann game gave an advantage in this respect, because rather than comparing your probability distribution before&after thinking about the question, it makes it possible to compare the distribution before&after hearing other people’s arguments&evidence; if you always speak in terms of standard probability distributions, it’s not too hard to infer your calibration there.

Further “funny” issues can arise when you get down to work; for instance if your prior was a Student-t with df n1 and your posterior was a Student-t with df n2s1^2 then your calibration cannot be more than 1/(1-s1^2/s2^2) without having your posterior explode. It’s tempting to say the lesson is that things break if you’re becoming asymptotically less certain, which makes some intuitive sense: if your distributions are actually mixtures of finitely many different hypotheses that you’re Bayesianly updating the weights of, then you will never become asymptotically less certain; in particular the Student-t scenario I described can’t happen. However this is not a satisfactory conclusion because the Normal scenario (where you increase your variance by upweighting a hypothesis that gives higher variance) can easily happen.

A different resolution to the above is that the model of evidence=calibration*felt evidence is wrong, and needs an error term or two; that can give a workable result, or at least not catch fire and die.

Another thought: if your mental process is like the one two paragraphs up, where you’re working with a mixture of several fixed (e.g. normal) hypotheses, and the calibration concept is applied to how you update the weights of the hypotheses, then the change in the mixture distribution (i.e. the marginal) will not follow anything like the calibration model.

So the concept is pretty tricky unless you carefully choose problems where you can reasonably model the mental inference, and in particular try to avoid “mixture-of-hypotheses”-type scenarios (unless you know in advance precisely what the hypotheses imply, which is unusual unless you construct the questions that way, .. but then I can’t think of why you’d ask about the mixture instead of about the probabilities of the hypotheses themselves).

You might be okay when looking at typical multiple-choice questions; certainly you won’t run into the issues with broken posteriors and invalid calibrations. Another advantage is that “the” prior (i.e. uniform) is uncontroversial, though whether the prior to use for computing calibration should be “the” prior is not obvious; but if you don’t have before-and-after results from people then I guess it’s the best you can do.

I just noticed that what’s usually called the “likelihood” I was calling “evidence” here. This has probably been suggested by someone before, but: I’ve never liked the term “likelihood”, and this is the best replacement for it that I know of.