What does your accuracy tell you about your confidence interval?
Yvain’s 2011 Less Wrong Census/Survey is still ongoing throughout November, 2011. If you haven’t taken it, please do before reading on, or at least write down your answers to the calibration questions so they won’t get skewed by the following discussion.
The survey includes these questions:
In the comments, several people including myself wondered what our level of accuracy in the first question said about the calibration of our answer to the second question. If your guess for the first question was really close to correct, but your probability for the second question was low, were you underconfident? If you were far off, but your probability was high, were you overconfident?
We could test our calibration by simply answering a lot of these pairs of questions, then applying a proper scoring rule. But that seems like throwing out information. Surely we could calibrate faster if we’re allowed to use our accuracy as evidence?
I suspect there are people on here with the tools to work this out trivially. Here’s my try at it:
Suppose you state a p-confidence interval of ±a around your guess x of the true value X. Then you find that, actually, |X—x| = b. What does this say about your confidence interval?
As a first approximation, we can represent your confidence interval as a claim that the answer is uniformly randomly placed within an interval of ±(a/p), and that you have guessed uniformly within the same interval. If this is the case, your guess should on average be ±(1/3 * a/p) off, following a triangular distribution. It should be in the range (1/3 ± 3⁄16)(a/p) half the time. It should be less than 1/3(3 - sqrt(6)), or about .18, 1⁄3 of the time, and greater than 1-1/(sqrt(3), or about .42, 1⁄3 of the time.
So, here’s a rule of thumb for evaluating your confidence intervals based on how close you’re getting to the actual answer. Again, a is the radius of your interval, and p is the probability you assigned that the answer is in that interval.
1. Determine how far you were off, divide by a, and multiply by p.
2. If your result is less than .18 more than a third of the time, you’re being underconfident. If your result is greater than .42 more than a third of the time, you’re being overconfident.
In my case, I was 2 years off, and estimated a probability of .85 that I was within 15 years. So my result is 2⁄15 * .85 = .11333… That’s less than the lower threshold. If I find this happening more than 1⁄3 of the time, I’m being underconfident.
Can anybody suggest a better system?
Since there was no option for “not a clue”, I left these fields blank. I do not believe they add anything.
That’s punting.
The question was about an event related to a well-known figure in world history. So even if you literally have no idea, your best guess for that reference class is “sometime between the year 0 and 2000”. The middle of this range is 1000. The probability that this should come within 15 years of the correct answer by sheer luck is about 1 in 100.
However, it just isn’t true that you didn’t have a clue. Given the name of the person and even a very rough idea who they were I’m pretty sure any LW reader could do considerably better than that; at the least narrow it down to a couple or maybe three centuries, for a 1 in 20 chance.
Yup; the only Principia Mathematica I’d ever heard of was the one by Russel and Whitehead. I leveraged this shocking lack of knowledge into a guess that Newton lived after Galileo and before Gauss, and put down 10% on 1750; which by the rule of thumb HonoreDB came up with puts me right on the edge of overconfidence.
Yeah. I got all panicky when I encountered the question (“Argh! Newton! How can I have nothing memorized about someone as important as Newton!”). By somewhat similar reasoning I got an answer and assigned about 1⁄3 probability to my being within 15 years. I ended up within 10 years of the correct answer. By HonoreDB’s rule that would be neither over- nor underconfident. But on discovering the answer I couldn’t help thinking, “rats—I should have been more confident”. I get a sense that thinking about scoring rules too much as a game can also lead to some biases.
I said that “I do not believe they add anything”, so no point engaging in the games where someone presumes that they do.
That sounds like a bad faith answer to me.
For one thing, you have no problem with the survey’s designer “presuming” that the other questions in the survey are valuable; why do you reverse that judgement only in the case of the question that troubles you?
For another, your rejection was based on the lack of a “not a clue” option, and you haven’t refuted my point that this option would be punting.
It’s possible that the reason I’m bothered by your dismissal is that I ended up spending more time on this one question than the rest of the survey altogether.
You would come across as more sincere if you just said “I couldn’t be bothered to answer that question”.
“‘I Don’t Know’” is a relevant Sequence post:
(Here’s the “Rerunning the Sequences” page.)
The next line should be:
Apple trees have zero apples most of the time.
Non-apple trees have no apples all of the time.
The quoted estimate sounds poorly calibrated and likely wrong.
An ordinary connotation of “How many X are there?” is that there aren’t any well-known reasons for there to be no X at all. If I ask you how many apples there are and you later find out that it’s actually a maple tree outside, then you would likely consider me not to be communicating in good faith — to be asking the question to make a point rather than to actually obtain information about apples.
I get your point. To add further weight to it, the snippet above is from an informal, likely fast paced IM conversation. Which makes considered analysis pedantic and socially uncalibrated.
That said, I found the 10 to 1000 estimate surprising.
The person asking the question hasn’t seen the tree. He is merely picking one out of the woods.
Say a tree has 100 apples. Come late autumn any apples will fall, to ten, to one, and then to none.
The fact that ordinary apple trees ordinarily have no apples at least raises the possibility.
0 to 1000 apples would certainly be correct. Which is what we want.
How many dollars are in my wallet? (I haven’t looked.)
I don’t know what it is, exactly, about that exchange… but I have a rather souring reaction to it. I mean, it seems somewhat… I don’t know the term, contrary comes closest but definitely isn’t correct; as though when someone insists on using numerical responses to a statement “in order to prevent confusion” and then admits that doing so is as likely to induce confusion—well, it seems like such a thing is rather non-fruitful.
Is it really so hard to say “There is insufficient data for a meaningful reply”?
How does guessing the answers add to the rest of the survey?
It’s a calibration test, which is (more or less) a test of how well you judge the accuracy of your guesses. How accurate your guess itself was is not important here except when judged in the light of the confidence level you assigned to the 15-years-either-way confidence interval.
Except for people who happened to have the year of [redacted event] memorized, everyone’s answer to the first of these two questions was a guess. Some people’s guesses were more educated than others, but the important part is not the accuracy of the guess, but how well the accuracy of the guess tracks the confidence level.
Huh? No, that’s getting more information, not throwing out information. You can’t calibrate if you don’t know your accuracy, so it’s not clear to me what you mean by “allowed.”
Essentially, the only information you can get from a single datapoint is whether or not you understand the acceptability of 0 or 100 as probabilities (hint: they aren’t). If you want to a useful calibration, you need lots of datapoints.
Not what I meant. For “simply,” read “merely.” I mean that when you get lots of datapoints, and calibrate from them, you can use your accuracy on the first parts in a richer way than just the binary check of whether they fell within your confidence intervals or not.
The way this is typically done is by eliciting more than two numbers to build the distribution out of. For example, I might ask you for a date so early that you think there’s only a 5% chance it happened before that date, then a date so late that you think there’s only a 5% chance it happened after that date, then try to figure out the tertiles or quartiles.
Notice that I worked from the outside in- when people try to come up with a central estimate and then imagine variance around that central estimate, like in Yvain’s elicitation, they do significantly worse than if guided by an well-designed process. (You can see an example of an expert elicitation process here.)
One you’ve done this, you’ve got more detailed bins, and you can evaluate the bin populations. (“Hm, I only have 10% in my lower tertile- I ought to adjust my estimates downwards.”)
People often fit distributions based on elicited values, but they’ll talk a lot with the experts about shape, to make sure it fits the expert’s beliefs. (They tend to use things a lot more sophisticated than uniforms, generally chosen so that Bayesian updates are convenient.) I don’t think I’ve seen much of that in the domain of calibration, though.
[edit] You could use that fitting procedure to produce a more precise estimate of your p, and then use that in your proper scoring rule to determine your score in negentropy, and so this could be useful for calibration. While I think this could increase precision in your calibration measurement, I don’t know if it would actually improve the accuracy of your calibration measurement. When doing statistics, it’s hard to make up for lack of data through use of clever techniques.
Thanks for that link, and for pointing out the technique which seems like a good hack. (In the nice sense of the word.)
That calibration question actually raised a question for me. I haven’t seen this question raised, although I suspect it has and I just haven’t found the discussion as yet; if so, please point me in the right direction.
The point, in short, is that my gut reaction to honing my subjective probability estimations is “I’m not interested in trying to be a computer.” As far as I know, all the value I’ve gotten out of rationalist practice has been much more qualitative than probability calculations, such as becoming aware of and learning to notice the subjective feeling of protecting a treasured belief from facts. (Insert the standard caveats, please.) I’m not sure what I would gain by taking the time to train myself to be better at generating numerical estimates of my likelihood of being correct about my guesses.
In this particular case, I guessed 1650 based on my knowledge that Newton was doing his thing in the mid to late seventeenth century. I put a 65% confidence on the 15-year interval, figuring that it might be a decent bit later than 1650 but not terribly much. It turns out that I was wrong about that interval, so I know now that based on that situation and the amount of confidence I felt, I should probably try a smaller confidence estimate. How much smaller I really don’t know; I suspect that’s what this “HonoreDB’s rule” others are talking about addresses.
But the thing is, I’m not sure what I would gain if it turned out that 65% was really the right guess to make. If my confidence estimates turn out to be really, really well-honed, so what?
On the flipside, if it turned out that 60% was the right guess to make and I spend some time calibrating my own probability estimates, what do I gain by doing that? That might be valuable for building an AI, but I’m a social scientist, not an AI researcher. I find it valuable to notice the planning fallacy in action, but I don’t find it terribly helpful to be skilled at, say, knowing that my guess of being 85% likely to arrive with a 5-minute margin of when I say is very likely to be accurate. What’s helpful is knowing that a given trip tends to take about 25 minutes and that somewhat regularly but seemingly randomly traffic will make it take 40 minutes, so if I need to be on time I leave 40 minutes before I need to be there but with things I can do for 15 minutes once I’m there in case traffic is smooth. No explicit probabilities needed.
But I see this business about learning to make better conscious probability estimates coming up fairly often on Less Wrong. It was even a theme in the Boot Camps if I understand correctly. So either quite a number of very smart people who spend their time thinking about rationality all managed to converge on valuing something that doesn’t actually matter that much to rationality, or I’m missing something.
Would someone do me the courtesy of helping me see what I’m apparently blind to here?
ETA: I just realized that the rule everyone is talking about must be what HonoreDB outlined in the lead post of this discussion. Sorry for not noticing that. But my core question remains!
There’s no good way to do this using only the error of a point estimate, it depends on the shape of the distribution you assign. For example, imagine someone who can’t remember whether some event happened in 1692 or 1892, so randomly guesses 1692 with 50% confidence. If the correct answer was 1892, this person was off by 200 years, but still very well calibrated.