Epsilon is not a probability, it’s a cop-out
A lot of people, after learning on this site that 0 and 1 are “not probabilities,” (they are, just ask a mathematician, but they are not useful for Bayesian updating) quickly switch to using “epsilon” or “one minus epsilon” instead, meaning “a reeeeaaaalllly tiny number, but not technically zero”. My interpretation of this is lazy signaling of “I am a good Bayesian” without actually doing the work. Why? Because if you ask this person what kind of evidence would change their estimate from “epsilon” to, say, 5%, they would be hard pressed to come up with anything sensible.
Well, that’s the descriptive part, which is much easier than the prescriptive part: what to do instead of just labeling what you consider a negligible probability epsilon and never updating it. As an aside, the complexity difference between identifying issues and successfully fixing them is generally like P vs. NP.
Probably in some cases one can actually put a number on a tiny probability (e.g. the odds of the sun rising tomorrow, without any additional data, is about 1-1/average number of consecutive sunrises). In other cases it would be more accurate to say “I have no idea, I would be super surprised if the Pascal’s mugger was telling the truth, but who knows, I have been super surprised before.” In yet other cases what one deems “epsilon” would be something logically inconsistent (like successfully two-boxing in a perfect predictor Newcomb’s setup). Surely there are other cases, as well. But I think it pays thinking through what exactly your “epsilon” means in a given situation.
In most cases doing the work has no direct practical value. Most of the value is in practicing for the rare occasions where it does matter. Neither fake nor excessive practice is useful, so the norm to always be practicing is harmful.
(In this case saying things without doing the work is not signaling of anything. Succeeding in systematically doing the work would signal things that can’t be separated by this activity from being beholden to the harmful norm.)
What’s the probability that there really isn’t a chair under my butt right now and that all my sense are fooling me? meh, I don’t know, low enough that I expect the time it takes me to think through an actual probability to have negative expected utility for my life, hence, epsilon = low enough that it’s not worth thinking about. Good bayesians don’t assign explicit mental probabilities to every proposition encountered in daily life, they balance the expected reward of doing the calculation against the cost of doing so.
A useful habit I would recommend to anyone who defaults to answers like “epsilon” is just to try to estimate the number of independent one-in-ten miracles required to make you uncertain.
This turns an exponential scale into a linear scale. Rather than seeing 99.999% and 99.99999% as close together, you just think of them as “five nines” and “seven nines.” Seven nines needs 7⁄5 times more evidence than five nines.
Have you calibrated the numbers that come out of this technique? :whistles_innocently:
I’ve given “number of nines” types answers to questions about a hundred times and haven’t been surprised by anything I thought was too likely/unlikely, so I’m on track, just give me another couple million years to get better data :P
Perhaps more importantly, I’ve actually taken different actions about things like covid or car safety because of thinking slightly quantitatively, and this has involved probabilities that require about 5 nines. And I guess religion, but that’s overdetermined enough that it doesn’t matter if I have 12 nines or 22 nines of certainty about naturalism.
Well if you consciously put down 0.001% (maybe even wrote down your argument) and it happens I’d think you might learn something.
Wait now I think that might have been your point.
I tend to informally use ε for shorthand for “anything with a small enough probability that competing hypotheses like ‘I have gone insane’ dominate”. Which I freely admit is somewhat sloppy.
...that’s kind of the point. In informal use, I tend to use ε for shorthand for cases where there aren’t any sensible ways to update to, say, 5% probability of the thing happening.
Well, why don’t you use zero instead, what is the difference, instrumentally?
Because remembering the structure of things is instrumentally useful.
Saying zero would draw out pedants saying, “but zero isn’t a probability, are you really infinitely certain, etc. etc.” Although personally I would just say zero and ignore the pedants.
Ignoring pedants is correct, but I find value in the acknowledgement and reminder that it’s not ACTUALLY zero, but just small enough for me not to estimate it or think much about it in normal circumstances.
Agree. Having done some calibration exercises, I found that it’s actually pretty hard to give a likelihood of less than 1% or greater than 99% because it’s within the rate at which I make stupid mistakes, like clicking the wrong button, writing down the wrong thing, misreading the question, etc.. It’s tempting to think you can be more certain that that by coming up with silly examples that are clearly impossible, but actually it’s pretty likely you’ll accidentally come up with a silly example that disproves your point more often than your silly example will be true/false, so our subjective probabilities end up dominated by our own practical limitations at the limits.
I think there is (sometimes) value in distinguishing two separate probabilities for any given thing. There’s the “naïve” probability that you estimate while ignoring the possibility that you’ve blundered, that you misread something critical, that some underlying assumption of yours is wrong in a way that never crossed your mind, etc. And then there’s the “pessimistic” probability that tries to account for those things.
You want these to be separate because if you’re doing a calculation using the various probabilities, sometimes it’s better to do all the calculations using “naïve” probabilities and then do a final correction at the end for blunders, wrong fundamental assumptions, etc.
… Maybe. It depends on what the calculation is, what sort of out-of-model errors there might be, etc.
Of course this is a rough heuristic. I think what it’s an approximation to is a more careful tracking of lots of conditional probabilities (people around here sometimes talk as if being a Bayesian means assigning probabilities to things, but it would be more precise to say that being a Bayesian means assigning conditional probabilities to things, and a lot of the information is in that extra structure). E.g., suppose there are 100 things, each of which you give naïve probability 10^-9 to, but there’s a 10^-3 chance that some fundamental error in your model makes them actually happen 1⁄10 of the time. Then your “adjusted” probability for each one is about 10^-4, and if you use those to estimate the probability that at least one happens you get about 10^-2; but in this situation—assuming that the “fundamental error in your model” is actually the only substantial cause of out-of-model errors—that probability should actually be more like 10^-4. Of course, if you make a calculation like that then sometimes there’s a fundamental error in your model of where the possible errors come from :-).
Hmm, my point though is that you’re mistaken if you think you can separate these two, because you’re the embedded agent making both predictions, so your naive prediction isn’t actually independent of you the faliable being making predictions.
I’d compare this to the concept of significant digits in science. Like, yeah, you can get highly accurate measurements, but as soon as you stick them in calculations they get eaten up by the error in other measurements. I’m claiming the same thing happens here for humans: beyond a certain point our predictions are dominated by our own errors. Maybe my particular numbers are not representative of all scenarios, but I think the point stands regardless, you just have to dial in the numbers to match reality.
I completely agree that beyond a certain point our predictions are dominated by our own errors, but I’m not sure that that’s always well modelled by just moving all probability estimates that are close to 0 or 1 away by (say) 10^-3.
Example: Pascal’s mugging. (This is an example where just moving everything away from 0 or 1 is probably a bad idea, but to be clear I think it isn’t an example where it would help much to separate out your “in-model” and “out-of-model” errors.) Someone comes to you and says: I am a god/wizard/simulation-operator and can do such-and-such things which produce/destroy incredibly large amounts of utility; pay me $1000 and I’ll do that in your favour rather than against you. You say: haha, no, my estimate of the probability that you can swing 3^^^3 utils is less than 1/3^^^3, so go jump in a lake.
In this situation, if instead you say “gosh, I could be wrong in all sorts of ways, so I’d better revise that probability estimate to say 10^-6” and then go ahead and do your expected-utility calculation, then you pay the mugger every time. Even after they say “behold, now I shall create you a mountain of gold just to prove I can” and nothing happens and they say “ah, well, I’m just testing your faith; would you like to give me another $1000 now?”.
Perhaps the right way to handle this is to say that utility 1/epsilon is no better than probability epsilon, embrace scope insensitivity, and pretend that they were only offering/threatening 10^6 utils and your probability is only 10^-6, or something like that. And maybe that says that when someone makes such a claim you should give them a dollar if that’s what they ask for, and see whether they deliver.
I am not at all confident that that’s really a good approach, but if you do handle it that way then you need to be able to reason that after you give them a dollar and they fail, you shouldn’t give them another dollar because however improbable their claim was before, it’s 100x more improbable now. You can’t do that if you just mechanically turn all very small probabilities into 10^-6 or whatever.
I don’t have a clearly-satisfactory approach to offer instead. But I think this sort of example demonstrates that sometimes you need to do something more sophisticated than pushing all tiny probabilities away from zero.
I guess an instrumental approach I’ve been advocating on this site for a long time is to estimate the noise level, call it “practically zero” and treat anything at that level as such. For example, in the Pascal’s mugger case, there are so many disjunctive possibilities with higher odds to hear the same story vs the story as told being true, that there is no reason to privilege believing in what you hear over all higher-probability options, including dreaming, hallucinating, con, psych experiment, candid camera… It’s not about accurately estimating EV and so becoming susceptible to blackmail, it’s about rejecting anything at the noise level. Which, I guess, is another way to say “epsilon”, not technically zero, but as good as.
You can at least estimate some lower bounds on self-error, even if you can’t necessarily be certain of upper ones. That’s better than nothing, which is what you get if you don’t separate the probabilities.
For example my performance in test questions where I know the subject backwards and forwards isn’t 100%, because sometimes I misread the question, or have a brain fart while working out answers, and so on. On the other hand, most of these are localized errors. Given extra time, opportunity to check references, consult with other people, and so on, I can reduce these sorts of errors a great deal.
There is value in knowing this.
Weird, I’m totally in the other boat. I think we can use sub-1% or super-99% probabilities easily, all the time.
I just went on a long road trip. What probability should I have used that my car springs a brake fluid leak slow enough that it’s going to be useful for me to have a can of brake fluid in the car? I’d guess it happens once every 250k miles or so, and I just drove about 1k, so that’s about 1 in 250 (or let’s say 1 in 500 to guesstimate at the effect of doing highway driving). Bam, sub-1% probability. Now, did I need to consciously evaluate the probabilities to decide that I should definitely bring engine oil, might as well bring brake fluid but it’s not super important, and not need to bring a bicycle pump? No. But it you ask me, I don’t see what’s stopping me from giving totally reasonable probabilities for needing these things.
I think there’s a perspective that can synthesize both of these observations.
I could easily write a list of predictions of which less than 1 in 10,000 would be false:
The Sun will be shining somewhere on Earth at 2022-02-19 18:34:25.00001 UTC
The Sun will be shining somewhere on Earth at 2022-02-19 18:34:25.00002 UTC
The Sun will be shining somewhere on Earth at 2022-02-19 18:34:25.00003 UTC etc...
Of course, I’m “cheating”. There seem to be less than 100 consciously distinct plausibility values for me (or probably anyone). What I actually believe in this situation are several facts about how the Sun, Earth, time, and shining work which I believe at the highest plausibility value I can distinguish/track (something like >99.5%). I’m able to logically synthesize these into the above class of statements, from which I can deduce that the implied probability of those statements is quite high (much more than 99.5% likely to hold). This is an important part of what makes abstraction so powerful.
If you asked me for 10,000 true statements of which I could not explicitly logically connect any of them, I would be surprised if more than 99.5% of them were actually true, even putting my highest possible level of care and effort into it. I think this is an inherent limitation of how my mind works: there just isn’t a distinguishable plausibility value that I can use to distinguish these (which is an inherent limitation of being a bounded agent).
The key, I think, is that there is an important sense in which we can be more certain of logical deductions than intuitive beliefs, notwithstanding the fact that we are prone to making logical errors (for example, I used redundant lines of reasoning and large margins for error to generate the above example). It’s easy to be overconfident, but it’s almost as easy to be too pessimistic about what we can know.
Yeah, being inaccurate about personal fallibility levels is something I was trying to gesture at in https://www.lesswrong.com/posts/3duptyaLKKJxcnRKA/you-are-way-more-fallible-than-you-think
and I think your comment summarizes what I wanted to express.
That depends on the subject. People are not as fallible on life or death subjects or more people would accidentally walk in front of cars, trains or fall of high places. Anybody have an idea what that probability would be?
People are pretty fallible in these cases, too, literally. Look at the number of fatalities in canyons with well marked paths and fences.
For each encounter with a precipice, I would strongly guess the success probably is > 99.999.
I think I assign epsilon in the cases where I’m already hard pressed to come up with sensible things which would increase it to quantifiable levels. This is a feature, not a bug. There probably could be a more precise probability estimate, for a more rational agent than I with better tracking of priors and updates. Which doesn’t help me, as I’m trapped in this brain.
When I say this I’m expressing that I see no practical reason to distinguish between 1 in a trillion and 1 in a googol because the rest of my behavior will be the same anyway. I think this is totally reasonable because quantifying probabilities is a lot of work.
I mean more or less the same thing, but I think I’m lazier <3
I’ve noticed myself say epsilon as a placeholder for “that is not in an event space I’ve thought about and so I was probably acting like the probability was 0%”...
...then I think rigorously for an hour and end up with a probability as high as 10% or 7% or so. So for me I guess one hundred and one trillion are “roughly equally too big to really intuit without formal reasoning”?
Part of the challenge is having a habit that works with a range of possible situations and audiences.
Some people say 100% (or 0%) and are wrong about 25% of the time in such cases.
This is so common that if I just say “100%” near another person who is probabilistically numerate and calibrated on normal humans, they might hear “100%” as “75% (and this person cannot number real good (though they might human very well))”.
Some people say 99.999% as a way to say they are “as absolutely sure one can be, while pragmatically admitting that imperfect knowledge is impossible”…
...but they usually don’t think that they’ve just said that “in 49,000 tries the thing would probably never be observed, and with 100,000 tries ~1 case would occur, and if you were searching for true cases and each person was a case, then ~70,000 people (out of ~7 billion) are actually probably like this, and if they exist equally among the WEIRD and non-WEIRD then ~3500 people who are positive cases are in the US, and therefore an ad campaign could probably find 100 such people with only a small budget”.
For these reason, I try to say “1 in 50” and 98% roughly interchangeably, and for probabilities more extreme than that I try to stick to a “1 in N” format with round number N.
Like “1 in a trillion” or “1 in 5000″ or whatever is me explicitly trying to say that the main argument in favor of this involves some sort of “mass action” model based on known big numbers and imaginable handfuls of examples.
A person has ~500M “3 second long waking thoughts” per life. There are approximately ~50k municipalities in the US. There are ~10M people in the greater LA region (and in 2010 I could only find ~4 people willing to drive an hour on a freeway to attend a LW meetup). And so on.
Example: There are ~10k natural languages left (and I think only Piraha lacks a word for “four” (Toki Pona is an edge case and calls it “nanpa tu tu” (but Toki Pona is designed (and this is one of the worst parts of its design))). So “based in informed priors” I tend to think that the probability of a language having no word for “four” is roughly 1 in 5000. If betting, I might offer 1 in 500 odds as a safety margin… so “99.8%” probability but better to say “between 1 in 500 and 1 in 15,000″ so that if someone thinks it is 1 in 100,000 they can bet against me on one side and if someone else thinks it is 1 in 100, then I can think about taking both bets and arb them on average :-)
If I say “epsilon” in a mixed audience, some people will know that I’m signaling the ability to try to put high quality numbers on things with effort, and basically get sane results, but other people (like someone higher than me in a company hierarchy who is very non-technical) might ask “what do you mean by epsilon?” and then I can explain that mathematicians use it as a variable or placeholder for unknown small numbers on potentially important problems, and it sounds smart instead of “sounding ignorant”. Then I can ask if it matters what the number actually might be, and it leads the conversation in productive directions :-)
My guess is that the work of quantifying small probabilities scales up with how small probability is, given the number of rare events that one has to take into consideration. I wonder if this can be estimated. It would be some function of 1/p, assuming a power law scaling of the number of disjunctive possibilities.
What is your probability of the sun not rising tomorrow? In reality, not in a hypothetical.
My probability is zero, I’m not an ideal Bayesian.
This is actually not a bad example, because the definition of “sun rising” is ambiguous. Does it mean that it doesn’t get bright in the morning because of something? Does it mean that the Earth and the Sun are not in a particular arrangement? Would a solar eclipse count? Ash from a volcano? Going blind overnight from a stroke and not being able to see the sunrise? These are the disjunctive possibilities one has to think through that contribute to any unlikely event. If you reply is “I intuitively know what I mean by sun not rising” without actually going through the possibilities, then you don’t know what you mean.
I’m going to start with the concrete question about the sun rising and then segue into related stuff that I suspect has more value. My concise reply is “whatever shminux means/meant by the sun not rising”, because my example is just a derivative of your example, from the opening post:
I agree that my example is ambiguous. I think your questions exaggerate the ambiguity. According to how I hear people communicate, and thus how I think you are communicating:
The “sun rising” does not refer to it getting bright in the morning, because it starts to get bright before the sun rises, and because the sun does not rise less after a full moon, or in areas of high light pollution, or on a cloudy day. However, morning brightness can be evidence for the sun rising.
The “sun rising” does not refer to my individual perception of the sun rising, because if I am indoors when the sun rises, and do not see the sun rising, I do not say that the sun did not rise that day. However, my perception of the sun rising can be evidence for the sun rising.
An eclipsed sun is still the sun, we say “don’t look at the sun during an eclipse”. The sun continues rising during an eclipse if it eclipsed while it is rising.
When the sun is obscured by clouds, mist, supervolcano eruption, asteroid impact, invading alien space craft, etc we say that we cannot see the sun rise. We don’t say that the sun did not rise.
But I agree that it’s an ambiguous example. In particular, close to the poles there are days in the year when the sun does not rise, and people say things like “the sun rises every morning at the equator but only once a year at the north pole”. So the statement “the sun is rising” is true in some places and times and false in others, and the example is ambiguous because it doesn’t specify the place. There is also some ambiguity around disputing definitions that I don’t think is very illuminating and I raise mostly so that nobody else has to.
So the ambiguities don’t update my assigned probabilities and the additional thought we’re putting into it doesn’t seem to be very decision-relevant, so I don’t think it’s paying off for us. I feel like we’ve roughly gone from “epsilon” to “it depends, but epsilon”. Do you think it’s paying off? If so, how? If not, is that evidence against your thesis?
I don’t think there is anything special about “epsilon” here. Consider another probability question: “Conditional on me flipping this unbiased coin, will it land heads?”. My probability is 50%, I’m not an ideal Bayesian. But an ideal Bayesian would give a number very slightly less than 50% due to “edge” and “does not land” and other more exotic possibilities. If “epsilon” is a cop-out then is “50%” a cop-out?
My conclusion so far: in the rare case where the pay-off of an outcome is “huge” and its probability is “epsilon” then both “huge” and “epsilon” are unhelpful and more thought is useful. In other scenarios, “huge” and “epsilon” are sufficient.
I agree that the sunrise example may not be the best one, I just wanted to show that definitions tend to become ambiguous when there are many disjunctive possibilities to get to roughly the same outcome.
It’s kind of the same example, what are the odds of the coin not landing either heads or tails? There are many possibilities, including those you listed, and it’s work if you care about this edge case (no pun intended). If you care about slight deviations from 50%, then do the work, otherwise just say “50% for my purpose”, no epsilon required.
I would guess that somebody “who always did the work” and does not have this “bad habit” would spout a lot of numbers that would not really hold out. It might be good to strive for accuracy but false accuracy can be pretty bad.
Then there might be cases where not using epsilon would be a simplification or cop-out. If you have beliefs like “AGI will hit between 10 to 20 years form now” and try to square this with more specific claims like “AGI will hit in 10 years 4 months and 3 days”, “AGI will hit in 10 years 4 months, 3 days, 2 hours and 3 minutes” you ideally will want the more exact claims to sum to the more general claim. And then there is an issue whether you have credenced for truly pointlike timeintervals instead of a span. But if there is no span suggested by context (ie AGI will hit in 10 years has a “natural adjacent” in “hits in 11 years” while “10 years 4 months 3 days” has an adjacent of “10 years 4 months 4 days”) what span should one apply?
If you have a map it is better to draw a dragon there rather than guess at a coastline or think that part is unmappable (by for example cropping the area out of the map).
I’m not sure I agree, but the comments you inspired have been thoughtful and eduational. Strong upvote :-)
Epsilon is not quantitative, it’s qualitative.