Suppose I estimate the probability for event X at 50%. It’s possible that this is just my prior and if you give me any amount of evidence, I’ll update dramatically. Or it’s possible that this number is the result of a huge amount of investigation and very strong reasoning, such that even if you give me a bunch more evidence, I’ll barely shift the probability at all. In what way can I quantify the difference between these two things?
One possible way: add a range around it, such that you’re 90% confident your credence won’t move out of this range in the next D days. Problem: this depends heavily on whether you think you’ll be out looking for new evidence, or whether you’ll be locked in your basement during that time. But if we abstract to, say, “A normal period of D days”, it does provides some sense of how confident you are in your credence.
Another possible way: give numbers for how much your credence would shift, given B bits of information. Obviously 1 bit would shift your credence to 0 or 1. But how much would 0.5 bits move it? Etc. I actually don’t think this works, but wanted to get a second opinion.
A third proposal: specify the function that you’ll shift to if a randomly chosen domain expert told you that yours was a certain amount too high/low. This seems kinda arbitrary too though.
This is related to the problem of predicting a coin with an unknown bias. Consider two possible coins: the first which you have inspected closely and which looks perfectly symmetrical and feels evenly weighted, and the second which you haven’t inspected at all and which you got from a friend who you have previously seen cheating at cards. The second coin is much more likely to be biased than the first.
Suppose you are about to toss one of the coins. For each coin, consider the event that the coin lands on heads. In both cases you will assign a probability of 50%, because you have no knowledge that distinguishes between heads and tails.
But now suppose that before you toss the coin you learn that the coin landed on heads for each of its 10 previous tosses. How does this affect your estimate?
In the case of the first coin it doesn’t make very much difference. Since you see no way in which the coin could be biased you assume that the 10 heads were just a coincidence, and you still assign a probability of 50% to heads on the next toss (maybe 51% if you are beginning to be suspicious despite your inspection of the coin).
But when it comes to the second coin, this evidence would make you very suspicious. You would think it likely that the coin had been tampered with. Perhaps it simply has two heads. But it would also still be possible that the coin was fair. Two headed coins are pretty rare, even in the world of degenerate gamblers. So you might assign a probability of around 70% to getting heads on the next toss.
This shows the effect that you were describing; both events had a prior probability of 50%, but the probability changes by different amounts in response to the same evidence. We have a lot of knowledge about the first coin, and compared to this knowledge the new evidence is insignificant. We know much less about the second coin, and so the new evidence moves our probability much further.
Mathematically, we model each coin as having a fixed but unknown frequency with which it comes up heads. This is a number 0 ≤ f ≤ 1. If we knew f then we would assign a probability of f to any coin-flip except those about which we have direct evidence (i.e. those in our causal past). Since we don’t know f we describe our knowledge about it by a probability distribution P(f). The probability of the next coin-flip coming up heads is then the expected value of f, the integral of P(f)f.
Then in the above example our knowledge about the first coin would be described by a function P(f) with a sharp peak around 1⁄2 and almost zero probability everywhere else. Our knowledge of the second coin would be described by a much broader distribution. When we find out that the coin has come up heads 10 times before our probability distribution updates according to Bayes’ rule. It changes from P(f) to P(f)f^10 (or rather the normalisation of P(f)f^10). This doesn’t affect the sharply pointed distribution very much because the function f^10 is approximately constant over the sharp peak. But it pushes the broad distribution strongly towards 1 because 1^10 is 1024 times larger than 1/2^10 and P(f) isn’t 1024 times taller near 1⁄2 than near 1.
So this is a nice case where it is possible to compare between two cases how much a given piece of evidence moves our probability estimate. However I’m not sure whether this can be extended to the general case. A proposition like “Trump gets reelected” can’t be thought of as being like a flip of a coin with a particular frequency. Not only are there no “previous flips” we can learn about, it’s not clear what another flip would even look like. The election that Trump won doesn’t count, because we had totally different knowledge about that one.
It seems like you’re describing a Bayesian probability distribution over a frequentist probability estimate of the “real” probability. Agreed that this works in cases which make sense under frequentism, but in cases like “Trump gets reelected” you need some sort of distribution over a Bayesian credence, and I don’t see any natural way to generalise to that.
It seems like you’re describing a Bayesian probability distribution over a frequentist probability estimate of the “real” probability.
Right. But I was careful to refer to f as a frequency rather than a probability, because f isn’t a description of our beliefs but rather a physical property of the coin (and of the way it’s being thrown).
Agreed that this works in cases which make sense under frequentism, but in cases like “Trump gets reelected” you need some sort of distribution over a Bayesian credence, and I don’t see any natural way to generalise to that.
I agree. But it seems to me like the other replies you’ve received are mistakenly treating all propositions as though they do have an f with an unknown distribution. Unnamed suggests using the beta distribution; the thing which it’s the distribution of would have to be f. Similarly rossry’s reply, containing phrases like “something in the ballpark of 50%” and “precisely 50%”, talks as though there is some unknown percentage to which 50% is an estimate.
A lot of people (like in the paper Pattern linked to) think that our distribution over f is a “second-order” probability describing our beliefs about our beliefs. I think this is wrong. The number f doesn’t describe our beliefs at all; it describes a physical property of the coin, just like mass and diameter.
In fact, any kind of second-order probability must be trivial. We have introspective access to our own beliefs. So given any statement about our beliefs we can say for certain whether or not it’s true. Therefore, any second-order probability will either be equal to 0 or 1.
I don’t have much to add on the original question, but I do disagree about your last point:
In fact, any kind of second-order probability must be trivial. We have introspective access to our own beliefs. So given any statement about our beliefs we can say for certain whether or not it’s true. Therefore, any second-order probability will either be equal to 0 or 1.
There is a sense in which, once you say “my credence in X is Y”, then I can’t contradict you. But if I pointed out that actually, you’re behaving as if it is Y/2, and some other statements you made implied that it is Y/2, and then you realise that when you said the original statement, you were feeling social pressure to say a high credence even though it didn’t quite feel right—well, that all looks a lot like you being wrong about your actual credence in X. This may end up being a dispute over the definition of belief, but I do prefer to avoid defining things in ways where people must be certain about them, because people can be wrong in so many ways.
Okay, sure. But an idealized rational reasoner wouldn’t display this kind of uncertainty about its own beliefs, but it would still have the phenomenon you were originally asking about (where statements assigned the same probability update by different amounts after the introduction of evidence). So this kind of second-order probability can’t be used to answer the question you originally asked.
If I learned the first coin came up heads 10 times, then I would figure the probability of it coming up heads would be higher than 50%, I think 51% at a minimum.
The beta distribution is often used to represent this type of scenario. It is straightforward to update in simple cases where you get more data points, though it’s not straightforward to update based on messier evidence like hearing someone’s opinion.
For future reference, after asking around elsewhere I learned that this has been discussed in a few places, and the term used for credences which are harder to shift is “resilient”. See this article, and the papers it links to: https://concepts.effectivealtruism.org/concepts/credal-resilience/
My experience has been that in practice it almost always suffices to express second-order knowledge qualitatively rather than quantitatively. Granted, it requires some common context and social trust to be adequately calibrated on “50%, to make up a number” < “50%, just to say a number” < “let’s say 50%” < “something in the ballpark of 50%” < “plausibly 50%” < “probably 50%” < “roughly 50%” < “actually just 50%” < “precisely 50%” (to pick syntax that I’m used to using with people I work with), but you probably don’t actually have good (third-order!) calibration of your second-order knowledge, so why bother with the extra precision?
The only other thing I’ve seen work when you absolutely need to pin down levels of second-order knowledge is just talking about where your uncertainty is coming from, what the gears of your epistemic model are, or sometimes how much time of concerted effort it might take you to resolve X percentage points of uncertainty in expectation.
I have some answers (for some guesses about what your question is, based on your comments) below.
Suppose I estimate the probability for event X at 50%. It’s possible that this is just my prior and if you give me any amount of evidence, I’ll update dramatically. Or it’s possible that this number is the result of a huge amount of investigation and very strong reasoning, such that even if you give me a bunch more evidence, I’ll barely shift the probability at all. In what way can I quantify the difference between these two things?
This sounds like Bayes’ Theorem, but the actual question about how you generate numbers given a hypothesis...I don’t know. There’s stuff around here about a good scoring rule I could dig up. Personally, I just make up numbers to give me an idea.
specify the function that you’ll shift to if a randomly chosen domain expert told you that yours was a certain amount too high/low.
I found this on higher order probabilities. (It notes the rule “for any x, x = PR[E given that Pr(E) = x]”.) Google also turned up some papers on the subject I haven’t read yet.
Your whole comment is founded on a false assumption. Look at Bayes’ formula. Do you see any mention of whether your probability estimate is “just your prior” or “the result of a huge amount of investigation and very strong reasoning” ? No ? Well this mean that this doesn’t effect how much you’ll update.
This is untrue. Consider a novice and an expert who both assign 0.5 probability to some proposition A. Let event B be a professor saying that A is true. Let’s also say that both the novice and the expert assign 0.5 probability to B. But the key term here is P(B|A). For a novice, this is plausibly quite high, because for all they know there’s already a scientific consensus on A which they just hadn’t heard about yet. For the expert, this is probably near 0.5, because they’re confident that the professor has no better source of information than they do.
In other words, experts may update less on evidence because the effect of that evidence is “screened off” by things they already knew. But it’s difficult to quantify this effect.
Suppose I estimate the probability for event X at 50%. It’s possible that this is just my prior and if you give me any amount of evidence, I’ll update dramatically. Or it’s possible that this number is the result of a huge amount of investigation and very strong reasoning, such that even if you give me a bunch more evidence, I’ll barely shift the probability at all. In what way can I quantify the difference between these two things?
One possible way: add a range around it, such that you’re 90% confident your credence won’t move out of this range in the next D days. Problem: this depends heavily on whether you think you’ll be out looking for new evidence, or whether you’ll be locked in your basement during that time. But if we abstract to, say, “A normal period of D days”, it does provides some sense of how confident you are in your credence.
Another possible way: give numbers for how much your credence would shift, given B bits of information. Obviously 1 bit would shift your credence to 0 or 1. But how much would 0.5 bits move it? Etc. I actually don’t think this works, but wanted to get a second opinion.
A third proposal: specify the function that you’ll shift to if a randomly chosen domain expert told you that yours was a certain amount too high/low. This seems kinda arbitrary too though.
This is related to the problem of predicting a coin with an unknown bias. Consider two possible coins: the first which you have inspected closely and which looks perfectly symmetrical and feels evenly weighted, and the second which you haven’t inspected at all and which you got from a friend who you have previously seen cheating at cards. The second coin is much more likely to be biased than the first.
Suppose you are about to toss one of the coins. For each coin, consider the event that the coin lands on heads. In both cases you will assign a probability of 50%, because you have no knowledge that distinguishes between heads and tails.
But now suppose that before you toss the coin you learn that the coin landed on heads for each of its 10 previous tosses. How does this affect your estimate?
In the case of the first coin it doesn’t make very much difference. Since you see no way in which the coin could be biased you assume that the 10 heads were just a coincidence, and you still assign a probability of 50% to heads on the next toss (maybe 51% if you are beginning to be suspicious despite your inspection of the coin).
But when it comes to the second coin, this evidence would make you very suspicious. You would think it likely that the coin had been tampered with. Perhaps it simply has two heads. But it would also still be possible that the coin was fair. Two headed coins are pretty rare, even in the world of degenerate gamblers. So you might assign a probability of around 70% to getting heads on the next toss.
This shows the effect that you were describing; both events had a prior probability of 50%, but the probability changes by different amounts in response to the same evidence. We have a lot of knowledge about the first coin, and compared to this knowledge the new evidence is insignificant. We know much less about the second coin, and so the new evidence moves our probability much further.
Mathematically, we model each coin as having a fixed but unknown frequency with which it comes up heads. This is a number 0 ≤ f ≤ 1. If we knew f then we would assign a probability of f to any coin-flip except those about which we have direct evidence (i.e. those in our causal past). Since we don’t know f we describe our knowledge about it by a probability distribution P(f). The probability of the next coin-flip coming up heads is then the expected value of f, the integral of P(f)f.
Then in the above example our knowledge about the first coin would be described by a function P(f) with a sharp peak around 1⁄2 and almost zero probability everywhere else. Our knowledge of the second coin would be described by a much broader distribution. When we find out that the coin has come up heads 10 times before our probability distribution updates according to Bayes’ rule. It changes from P(f) to P(f)f^10 (or rather the normalisation of P(f)f^10). This doesn’t affect the sharply pointed distribution very much because the function f^10 is approximately constant over the sharp peak. But it pushes the broad distribution strongly towards 1 because 1^10 is 1024 times larger than 1/2^10 and P(f) isn’t 1024 times taller near 1⁄2 than near 1.
So this is a nice case where it is possible to compare between two cases how much a given piece of evidence moves our probability estimate. However I’m not sure whether this can be extended to the general case. A proposition like “Trump gets reelected” can’t be thought of as being like a flip of a coin with a particular frequency. Not only are there no “previous flips” we can learn about, it’s not clear what another flip would even look like. The election that Trump won doesn’t count, because we had totally different knowledge about that one.
It seems like you’re describing a Bayesian probability distribution over a frequentist probability estimate of the “real” probability. Agreed that this works in cases which make sense under frequentism, but in cases like “Trump gets reelected” you need some sort of distribution over a Bayesian credence, and I don’t see any natural way to generalise to that.
Right. But I was careful to refer to f as a frequency rather than a probability, because f isn’t a description of our beliefs but rather a physical property of the coin (and of the way it’s being thrown).
I agree. But it seems to me like the other replies you’ve received are mistakenly treating all propositions as though they do have an f with an unknown distribution. Unnamed suggests using the beta distribution; the thing which it’s the distribution of would have to be f. Similarly rossry’s reply, containing phrases like “something in the ballpark of 50%” and “precisely 50%”, talks as though there is some unknown percentage to which 50% is an estimate.
A lot of people (like in the paper Pattern linked to) think that our distribution over f is a “second-order” probability describing our beliefs about our beliefs. I think this is wrong. The number f doesn’t describe our beliefs at all; it describes a physical property of the coin, just like mass and diameter.
In fact, any kind of second-order probability must be trivial. We have introspective access to our own beliefs. So given any statement about our beliefs we can say for certain whether or not it’s true. Therefore, any second-order probability will either be equal to 0 or 1.
I don’t have much to add on the original question, but I do disagree about your last point:
There is a sense in which, once you say “my credence in X is Y”, then I can’t contradict you. But if I pointed out that actually, you’re behaving as if it is Y/2, and some other statements you made implied that it is Y/2, and then you realise that when you said the original statement, you were feeling social pressure to say a high credence even though it didn’t quite feel right—well, that all looks a lot like you being wrong about your actual credence in X. This may end up being a dispute over the definition of belief, but I do prefer to avoid defining things in ways where people must be certain about them, because people can be wrong in so many ways.
Okay, sure. But an idealized rational reasoner wouldn’t display this kind of uncertainty about its own beliefs, but it would still have the phenomenon you were originally asking about (where statements assigned the same probability update by different amounts after the introduction of evidence). So this kind of second-order probability can’t be used to answer the question you originally asked.
FYI there’s more about “credal resilience” here (although I haven’t read the linked papers yet).
If I learned the first coin came up heads 10 times, then I would figure the probability of it coming up heads would be higher than 50%, I think 51% at a minimum.
It doesn’t really matter for the point I was making, so long as you agree that the probability moves further for the second coin.
The beta distribution is often used to represent this type of scenario. It is straightforward to update in simple cases where you get more data points, though it’s not straightforward to update based on messier evidence like hearing someone’s opinion.
For future reference, after asking around elsewhere I learned that this has been discussed in a few places, and the term used for credences which are harder to shift is “resilient”. See this article, and the papers it links to: https://concepts.effectivealtruism.org/concepts/credal-resilience/
My experience has been that in practice it almost always suffices to express second-order knowledge qualitatively rather than quantitatively. Granted, it requires some common context and social trust to be adequately calibrated on “50%, to make up a number” < “50%, just to say a number” < “let’s say 50%” < “something in the ballpark of 50%” < “plausibly 50%” < “probably 50%” < “roughly 50%” < “actually just 50%” < “precisely 50%” (to pick syntax that I’m used to using with people I work with), but you probably don’t actually have good (third-order!) calibration of your second-order knowledge, so why bother with the extra precision?
The only other thing I’ve seen work when you absolutely need to pin down levels of second-order knowledge is just talking about where your uncertainty is coming from, what the gears of your epistemic model are, or sometimes how much time of concerted effort it might take you to resolve X percentage points of uncertainty in expectation.
That makes sense to me, and what I’d do in practice too, but it still feels odd that there’s no theoretical solution to this question.
What’s your question?
I have some answers (for some guesses about what your question is, based on your comments) below.
This sounds like Bayes’ Theorem, but the actual question about how you generate numbers given a hypothesis...I don’t know. There’s stuff around here about a good scoring rule I could dig up. Personally, I just make up numbers to give me an idea.
This sounds like Inadequate Equilibria.
I found this on higher order probabilities. (It notes the rule “for any x, x = PR[E given that Pr(E) = x]”.) Google also turned up some papers on the subject I haven’t read yet.
Your whole comment is founded on a false assumption. Look at Bayes’ formula. Do you see any mention of whether your probability estimate is “just your prior” or “the result of a huge amount of investigation and very strong reasoning” ? No ? Well this mean that this doesn’t effect how much you’ll update.
This is untrue. Consider a novice and an expert who both assign 0.5 probability to some proposition A. Let event B be a professor saying that A is true. Let’s also say that both the novice and the expert assign 0.5 probability to B. But the key term here is P(B|A). For a novice, this is plausibly quite high, because for all they know there’s already a scientific consensus on A which they just hadn’t heard about yet. For the expert, this is probably near 0.5, because they’re confident that the professor has no better source of information than they do.
In other words, experts may update less on evidence because the effect of that evidence is “screened off” by things they already knew. But it’s difficult to quantify this effect.