This is related to the problem of predicting a coin with an unknown bias. Consider two possible coins: the first which you have inspected closely and which looks perfectly symmetrical and feels evenly weighted, and the second which you haven’t inspected at all and which you got from a friend who you have previously seen cheating at cards. The second coin is much more likely to be biased than the first.
Suppose you are about to toss one of the coins. For each coin, consider the event that the coin lands on heads. In both cases you will assign a probability of 50%, because you have no knowledge that distinguishes between heads and tails.
But now suppose that before you toss the coin you learn that the coin landed on heads for each of its 10 previous tosses. How does this affect your estimate?
In the case of the first coin it doesn’t make very much difference. Since you see no way in which the coin could be biased you assume that the 10 heads were just a coincidence, and you still assign a probability of 50% to heads on the next toss (maybe 51% if you are beginning to be suspicious despite your inspection of the coin).
But when it comes to the second coin, this evidence would make you very suspicious. You would think it likely that the coin had been tampered with. Perhaps it simply has two heads. But it would also still be possible that the coin was fair. Two headed coins are pretty rare, even in the world of degenerate gamblers. So you might assign a probability of around 70% to getting heads on the next toss.
This shows the effect that you were describing; both events had a prior probability of 50%, but the probability changes by different amounts in response to the same evidence. We have a lot of knowledge about the first coin, and compared to this knowledge the new evidence is insignificant. We know much less about the second coin, and so the new evidence moves our probability much further.
Mathematically, we model each coin as having a fixed but unknown frequency with which it comes up heads. This is a number 0 ≤ f ≤ 1. If we knew f then we would assign a probability of f to any coin-flip except those about which we have direct evidence (i.e. those in our causal past). Since we don’t know f we describe our knowledge about it by a probability distribution P(f). The probability of the next coin-flip coming up heads is then the expected value of f, the integral of P(f)f.
Then in the above example our knowledge about the first coin would be described by a function P(f) with a sharp peak around 1⁄2 and almost zero probability everywhere else. Our knowledge of the second coin would be described by a much broader distribution. When we find out that the coin has come up heads 10 times before our probability distribution updates according to Bayes’ rule. It changes from P(f) to P(f)f^10 (or rather the normalisation of P(f)f^10). This doesn’t affect the sharply pointed distribution very much because the function f^10 is approximately constant over the sharp peak. But it pushes the broad distribution strongly towards 1 because 1^10 is 1024 times larger than 1/2^10 and P(f) isn’t 1024 times taller near 1⁄2 than near 1.
So this is a nice case where it is possible to compare between two cases how much a given piece of evidence moves our probability estimate. However I’m not sure whether this can be extended to the general case. A proposition like “Trump gets reelected” can’t be thought of as being like a flip of a coin with a particular frequency. Not only are there no “previous flips” we can learn about, it’s not clear what another flip would even look like. The election that Trump won doesn’t count, because we had totally different knowledge about that one.
It seems like you’re describing a Bayesian probability distribution over a frequentist probability estimate of the “real” probability. Agreed that this works in cases which make sense under frequentism, but in cases like “Trump gets reelected” you need some sort of distribution over a Bayesian credence, and I don’t see any natural way to generalise to that.
It seems like you’re describing a Bayesian probability distribution over a frequentist probability estimate of the “real” probability.
Right. But I was careful to refer to f as a frequency rather than a probability, because f isn’t a description of our beliefs but rather a physical property of the coin (and of the way it’s being thrown).
Agreed that this works in cases which make sense under frequentism, but in cases like “Trump gets reelected” you need some sort of distribution over a Bayesian credence, and I don’t see any natural way to generalise to that.
I agree. But it seems to me like the other replies you’ve received are mistakenly treating all propositions as though they do have an f with an unknown distribution. Unnamed suggests using the beta distribution; the thing which it’s the distribution of would have to be f. Similarly rossry’s reply, containing phrases like “something in the ballpark of 50%” and “precisely 50%”, talks as though there is some unknown percentage to which 50% is an estimate.
A lot of people (like in the paper Pattern linked to) think that our distribution over f is a “second-order” probability describing our beliefs about our beliefs. I think this is wrong. The number f doesn’t describe our beliefs at all; it describes a physical property of the coin, just like mass and diameter.
In fact, any kind of second-order probability must be trivial. We have introspective access to our own beliefs. So given any statement about our beliefs we can say for certain whether or not it’s true. Therefore, any second-order probability will either be equal to 0 or 1.
I don’t have much to add on the original question, but I do disagree about your last point:
In fact, any kind of second-order probability must be trivial. We have introspective access to our own beliefs. So given any statement about our beliefs we can say for certain whether or not it’s true. Therefore, any second-order probability will either be equal to 0 or 1.
There is a sense in which, once you say “my credence in X is Y”, then I can’t contradict you. But if I pointed out that actually, you’re behaving as if it is Y/2, and some other statements you made implied that it is Y/2, and then you realise that when you said the original statement, you were feeling social pressure to say a high credence even though it didn’t quite feel right—well, that all looks a lot like you being wrong about your actual credence in X. This may end up being a dispute over the definition of belief, but I do prefer to avoid defining things in ways where people must be certain about them, because people can be wrong in so many ways.
Okay, sure. But an idealized rational reasoner wouldn’t display this kind of uncertainty about its own beliefs, but it would still have the phenomenon you were originally asking about (where statements assigned the same probability update by different amounts after the introduction of evidence). So this kind of second-order probability can’t be used to answer the question you originally asked.
If I learned the first coin came up heads 10 times, then I would figure the probability of it coming up heads would be higher than 50%, I think 51% at a minimum.
This is related to the problem of predicting a coin with an unknown bias. Consider two possible coins: the first which you have inspected closely and which looks perfectly symmetrical and feels evenly weighted, and the second which you haven’t inspected at all and which you got from a friend who you have previously seen cheating at cards. The second coin is much more likely to be biased than the first.
Suppose you are about to toss one of the coins. For each coin, consider the event that the coin lands on heads. In both cases you will assign a probability of 50%, because you have no knowledge that distinguishes between heads and tails.
But now suppose that before you toss the coin you learn that the coin landed on heads for each of its 10 previous tosses. How does this affect your estimate?
In the case of the first coin it doesn’t make very much difference. Since you see no way in which the coin could be biased you assume that the 10 heads were just a coincidence, and you still assign a probability of 50% to heads on the next toss (maybe 51% if you are beginning to be suspicious despite your inspection of the coin).
But when it comes to the second coin, this evidence would make you very suspicious. You would think it likely that the coin had been tampered with. Perhaps it simply has two heads. But it would also still be possible that the coin was fair. Two headed coins are pretty rare, even in the world of degenerate gamblers. So you might assign a probability of around 70% to getting heads on the next toss.
This shows the effect that you were describing; both events had a prior probability of 50%, but the probability changes by different amounts in response to the same evidence. We have a lot of knowledge about the first coin, and compared to this knowledge the new evidence is insignificant. We know much less about the second coin, and so the new evidence moves our probability much further.
Mathematically, we model each coin as having a fixed but unknown frequency with which it comes up heads. This is a number 0 ≤ f ≤ 1. If we knew f then we would assign a probability of f to any coin-flip except those about which we have direct evidence (i.e. those in our causal past). Since we don’t know f we describe our knowledge about it by a probability distribution P(f). The probability of the next coin-flip coming up heads is then the expected value of f, the integral of P(f)f.
Then in the above example our knowledge about the first coin would be described by a function P(f) with a sharp peak around 1⁄2 and almost zero probability everywhere else. Our knowledge of the second coin would be described by a much broader distribution. When we find out that the coin has come up heads 10 times before our probability distribution updates according to Bayes’ rule. It changes from P(f) to P(f)f^10 (or rather the normalisation of P(f)f^10). This doesn’t affect the sharply pointed distribution very much because the function f^10 is approximately constant over the sharp peak. But it pushes the broad distribution strongly towards 1 because 1^10 is 1024 times larger than 1/2^10 and P(f) isn’t 1024 times taller near 1⁄2 than near 1.
So this is a nice case where it is possible to compare between two cases how much a given piece of evidence moves our probability estimate. However I’m not sure whether this can be extended to the general case. A proposition like “Trump gets reelected” can’t be thought of as being like a flip of a coin with a particular frequency. Not only are there no “previous flips” we can learn about, it’s not clear what another flip would even look like. The election that Trump won doesn’t count, because we had totally different knowledge about that one.
It seems like you’re describing a Bayesian probability distribution over a frequentist probability estimate of the “real” probability. Agreed that this works in cases which make sense under frequentism, but in cases like “Trump gets reelected” you need some sort of distribution over a Bayesian credence, and I don’t see any natural way to generalise to that.
Right. But I was careful to refer to f as a frequency rather than a probability, because f isn’t a description of our beliefs but rather a physical property of the coin (and of the way it’s being thrown).
I agree. But it seems to me like the other replies you’ve received are mistakenly treating all propositions as though they do have an f with an unknown distribution. Unnamed suggests using the beta distribution; the thing which it’s the distribution of would have to be f. Similarly rossry’s reply, containing phrases like “something in the ballpark of 50%” and “precisely 50%”, talks as though there is some unknown percentage to which 50% is an estimate.
A lot of people (like in the paper Pattern linked to) think that our distribution over f is a “second-order” probability describing our beliefs about our beliefs. I think this is wrong. The number f doesn’t describe our beliefs at all; it describes a physical property of the coin, just like mass and diameter.
In fact, any kind of second-order probability must be trivial. We have introspective access to our own beliefs. So given any statement about our beliefs we can say for certain whether or not it’s true. Therefore, any second-order probability will either be equal to 0 or 1.
I don’t have much to add on the original question, but I do disagree about your last point:
There is a sense in which, once you say “my credence in X is Y”, then I can’t contradict you. But if I pointed out that actually, you’re behaving as if it is Y/2, and some other statements you made implied that it is Y/2, and then you realise that when you said the original statement, you were feeling social pressure to say a high credence even though it didn’t quite feel right—well, that all looks a lot like you being wrong about your actual credence in X. This may end up being a dispute over the definition of belief, but I do prefer to avoid defining things in ways where people must be certain about them, because people can be wrong in so many ways.
Okay, sure. But an idealized rational reasoner wouldn’t display this kind of uncertainty about its own beliefs, but it would still have the phenomenon you were originally asking about (where statements assigned the same probability update by different amounts after the introduction of evidence). So this kind of second-order probability can’t be used to answer the question you originally asked.
FYI there’s more about “credal resilience” here (although I haven’t read the linked papers yet).
If I learned the first coin came up heads 10 times, then I would figure the probability of it coming up heads would be higher than 50%, I think 51% at a minimum.
It doesn’t really matter for the point I was making, so long as you agree that the probability moves further for the second coin.