A very strange probability paradox
Which of the following do you think is bigger?
: The expected number of rolls of a fair die until you roll two in a row, given that all rolls were even.
: The expected number of rolls of a fair die until you roll the second (not necessarily in a row), given that all rolls were even.
If you are unfamiliar with conditional expectation, think of it this way: Imagine you were to perform a million sequences of die rolls, stopping each sequence when you roll two in a row. Then you throw out all the sequences that contain an odd roll. The average number of rolls in the remaining sequences should be close to . Next, perform a million sequences of die rolls, stopping each sequence when you roll the second . Throw out all the sequences among these that contain an odd roll. The average number of rolls in the remaining sequences should be close to .
I asked something like this on r/math about a year ago, and even with the hint that the answer was paradoxical, the early consensus was that must be larger. The justification was more or less the following: any time you roll until reaching two in a row, you will have also hit your second 6 at or before then. So regardless what the conditions are, must be larger than .
But the correct answer is actually . What on earth is going on?
A quick verification
Before we proceed, let’s write some code to estimate and . The only goal here is to be as unambiguous as possible, so the code will be almost comically unoptimized in both run-time and length.
import random
def estimate_A(n):
#Rolls 'n' sequences of die rolls, stopping each when two 6s in a row.
#Tracks number of sequences with no odds
#Tracks sum of number of rolls in all sequences with no odds
#Computes average by dividing the two.
num_sequences_without_odds=0
sum_rolls_without_odds=0
for i in range(n):
counter=0
last_roll_six=False
has_odd=False
#Roll until two sixes in a row
while True:
x=random.randint(1,6)
counter+=1
if x%2==1:
has_odd=True
if x==6:
if last_roll_six:
break
last_roll_six=True
else:
last_roll_six=False
if not has_odd:
sum_rolls_without_odds+=counter
num_sequences_without_odds+=1
A_estimate=(sum_rolls_without_odds)/(num_sequences_without_odds)
return A_estimate
def estimate_B(n):
#Rolls 'n' sequences of die rolls, stopping each at second 6.
#Tracks number of sequences with no odds
#Tracks sum of number of rolls in all sequences with no odds
#Computes average by dividing the two.
num_sequences_without_odds=0
sum_rolls_without_odds=0
for i in range(n):
counter=0
six_count=0
has_odd=False
#Roll until second 6
while True:
x=random.randint(1,6)
counter+=1
if x%2==1:
has_odd=True
if x==6:
six_count+=1
if six_count==2:
break
if not has_odd:
sum_rolls_without_odds+=counter
num_sequences_without_odds+=1
B_estimate=(sum_rolls_without_odds)/(num_sequences_without_odds)
return B_estimate
print("Estimate for A: " + "{0:0.3f}".format(estimate_A(100000),2))
print("Estimate for B: " + "{0:0.3f}".format(estimate_B(100000),2))
The 100,000s in the last two lines represent the number of sequences being rolled for the estimate; you can add zeros for accuracy or subtract zeros for faster run-time.
The estimate for should be close to and the estimate for B should be close to . The exact value for is and the exact value for is , but it is helpful to first unambiguously experimentally verify that is greater than (and ensure that we are on the same page of what and mean) before diving into the possibly unintuitive math.
Primer on geometric distributions
We’ll begin by calculating the expected number of rolls of a die to see the first . To do so, we first find the probability that it takes exactly rolls to see the first . This means the first rolls were not and roll was .
The probability that a die rolls a is and the probability it does not is . Following the rule of product for independent probabilities, we get:
We can now get a formula for the expected number of rolls of a die until we see the first . The formula for expectation gives:
Now we’ll use the following fact: for :
This can be obtained by starting with the formula for geometric series and taking the derivative of both sides (if you remember calculus) or squaring both sides (if you’re very good at algebra). Plugging in, we have:
And we are done. Sort of.
Let’s try that again, this time using an intuitive trick from Markov chains. We’ll use “average” and “expected” interchangeably as the former is more colloquial and we are going to be a bit informal here.
Let be the average number of rolls until we see the first . Let’s roll the die once. With probability , we rolled a and can stop. With probability , we didn’t roll a and are then still an average of steps away from a .
So with probability , we are in a scenario where we take roll to see a and in the remaining probability , it will take an average of steps to see a . So the average number of rolls until we see the first is . But the average is also ! This gives us the algebra equation:
that gives when solving for .
Let’s generalize now. Let’s say we have some experiment that has fixed probability of success. We repeat the experiment until it succeeds. Then if is the expected number of trials until success, we have:
Probability distributions of this form are called geometric distributions. In this case, the experiment was rolling a die and success was defined by seeing a , so it is a geometric distribution with success rate . And so the expected number of trials until success is .
Rolls until first 6, given all even
In this section, we will use to refer to a fair -sided die with sides labeled and to refer to a fair -sided die with sides labeled . Consider the following two questions:
What is the expected number of rolls of a until we see the first ?
What is the expected number of rolls of a until we see the first , given that all rolls are even?
For the first question, we have a geometric distribution of success rate , so the expected number of trials until success is .
The second question was posed by Gil Kalai in a 2017 blog post. Most people incorrectly answered 3 (and keep in mind the audience for this blog is fairly math literate). The rationale was that the second question seems equivalent to the first. But let’s calculate it explicitly.
Analogously to last section, we begin by calculating the probability that it takes exactly rolls to see the first , given that all rolls were even. Following standard conventions, we’ll write as shorthand for “Probability that occurs, given that occurs”. From the formula for conditional probability, we have:
Let’s start with the numerator. If it takes us exactly rolls to see our first and all rolls in the process were even, then the first rolls were all or and the roll was a . The probability of this occurring is
The denominator is the total probability that we roll a before the first odd. One way we could determine this is by evaluating (that is, summing the probability it takes exactly rolls to get a and all rolls were even, over all possible values of ). We saw how to sum those kinds of series in the last section.
But a more intuitive way is as follows—rephrase “Probability we roll a before the first odd” as “Probability that between the sides , 6 is the first to show up”. From here, we can immediately see by symmetry that the probability is . Indeed summing the above series gives the same answer.
Altogether, we have:
and so:
There is another, cuter, way to answer the second problem that will be important for our evaluation of . We will first rephrase the question as “What is the expected number of rolls of a until the first , given that is the first to occur out of ?”. We can rephrase this again as “What is the expected number of rolls of a until the first side in shows up, given that is the first to occur out of ?”.
Now we have some neat symmetry—the expected number of rolls of a until a the first side in shows up shouldn’t depend on which of those four sides happened to be first. So the following will have the same answer as the second question: “What is the expected number of rolls of a die until the first side in shows up?”
That’s a geometric distribution with success rate ! And so its expectation is .
Adding conditions after the first 6
We’ll now make a modification that might seem harmless. We will roll a one billion times (and keep going until we get a if we somehow didn’t get it in the first 1 billion rolls). What is the expected number of rolls until the first 6 given that every roll that ever showed up in this process is even.
In other words, we are still looking at the number of rolls until the first 6, and we are still requiring that all rolls before the first 6 are even. But now we are also requiring that rolls after the first 6 are even. Does this impact the expectation? Intuitively one might think “no”, but we’ll see that the expectation nearly doubles.
Now even under the assumption of having all evens show up in the first billion rolls, having no show up in the first billion rolls is extremely unlikely. So with extreme accuracy, we can approximate the denominator as just the probability that the first billion rolls are all even -
Now for less than one billion, the probability that the first is on roll and there are only evens up to the billionth roll is:
For greater than one billion, the expressions are slightly different, but the contributions are so astronomically small at this point that we can pretend they are the same with basically no percent change in our answer. So we have
which is roughly the expected number of rolls until on a with sides labeled . The conclusion is that adding conditions after the first can indeed impact the expected number of rolls to the first .
Reviewing the paradox
Recall our definitions:
: The expected number of rolls of a fair die until you roll two in a row, given that all rolls were even.
: The expected number of rolls of a fair die until you roll the second (not necessarily in a row), given that all rolls were even.
we will now consider a third:
: The expected number of rolls of a fair die until you roll the second , given that all rolls until the first instance of two in a row are even.
and are directly comparable since they have the exact same condition—all rolls before the first instance of two in a row are even and no conditions are given on any rolls that occur after. Since, amongst these rolls, the second will always occur at or before the first instance of two in a row, we can safely conclude
However, we’ve seen in the last section that is not necessarily the same as , even though we’ve only added conditions that apply at or after the second . So we cannot immediately conclude . The two need to be calculated independently and directly compared.
Proving
ends up being a lot easier to calculate than , so our strategy will be to prove that , then use some nicer upper bounds to show that is less than . The strategy to compute comes from a cute argument by reddit user u/bobjane in the aforementioned thread:
The average number of rolls until the first instance of is , as it is a geometric distribution. Then after rolling the first, the average until the next instance of is again , no matter how long it took to get the first instance.
By linearity of expectation, we then have that the expected number of rolls of a die until the second instance of a side in is . Conditioning on the specific combination we see being and does not impact the expectation, hence we have .
The same user actually gave an explicit formula for in the general setting of rolling until in a row, but there are a lot of steps and we won’t need the exact answer anyway. Instead, we will content ourselves with an upper bound for that is lower than .
We first compute the probability that two occur before the first odd via a Markov chain argument. A “success” occurs if we roll the two before an odd and a “failure” occurs if we roll the odd first.
Let be the probability of success at the beginning. Any time our most recent roll was a or and we have neither rolled in a row nor an odd, the probability of success is still - nothing has changed. We will refer to this state as .
But when we roll a , the probability of getting two before an odd is temporarily higher—call this . We will call this state . If the next roll is a or , we are back to and the new probability becomes again.
We will solve for and via a system of equations:
In other words, the probability of success from is equal to the probability of going from to (which occurs when you roll a , so ) times the probability of success from , plus the probability you stay at (which occurs when rolling a or ) times the probability of success from .
On the other hand the probability of success from is (the probability of rolling another and being done), plus the probability you go back to times the probability of success from .
Solving this system gives . So:
That numerator is tricky to calculate. It is much easier to calculate the probability that an instance of exactly two in a row occurs at roll and all evens until row ; this will be higher but we just want an upper bound anyway!.
For , this probability is and for , the probability is . For , this is the probability that the first rolls are even, roll is a or and rolls are . In other words, .
So:
(skipping some algebra to manipulate the sum). The exact answer for is actually , which is about .
The general case
Following the above argument, you can show that the expected number of rolls until the , given that all rolls are odd, is . The expected number of rolls until the first instance of in a row, given that all rolls are odd is a little less than . So, for instance, the expected number of rolls until the is more than the expected number of rolls until the first instance of in a row when both require no odds to show up before the stopping condition.
Interesting. The natural approach is to imagine that you just have a 3-sided die with 2, 4, 6 on the sides, and if you do that, then I compute A = 12 and B = 6[1]. But, as the top Reddit comment’s edit points out, the difference between that problem and the one you posed is that your version heavily weights the probability towards short sequences—that weighting being 1/2^n for a sequence of length n. (Note that the numbers I got, A=12 and B=6, are so much higher than the A≈2.7 and B=3 you get.) It’s an interesting selection effect.
The thing is that, if you roll a 6 and then a non-6, in an “A” sequence you’re likely to just die due to rolling an odd number before you succeed in getting the double 6, and thus exclude the sequence from the surviving set; whereas in a “B” sequence there’s a much higher chance you’ll roll a 6 before dying, and thus include this longer “sequence of 3+ rolls” in the set.
To illustrate with an extreme version, consider:
Obviously that’s one way to reduce A to 2.
Excluding odd rolls completely, so the die has a 1⁄3 chance of rolling 6 and a 2⁄3 chance of rolling an even number that’s not 6, we have:
A = 1 + 1⁄3 * A2 + 2⁄3 * A
Where A2 represents “the expected number of die rolls until you get two 6′s in a row, given that the last roll was a 6”. Subtraction and multiplication then yields:
A = 3 + A2
And if we consider rolling a die from the A2 state, we get:
A2 = 1 + 1⁄3 * 0 + 2⁄3 * A
= 1 + 2⁄3 * A
Substituting:
A = 3 + 1 + 2⁄3 * A
=> (subtract)
1⁄3 * A = 4
=> (multiply)
A = 12
For B, a similar approach yields the equations:
B = 1 + 1⁄3 * B2 + 2⁄3 * B
B2 = 1 + 1⁄3 * 0 + 2⁄3 * B2
And the reader may solve for B = 6.
I did not believe this until I tried it myself, so yes, very paradox, much confusing.
A good way to resolve the paradox to me is to modify the code to combine both the functions into one function and record the sequences of the 10,000, In one array you store the sequences where there are two consecutive 6′s and in the second you store the one where they are not consecutive. That makes it a bit clearer.
For a run of 10,000 I get 412 runs where the first two 6′s are consecutive (sequences_no_gap), and 192 where they are not (sequences_can_gap). So if its just case A you get 412 runs, but for case B you get 412+192 runs. Then you look at the average sequence length of sequences_no_gap and compare it to sequences_can_gap. If the average sequence length in sequences_can_gap > than sequences_no_gap, then that means the expectation will be higher, and thats what you get.
mean sequence lengths
sequences_can_gap: 3.93
sequences_no_gap: 2.49
Examples:
sequences_no_gap
[[4, 6, 6], [6, 6], [6, 6], [4, 6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [4, 6, 6], [4, 6, 6], …]
sequences_can_gap
[[6, 4, 4, 6], [6, 4, 6], [4, 6, 4, 6], [2, 2, 6, 2, 6], [6, 4, 2, 6], [6, 4, 6], [6, 2, 6], [6, 2, 4, 6], [6, 2, 2, 4, 6], [6, 4, 6], [6, 4, 6], [2, 4, 6, 2, 6], [6, 4, 6], [6, 4, 6], …]
The many examples such as [6 4 4 6] which are excluded in the first case make the expected number of rolls higher for the case where they are allowed.
(Note GPT o-1 is confused by this problem and gives slop)
By the way, this is my first non-question post on lesswrong after lurking for almost a year. I’ve made some guesses about the mathematical maturity I can assume, but I’d appreciate feedback on if I should assume more or less in future posts (and general feedback about the writing style, either here or in private).
Seemed well-targeted to me. I do feel like it could have been condensed a bit or had a simpler storyline or something, but it was still a good post.
This is a lovely little problem, so thank you for sharing it. I thought at first it would be [a different problem](https://www.wolfram.com/mathematica/new-in-9/markov-chains-and-queues/coin-flip-sequences.html) that’s similarly paradoxical.
I wonder if there’s a connection with anthropic reasoning. Let’s suppose that a bomb goes off on rolling an odd number...
Note that a slightly different worded problem gives the intuitive result :
A_k is the event “I roll a dice k times, and it end up with 66, with no earlier 66 sequence”.
B_k be the event “I roll a dice k times, and it end up with a 6, and one and only one 6 before that (but not necessarily the roll just before the end : 16236 works)”.
C_k is the event “I roll a dice k times, and I only get even numbers”.
In this case we do have the intuitive result (that I think most mathematicians intuitively translate this problem into) :
Σ[k * P(A_k|C_k)] > Σ[k * P(B_k|C_k)]
Now the question is : why are not the two formulations equivalent ? How would you write “expected number of runs” more formally, in a way that would not yield the above formula, and would reproduce the numbers of your Python program ?
(this is what I hate in probability theory, where slightly different worded problems, seemingly equivalent, yields completely different results for no obvious reason).
Also, the difference between the two processes is not small :
vs (n = 10 millions)