We’ll describe a strategy for the magician that stops you sometime within the first 2k bits and gets an expected squared error of exactly 1/k regardless of your string. If we pick k=106, the strategy has a squared error of at most 10−6, and so gets within 1% at least 99% of the time. This requires the magician to wait for up to 2106 bits. I suspect that’s roughly optimal but haven’t checked the construction carefully.
The strategy is to pick a random 0≤ℓ≤k, then pick a random 1≤m≤2k−l−1. Then the magician stops you after (2m+1)2ℓ steps, and guesses that the next 2ℓ bits will have the same average as the preceding 2ℓ bits.
(Probably it would also work to pick a random multiple of 2ℓ instead of an odd multiple, and it would probably work to just pick the horizon and stopping point at random from suitable distributions. The exact choices here were just to make the analysis work out, and it may be an artifact of me being more comfortable with boolean fourier analysis than real fourier analysis.)
The intuition is that you can try to trick the magician in many ways, but you can’t do all of them at once:
If the magician guesses that the next bit will be equal to the last, then to fool you could make each bit different from the last.
But if your bitstring were to flip a lot of the time, then it would average to 1⁄2 over any long time period.
So to stymie that, you’d need to ensure that you also include some “low frequency” variation where the most common symbol flips from 0 to 1 every 1000 steps (say).
But then the magician could zoom out to every million steps, and the average would be closer to 50-50. So we need even-lower-frequency variation.
But we only have so much variance to go around. So eventually the magician is guaranteed to be OK—if they choose from at least a million exponentially decreasing frequencies, and then guess that their can’t be much oscillation at that frequency, they will be roughly correct.
OK, now the actual analysis. Let’s consider your first 2k numbers as a function f:{0,1}k→{−1,1} (note that we’ve replaced 0 with −1 just to make things balanced, I think that’s just personal preference). Define ^f as the fourier transform of f, i.e. ^f(y)=12n∑x(−1)x⋅yf(x). Then it’s easy to see that f(x)=∑y(−1)x⋅y^f(y).
The fact that “there is only so much variance to go around” is captured by the following calculation: ∑y^f(y)2=∑y(12k∑x(−1)x⋅yf(x))2=122k∑y∑x∑x′(−1)(x+x′)⋅yf(x)f(x′). Then we can flip the order of the sum over x and y, and note that ∑y(−1)(x+x′)⋅y=0 unless x=x′. Thus ∑y^f(y)2=12k∑xf(x)2=1.
If we write down the magician’s loss in terms of the fourier coefficients, we’ll get just the bounds we wanted. This is a lot of symbols but it’s all pretty mechanical and standard.
If the magician picks m and ℓ, then their guess will be the average value of f over all bitstrings that start with m0. Then that guess will be compared to the average of f over all bitstrings that start with m1. I’m going to ignore factors of 2 in the error for simplicity. Then their squared error is: (12ℓ∑z∈{0,1}ℓf(m0z)−12ℓ∑z∈{0,1}ℓf(m1z))2. We can expand each appearance of f in terms of the fourier transform ^f, and group together like terms, to get: (12ℓ∑z∈{0,1}ℓ∑y^f(y)((−1)m0z⋅y−(−1)m1z⋅y))2.
Now we can flip the order of summation of y and z. When we sum over z, the innermost term (−1)m0z⋅y−(−1)m1z⋅y usually disappears. The two terms are opposite unless y has a 1 in the (ℓ+1)st coordinate. And the summation over z will cancel everything out if y has any non-zero terms in positions after ℓ+1. So this is non-zero if and only if the final 1 in y is in position ℓ+1. For such y, it is equal to 2ℓ(−1)m0k−ℓ⋅y .
Write Yℓ for the set of y which have their last 1 in position ℓ+1, and write m⋅y for m0k−ℓ⋅y. Then we can write the squared loss as ∑y∈Yℓ∑y′∈Yℓ^f(y)^f(y′)(−1)m⋅(y+y′). If we now take the average over the 2k−ℓ different values of m, we get 12k−ℓ∑m∑y∈Yℓ∑y′∈Yℓ^f(y)^f(y′)(−1)m⋅(y+y′). Again, we can reverse the order of summation, and the inner sum adds up to 0 unless y=y′. So we get ∑y∈Yt^f(y)2.
Now if we draw ℓ at random, we get an expected squared error of 1k∑ℓ∑y∈Yℓ^f(y)2=1k∑y^f(y)2=1k. QED.
(I sorted of cheated in that I promised exactly 1/k error but then dropped some factors of 2. I think those factors exactly cancel with the difference between predicting ±1 and predicting {0,1}, but don’t hold me to that.)
A solution:
We’ll describe a strategy for the magician that stops you sometime within the first 2k bits and gets an expected squared error of exactly 1/k regardless of your string. If we pick k=106, the strategy has a squared error of at most 10−6, and so gets within 1% at least 99% of the time. This requires the magician to wait for up to 2106 bits. I suspect that’s roughly optimal but haven’t checked the construction carefully.
The strategy is to pick a random 0≤ℓ≤k, then pick a random 1≤m≤2k−l−1. Then the magician stops you after (2m+1)2ℓ steps, and guesses that the next 2ℓ bits will have the same average as the preceding 2ℓ bits.
(Probably it would also work to pick a random multiple of 2ℓ instead of an odd multiple, and it would probably work to just pick the horizon and stopping point at random from suitable distributions. The exact choices here were just to make the analysis work out, and it may be an artifact of me being more comfortable with boolean fourier analysis than real fourier analysis.)
The intuition is that you can try to trick the magician in many ways, but you can’t do all of them at once:
If the magician guesses that the next bit will be equal to the last, then to fool you could make each bit different from the last.
But if your bitstring were to flip a lot of the time, then it would average to 1⁄2 over any long time period.
So to stymie that, you’d need to ensure that you also include some “low frequency” variation where the most common symbol flips from 0 to 1 every 1000 steps (say).
But then the magician could zoom out to every million steps, and the average would be closer to 50-50. So we need even-lower-frequency variation.
But we only have so much variance to go around. So eventually the magician is guaranteed to be OK—if they choose from at least a million exponentially decreasing frequencies, and then guess that their can’t be much oscillation at that frequency, they will be roughly correct.
OK, now the actual analysis. Let’s consider your first 2k numbers as a function f:{0,1}k→{−1,1} (note that we’ve replaced 0 with −1 just to make things balanced, I think that’s just personal preference). Define ^f as the fourier transform of f, i.e. ^f(y)=12n∑x(−1)x⋅yf(x). Then it’s easy to see that f(x)=∑y(−1)x⋅y^f(y).
The fact that “there is only so much variance to go around” is captured by the following calculation: ∑y^f(y)2=∑y(12k∑x(−1)x⋅yf(x))2=122k∑y∑x∑x′(−1)(x+x′)⋅yf(x)f(x′). Then we can flip the order of the sum over x and y, and note that ∑y(−1)(x+x′)⋅y=0 unless x=x′. Thus ∑y^f(y)2=12k∑xf(x)2=1.
If we write down the magician’s loss in terms of the fourier coefficients, we’ll get just the bounds we wanted. This is a lot of symbols but it’s all pretty mechanical and standard.
If the magician picks m and ℓ, then their guess will be the average value of f over all bitstrings that start with m0. Then that guess will be compared to the average of f over all bitstrings that start with m1. I’m going to ignore factors of 2 in the error for simplicity. Then their squared error is: (12ℓ∑z∈{0,1}ℓf(m0z)−12ℓ∑z∈{0,1}ℓf(m1z))2. We can expand each appearance of f in terms of the fourier transform ^f, and group together like terms, to get: (12ℓ∑z∈{0,1}ℓ∑y^f(y)((−1)m0z⋅y−(−1)m1z⋅y))2.
Now we can flip the order of summation of y and z. When we sum over z, the innermost term (−1)m0z⋅y−(−1)m1z⋅y usually disappears. The two terms are opposite unless y has a 1 in the (ℓ+1)st coordinate. And the summation over z will cancel everything out if y has any non-zero terms in positions after ℓ+1. So this is non-zero if and only if the final 1 in y is in position ℓ+1. For such y, it is equal to 2ℓ(−1)m0k−ℓ⋅y .
Write Yℓ for the set of y which have their last 1 in position ℓ+1, and write m⋅y for m0k−ℓ⋅y. Then we can write the squared loss as ∑y∈Yℓ∑y′∈Yℓ^f(y)^f(y′)(−1)m⋅(y+y′). If we now take the average over the 2k−ℓ different values of m, we get 12k−ℓ∑m∑y∈Yℓ∑y′∈Yℓ^f(y)^f(y′)(−1)m⋅(y+y′). Again, we can reverse the order of summation, and the inner sum adds up to 0 unless y=y′. So we get ∑y∈Yt^f(y)2.
Now if we draw ℓ at random, we get an expected squared error of 1k∑ℓ∑y∈Yℓ^f(y)2=1k∑y^f(y)2=1k. QED.
(I sorted of cheated in that I promised exactly 1/k error but then dropped some factors of 2. I think those factors exactly cancel with the difference between predicting ±1 and predicting {0,1}, but don’t hold me to that.)