First off, when you have that much data, over-fitting won’t make a big difference. For example, you’ll get a prediction that something happens between 999,000 and 1,001,000 times, instead of 1,000,000. Second, the correct answer would take 2,000,000 bits. The incorrect one would take 1001000-ln(0.5005)/ln(2)+999000-ln(0.4995)/ln(2) = 1,999,998.56 bits. The difference in data will always be how unlikely it is to be that far from the mean.
Third, and most importantly, no matter how much your intuition says otherwise, this actually is the correct way to do it. The more bits you have to use, the less likely it is. The coincidence might not seem interesting, but that exact sequence of data is unlikely. What normally makes it seem like a coincidence is that there seems to be a way to explain it with smaller data.
First off, when you have that much data, over-fitting won’t make a big difference. For example, you’ll get a prediction that something happens between 999,000 and 1,001,000 times, instead of 1,000,000. Second, the correct answer would take 2,000,000 bits. The incorrect one would take 1001000-ln(0.5005)/ln(2)+999000-ln(0.4995)/ln(2) = 1,999,998.56 bits. The difference in data will always be how unlikely it is to be that far from the mean.
Third, and most importantly, no matter how much your intuition says otherwise, this actually is the correct way to do it. The more bits you have to use, the less likely it is. The coincidence might not seem interesting, but that exact sequence of data is unlikely. What normally makes it seem like a coincidence is that there seems to be a way to explain it with smaller data.
Can someone else explain this better?