This seems like it would lead to overfitting on the random details of our particular universe, when what we really want (I think) is a theory that equally describes our universe or any sufficiently similar one.
First off, when you have that much data, over-fitting won’t make a big difference. For example, you’ll get a prediction that something happens between 999,000 and 1,001,000 times, instead of 1,000,000. Second, the correct answer would take 2,000,000 bits. The incorrect one would take 1001000-ln(0.5005)/ln(2)+999000-ln(0.4995)/ln(2) = 1,999,998.56 bits. The difference in data will always be how unlikely it is to be that far from the mean.
Third, and most importantly, no matter how much your intuition says otherwise, this actually is the correct way to do it. The more bits you have to use, the less likely it is. The coincidence might not seem interesting, but that exact sequence of data is unlikely. What normally makes it seem like a coincidence is that there seems to be a way to explain it with smaller data.
This seems like it would lead to overfitting on the random details of our particular universe, when what we really want (I think) is a theory that equally describes our universe or any sufficiently similar one.
First off, when you have that much data, over-fitting won’t make a big difference. For example, you’ll get a prediction that something happens between 999,000 and 1,001,000 times, instead of 1,000,000. Second, the correct answer would take 2,000,000 bits. The incorrect one would take 1001000-ln(0.5005)/ln(2)+999000-ln(0.4995)/ln(2) = 1,999,998.56 bits. The difference in data will always be how unlikely it is to be that far from the mean.
Third, and most importantly, no matter how much your intuition says otherwise, this actually is the correct way to do it. The more bits you have to use, the less likely it is. The coincidence might not seem interesting, but that exact sequence of data is unlikely. What normally makes it seem like a coincidence is that there seems to be a way to explain it with smaller data.
Can someone else explain this better?