(Compressed info meaningful to humans) + uncompressed meaningless random noise)
is a better hypotheses than
(Uncompressed info meaningful to humans) + (uncompressed meaningless random noise)
I don’t see how these claims refute anything I said. You could probably use a similar argument to justify overfitting in general. A model which overfits doesn’t care at all about more or less approximate fit, it cares only about perfect fit, and two hypotheses with perfect fit on the training data could have wildly different approximate fit to reality on the predictions, while as perfect predictions they are equally bad. Then Solomonoff induction wouldn’t care at all about picking the one with better approximate predictions!
Just think about predictions of actual scientific theories: We know in advance that these theories are all strictly speaking wrong, since they are simplifications of reality (so they would be equally bad for Solomonoff), but one theory could be closer to the truth, a much better approximation, than the other. While the probability of being precisely correct could still be equal (equally low) for both theories.
That A is a better approximate prediction than B doesn’t imply that A is more likely true than B. In fact, B could (and probably would, under Solomonoff Induction) contain a lot of made-up fake precision, which would give it at least a chance of being precisely true, in contrast to A, which can fit reality only ever imperfectly. Then B would be more likely true than A, but far less similar, in expectation, to reality.
Larger errors literally take more bits to describe. For example, in binary, 3 is 11₂ and 10 is 1010₂ (twice the bits).
Say that you have two hypotheses, A and B, such that A is 100 bits more complicated than B but 5% closer to the true value. This means for each sample, the error in B on average takes log₂(1.05) = 0.07 bits more to describe than the error in A.
After about 1,430 samples, A and B will be considered equally likely. After about 95 more samples, A will be considered 100 times more likely than B.
In general, if f(x) is some high level summary of important information in x, Solomonoff induction that only tries to predict x is also universal for predicting f(x) (and it even has the same or better upper-bounds).
I don’t see how these claims refute anything I said. You could probably use a similar argument to justify overfitting in general. A model which overfits doesn’t care at all about more or less approximate fit, it cares only about perfect fit, and two hypotheses with perfect fit on the training data could have wildly different approximate fit to reality on the predictions, while as perfect predictions they are equally bad. Then Solomonoff induction wouldn’t care at all about picking the one with better approximate predictions!
Just think about predictions of actual scientific theories: We know in advance that these theories are all strictly speaking wrong, since they are simplifications of reality (so they would be equally bad for Solomonoff), but one theory could be closer to the truth, a much better approximation, than the other. While the probability of being precisely correct could still be equal (equally low) for both theories.
That A is a better approximate prediction than B doesn’t imply that A is more likely true than B. In fact, B could (and probably would, under Solomonoff Induction) contain a lot of made-up fake precision, which would give it at least a chance of being precisely true, in contrast to A, which can fit reality only ever imperfectly. Then B would be more likely true than A, but far less similar, in expectation, to reality.
Larger errors literally take more bits to describe. For example, in binary, 3 is 11₂ and 10 is 1010₂ (twice the bits).
Say that you have two hypotheses, A and B, such that A is 100 bits more complicated than B but 5% closer to the true value. This means for each sample, the error in B on average takes log₂(1.05) = 0.07 bits more to describe than the error in A.
After about 1,430 samples, A and B will be considered equally likely. After about 95 more samples, A will be considered 100 times more likely than B.
In general, if f(x) is some high level summary of important information in x, Solomonoff induction that only tries to predict x is also universal for predicting f(x) (and it even has the same or better upper-bounds).