Because you need the right notion of complexity in order to prevent overfitting. To prevent overfitting, you need to penalize highly expressive model classes. The MDL complexity measure does not perfectly capture the notion of expressivity, especially if you use a naive encoding method. An ensemble with many subclassifiers may require many bits to specify, but still not be very expressive, if the weights are small.
I think we’re not trying to prevent overfitting. We’re talking about the problem of induction.
How do we know that a method that prevented overfitting in the past will continue to prevent overfitting in the future? Appealing to “induction has always worked before” is circular.
I think it was Hume who asked: How do we know that bread will continue to nourish humans? It always has before, but in what sense are past observations logical grounds for anything? We would need some magical law of nature that says “Nature definitely won’t just change all the rules.” But of course, there is no such law, and Nature might change all the rules at any time.
Suppose there are two kinds of worlds, one where bread always nourishes, and one where bread sometimes nourishes. The first variety is strictly simpler, because the second can camouflage itself as the first, matching the givens for an arbitrary amount of time, but the converse is not true. The Occam strategy of believing the simplest hypothesis consistent with the givens has this pleasant property: It is the best strategy as measured in worst-case mind changes against an adversarial Nature. Therefore, believe that bread will continue to nourish.
I think Occam’s razor is fairly easy to justify in general.
I would say that the problem is resolved if we assume that our reference class is “every conceivable, formally describable universe”—and the description in “formally describable” doesn’t just describe the universe’s state at an instant: It describes the entire universe, and its history, as an object. We should assume that one of those objects corresponds to our world.
Once we have that, we have some data from experiments, and we generate a partial model, which is an algorithm that accepts past data and predicts future data. The model needs to work for past data, and we want one that is as likely as possible to predict future data correctly. We are hoping that our partial model correctly describes the behavior of our universe—which is one of the ones in the reference class. The greater the information content in the partial model, the more specific it is being, and the smaller the proportion of possible universes in the reference class that will agree with it. The smaller the information content in the partial model, the more general it is being, and the greater the proportion of possible universes in the reference class that will agree it and, therefore, the greater will be the chance that one of these universes happens to be the real one (or the one in which we are living if you subscribe to some form of modal realism—not that it matters whether you do or not).
This should easily deal with the issue of why, when you see ordered behavior in reality, you should expect to see continued ordered behavior. It doesn’t resolve these issues:
Why do we see any ordered behavior in the first place? None of this imposes any ordered behavior on reality. It simply says that if you see some you should expect more to follow. Any simplicity that you observe does not imply that reality is simple: it simply means that your partial model is relying on a simple feature of reality that happened to be there—a very different thing. It does nothing to stop reality being a chaotic mess. However, an anthropic argument might be used here.
It doesn’t resolve the issue of the coding system used for the descriptions—although I think that issue can be resolved without too much trouble—though I am sure others would disagree.
I think the phrase you used “the proportion of possible universes” isn’t, in general, well defined without a measure (effectively a probability distribution) on that space—and there isn’t a unique best probability distribution.
Because you need the right notion of complexity in order to prevent overfitting. To prevent overfitting, you need to penalize highly expressive model classes. The MDL complexity measure does not perfectly capture the notion of expressivity, especially if you use a naive encoding method. An ensemble with many subclassifiers may require many bits to specify, but still not be very expressive, if the weights are small.
I think we’re not trying to prevent overfitting. We’re talking about the problem of induction.
How do we know that a method that prevented overfitting in the past will continue to prevent overfitting in the future? Appealing to “induction has always worked before” is circular.
I think it was Hume who asked: How do we know that bread will continue to nourish humans? It always has before, but in what sense are past observations logical grounds for anything? We would need some magical law of nature that says “Nature definitely won’t just change all the rules.” But of course, there is no such law, and Nature might change all the rules at any time.
Suppose there are two kinds of worlds, one where bread always nourishes, and one where bread sometimes nourishes. The first variety is strictly simpler, because the second can camouflage itself as the first, matching the givens for an arbitrary amount of time, but the converse is not true. The Occam strategy of believing the simplest hypothesis consistent with the givens has this pleasant property: It is the best strategy as measured in worst-case mind changes against an adversarial Nature. Therefore, believe that bread will continue to nourish.
I worry that I’m just muddying the water here.
I think Occam’s razor is fairly easy to justify in general.
I would say that the problem is resolved if we assume that our reference class is “every conceivable, formally describable universe”—and the description in “formally describable” doesn’t just describe the universe’s state at an instant: It describes the entire universe, and its history, as an object. We should assume that one of those objects corresponds to our world.
Once we have that, we have some data from experiments, and we generate a partial model, which is an algorithm that accepts past data and predicts future data. The model needs to work for past data, and we want one that is as likely as possible to predict future data correctly. We are hoping that our partial model correctly describes the behavior of our universe—which is one of the ones in the reference class. The greater the information content in the partial model, the more specific it is being, and the smaller the proportion of possible universes in the reference class that will agree with it. The smaller the information content in the partial model, the more general it is being, and the greater the proportion of possible universes in the reference class that will agree it and, therefore, the greater will be the chance that one of these universes happens to be the real one (or the one in which we are living if you subscribe to some form of modal realism—not that it matters whether you do or not).
This should easily deal with the issue of why, when you see ordered behavior in reality, you should expect to see continued ordered behavior. It doesn’t resolve these issues:
Why do we see any ordered behavior in the first place? None of this imposes any ordered behavior on reality. It simply says that if you see some you should expect more to follow. Any simplicity that you observe does not imply that reality is simple: it simply means that your partial model is relying on a simple feature of reality that happened to be there—a very different thing. It does nothing to stop reality being a chaotic mess. However, an anthropic argument might be used here.
It doesn’t resolve the issue of the coding system used for the descriptions—although I think that issue can be resolved without too much trouble—though I am sure others would disagree.
I think the phrase you used “the proportion of possible universes” isn’t, in general, well defined without a measure (effectively a probability distribution) on that space—and there isn’t a unique best probability distribution.