From what I understand, there is a debate in epistemology / philosophy of science regarding the concept of simplicity (“Occam’s Razor”). Some hold that there is a justifiable basis for the concept in the sense that it is an indicator of which of a set of possible theories is more likely to be true. Others dispute this and say that there is no justified basis for simplicity arguments in this sense.
In a recent conversation I made the following assertion (more or less):
Those who say that simplicity arguments are unjustified are actually saying that we can never really know the truth about any theory at all, since there are always an infinite number of alternative and more complex theories that account equally for the data. The best we can do is to falsify a theory (as Karl Popper proposed), but beyond that we can never say anything about whether a theory is true.
So (I said), we have only one of two choices. We can either allow for simplicity arguments, or we can give up on ever saying anything positive about the truth (beyond falsifying a few of the infinite possible theories).
Adding to Vulture’s reply (that you can not make absolute positive statements about truth), the modern view of “Occam’s razor” (at least in Bayesian thought) is the minimum description length (MDL) principle (http://en.wikipedia.org/wiki/Minimum_description_length), which can be rigorously formalized. In this formalism, it becomes a prior over models. Multiplied with the likelihood over models (derived from data), this gives you a posterior. In this posterior, if you have two models that make exactly the same predictions, the simpler one is preferred (note that the more complicated one isn’t completely rejected; it’s just given lower posterior probability).
There are very deep fundamental theoretical considerations for why MDL is a very good way of assigning a prior to beliefs. If someone wants to reject Occam’s razor, they would have to give an alternative system and show that under the assumptions of MDL it gives better long-term utility. Or that the assumptions of MDL are unfounded.
That paper doesn’t seem to be arguing against Occam’s razor. Rather it seems to be making the more specific point that model complexity on training data doesn’t necessarily mean worse generalization error. I didn’t read through the whole article so I can’t say if the arguments make sense, but it seems that if you follow the procedure of updating your posteriors as new data arrives, the point is moot. Besides, the complexity prior framework doesn’t make that claim at all.
In the Bayesian view, you can never really make absolute positive statements about truth anyway. Without a simplicity prior you would need some other kind of distribution. Even for computable theories, I don’t think you can ever have a uniform distribution over possible explanations (math people, feel free to correct me on this if I’m wrong!); you could have some kind of perverse non-uniform but non-simplicity-based distribution, I suppose, but I would bet some money that it would perform very badly.
Damn, I didn’t intend to hit that Retract button. Stupid mobile. In case it wasn’t clear, I do stand by this comment aside from the corrections offered by JoshuaZ.
Consistency forces you to have a simplicity based prior if you have a counteable set of non-overlapping hypotheses described using some finite collection of symbols (and some other minor conditions to ensure non-pathology). See prior discussion here. See also here for related issues.
Let me clarify my question. Why do you and iarwain1 think there are absolutely no other methods that can be used to arrive at the truth, even if they are sub-optimal ones?
The prior distribution over hypotheses is distribution over programs, which are bit strings, which are integers. The distribution must be normalizable (its sum over all hypotheses must be 1). All distributions on the integers go to 0 for large integers, which corresponds to having lower probability for longer / more complex programs. Thus, all prior distributions over hypotheses have a complexity penalty.
You could conceivably use a criterion like “pick the simplest program that is longer than 100 bits” or “pick the simplest program that starts with 101101″, or things like that, but I don’t think you can get rid of the complexity penalty altogether.
I know what SI is. I’m not even pushing the point that SI not always the best thing to do—I’m not sure if it is, as it’s certainly not free of assumptions (such as the choice of the programming language / Turing machine), but let’s not go into that discussion.
The point I’m making is different. Imagine a world / universe where nobody has any idea what SI is. Would you be prepared to speak to them, all their scientists, empiricists and thinkers and say that “all your knowledge is purely accidental, you unfortunately have absolutely no methods for determining what the truth is, no reliable methods to sort out unlikely hypotheses from likely ones - while we, incidentally, do have the method and it’s called Solomonoff induction”? Because it looks like what iarwain1 is saying implies that. I’m sceptical of this claim.
You can have more to it than the complexity penalty, but you need a complexity penalty. The number of possibilities increases exponentially with the complexity. If the probability didn’t go down faster, the total probability would be infinite. But that’s impossible. It has to add to one.
The best we can do is to falsify a theory (as Karl Popper proposed), but beyond that we can never say anything about whether a theory is true.
You don’t have to go that road, you can also say: “The map is not the territory” and stop looking for true maps but rather value maps based on accuracy or usefulness.
From what I understand, there is a debate in epistemology / philosophy of science regarding the concept of simplicity (“Occam’s Razor”). Some hold that there is a justifiable basis for the concept in the sense that it is an indicator of which of a set of possible theories is more likely to be true. Others dispute this and say that there is no justified basis for simplicity arguments in this sense.
In a recent conversation I made the following assertion (more or less):
Those who say that simplicity arguments are unjustified are actually saying that we can never really know the truth about any theory at all, since there are always an infinite number of alternative and more complex theories that account equally for the data. The best we can do is to falsify a theory (as Karl Popper proposed), but beyond that we can never say anything about whether a theory is true.
So (I said), we have only one of two choices. We can either allow for simplicity arguments, or we can give up on ever saying anything positive about the truth (beyond falsifying a few of the infinite possible theories).
Is this correct?
Adding to Vulture’s reply (that you can not make absolute positive statements about truth), the modern view of “Occam’s razor” (at least in Bayesian thought) is the minimum description length (MDL) principle (http://en.wikipedia.org/wiki/Minimum_description_length), which can be rigorously formalized. In this formalism, it becomes a prior over models. Multiplied with the likelihood over models (derived from data), this gives you a posterior. In this posterior, if you have two models that make exactly the same predictions, the simpler one is preferred (note that the more complicated one isn’t completely rejected; it’s just given lower posterior probability).
There are very deep fundamental theoretical considerations for why MDL is a very good way of assigning a prior to beliefs. If someone wants to reject Occam’s razor, they would have to give an alternative system and show that under the assumptions of MDL it gives better long-term utility. Or that the assumptions of MDL are unfounded.
Perhaps you can comment this opinion that “simpler models are always more likely” is false: http://www2.denizyuret.com/ref/domingos/www.cs.washington.edu/homes/pedrod/papers/dmkd99.pdf
That paper doesn’t seem to be arguing against Occam’s razor. Rather it seems to be making the more specific point that model complexity on training data doesn’t necessarily mean worse generalization error. I didn’t read through the whole article so I can’t say if the arguments make sense, but it seems that if you follow the procedure of updating your posteriors as new data arrives, the point is moot. Besides, the complexity prior framework doesn’t make that claim at all.
In the Bayesian view, you can never really make absolute positive statements about truth anyway. Without a simplicity prior you would need some other kind of distribution. Even for computable theories, I don’t think you can ever have a uniform distribution over possible explanations (math people, feel free to correct me on this if I’m wrong!); you could have some kind of perverse non-uniform but non-simplicity-based distribution, I suppose, but I would bet some money that it would perform very badly.
Damn, I didn’t intend to hit that Retract button. Stupid mobile. In case it wasn’t clear, I do stand by this comment aside from the corrections offered by JoshuaZ.
Consistency forces you to have a simplicity based prior if you have a counteable set of non-overlapping hypotheses described using some finite collection of symbols (and some other minor conditions to ensure non-pathology). See prior discussion here. See also here for related issues.
You can act “as if” by just using the likelihood ratios and not operating with prior and posterior probabilities.
Why can’t there be other criteria to prefer some theories over other theories, besides simplicity?
Solomonoff induction justifies this : optimal induction uses a prior which weights hypotheses by their simplicity.
Let me clarify my question. Why do you and iarwain1 think there are absolutely no other methods that can be used to arrive at the truth, even if they are sub-optimal ones?
The prior distribution over hypotheses is distribution over programs, which are bit strings, which are integers. The distribution must be normalizable (its sum over all hypotheses must be 1). All distributions on the integers go to 0 for large integers, which corresponds to having lower probability for longer / more complex programs. Thus, all prior distributions over hypotheses have a complexity penalty.
You could conceivably use a criterion like “pick the simplest program that is longer than 100 bits” or “pick the simplest program that starts with 101101″, or things like that, but I don’t think you can get rid of the complexity penalty altogether.
I know what SI is. I’m not even pushing the point that SI not always the best thing to do—I’m not sure if it is, as it’s certainly not free of assumptions (such as the choice of the programming language / Turing machine), but let’s not go into that discussion.
The point I’m making is different. Imagine a world / universe where nobody has any idea what SI is. Would you be prepared to speak to them, all their scientists, empiricists and thinkers and say that “all your knowledge is purely accidental, you unfortunately have absolutely no methods for determining what the truth is, no reliable methods to sort out unlikely hypotheses from likely ones - while we, incidentally, do have the method and it’s called Solomonoff induction”? Because it looks like what iarwain1 is saying implies that. I’m sceptical of this claim.
You can have more to it than the complexity penalty, but you need a complexity penalty. The number of possibilities increases exponentially with the complexity. If the probability didn’t go down faster, the total probability would be infinite. But that’s impossible. It has to add to one.
You don’t have to go that road, you can also say: “The map is not the territory” and stop looking for true maps but rather value maps based on accuracy or usefulness.