When there is a testable physical difference between hypotheses, we want the one that makes the correct prediction.
When there is no testable physical difference between hypotheses, we want to use the one that makes it easiest to make the correct prediction. By definition, we can never get a prediction that wouldn’t have happened were we using the other hypothesis, but we’ll get that prediction quicker. Neither hypothesis can be said to be ‘the way the world really is’ because there’s no way to distinguish between them, but the simpler hypothesis is more useful.
Wha? Then you must order the equivalent theories by running time, not code length. The two are frequently opposed: for example, the fastest known algorithm for matrix multiplication (in the big-O sense) is very complex compared to the naive one. In short, I feel you’re only digging yourself deeper into the hole of heresy.
Well, firstly, who said I cared at all about ‘heresy’? I’m not replying here in order to demonstrate my adherence to the First Church Of Bayes or something...
And while there are, obviously, occasions where ordering by running time and code length are opposed, in general when comparing two arbitrary programs which generate the same output, the longer one will also take longer. This is obvious when you consider it—if you have an arbitrary program X from the space of all programs that generate an output Y, there can only be a finite number of programs that generate that output more quickly. However, there are an infinite number of programs in the sample space that will generate the output more slowly, and that are also longer than X—just keep adding an extra ‘sleep 1’ before it prints the output, to take a trivial example.
In general, the longer the program, the more operations it performs and the longer it takes, when you’re sampling from the space of all possible programs. So while run time and code length aren’t perfectly correlated, they’re a very decent proxy for each other.
if you have an arbitrary program X from the space of all programs that generate an output Y, there can only be a finite number of programs that generate that output more quickly.
Amusingly, this statement is false. If a program Z is faster than X, then there exist infinitely many versions of Z that also run faster than X: just add some never-executed code under an if(false) branch. I’m not sure whether your overall argument can be salvaged.
You’re quite correct, there. I was only including code paths that can ever actually be executed, in the same way I wouldn’t count comments as part of the program. This seems to me to be the correct thing to do, and I believe one could come up with some more rigorous reasoning along the lines of my previous comment, but I’m too tired right now to do so. I’ll think about this...
I was only including code paths that can ever actually be executed...
Wouldn’t a meta-algorithm that determines which paths are executable in a given algorithm necessarily not be able to do so for every possible algorithm unless it was functionally equivalent to a halting oracle?
I’m not sure how problematic this is to your idea, but it’s one advantage that the simpler system of just counting total lines has.
I think there’s a difference between looking at a theory as data versus looking at it as code.
You look at a theory as code when you need to use the theory to predict the future of something it describes. (E.g., will it rain.) For this purpose, theories that generate the same predictions are equivalent, you don’t care about their size. In fact, even theories with different predictions can be considered equivalent, as long as their predictions are close enough for your purpose. (See Newtonian vs. relativistic physics applied to predicting kitchen-sink performance.) You do care about how fast you can run them, though.
However, you look at a theory as data when you need to reason about theories, and “make predictions” about them, particularly unknown theories related to known ones. As long as two theories make exactly the same predictions, you don’t have much reason to reason about them. However, if they predict differently for something you haven’t tested yet, but will test in the future, and you need to take an action now that has different outcomes depending on the result of the future test (simple example: a bet), then you need to try to guess which is more likely.
You need something like a meta-theory that predicts which of the two is more likely to be true. Occam’s razor is one of those meta-theories.
Thinking about it more, this isn’t quite a disagreement to the post immediately above; it’s not immediately obvious to me that a simpler theory is easier to reason about (though intuition says it should be). But I don’t think Occam’s razor is about how easy it is to reason about theories, it just claims simpler ones are more likely. (Although one could justify it like this: take an incomplete theory; add one detail; add another detail; on each step you have to pick between many details you might add, so the more details you add you’re more likely to pick the wrong one (remember you haven’t tested the successive theories yet); thus, the more complex your theory the likelier you are to be wrong.)
The length of the program description is not really the measure of how easy it is to make a correct prediction. In fact, the shortest program for predicting is almost never the one you should use to make predictions in practice, precisely because it is normally quite slow. It is also very rarely the program which is easiest to manipulate mentally, since short programs tend to be very hard for humans to reason about.
Like PaulFChistiano said, the shortest accurate program isn’t particularly useful, but its predictive model is more a priori probable according to the universal / Occamian prior.
It’s really hard (and uncomputable) to discover, understand, and verify the shortest program that computes a certain input->prediction mapping. But we use the “shortest equivalent program” concept to judge which human-understandable program is more a priori probable.
When there is no testable physical difference between hypotheses, we want to use the one that makes it easiest to make the correct prediction.
Yes, we want to use the hypothesis that is easiest to use. But if we use it, does that commit us to ‘believing’ in it? In the case of no testable physical difference between hypotheses, I propose that someone has no obligation to believe (or admit they believe) that particular theory instead of another one with the same predictions.
I enthusiastically propose that we say we ‘have’ a belief only when we use or apply a belief for which there is an empirical difference in the predictions of the belief compared to the non-belief. Alternatively, we can use some other word instead of belief, that will serve to carry this more relevant distinction.
(Later: I realize this comment is actually directed at cousin_it, since he was the one that wrote, ‘your degree of belief in (prior probability of) a hypothesis should not depend on how clearly it allows you to think’. I also think I may have reiterated what Vladimir_Nesov wrote here.)
When there is a testable physical difference between hypotheses, we want the one that makes the correct prediction.
When there is no testable physical difference between hypotheses, we want to use the one that makes it easiest to make the correct prediction. By definition, we can never get a prediction that wouldn’t have happened were we using the other hypothesis, but we’ll get that prediction quicker. Neither hypothesis can be said to be ‘the way the world really is’ because there’s no way to distinguish between them, but the simpler hypothesis is more useful.
Wha? Then you must order the equivalent theories by running time, not code length. The two are frequently opposed: for example, the fastest known algorithm for matrix multiplication (in the big-O sense) is very complex compared to the naive one. In short, I feel you’re only digging yourself deeper into the hole of heresy.
Well, firstly, who said I cared at all about ‘heresy’? I’m not replying here in order to demonstrate my adherence to the First Church Of Bayes or something...
And while there are, obviously, occasions where ordering by running time and code length are opposed, in general when comparing two arbitrary programs which generate the same output, the longer one will also take longer. This is obvious when you consider it—if you have an arbitrary program X from the space of all programs that generate an output Y, there can only be a finite number of programs that generate that output more quickly. However, there are an infinite number of programs in the sample space that will generate the output more slowly, and that are also longer than X—just keep adding an extra ‘sleep 1’ before it prints the output, to take a trivial example.
In general, the longer the program, the more operations it performs and the longer it takes, when you’re sampling from the space of all possible programs. So while run time and code length aren’t perfectly correlated, they’re a very decent proxy for each other.
Amusingly, this statement is false. If a program Z is faster than X, then there exist infinitely many versions of Z that also run faster than X: just add some never-executed code under an if(false) branch. I’m not sure whether your overall argument can be salvaged.
You’re quite correct, there. I was only including code paths that can ever actually be executed, in the same way I wouldn’t count comments as part of the program. This seems to me to be the correct thing to do, and I believe one could come up with some more rigorous reasoning along the lines of my previous comment, but I’m too tired right now to do so. I’ll think about this...
Wouldn’t a meta-algorithm that determines which paths are executable in a given algorithm necessarily not be able to do so for every possible algorithm unless it was functionally equivalent to a halting oracle?
I’m not sure how problematic this is to your idea, but it’s one advantage that the simpler system of just counting total lines has.
I think there’s a difference between looking at a theory as data versus looking at it as code.
You look at a theory as code when you need to use the theory to predict the future of something it describes. (E.g., will it rain.) For this purpose, theories that generate the same predictions are equivalent, you don’t care about their size. In fact, even theories with different predictions can be considered equivalent, as long as their predictions are close enough for your purpose. (See Newtonian vs. relativistic physics applied to predicting kitchen-sink performance.) You do care about how fast you can run them, though.
However, you look at a theory as data when you need to reason about theories, and “make predictions” about them, particularly unknown theories related to known ones. As long as two theories make exactly the same predictions, you don’t have much reason to reason about them. However, if they predict differently for something you haven’t tested yet, but will test in the future, and you need to take an action now that has different outcomes depending on the result of the future test (simple example: a bet), then you need to try to guess which is more likely.
You need something like a meta-theory that predicts which of the two is more likely to be true. Occam’s razor is one of those meta-theories.
Thinking about it more, this isn’t quite a disagreement to the post immediately above; it’s not immediately obvious to me that a simpler theory is easier to reason about (though intuition says it should be). But I don’t think Occam’s razor is about how easy it is to reason about theories, it just claims simpler ones are more likely. (Although one could justify it like this: take an incomplete theory; add one detail; add another detail; on each step you have to pick between many details you might add, so the more details you add you’re more likely to pick the wrong one (remember you haven’t tested the successive theories yet); thus, the more complex your theory the likelier you are to be wrong.)
The length of the program description is not really the measure of how easy it is to make a correct prediction. In fact, the shortest program for predicting is almost never the one you should use to make predictions in practice, precisely because it is normally quite slow. It is also very rarely the program which is easiest to manipulate mentally, since short programs tend to be very hard for humans to reason about.
Like PaulFChistiano said, the shortest accurate program isn’t particularly useful, but its predictive model is more a priori probable according to the universal / Occamian prior.
It’s really hard (and uncomputable) to discover, understand, and verify the shortest program that computes a certain input->prediction mapping. But we use the “shortest equivalent program” concept to judge which human-understandable program is more a priori probable.
Yes, we want to use the hypothesis that is easiest to use. But if we use it, does that commit us to ‘believing’ in it? In the case of no testable physical difference between hypotheses, I propose that someone has no obligation to believe (or admit they believe) that particular theory instead of another one with the same predictions.
I enthusiastically propose that we say we ‘have’ a belief only when we use or apply a belief for which there is an empirical difference in the predictions of the belief compared to the non-belief. Alternatively, we can use some other word instead of belief, that will serve to carry this more relevant distinction.
(Later: I realize this comment is actually directed at cousin_it, since he was the one that wrote, ‘your degree of belief in (prior probability of) a hypothesis should not depend on how clearly it allows you to think’. I also think I may have reiterated what Vladimir_Nesov wrote here.)