“Everything should be made as simple as possible, but not simpler.”—Albert Einstein.
But yes, Occam’s Razor is not a natural law or anything like that. It’s a heuristic—something that usually points in the right direction but very much not guaranteed to be correct.
It’s arguably a bit more than that, on account of Solomonoff induction. An “Occamian” prior that weights computable hypotheses according to the fraction of computer-program-space occupied by programs that compute their consequences provably performs—in an admittedly somewhat artificial sense—at least as well in the long run as any other prior, provided the observations you see really are generated by something computable.
More practically, there has to be a complexity penalty in the following sense: no matter what probabilities you assign, almost all very complex hypotheses have to be very improbable because otherwise your total probability has to be infinite.
An “Occamian” prior that weights computable hypotheses according to the fraction of computer-program-space occupied by programs that compute their consequences provably performs—in an admittedly somewhat artificial sense—at least as well in the long run as any other prior, provided the observations you see really are generated by something computable.
Yes and any prior that doesn’t assign things zero probability has this property. Why that one in particular?
any prior that doesn’t assign things zero probability has this property
Oh yes, so it does. Let me therefore be both more precise and more accurate.
Let p be an Occamian prior in this sense and q any computable prior. Then as cousin_it remarks “a computable human cannot beat Solomonoff in accumulated log scores by more than a constant, even if the universe is uncomputable and loves the human”; in other words, whatever q is—however much information about the world is built into it in advance—it can’t do much better than p, even though p encodes no information about the world (it can’t since what the theorem says is that even if you choose what the world does pessimally-for-p, it still does pretty well). This is not true for arbitrary priors.
I wasn’t arguing that we should all be actually doing Solomonoff induction. (Clearly we can’t.) I was saying that there is a somewhat-usable sense in which preferring simpler hypotheses seems to be The Right Thing, or at least A Right Thing. Namely, that basing your probabilities miraculously accurately on simplicity leads to good results. The same isn’t true if you put something other than “simplicity” in that statement.
I wonder whether there are any theorems along similar lines that don’t involve any uncomputable priors. (Something handwavily along the following lines: If p,q are two computable priors and p is dramatically enough “closer to Occamian” than q, then an agent with p as prior will “usually” do better than an agent with q as prior. But I have so far not thought of any statement of this kind that’s both credible and interesting.)
My impression is that Solomonoff induction starts by assuming the Occam’s Razor.
no matter what probabilities you assign, almost all very complex hypotheses have to be very improbable because otherwise your total probability has to be infinite.
That’s not a problem—all simple hypotheses can be just as improbable.
Again, I am not saying that Occam’s Razor is not a useful heuristic. It is. But it is not evidence.
I am not saying that Occam’s Razor is not a useful heuristic. It is. But it is not evidence.
Can you restate what you consider the use of Occam’s Razor to be, and what you consider evidence to be for?
Because from my perspective the purpose of evidence is to increase/decrease my confidence in various statements, and it seems to me that Occam’s Razor is useful for doing precisely that. So this distinction doesn’t make a lot of sense to me, and rereading the thread doesn’t clarify matters.
My impression is that Solomonoff induction starts by assuming the Occam’s Razor.
The fact that it buys you something interesting without making that assumption was the whole point of the paragraph you were commenting on.
That’s not a problem—all simple hypotheses can be just as improbable.
I don’t believe that is true. Perhaps I’ve been insufficiently clear by trying to be brief (the difficulty being that “very complex” is really shorthand for something involving a limiting process), so let me be less brief.
First: Suppose you have a list of mutually exclusive hypotheses H1, H2, etc., with probabilities p1, p2, etc. List them in increasing order of complexity. Then the sum of all the pj is finite, and therefore as j → infinity pj → zero. Hence, “very complex hypotheses (in this list) have to be very improbable” in the following sense: for any probability p, however small, there’s a level of complexity C such that every hypothesis from your list whose complexity is at least C has probability smaller than p.
That doesn’t quite mean that very complex hypotheses have to be improbable. Indeed, you can construct very complex high-probability hypotheses as very long disjunctions. And since p and ~p have about the same complexity for any p, it must in some sense be true that about as many very complex propositions have high probabilities as have low probabilities. (So what I said certainly wasn’t quite right.)
However, I bet something along the following lines is true. Suppose you have a probability distribution over propositions (this is for generating them, and isn’t meant to have anything directly to do with the probability that each proposition is true), and suppose we also assign all the propositions probabilities in a way consistent with the laws of probability theory. (I’m assuming here that our class of propositions is closed under the usual logical operations.) And suppose we also assign all the propositions complexities in any reasonable way. Define the essential complexity of a proposition to be the infimum of the complexities of propositions that imply it. (I’m pretty sure it’s always attained.) Then I conjecture that something like this is both true and fairly easy to prove: for any fixed probability level q, as C → oo, if you generate a proposition at random (according to the “generating” distribution) conditional on its essential complexity being at least C, then Pr(its probability >= q) tends to 0.
“Everything should be made as simple as possible, but not simpler.”—Albert Einstein.
But yes, Occam’s Razor is not a natural law or anything like that. It’s a heuristic—something that usually points in the right direction but very much not guaranteed to be correct.
It’s arguably a bit more than that, on account of Solomonoff induction. An “Occamian” prior that weights computable hypotheses according to the fraction of computer-program-space occupied by programs that compute their consequences provably performs—in an admittedly somewhat artificial sense—at least as well in the long run as any other prior, provided the observations you see really are generated by something computable.
More practically, there has to be a complexity penalty in the following sense: no matter what probabilities you assign, almost all very complex hypotheses have to be very improbable because otherwise your total probability has to be infinite.
Yes and any prior that doesn’t assign things zero probability has this property. Why that one in particular?
Oh yes, so it does. Let me therefore be both more precise and more accurate.
Let p be an Occamian prior in this sense and q any computable prior. Then as cousin_it remarks “a computable human cannot beat Solomonoff in accumulated log scores by more than a constant, even if the universe is uncomputable and loves the human”; in other words, whatever q is—however much information about the world is built into it in advance—it can’t do much better than p, even though p encodes no information about the world (it can’t since what the theorem says is that even if you choose what the world does pessimally-for-p, it still does pretty well). This is not true for arbitrary priors.
Well, since Solomonoff is uncomputable, this isn’t really a fair comparison.
I wasn’t arguing that we should all be actually doing Solomonoff induction. (Clearly we can’t.) I was saying that there is a somewhat-usable sense in which preferring simpler hypotheses seems to be The Right Thing, or at least A Right Thing. Namely, that basing your probabilities miraculously accurately on simplicity leads to good results. The same isn’t true if you put something other than “simplicity” in that statement.
I wonder whether there are any theorems along similar lines that don’t involve any uncomputable priors. (Something handwavily along the following lines: If p,q are two computable priors and p is dramatically enough “closer to Occamian” than q, then an agent with p as prior will “usually” do better than an agent with q as prior. But I have so far not thought of any statement of this kind that’s both credible and interesting.)
My impression is that Solomonoff induction starts by assuming the Occam’s Razor.
That’s not a problem—all simple hypotheses can be just as improbable.
Again, I am not saying that Occam’s Razor is not a useful heuristic. It is. But it is not evidence.
Can you restate what you consider the use of Occam’s Razor to be, and what you consider evidence to be for?
Because from my perspective the purpose of evidence is to increase/decrease my confidence in various statements, and it seems to me that Occam’s Razor is useful for doing precisely that. So this distinction doesn’t make a lot of sense to me, and rereading the thread doesn’t clarify matters.
The fact that it buys you something interesting without making that assumption was the whole point of the paragraph you were commenting on.
I don’t believe that is true. Perhaps I’ve been insufficiently clear by trying to be brief (the difficulty being that “very complex” is really shorthand for something involving a limiting process), so let me be less brief.
First: Suppose you have a list of mutually exclusive hypotheses H1, H2, etc., with probabilities p1, p2, etc. List them in increasing order of complexity. Then the sum of all the pj is finite, and therefore as j → infinity pj → zero. Hence, “very complex hypotheses (in this list) have to be very improbable” in the following sense: for any probability p, however small, there’s a level of complexity C such that every hypothesis from your list whose complexity is at least C has probability smaller than p.
That doesn’t quite mean that very complex hypotheses have to be improbable. Indeed, you can construct very complex high-probability hypotheses as very long disjunctions. And since p and ~p have about the same complexity for any p, it must in some sense be true that about as many very complex propositions have high probabilities as have low probabilities. (So what I said certainly wasn’t quite right.)
However, I bet something along the following lines is true. Suppose you have a probability distribution over propositions (this is for generating them, and isn’t meant to have anything directly to do with the probability that each proposition is true), and suppose we also assign all the propositions probabilities in a way consistent with the laws of probability theory. (I’m assuming here that our class of propositions is closed under the usual logical operations.) And suppose we also assign all the propositions complexities in any reasonable way. Define the essential complexity of a proposition to be the infimum of the complexities of propositions that imply it. (I’m pretty sure it’s always attained.) Then I conjecture that something like this is both true and fairly easy to prove: for any fixed probability level q, as C → oo, if you generate a proposition at random (according to the “generating” distribution) conditional on its essential complexity being at least C, then Pr(its probability >= q) tends to 0.
Sorry, will put this on hold for a bit—it requires some thinking and I don’t have time for it at the moment...
No problem!