It turns out there’s an extremely straightforward mathematical reason why simplicity is to some extent an indicator of high probability.
Consider the list of all possible hypotheses with finite length. We might imagine there being a labeling of this list, starting with hypothesis 1, then hypothesis 2, and continuing on for an infinite number of hypotheses. This list contains the hypotheses capable of being distinguished by a human brain, input into a computer, having their predictions checked against the others, and other nice properties like that. In order to make predictions about which hypothesis is true, all we have to do is assign a probability to each one.
The obvious answer is just to give every hypotheses equal probability. But since there’s an infinite number of these hypotheses, that can’t work, because we’d end up giving every hypothesis probability zero! So (and here’s where it starts getting Occamian) it turns out that any valid probability assignment has to get smaller and smaller as we go to very high numbers in the list (so that the probabilities can all add up to 1). At low numbers in the list the probability is, in general, allowed to go up and down, but hypotheses with very high numbers always have to be low probability.
There’s a caveat, though—the position in the list can be arbitrary, and doesn’t have to be based on simplicity. But it turns out that it is impossible to make any ordering of hypotheses at all, without having more complicated hypotheses have higher numbers than simpler hypotheses on average.
There’s a general argument for this (there’s a more specific argument based on universal turing machines that you can find in a good textbook) that’s basically a reflection of the fact that there’s a most simple hypothesis, but no “most complex” hypothesis, just like how there’s no biggest positive integer. Even if you tried to shuffle up the hypotheses really well, you have to have each simple hypothesis end up at some finite place in the list (otherwise they end up at no place in the list and it’s not a valid shuffling). And if the simple hypotheses are all at finite places in the list, that means there’s still an infinite number of complex hypotheses with higher numbers, so complexity still decreases for large enough places in the list.
Why would the mapping between the language the hypotheses are framed in have impact on which statements are most likley to be true? The article mentions that in domains where the correct hypotheses are complex in the proof language the principle tends to be anti-productive. There is no guarantee that the language is well suited to describe the target phenomenon if we are allowed to freely pick the phenomenon to track!
Wouldn’t also any finite complexity class only have finitely many hypotheses in it and wouldn’t those also be in a finite numbered index in it? The problem only arises for infinite complexity hypotheses. And it could be argued that if the index is a hyperinteger it can still be a valid placing.
With surreal probability it would be no problem to give an equal infinistemal probability to an infinite list of hypotheses.
Wouldn’t also any finite complexity class only have finitely many hypotheses in it
Think of it as like the set of all positive integers of finite size. As it turns out, every single integer has finite size! You show me an integer, and I’ll show you its size :P But even though each individual element is less than infinity, the size of the set is infinite.
Why would the mapping between the language the hypotheses are framed in have impact on which statements are most likely to be true?
Choosing which language to use is ultimately arbitrary. But because there’s no way to assign the same probability to infinitely many discrete things and have the probabilities still add up to one, we’re forced into a choice of some “natural ordering of hypotheses” in which the probability is monotonically decreasing. This does not happen because of any specific fact about the external world—this is a property of what it looks to have hypotheses about something that might be arbitrarily complicated.
The article mentions that in domains where the correct hypotheses are complex in the proof language the principle tends to be anti-productive.
Well… it’s anti-productive until you eliminate the simple-but-wrong alternatives, and then suddenly it’s the only thing allowing you to choose the right hypothesis out of the list that contains many more complex-and-still-accurate hypotheses.
If you want a much better explanation of these topics than I can give, and you like math, I recommend the textbook by Li and Vitanyi.
9 has 4 digits as “1001” in binary and 1 in decimal, so no function from integers to their size. There is no such thing as the size of a integer independent of any digit system used (well you could refer to some set constructions but then the size would be the integer itself).
As surreals we could have ω pieces of equal probability ɛ that sum to 1 exactly (althought ordinal numbers are only applicaple to orders which can be different than cardinal numbers. While for finites there is no big distinciton from ordinal and cardinal, “infinitely many discrete things” might refer to a cardinal concept. However for hypotheses that are listable (such as formed as arbitrary lenght strings of letters from a (finite) alphabeth) the ωth index should be well founded).
It is not about arbitrary complexity but probability over infinite options. We could for example order the hypotheses by the amounts of negation used first and the number of symbols used second. This would not be any less natural and would result in a different probability distribution. Or arguing that the complexity ordereing is the one that produces the “true” probailities is reframing of the question whether the simplicity formulation is truth-indicative.
If I use a complexity-ambivalent method I might need to do fewer eliminations before encountering a working one. There is no need to choose from accurate hypotheses if we know that any of them are true. If I encounter a working hypthesis there is no need to search for a more simpler form of it. Or if I encounter a theory of gravitation using ellipses should I countinue the search to find one that uses simpler concepts like circles only?
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes’. Universability seems to be important to all the most productive approaches there.
Or arguing that the complexity ordereing is the one that produces the “true” probailities is reframing of the question whether the simplicity formulation is truth-indicative.
If the approach that says simplicity is truth-indicative is self-consistent, that’s at least something. I’m reminded of the LW sequence that talks about toxic vs healthy epistemic loops.
If I encounter a working hypothesis there is no need to search for a more simpler form of it.
This seems likely to encourage overfitted hypotheses. I guess the alternative would be wasting effort on searching for simplicity that doesn’t exist, though. Now I am confused again, although in a healthier and more abstract way than originally. I’m looking for where the problem in anti-simplicity arguments lies rather than taking them seriously, which is easier to live with.
Honestly, I’m starting to feel as though perhaps the easiest approach to disproving the author’s argument would be to deny his assertion that processes in Nature which are simple are relatively uncommon. From off the top of my head, argument one is replicators, argument two is that simpler processes are smaller and thus more of them fit into the universe than complex ones would, argument three is the universe seems to run on math (might be begging the question a bit, although I don’t think so, since it’s kinda amazing that anything more meta than perfect atomist replication can lead to valid inference—again the connection to universalizability surfaces), argument four is an attempt to undeniably avoid begging the question inspired by Descartes: if nothing else we have access to at least one form of Nature unfiltered by our perceptions of simplicity : the perceptions themselves, which via anthropic type induction arguments we should assume-more-than-not to be of more or less average representativeness. (Current epistemic status: playing with ideas very nonrigorously, wild and free.)
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes’. Universability seems to be important to all the most productive approaches there.
If I encounter a working hypothesis there is no need to search for a more simpler form of it.
This seems likely to encourage overfitted hypotheses.
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes’. Universability seems to be important to all the most productive approaches there.
Thanks for this! Apparently, among many economists Occam’s Razor is viewed as just a modelling trick, judging from the conversations on Reddit I’ve had recently. I’d felt that perspective was incorrect for a while, but after encountering it so many times, and then later on being directed to this paper, I’d begun to fear my epistemology was built on shaky foundations. It’s relieving to see that’s not the case.
It turns out there’s an extremely straightforward mathematical reason why simplicity is to some extent an indicator of high probability.
Is there anything ruling out a bias towards simplicity that is extremely small, or are there good reasons to think the bias would be rather large? Figuring out how much predictive accuracy to exchange for theory conciseness seems like a tough problem, possibly requiring some arbitrariness.
That only works if you have a countable set of mutually exclusive hypotheses, and exactly one of them is true. Not all worlds are like that. For example, if the “world” is a single real number picked uniformly from [0,1], then it’s hard to say what the hypotheses should be.
If hypotheses aren’t restricted to being mutually exclusive, the approach doesn’t work. For example, if you randomly generate sentences about the integers in some formal theory, then short sentences aren’t more likely to be true than long ones. That leads to a problem if you want to apply Occam’s razor to choosing physical theories, which aren’t mutually exclusive.
Another reason to prefer the simplest theories that fit observations well is that they make life easier for engineers. Kevin Kelly’s Occam efficiency theorem is related, but the idea is really simpler than that.
It turns out there’s an extremely straightforward mathematical reason why simplicity is to some extent an indicator of high probability.
And what exactly does that bit of mathematical wankery with infinite lists have to do with trying to figure out which maps are better in our reality? Does it have any practical application?
It turns out there’s an extremely straightforward mathematical reason why simplicity is to some extent an indicator of high probability.
Consider the list of all possible hypotheses with finite length. We might imagine there being a labeling of this list, starting with hypothesis 1, then hypothesis 2, and continuing on for an infinite number of hypotheses. This list contains the hypotheses capable of being distinguished by a human brain, input into a computer, having their predictions checked against the others, and other nice properties like that. In order to make predictions about which hypothesis is true, all we have to do is assign a probability to each one.
The obvious answer is just to give every hypotheses equal probability. But since there’s an infinite number of these hypotheses, that can’t work, because we’d end up giving every hypothesis probability zero! So (and here’s where it starts getting Occamian) it turns out that any valid probability assignment has to get smaller and smaller as we go to very high numbers in the list (so that the probabilities can all add up to 1). At low numbers in the list the probability is, in general, allowed to go up and down, but hypotheses with very high numbers always have to be low probability.
There’s a caveat, though—the position in the list can be arbitrary, and doesn’t have to be based on simplicity. But it turns out that it is impossible to make any ordering of hypotheses at all, without having more complicated hypotheses have higher numbers than simpler hypotheses on average.
There’s a general argument for this (there’s a more specific argument based on universal turing machines that you can find in a good textbook) that’s basically a reflection of the fact that there’s a most simple hypothesis, but no “most complex” hypothesis, just like how there’s no biggest positive integer. Even if you tried to shuffle up the hypotheses really well, you have to have each simple hypothesis end up at some finite place in the list (otherwise they end up at no place in the list and it’s not a valid shuffling). And if the simple hypotheses are all at finite places in the list, that means there’s still an infinite number of complex hypotheses with higher numbers, so complexity still decreases for large enough places in the list.
Why would the mapping between the language the hypotheses are framed in have impact on which statements are most likley to be true? The article mentions that in domains where the correct hypotheses are complex in the proof language the principle tends to be anti-productive. There is no guarantee that the language is well suited to describe the target phenomenon if we are allowed to freely pick the phenomenon to track!
Wouldn’t also any finite complexity class only have finitely many hypotheses in it and wouldn’t those also be in a finite numbered index in it? The problem only arises for infinite complexity hypotheses. And it could be argued that if the index is a hyperinteger it can still be a valid placing.
With surreal probability it would be no problem to give an equal infinistemal probability to an infinite list of hypotheses.
Think of it as like the set of all positive integers of finite size. As it turns out, every single integer has finite size! You show me an integer, and I’ll show you its size :P But even though each individual element is less than infinity, the size of the set is infinite.
Choosing which language to use is ultimately arbitrary. But because there’s no way to assign the same probability to infinitely many discrete things and have the probabilities still add up to one, we’re forced into a choice of some “natural ordering of hypotheses” in which the probability is monotonically decreasing. This does not happen because of any specific fact about the external world—this is a property of what it looks to have hypotheses about something that might be arbitrarily complicated.
Well… it’s anti-productive until you eliminate the simple-but-wrong alternatives, and then suddenly it’s the only thing allowing you to choose the right hypothesis out of the list that contains many more complex-and-still-accurate hypotheses.
If you want a much better explanation of these topics than I can give, and you like math, I recommend the textbook by Li and Vitanyi.
9 has 4 digits as “1001” in binary and 1 in decimal, so no function from integers to their size. There is no such thing as the size of a integer independent of any digit system used (well you could refer to some set constructions but then the size would be the integer itself).
As surreals we could have ω pieces of equal probability ɛ that sum to 1 exactly (althought ordinal numbers are only applicaple to orders which can be different than cardinal numbers. While for finites there is no big distinciton from ordinal and cardinal, “infinitely many discrete things” might refer to a cardinal concept. However for hypotheses that are listable (such as formed as arbitrary lenght strings of letters from a (finite) alphabeth) the ωth index should be well founded).
It is not about arbitrary complexity but probability over infinite options. We could for example order the hypotheses by the amounts of negation used first and the number of symbols used second. This would not be any less natural and would result in a different probability distribution. Or arguing that the complexity ordereing is the one that produces the “true” probailities is reframing of the question whether the simplicity formulation is truth-indicative.
If I use a complexity-ambivalent method I might need to do fewer eliminations before encountering a working one. There is no need to choose from accurate hypotheses if we know that any of them are true. If I encounter a working hypthesis there is no need to search for a more simpler form of it. Or if I encounter a theory of gravitation using ellipses should I countinue the search to find one that uses simpler concepts like circles only?
I think this is relevant: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes’. Universability seems to be important to all the most productive approaches there.
If the approach that says simplicity is truth-indicative is self-consistent, that’s at least something. I’m reminded of the LW sequence that talks about toxic vs healthy epistemic loops.
This seems likely to encourage overfitted hypotheses. I guess the alternative would be wasting effort on searching for simplicity that doesn’t exist, though. Now I am confused again, although in a healthier and more abstract way than originally. I’m looking for where the problem in anti-simplicity arguments lies rather than taking them seriously, which is easier to live with.
Honestly, I’m starting to feel as though perhaps the easiest approach to disproving the author’s argument would be to deny his assertion that processes in Nature which are simple are relatively uncommon. From off the top of my head, argument one is replicators, argument two is that simpler processes are smaller and thus more of them fit into the universe than complex ones would, argument three is the universe seems to run on math (might be begging the question a bit, although I don’t think so, since it’s kinda amazing that anything more meta than perfect atomist replication can lead to valid inference—again the connection to universalizability surfaces), argument four is an attempt to undeniably avoid begging the question inspired by Descartes: if nothing else we have access to at least one form of Nature unfiltered by our perceptions of simplicity : the perceptions themselves, which via anthropic type induction arguments we should assume-more-than-not to be of more or less average representativeness. (Current epistemic status: playing with ideas very nonrigorously, wild and free.)
I think this is relevant: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes’. Universability seems to be important to all the most productive approaches there.
This seems likely to encourage overfitted hypotheses.
I think this is relevant: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)
The approach of the final authors mentioned on the page seems especially interesting to me. I also am interested to note that their result agrees with Jaynes’. Universability seems to be important to all the most productive approaches there.
I think this is relevant: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)#Jaynes.27_solution_using_the_.22maximum_ignorance.22_principle
I think this is relevant:
https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)#Jaynes.27_solution_using_the_.22maximum_ignorance.22_principle
Thanks for this! Apparently, among many economists Occam’s Razor is viewed as just a modelling trick, judging from the conversations on Reddit I’ve had recently. I’d felt that perspective was incorrect for a while, but after encountering it so many times, and then later on being directed to this paper, I’d begun to fear my epistemology was built on shaky foundations. It’s relieving to see that’s not the case.
Is there anything ruling out a bias towards simplicity that is extremely small, or are there good reasons to think the bias would be rather large? Figuring out how much predictive accuracy to exchange for theory conciseness seems like a tough problem, possibly requiring some arbitrariness.
That only works if you have a countable set of mutually exclusive hypotheses, and exactly one of them is true. Not all worlds are like that. For example, if the “world” is a single real number picked uniformly from [0,1], then it’s hard to say what the hypotheses should be.
If hypotheses aren’t restricted to being mutually exclusive, the approach doesn’t work. For example, if you randomly generate sentences about the integers in some formal theory, then short sentences aren’t more likely to be true than long ones. That leads to a problem if you want to apply Occam’s razor to choosing physical theories, which aren’t mutually exclusive.
Another reason to prefer the simplest theories that fit observations well is that they make life easier for engineers. Kevin Kelly’s Occam efficiency theorem is related, but the idea is really simpler than that.
And what exactly does that bit of mathematical wankery with infinite lists have to do with trying to figure out which maps are better in our reality? Does it have any practical application?