I’ve seen references to “anti-Occamian priors” in the Sequences, where Eliezer was talking about how not all possible minds would agree with Occam’s Razor. I’m not sure how such a prior could consistently exist.
It seems like Occam’s Razor just logically follows from the basic premises of probability theory. Assume the “complexity” of a hypothesis is how many bits it takes to specify under a particular method of specifying hypotheses, and that hypotheses can be of any length. Then for any prior that assigns nonzero probability to any finite hypothesis H, there must exist some level of complexity L such that any hypothesis more complex than L is less likely than H.
(That is to say, if a particular 13-bit hypothesis is 0.01% likely, then there are at most 9,999 other hypotheses with >= 0.01% probability mass. If the most complicated of these <10,000 hypotheses is 27 bits, then every hypothesis that takes 28 bits or more to specify is less likely than the 13-bit hypothesis. You can change around the numbers 13 and 0.01% and 27 as much as you want, but as long as there’s any hypothesis whatsoever with non-infinitesimal probability, then there’s some level where everything more complex than that level is less likely than that hypothesis.)
This seems to prove that an “anti-Occamian prior”—that is to say, a hypothesis that always assigns more probability to more complex hypotheses and less to the less complex—is impossible. Or at least, that it assigns zero probability to every finite hypothesis. (You could, I suppose, construct a prior such that {sum of probability mass from all 1-bit hypotheses} is 1⁄3 of {sum of probability mass from all 2-bit hypotheses}, which is then itself 1⁄3 of {sum of probability mass from all 3-bit hypotheses}, and on and on forever, and that would indeed be anti-Occamian—but it would also assign zero probability to every finite hypothesis, which would make it essentially meaningless.)
Am I missing something about what “anti-Occamian prior” is really supposed to mean here, or how it could really be consistent?
Let’s imagine a Solomonoff-style inference problem, i.e. our inference-engine receives a string of bits and tries to predict the next bit.
Here’s an anti-Occamian environment for this inference problem. At timestep t, the environment takes the t bits which the inference-engine has seen, and feeds them into a Solomonoff inductor. The Solomonoff inductor does its usual thing: roughly speaking, it finds the shortest program which would output those t bits, then outputs the next out-bit of that shortest program. But our environment is anti-Occamian, so it sets bit t+1 to the opposite of the Solomonoff inductor’s output.
That’s an anti-Occamian environment. (Note that it is uncomputable, but still entirely well-defined mathematically.)
One example of an “anti-Occamian hypothesis” would be the hypothesis that the environment works as just described—i.e. at every step, the environment does the opposite of what Occam’s razor (as operationalized by a Solomonoff inductor) would predict.
An agent which believed the anti-Occamian hypothesis would, for instance, expect that any hypothesis which did a very good job predicting past data would almost-surely fail at the next timestep. Of course, in a world like ours, the agent would be wrong about this most of the time… but that would only make the agent more confident. After all, it’s always been wrong before, so (roughly speaking) on the anti-Occamian hypothesis it’s very likely to be right this time!
Again, note that this example is uncomputable but entirely well-defined mathematically. As with Solomonoff inductors, we could in principle construct limited-compute analogues.
The standard example of a system which (maybe) behaves in this anti-inductive way in the real world is a financial market.
This does bring in the Berry paradox, since after all the hypothesis that the environment follows such a rule is in fact a very short hypothesis itself!
The resolution is that despite being describable by a short rule, it does not correspond to any computable program. Computing the results of any such hypothesis (even with the Solomonoff inductor requiring a Turing oracle already) would require an oracle of the next level up.
A super-Solomonoff inductor with a higher level Turing oracle can of course compute such a result and would assign it high weight in the described environment. Then that inductor likewise has short but uncomputable hypotheses, and so on.
Does this apply at all to anything more probabilistic than just reversing the outcome of a single most likely hypothesis and the next bit(s) it outputs? An Occamian prior doesn’t just mean “this is the shortest hypothesis; therefore it is true,” it means hypotheses are weighted by their simplicity. It’s possible for an Occamian prior to think the shortest hypothesis is most likely wrong, if there are several slightly longer hypotheses that have more probability in total.
Of course, the generalization would be that the environment inverts whatever next bit some Occamian agent thinks is most probable.
I think this is similar to the 2010 post A Proof of Occam’s Razor? …which spawned 139 comments. I didn’t read them all. But here’s one point that came up a couple times:
Let N be a ridiculously, comically large finite number like N=3↑↑↑↑↑3. Take the N simplest possible hypotheses. This is a finite set, so we can split up probability weight such that simpler hypotheses are less likely, within this set.
For example, rank-order these N hypotheses by decreasing complexity, and assign probability 2−n to the n’th on the list. That leaves 2−N leftover probability weight for all the other infinity hypotheses outside that set, and you can distribute that however you like, N is so big that it doesn’t matter.
Now, simpler hypotheses are less likely, until we go past the first N hypotheses. But N is so ridiculously large that that’s never gonna happen.
I want to say something like: “The bigger N is, the bigger a computer needs to be in order to implement that prior; and given that your brain is the size that it is, it can’t possibly be setting N=3↑↑↑↑↑3.”
Now, this isn’t strictly correct, since the Solomonoff prior is uncomputable regardless of the computer’s size, etc. - but is there some kernel of truth there? Like, is there a way of approximating the Solomonoff prior efficiently, which becomes less efficient the larger N gets?
There are practical anti-occam calculations. Start uniform over all bitstrings. And every time you find a short program that produces a bitstring, turn the probability of that bitstring down.
I think it depends on the size of your outcome-space? If you assume a finite (probably generalizes to compact?) space of outcomes, then your argument doesn’t really go through. But a lot of things seem better modelled with infinite outcome spaces, so in that case your argument seems to go through, at least under conventional formalisms.
This turns out to be the case for countable probability spaces, like Turing Machines, due to Qiaochu Yuan’s comment:
https://www.lesswrong.com/posts/fpRN5bg5asJDZTaCj/against-occam-s-razor?commentId=xFKD5hZZ68QXttHqq
I think the method can’t be arbitrary for your argument to work. If we were to measure the “complexity” of an hypothesis by how many outcomes the event set of the hypothesis contains (or by how many disjuncts some non-full disjunctive normal form has), then the more “complex” hypotheses are less likely.
Of course that’s an implausible definition of complexity. For example, using conjunctive normal forms for measuring complexity is overall more intuitive, since A∧B is considered more complex than C, and it would indeed lead to more “complex” hypothesis being less likely. But universally quantified statements (like laws of nature) correspond to very long (perhaps infinitely long) CNFs, which would suggest that they are highly complex and unlikely. But intuitively, laws of nature can be simple and likely.
This was basically the problem Wittgenstein later identified with his definition of logical probability in the Tractatus a hundred years ago. Laws would have logical probability 0, which is absurd. It would mean no amount of finite evidence can confirm them to any degree. Later Rudolf Carnap worked on this problem, but I’m pretty sure it is still considered to be unsolved. If someone does solve it, and finds an a priori justification for Ockham’s razor, that could be used as a solution to the problem of induction, the core problem of epistemology.
So finding a plausible definition of hypothesis complexity, for which also Ockham’s razor holds, is a very hard open problem.
What exactly is an “event set” in this context? I don’t think a hypothesis would necessarily correspond to a particular set of events that it permitted, but rather its own probability distribution over which events you would be more likely to see under that hypothesis. In that sense, an event set with no probabilities attached would not be enough to specify which hypothesis you were talking about, because multiple hypotheses could correspond to the same set of permitted events despite assigning very different probabilities to each of those events occurring.
A hypothesis corresponds to an “event” in Kolmogorov probability theory. Such an event is the set of mutually exclusive and exhaustive “outcomes” under which the hypothesis is true. An outcome is basically a possible world. So a hypothesis can be thought of as the set of (or the disjunction of) possible worlds in which the hypothesis is true. So the probability of an event/hypothesis is exactly the sum of the probabilities of its outcomes.
That being said, this way of seeing things is problematic, since most hypotheses are true in infinitely many “possible worlds”. It is not clear how we could sum infinitely many probabilities. And if we don’t use possible worlds as outcomes, but use arbitrary DNFs, the decomposition (partition) becomes non-unique, because outcomes are no longer distinguishable from normal events. Possible worlds at least have the special feature of being “atomic” in the sense that they can’t be decomposed into mutually exclusive and exhaustive disjuncts.
Ideally we would want finitely many non-arbitrary outcomes for each hypothesis. Then we could apply the principle of indifference, assign equal probabilities, and create the sum. But that doesn’t seem to work.
So the whole disjunctive approach seems not exactly promising. An alternative is Wittgenstein-style logical atomism, which supposes there are something like “atomic propositions”, presumably about sense data, which are all independent of each other. Hypotheses/events are then supposed to be logical combinations of these. This approach is riddled with technical difficulties as well, it is unclear how it could be made precise.