My own way of thinking of Occam’s Razor is through model selection. Suppose you have two competing statements H1 (the which did it) and H2 (it was chance or possibly something other than a which caused it (H2=¬H1)) and some observations D (the sequence came up 0101010101). Then the preferred statement is whichever is more probable calculated as
p(H|D)=p(H)p(D|H)p(D)
this is simply Bayes rule where
p(D|H)=∫θp(D|θ,H)p(θ|H)dθ
and the model is parametrized by some parameters θ.
Now all this is just the mathematical way of writing that a hypothesis that has more parameters (or more specifically more possible values that it predicts), will not be as strong a statement that predicts a smaller state of outcomes.
In the witch example this would be:
H1= There exist an advanced intelligent being (at least not much less than human intelligence) that can do things beyond what has ever been reproduced in a scientific way that for some reason chooses to live on our street and act mostly as a human that will choose to influence my sequence coin tosses to end up in some seemingly looking pattern
H2= The coin toss is ruled by chance and might end up in the set of possible outcomes that seem to form a pattern ({0101010101,1111100000,1100110011,…})
D=The coin toss ended up as 0101010101
The way I stated the hypotheses p(D|H1)=p(D|H2)⋅Fraction of outcomes that look like a pattern
Now what remains is to estimate the priors and the the fraction of outcomes that look like a pattern. We can skip p(D) as we are interested in p(H1|D):p(H2|D).
Now comparing the amount of conditionals in the hypotheses and how surprised I am by them I would roughly estimate a ratio of the priors as something like 2100 in favor to chance, as the witch hypothesis goes against many of my formed beliefs of the world collected over many years, it includes weird choices of living for this hypothetical alien entity, it picks out me as a possible agent of many in the neighborhood, it singles out an arbitrary action of mine and an arbitrary set of outcomes.
For the sake of completeness. The fraction of outcomes that look like a pattern is kind of hard to estimate exactly. However, my way of thinking about it is how soon in the sequence would I postulate the specific sequence that it ended up in. After 0101, I think that the sequence 0101010101 is the most obvious pattern to continue it in. So roughly this is six bits of evidence.
In conclusion, I would say that the probability of the witch hypothesis is lacking around 94 bits of evidence for me to believe it as much as the chance hypothesis.
The downside of this approach to the Solomonoff induction and the minimum message length is that it is clunkier to use and it might be easy to forget to include conditionals or complexity in the priors the same way they can be lost in the English language. The upside is that as a model it is simpler, less ad hoc and builds directly on the product rule in probability and that probabilities sum to one and should thus be preferred by Occam’s Razor ;).
My own way of thinking of Occam’s Razor is through model selection. Suppose you have two competing statements H1 (the which did it) and H2 (it was chance or possibly something other than a which caused it (H2=¬H1)) and some observations D (the sequence came up 0101010101). Then the preferred statement is whichever is more probable calculated as
this is simply Bayes rule where
and the model is parametrized by some parameters θ.
Now all this is just the mathematical way of writing that a hypothesis that has more parameters (or more specifically more possible values that it predicts), will not be as strong a statement that predicts a smaller state of outcomes.
In the witch example this would be:
The way I stated the hypotheses p(D|H1)=p(D|H2)⋅Fraction of outcomes that look like a pattern
Now what remains is to estimate the priors and the the fraction of outcomes that look like a pattern. We can skip p(D) as we are interested in p(H1|D):p(H2|D).
Now comparing the amount of conditionals in the hypotheses and how surprised I am by them I would roughly estimate a ratio of the priors as something like 2100 in favor to chance, as the witch hypothesis goes against many of my formed beliefs of the world collected over many years, it includes weird choices of living for this hypothetical alien entity, it picks out me as a possible agent of many in the neighborhood, it singles out an arbitrary action of mine and an arbitrary set of outcomes.
For the sake of completeness. The fraction of outcomes that look like a pattern is kind of hard to estimate exactly. However, my way of thinking about it is how soon in the sequence would I postulate the specific sequence that it ended up in. After 0101, I think that the sequence 0101010101 is the most obvious pattern to continue it in. So roughly this is six bits of evidence.
In conclusion, I would say that the probability of the witch hypothesis is lacking around 94 bits of evidence for me to believe it as much as the chance hypothesis.
The downside of this approach to the Solomonoff induction and the minimum message length is that it is clunkier to use and it might be easy to forget to include conditionals or complexity in the priors the same way they can be lost in the English language. The upside is that as a model it is simpler, less ad hoc and builds directly on the product rule in probability and that probabilities sum to one and should thus be preferred by Occam’s Razor ;).