Thought experiments on simplicity in logical probability
A common feature of many proposed logical priors is a preference for simple sentences over complex ones. This is sort of like an extension of Occam’s razor into math. Simple things are more likely to be true. So, as it is said, “why not?”
Well, the analogy has some wrinkles—unlike hypothetical rules for the world, logical sentences do not form a mutually exclusive set. Instead, for every sentence A there is a sentence not-A with pretty much the same complexity, and probability 1-P(A). So you can’t make the probability smaller for all complex sentences, because their negations are also complex sentences! If you don’t have any information that discriminates between them, A and not-A will both get probability 1⁄2 no matter how complex they get.
But if our agent knows something that breaks the symmetry between A and not-A, like that A belongs to a mutually exclusive and exhaustive set of sentences with differing complexities, then it can assign higher probabilities to simpler sentences in this set without breaking the rules of probability. Except, perhaps, the rule about not making up information.
The question: is the simpler answer really more likely to be true than the more complicated answer, or is this just a delusion? If so, is it for some ontologically basic reason, or for a contingent and explainable reason?
There are two complications to draw your attention to. The first is in what we mean by complexity. Although it would be nice to use the Kolmogorov complexity of any sentence, which is the length of the shortest program that prints the sentence, such a thing is uncomputable by the kind of agent we want to build in the real world. The only thing our real-world agent is assured of seeing is the length of the sentence as-is. We can also find something in between Kolmogorov complexity and length by doing a brief search for short programs that print the sentence—this meaning is what is usually meant in this article, and I’ll call it “apparent complexity.”
The second complication is in what exactly a simplicity prior is supposed to look like. In the case of Solomonoff induction the shape is exponential—more complicated hypotheses are exponentially less likely. But why not a power law? Why not even a Poisson distribution? Does the difficulty of answering this question mean that thinking that simpler sentences are more likely is a delusion after all?
Thought experiments:
1: Suppose our agent knew from a trusted source that some extremely complicated sum could only be equal to A, or to B, or to C, which are three expressions of differing complexity. What are the probabilities?
Commentary: This is the most sparse form of the question. Not very helpful regarding the “why,” but handy to stake out the “what.” Do the probabilities follow a nice exponential curve? A power law? Or, since there are just the three known options, do they get equal consideration?
This is all based off intuition, of course. What does intuition say when various knobs of this situation are tweaked—if the sum is of unknown complexity, or of complexity about that of C? If there are a hundred options, or countably many? Intuitively speaking, does it seem like favoring simpler sentences is an ontologically basic part of your logical prior?
2: Consider subsequences of the digits of pi. If I give you a pair (n,m), you can tell me the m digits following the nth digit of pi. So if I start a sentence like “the subsequence of digits of pi (10100, 102) = ”, do you expect to see simpler strings of digits on the right side? Is this a testable prediction about the properties of pi?
Commentary: We know that there is always a short-ish program to produce the sequences, which is just to compute the relevant digits of pi. This sets a hard upper bound on the possible Kolmogorov complexity of sequences of pi (that grows logarithmically as you increase m and n), and past a certain m this will genuinely start restricting complicated sequences, and thus favoring “all zeros”—or does it?
After all, this is weak tea compared to an exponential simplicity prior, for which the all-zero sequence would be hojillions of times more likely than a messy one. On the other hand, an exponential curve allows sequences with higher Kolmogorov complexity than the computation of the digits of pi.
Does the low-level view outlined in the first paragraph above demonstrate that the exponential prior is bunk? Or can you derive one from the other with appropriate simplifications (keeping in mind Komogorov complexity vs. apparent complexity)? Does pi really contain more long simple strings than expected, and if not what’s going on with our prior?
3: Suppose I am writing an expression that I want to equal some number you know—that is, the sentence “my expression = your number” should be true. If I tell you the complexity of my expression, what can you infer about the likelihood of the above sentence?
Commentary: If we had access to Kolmogorov complexity of your number, then we could completely rule out answers that were too K-simple to work. With only an approximation, it seems like we can still say that simple answers are less likely up to a point. Then as my expression gets more and more complicated, there are more and more available wrong answers (and, outside of the system a bit, it becomes less and less likely that I know what I’m doing), and so probability goes down.
In the limit that my expression is much more complex than your number, does an elegant exponential distribution emerge from underlying considerations?
I do not know of any proposed prior that says that “simple sentences are more likely to be true.”
The prior I proposed says that I care more about being correct on simple sentences. This translates better to “I expect that my answer to simple questions is more likely to matter”
The prior Paul Christiano proposed tries to minimize a “weighted” entropy, and so translates as “try harder to say neutral on the simple sentences”
The prior Abram Demski proposed (or at least the minor modification to it that he takes more seriously where you are just as likely to put the negation of a sentence in as the sentence itself) translates as “truth values of simple sentence are more likely to modify truth values of complicated sentences than vice versa.”
The reason these look like “simple sentences are more likely to be true” is because you are looking at a list of mutually exclusive sentences, and so pushing them towards 1⁄2 is generally increasing their probability. If instead, you had a list of sentences and knew that exactly 5 out of the 7 were true, the simple ones would look less likely.
Note 1: I have recently become convinced that Abram’s prior is likely a better proposal than my own.
Note 2: I haven’t posted this yet, but my proposal is actually three different definitions, which are (conjecturally) the same, and one of them is better described similarly to Abram’s in that it lets simple sentences change complex sentences more than vice versa.
Thanks for linking to your own proposal! I’m posting this before actually digesting it, so I may reply again later.
I agree that I’m tunneling on the mutually exclusive and exhaustive case, and the characterization “simpler sentences have probabilities closer to 1/2” is an accurate characterization of Abram’s scheme. I’m not so sure about Paul’s—I’m pretty sure it’s not minimizing a weighted entropy, but a cross-entropy, which tries to make the probability proportional to the pre-prior.
As for other examples, there are a variety of more dilletantish cases (though I’m not really one to talk, of course) of people just trying to port Solomonoff induction on sentences or on models, and similar proposals of things like decreasing probability as a function of logical depth (example).
I’d defend my focus on mutually exclusive and exhaustive sets by saying that they show up everywhere and are always a basis we can use to represent knowledge. For example, if I had a list of 7 sentences and knew that exactly 5 were true, I can completely characterize my knowledge by probabilities of the 7 choose 5 mutually exclusive and exhaustive possibilities.
That said, it’s certainly possible there are things I’m missing because of this focus.
I don’t believe that complex mathematical statements should have lower probability than simple statements. It’s not just about A vs not-A. If A is a conjunction of many parts, its probability should be low. If A is a disjunction of many parts, its probability should be high.
The part about “mutually exclusive and exhaustive set of sentences” is a little similar to the idea from an old post of mine. Namely, if you have some set of sentences S_1 … S_n, then for each i you can define the probability that a random proof, sampled from a length-based prior, happens to prove S_i, conditional on the fact that it proves one of S_1 … S_n. The probabilities defined this way always sum to 1, provided that at least one S_i is true. Such an approach is helpful when S_i are logical counterfactuals, which look “mutually exclusive” to the agent but are actually all true. Also note that it agrees with our intuition about conjunctions and disjunctions, because proofs of disjunctions will tend to be shorter.
Can I ask you some more questions?
In thought experiment 2, the subsequences of pi, it seems like simpler sentences really are favored (because sequences more K-complex than the description are ruled out). Do you agree? Do you think this constitutes an empirical prediction that the all-zeroes subsequence will be overrepresented in pi?
How do you think a logical prior should behave in the limit that it’s applying to a countable set of sentences? After all, you can’t have a uniform distribution on the integers. Should it stay uniform for every finite number of sentences but be nonuniform if we go up a meta-level and consider an infinite set rather than individual sentences?
Well, computable normal numbers exist. if you replace pi with such a number, then we know that strings of zeroes aren’t overrepresented in the sense of asymptotic frequency. They might be overrepresented in some other sense, though. Can you define that precisely?
I don’t see why the prior should be uniform. None of the proposed priors are. Maybe I misunderstand your question?
Asymptotic frequency doesn’t explain this effect away, because the allowed complexity of subsequences of pi grows logarithmically, and so for any m there is an n such that there are no restrictions on complexity past that n. That is, it’s totally possible for a number to be asymptotically normal but still favor simpler sequences. Not sure how to formalize this or how to formalize the alternatives.
Hmm, it might be provable that some simple sequence shows up more often than 10^-m in Chaitlin’s constant in some range that depends predictably on its complexity.
Basically, thought experiment 1. Suppose you have N mutually exclusive and exhaustive sentences, you know their complexities K(n), and you don’t know anything else. A uniform prior here just means assigning each of these sentences logical probability 1/N.
Next suppose you have a countable set of of mutually exclusive and exhaustive sentences, you know their complexities K(n) (alternately you could have some privileged ordering of the sentences), and you don’t know anything else. There’s no uniform prior now because it would be everywhere 0, and that’s not even a little normalizable.
See my post below for one way of framing it more precisely.
So what do you think happens in thought experiment 3?
I think P(eval(random_expression(n)) = 42) will go to a positive limit when n goes to infinity, because expressions like (short expression that evaluates to 42) + 0 * (arbitrary long expression) will tend to be a constant fraction.
That’s a good point. I’m not sure if it goes through, though, because rather than length, we want to order sentences by apparent complexity, and the sentences you describe are only (apparently) as complex as the prefix.
There might still be tricks that exploit the weakness in our method for evaluating apparent complexity [...]
EDIT: Okay, I thought of a trick—rather than “0”, multiply the suffix by an expression that evaluates to 0 but is too complicated for the checker to evaluate. This allows the positive limit of probabilities to be fairly big (well, relative to 0).
Anyhow, great, the probability that some apparently-complex expression equals 42 doesn’t go to zero. What happens when the expression is just as complex as “42”? Is the probability that it equals 42 higher then? Lower?
I think you’re confusing the complexity of the sentence with the complexity of what the sentence evaluates to. The sentence “0 * random_expression” is still roughly as complex as random_expression, even though it evaluates to 0.
Whoops, you’re right.
With regards to thought experiment 2, I think a useful reference point is completely random strings of digits. If you were indeed picking the strings of digits totally at random, you would expect a distribution in which:
1) Any string of digits is equally likely.
2) A significant majority of the strings has complexity of roughly m (plus some constant overhead).
Now, in the actual thought experiment, the string of digits is quite obviously not random; as you’ve said, the Kolmogorov complexity has a hard upper bound of C + log(m) + log(n), which in general could be significantly smaller than m.
Here’s one alternative hypothesis that I can think of, which makes sense to me but could easily be wrong:
H2: In general, such strings will appear to be mostly random, except for the simple fact that they actually occur relatively early in pi.
If this is true, one would expect that for quite a lot of those strings there will not be a more compact representation than a program that, in one way or another, explicitly computes those digits of pi.
Also, I think I’ve thought of an interesting way of testing a prediction along the lines you’ve suggested. If we take S(n, m, pi) to mean “the m digits following the nth digit of pi”, let’s define
N(s, pi) := the smallest integer n such that S(n, len(s), pi) = s
If, as you suggest, simpler strings should be “more likely”, then it would follow that for simpler strings s, you would expect to see lower values of N(s, pi). In particular, how about the following experiments:
i) Generate a bunch of random sequences R (all of length m) and look at the distribution you get for N(r, pi). You can also use your apparent complexity checker on those sequences, but in general you would expect most of them to have complexity of roughly m.
ii) Pick out a bunch of strings Q that are demonstrably simple, e.g. 0000000000, 1111111111, 0123456789, and look at the values you get for N(q, pi)
iii) Pick a different number, which you think has similar properties to pi in this regard—let’s say e. Now, for a given number n, you can calculate
F(n) := N(S(n, m, e), pi)
which basically means “for a sequence occurring at position n in e, what position does it first occur at in pi”?
I think if your hypothesis is true, it will hold that the values for (ii) are typically much smaller than the values in (i), and the values for F(n) in (iii) will generally tend to increase as n increases (up to a finite limit, because eventually you will have covered all sequences of length m). On the other hand, if H2 is correct, the values for (i), (ii), and (iii) will all be pretty similar.
Now, with all that being said, I’m not sure that your thought experiment #2 carries across to a prior as to whether or not certain logical statements are true. It may be an interesting experiment to actually try in practice, though.
A useful point of contrast is the idea of a normal number, as pi and e are widely believed to be. The crucial difference is that normal-ness requires the asymptotic frequencies to be the same for all strings of the same length, which is very different to the criterion I mentioned above.
More complicated hypotheses have to be on average at least exponentially less likely, because there are exponentially many of them. There are probability measures that decline with length faster than exponential, but none that are slower. One could even say that the Solomonoff probabilities decline as slowly as possible.
The analogy between Solomonoff induction and a simplicity prior on logical sentences is not perfect.
EDIT: What I mean is: and so I’m not sold that the reasons Solomonoff induction uses an exponential distribution also applies to logical sentences.
For example, in thought experiment 1, it is totally allowed that there are only 3 options. If this was solomonoff induction, there are always an infinite number of programs that fit the data. But in thought experiment 1, the only options realio trulio are A, B and C. There are no bad consequences if you give them probabilities proportional to a power law—or at least not the same bad consequences.