Your post still wouldn’t make any sense. Events/world have prior probability, but not complexity. Descriptions/algorithms/predictors have complexity, but not prior probability.
Infinite number of descriptions of different lengths predicts same event.
Does this imply you should assign credence 2^-m to any statement about the universe (“hypothesis”) that has length m? [...]
This confuses “A and B and C and D” as description (which has complexity but no prior probability) and “A and B and C and D” as subset of possible worlds (which has prior probability but no complexity).
Speaking about Kolmogorov complexity requires a lot of precision—we can select toy world but there’s just no way to use anything less than Turing complete description language.
Except if world can described in m bits, 2^-m serves as a lower bound on probability, as long as it’s consistent with data.
The hypothesis “the correct theory of everything is the lexicographically least algorithm with K-complexity 3^^^^3” is quite short, but the universal prior for it is astronomically low.
Is it? If you try to encode informal statements formally, they get very big. And doesn’t “lexicographically least algorithm with K-complexity 3^^^^3” violate Rice theorem? If it wasn’t for these problems, 2^-m would still be proper lower bound of prior probability of this.
Any statement about the world (“description”) does have a prior probability, which is equal to the sum of prior probabilities of worlds satisfying that statement. This is kind of obvious.
The prior probability of a statement is not connected to its “complexity” (whatever that means). This is also kind of obvious, but many people have managed to miss it, so I wrote the post.
A formal encoding of example 2 (as a predicate on algorithms expressed in ZFC, say) isn’t very long or difficult. The notion of K-complexity is easily definable, though not computable. Just like BB numbers—also easily definable, but not computable. What does Rice’s theorem have to do with it? Maybe I’m missing something?
Any statement about the world (“description”) does have a prior probability, which is equal to the sum of prior probabilities of worlds satisfying that statement. This is kind of obvious.
Programs used in Kolmogorov priors have to describe everything about the world. So it’s about programs like “print(2,4,6);” and “while(true){print(42);}” and their lengths. World “2 4 6” has probability induced by all programs that generate “2 4 6″.
You don’t normally have world-testing programs, just world-generating programs.
I shouldn’t be replying that late at night, as my clarity suffer a lot, and this subject demands a lot of clarity.
There are meaningful ways to describe Kolmogorov complexity of world-testing programs and their combinations, and there is meaningful equivalent of conditional probability. But you don’t use them in a prior directly. When you want to define prior with world-testing programs you need to do a moral equivalent of padding every program with enough entropy to fully describe the world. So program “numbers are 2, 4, 6” converts without entropy padding, program “numbers are even” needs a lot of extra entropy, and “numbers are whatever Jesus wants them to be” requires as much entropy as an empty program.
My excuse is that when you sum over every possible padding, as in (1/2)^length(“A and B” + padding bits 0000) + (1/2)^length(“A and B” + padding bits 0001) + … what you get will be simply (1/2)^length(“A and B”) times a small language-dependent constant for (1/2)^length(program to convert entropy bits and world-testing program into world-generating program).
So you have:
Length of world-testing program
Entropy needed to get world-generating program from world-testing program
Probability that world-testing program matches the world
For complete descriptions (no extra entropy needed) of world that don’t have any shorter descriptions, complexity of description is very closely related to Kolmogorov complexity—“sequence is 50 times 42” is shorter than “sequence is 9, 0, 2, 9, 3, 2, 7, 5, 8, 1, 4, 9, 8, 7, 1, 0, 6, 2, 3, 4, 4, 8, 5, 7, 4, 4, 7, 0, 1, 2, 8, 3, 9, 0, 9, 0, 3, 2, 8, 4, 6, 0, 4, 7, 8, 3, 4, 2, 3, 6, 3″, both are unambiguous and about maximally concise (I generated the second randomly) - so first is more likely than second.
For incomplete descriptions it’s more complicated, but mathematics should be straightforward enough.
Thanks, this angle looks interesting, and new to me. Could you explain some more how it’s supposed to work? For example, if the input is an infinite stream of numbers and my hypothesis is something like “next number will be 42” or “prime-numbered members will be prime”, should I pad it with infinite additional entropy?
Normally streams are assumed to be finitely long and have symbols from finite set to keep math neat.
Extension to infinitely long streams with finite complexity are conceptually fairly straightforward using trick very similar to non-normalizable Bayesian prior (just like uniform prior over real numbers is well defined even if it sums to infinity so it’s not really probability, or entropy of a molecule when atom locations are unbound continuous variables), but math can get considerably messier.
Finite or not, streams must be countable.
So just select some ordering and for stream-tester T and number N world generator is “i=0; for(w in streams) { if(T(w)) i += 1; if(i == N) { print(w); break; } }”. There are multiple ways to choose iteration order and encoding for N (simplest would be fixed bit size number, but you need more complicated encoding for infinite number of possibilities obviously), but they’re just a constant away from each other, so you can ignore this detail.
Your post still wouldn’t make any sense. Events/world have prior probability, but not complexity. Descriptions/algorithms/predictors have complexity, but not prior probability.
Infinite number of descriptions of different lengths predicts same event.
This confuses “A and B and C and D” as description (which has complexity but no prior probability) and “A and B and C and D” as subset of possible worlds (which has prior probability but no complexity).
Speaking about Kolmogorov complexity requires a lot of precision—we can select toy world but there’s just no way to use anything less than Turing complete description language.
Except if world can described in m bits, 2^-m serves as a lower bound on probability, as long as it’s consistent with data.
Is it? If you try to encode informal statements formally, they get very big. And doesn’t “lexicographically least algorithm with K-complexity 3^^^^3” violate Rice theorem? If it wasn’t for these problems, 2^-m would still be proper lower bound of prior probability of this.
Any statement about the world (“description”) does have a prior probability, which is equal to the sum of prior probabilities of worlds satisfying that statement. This is kind of obvious.
The prior probability of a statement is not connected to its “complexity” (whatever that means). This is also kind of obvious, but many people have managed to miss it, so I wrote the post.
A formal encoding of example 2 (as a predicate on algorithms expressed in ZFC, say) isn’t very long or difficult. The notion of K-complexity is easily definable, though not computable. Just like BB numbers—also easily definable, but not computable. What does Rice’s theorem have to do with it? Maybe I’m missing something?
By complexity I mean program length.
Programs used in Kolmogorov priors have to describe everything about the world. So it’s about programs like “print(2,4,6);” and “while(true){print(42);}” and their lengths. World “2 4 6” has probability induced by all programs that generate “2 4 6″.
You don’t normally have world-testing programs, just world-generating programs.
Then why did you mention things like length(“A and B”) in your first comment? You can’t take the conjunction of two programs, can you? I’m confused.
I shouldn’t be replying that late at night, as my clarity suffer a lot, and this subject demands a lot of clarity.
There are meaningful ways to describe Kolmogorov complexity of world-testing programs and their combinations, and there is meaningful equivalent of conditional probability. But you don’t use them in a prior directly. When you want to define prior with world-testing programs you need to do a moral equivalent of padding every program with enough entropy to fully describe the world. So program “numbers are 2, 4, 6” converts without entropy padding, program “numbers are even” needs a lot of extra entropy, and “numbers are whatever Jesus wants them to be” requires as much entropy as an empty program.
My excuse is that when you sum over every possible padding, as in (1/2)^length(“A and B” + padding bits 0000) + (1/2)^length(“A and B” + padding bits 0001) + … what you get will be simply (1/2)^length(“A and B”) times a small language-dependent constant for (1/2)^length(program to convert entropy bits and world-testing program into world-generating program).
So you have:
Length of world-testing program
Entropy needed to get world-generating program from world-testing program
Probability that world-testing program matches the world
For complete descriptions (no extra entropy needed) of world that don’t have any shorter descriptions, complexity of description is very closely related to Kolmogorov complexity—“sequence is 50 times 42” is shorter than “sequence is 9, 0, 2, 9, 3, 2, 7, 5, 8, 1, 4, 9, 8, 7, 1, 0, 6, 2, 3, 4, 4, 8, 5, 7, 4, 4, 7, 0, 1, 2, 8, 3, 9, 0, 9, 0, 3, 2, 8, 4, 6, 0, 4, 7, 8, 3, 4, 2, 3, 6, 3″, both are unambiguous and about maximally concise (I generated the second randomly) - so first is more likely than second.
For incomplete descriptions it’s more complicated, but mathematics should be straightforward enough.
Thanks, this angle looks interesting, and new to me. Could you explain some more how it’s supposed to work? For example, if the input is an infinite stream of numbers and my hypothesis is something like “next number will be 42” or “prime-numbered members will be prime”, should I pad it with infinite additional entropy?
Normally streams are assumed to be finitely long and have symbols from finite set to keep math neat.
Extension to infinitely long streams with finite complexity are conceptually fairly straightforward using trick very similar to non-normalizable Bayesian prior (just like uniform prior over real numbers is well defined even if it sums to infinity so it’s not really probability, or entropy of a molecule when atom locations are unbound continuous variables), but math can get considerably messier.
Finite or not, streams must be countable.
So just select some ordering and for stream-tester T and number N world generator is “i=0; for(w in streams) { if(T(w)) i += 1; if(i == N) { print(w); break; } }”. There are multiple ways to choose iteration order and encoding for N (simplest would be fixed bit size number, but you need more complicated encoding for infinite number of possibilities obviously), but they’re just a constant away from each other, so you can ignore this detail.
Ah, sorry. I deleted my other reply, will reply here.