Here is my understanding: we assume a programming language where a program is a finite sequence of bits, and such that no program is a prefix of another program. So, for example, if 01010010 is a program, then 0101 is not a program. Then, the (not-normalized) prior probability for a program is 2−length(the program) Why that probability? If you take any infinite sequence of bits, then, because no program is a prefix of any other program, at most one program will be a prefix of that sequence of bits. If you randomly (with uniform distribution) select an infinite sequence of bits, the probability that the sequence of bits has a particular program as a prefix, is then 2−length(the program) (because there’s a factor of (1/2) for each of the bits of the program, and if the first (length of the program) bits match the program, then it doesn’t matter what the rest of the bits that come after are. (Ah, I suppose you don’t strictly need to talk about infinite sequences of bits, and you could just talk about randomly picking the value of the next bit, stopping if it ever results in a valid program in the programming language..., not sure which makes it easier to think about.)
If you want this to be an actual prior, you can normalize this by dividing by (the sum over all programs, of 2−length(the program) ).
The usual way of defining Solomonoff induction, I believe has the programs being deterministic, but I’ve read that allowing them to use randomness has equivalent results, and may make some things conceptually easier. So, I’ll make some educated guesses about how to incorporate the programs having random behavior.
Let G be the random variable for “which program gets selected”, and g be used to refer to any potential particular value for that variable. (I avoided using p for program because I want to use it for probability. The choice of the letter g was arbitrary.) Let O be the random variable for the output observed, and let o be used for any particular value for that variable.
P(G=g|O=o)=P(G=g∧O=o)/P(O=o) =P(O=o|G=g)P(G=g)/P(O=o) And, given a program g, the idea of P(O=o|G=g) makes sense, and P(G=g) is proportional to 2−length(g) (missing a normalization constant, but this will be the same across all g), And, P(O=o) will also be a normalization constant that is the same across all g.
And so, if you can compute values P(O=o|G=g) (the program g may take too long to run in practice) we can compute values proportional to P(G=g|O=o) .
Does that help any? (apologies if this should be a “comment” rather than an “answer”. Hoping it suffices.)
The part I understood is, that you weigh the programs based on the length in bits, the longer the program the less weight it has. This makes total sense.
I am not sure that I understand the prefix thing and I think that’s relevant. For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010).
I also find still fuzzy (and know at least I can put my finger on it) is the part where Solomonoff induction is extended to deal with randomness.
Let me see if I can make my question more specific:
Let’s imagine for a second that we live in a universe where only the next programs could be written:
A) A program that produces deterministically a given sequence of five digits (there are 2^5 of this programs
B) A program that produces deterministically a given sequence of 6 digits (there are 2^6 of them)
C) A program that produces 5 random coin flips with p=0.5
The programs in A have 5 bits of Kolmogorov complexity each. The programs in B have 6 bits. The program C has 4
We observe the sequence O = HTHHT
I measure the likelihood for each possible model. I discard the models with L = 0
A) There is a model here with likelihood 1
B) There are 2 models here, each of them with likelihood 1 too
C) This model has likelihood 2^-5
Then, things get murky:
the priors for each models will be 2^-5 for model A, 2^-6 for model B and 2^-4 for model C, according to their Kolmogorov complexity?
For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010).
No, the thing about prefixes is about what strings encode a program, not about their outputs. The purpose of this is mostly just to define a prior over possible programs, in a way that conveniently ensures that the total probability assigned over all programs is at most 1. Seeing as it still works for different choices of language, it probably doesn’t need to exactly use this kind of defining the probabilities, and I think any reasonable distribution over programs will do (at least, after enough observations) But, while I think another distribution over programs should work, this thing with the prefix-free language is the standard way of doing it, and there are reasons it is nice. The analogy for a normal programming language would be if no python script was a prefix of any other python script (which isn’t true of python scripts, but could be if they were required to end with some “end of program” string) There will be many different programs which produce the exact same output when run, and will all be considered when doing Solomonoff induction.
The programs in A have 5 bits of Kolmogorov complexity each. The programs in B have 6 bits. The program C has 4
This may be pedantic of me, but I wouldn’t call the lengths of the programs, the Kolmogorov complexity of the program. The lengths of the programs are (upper bounds on) the Kolmogorov complexity of the outputs of the programs. The Kolmogorov complexity of a program g, would be the length of the shortest program which outputs the program g, not the length of g.
When you say that program C has 4 bits, is that just a value you picked, or are you obtaining that from somewhere?
Also, for a prefix-free programming language, you can’t have 2^5 valid programs of length 5, and 2^6 programs of length 6, because if all possible binary strings of length 5 were valid programs, then no string of length 6 would be a valid program. This is probably getting away from the core points though (You could have the programming language be such that, e.g. 00XXXXX outputs the bits XXXXX, and 01XXXXXX outputs the bits XXXXXX, and other programs start with a 1, and any other program might want to encode is somehow encoded using some scheme)
the priors for each models will be 2^-5 for model A, 2^-6 for model B and 2^-4 for model C, according to their Kolmogorov complexity?
yeah, the (non-normalized) prior for each will be 2^(-(length of a program which directly encodes a 5 bit string to output)) for the programs which directly encode some 5 bit string and output it, 2^(-(length of a program which directly encodes a 6 bit string to output)) for the programs which directly encode some 6 bit string and output it, and (say) 2^(-4) for program C. And those likelihoods you gave are all correct for those models.
So, then, the posteriors (prior to normalization) would be 2^(-(length of a program which directly encodes a 5 bit string to output)) (let’s say this is 2^(-7) for the program that essentially is print(“HTHHT”), 2^(-(length of a program which directly encodes a 6 bit string to output)) (let’s say this is 2^(-8) ) for the programs that essentially are print(“HTHHTH”) and print(“HTHHTT”) respectively 2^(-4) * 2^(-5) = 2^(-9) for model C.
If we want to restrict to these 4 programs, then, adding these up, we get 2^(-7) + 2^(-8) + 2^(-8) + 2^(-9) = 2^(-6) + 2^(-9) = 9 * 2^(-9), and dividing that, we get (4/9) chance for the program that hardcodes HTHHT (say, 0010110) (2/9) chance for the program that hardcodes HTHHTH (say, 01101101) (2/9) chance for the program that hardcodes HTHHTT (say, 01101100) (1/9) chance for the program that produces a random 5 bit string. (say, 1000) So, in this situation, where we’ve restricted to these programs, the posterior probability distribution for “what is the next bit” would be (4/9)+(1/9)=(5/9) chance that “there is no next bit” (this case might usually be disregarded/discared, idk.) (2/9) chance that the next bit is H (2/9) chance that the next bit is T
Thank you for the comprehensive answer and for correcting the points where I wasn’t clear. Also, thank you for pointing out that the Kolmogorov complexity of a program is the length of the program that writes that program
The complexity of the algorithms was totally arbitrary and for the sake of the example.
I still have some doubts, but everything is more clear now (see my answer to Charlie Steiner also)
Here is my understanding:
we assume a programming language where a program is a finite sequence of bits, and such that no program is a prefix of another program. So, for example, if 01010010 is a program, then 0101 is not a program.
Then, the (not-normalized) prior probability for a program is 2−length(the program)
Why that probability?
If you take any infinite sequence of bits, then, because no program is a prefix of any other program, at most one program will be a prefix of that sequence of bits.
If you randomly (with uniform distribution) select an infinite sequence of bits, the probability that the sequence of bits has a particular program as a prefix, is then 2−length(the program) (because there’s a factor of (1/2) for each of the bits of the program, and if the first (length of the program) bits match the program, then it doesn’t matter what the rest of the bits that come after are.
(Ah, I suppose you don’t strictly need to talk about infinite sequences of bits, and you could just talk about randomly picking the value of the next bit, stopping if it ever results in a valid program in the programming language..., not sure which makes it easier to think about.)
If you want this to be an actual prior, you can normalize this by dividing by (the sum over all programs, of 2−length(the program) ).
The usual way of defining Solomonoff induction, I believe has the programs being deterministic, but I’ve read that allowing them to use randomness has equivalent results, and may make some things conceptually easier.
So, I’ll make some educated guesses about how to incorporate the programs having random behavior.
Let G be the random variable for “which program gets selected”, and g be used to refer to any potential particular value for that variable. (I avoided using p for program because I want to use it for probability. The choice of the letter g was arbitrary.)
Let O be the random variable for the output observed, and let o be used for any particular value for that variable.
P(G=g|O=o)=P(G=g∧O=o)/P(O=o)
=P(O=o|G=g)P(G=g)/P(O=o)
And, given a program g, the idea of P(O=o|G=g) makes sense, and P(G=g) is proportional to 2−length(g) (missing a normalization constant, but this will be the same across all g),
And, P(O=o) will also be a normalization constant that is the same across all g.
And so, if you can compute values P(O=o|G=g) (the program g may take too long to run in practice) we can compute values proportional to P(G=g|O=o) .
Does that help any?
(apologies if this should be a “comment” rather than an “answer”. Hoping it suffices.)
The part I understood is, that you weigh the programs based on the length in bits, the longer the program the less weight it has. This makes total sense.
I am not sure that I understand the prefix thing and I think that’s relevant. For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010).
I also find still fuzzy (and know at least I can put my finger on it) is the part where Solomonoff induction is extended to deal with randomness.
Let me see if I can make my question more specific:
Let’s imagine for a second that we live in a universe where only the next programs could be written:
A) A program that produces deterministically a given sequence of five digits (there are 2^5 of this programs
B) A program that produces deterministically a given sequence of 6 digits (there are 2^6 of them)
C) A program that produces 5 random coin flips with p=0.5
The programs in A have 5 bits of Kolmogorov complexity each. The programs in B have 6 bits. The program C has 4
We observe the sequence O = HTHHT
I measure the likelihood for each possible model. I discard the models with L = 0
A) There is a model here with likelihood 1
B) There are 2 models here, each of them with likelihood 1 too
C) This model has likelihood 2^-5
Then, things get murky:
the priors for each models will be 2^-5 for model A, 2^-6 for model B and 2^-4 for model C, according to their Kolmogorov complexity?
No, the thing about prefixes is about what strings encode a program, not about their outputs.
The purpose of this is mostly just to define a prior over possible programs, in a way that conveniently ensures that the total probability assigned over all programs is at most 1. Seeing as it still works for different choices of language, it probably doesn’t need to exactly use this kind of defining the probabilities, and I think any reasonable distribution over programs will do (at least, after enough observations)
But, while I think another distribution over programs should work, this thing with the prefix-free language is the standard way of doing it, and there are reasons it is nice.
The analogy for a normal programming language would be if no python script was a prefix of any other python script (which isn’t true of python scripts, but could be if they were required to end with some “end of program” string)
There will be many different programs which produce the exact same output when run, and will all be considered when doing Solomonoff induction.
This may be pedantic of me, but I wouldn’t call the lengths of the programs, the Kolmogorov complexity of the program. The lengths of the programs are (upper bounds on) the Kolmogorov complexity of the outputs of the programs. The Kolmogorov complexity of a program g, would be the length of the shortest program which outputs the program g, not the length of g.
When you say that program C has 4 bits, is that just a value you picked, or are you obtaining that from somewhere?
Also, for a prefix-free programming language, you can’t have 2^5 valid programs of length 5, and 2^6 programs of length 6, because if all possible binary strings of length 5 were valid programs, then no string of length 6 would be a valid program.
This is probably getting away from the core points though
(You could have the programming language be such that, e.g. 00XXXXX outputs the bits XXXXX, and 01XXXXXX outputs the bits XXXXXX, and other programs start with a 1, and any other program might want to encode is somehow encoded using some scheme)
yeah, the (non-normalized) prior for each will be 2^(-(length of a program which directly encodes a 5 bit string to output)) for the programs which directly encode some 5 bit string and output it, 2^(-(length of a program which directly encodes a 6 bit string to output)) for the programs which directly encode some 6 bit string and output it, and (say) 2^(-4) for program C.
And those likelihoods you gave are all correct for those models.
So, then, the posteriors (prior to normalization)
would be 2^(-(length of a program which directly encodes a 5 bit string to output)) (let’s say this is 2^(-7) for the program that essentially is print(“HTHHT”),
2^(-(length of a program which directly encodes a 6 bit string to output)) (let’s say this is 2^(-8) ) for the programs that essentially are print(“HTHHTH”) and print(“HTHHTT”) respectively
2^(-4) * 2^(-5) = 2^(-9) for model C.
If we want to restrict to these 4 programs, then, adding these up, we get 2^(-7) + 2^(-8) + 2^(-8) + 2^(-9) = 2^(-6) + 2^(-9) = 9 * 2^(-9), and dividing that, we get
(4/9) chance for the program that hardcodes HTHHT (say, 0010110)
(2/9) chance for the program that hardcodes HTHHTH (say, 01101101)
(2/9) chance for the program that hardcodes HTHHTT (say, 01101100)
(1/9) chance for the program that produces a random 5 bit string. (say, 1000)
So, in this situation, where we’ve restricted to these programs, the posterior probability distribution for “what is the next bit” would be
(4/9)+(1/9)=(5/9) chance that “there is no next bit” (this case might usually be disregarded/discared, idk.)
(2/9) chance that the next bit is H
(2/9) chance that the next bit is T
Thank you for the comprehensive answer and for correcting the points where I wasn’t clear. Also, thank you for pointing out that the Kolmogorov complexity of a program is the length of the program that writes that program
The complexity of the algorithms was totally arbitrary and for the sake of the example.
I still have some doubts, but everything is more clear now (see my answer to Charlie Steiner also)