Err, no, I wouldn’t call it a justification. It’s more like a technical setup of why shorter programs turn out to be more likely (because the padding happens to include them in this way). This is not a profound thing. It’s just one possible way to get where you want, which is to assign 2^-length(p) to p. Note that it’s brittle in the sense that you need a support from your formalism for that. You need your notion of what is a program to allow adding arbitrary bits to the end without affecting its validity or what it does. That’s not a very interesting thing to do: why would you even consider all those programs with dummy padding as real programs, given that whatever’s running them can probably easily rule them out? E.g. modern compilers will usually reject your program if you add some junk at the end.
If you restrict yourself to a finite program length, you could still only count minimal programs, those w/o padding, and give them appropriate weights (the way they do in the paper immediately afterwards for the infinite case, the standard way to set it up) and end up with the same thing. There’s no profound reason I can think of to say “oh, for fixed bound L on program length we’re just only going to look at programs of length exactly L, including padded programs, but for unrestricted length we’re throwing out padded programs and only looking at minimal ones”. The real reason is that you want to sum up to 1, and you want to prefer short programs while doing so.
Now, something closer to justiication is when they point our right afterwards that 2^-length(p) is what you get if you randomly choose program bits one by one. But as justifications go, I think it’s a pretty weak one, because no one really computes anything in any way that looks like that. And more importantly, as I said above, it just doesn’t matter for SI prediction. You don’t have to use 2^-length(p) for SI prediction to work. You just need to have it all sum up to 1.
You kind of missed my point. I provided it as a technical justification why the exact formula 2^-length(p) makes sense. As you said, “It’s one possible way to get where you want”. I know that 2^-length(p) is not the only possible option, but it’s the most natural one given that we know nothing about the input.
The model of computation is a “Universal Turing machine provided with completely random noise” (i.e. fair coin flips on the input tape). It’s a mathematical formalism; it has nothing to do with modern compilers.
I didn’t miss your point; instead, I’ve tried to explain your point to you, and perhaps didn’t do a really good job. You’re confusing two different things which are both spelled out in the paper you referenced on page 1120, but in your comment you linked me to you’ve only recounted the first one, perhaps thinking that the second one is the same thing, which it is not.
The first thing is setting up probabilities over programs of a given fixed length L, so that padding helps us take into account also shorter programs. In the paper, this is the paragraph starting “This sum is still only over programs of length ℓ...”. This is what you talked about in your other comment, and this is what I talked about in my first paragraph of the reply above.
The second thing is a way of setting up probabilities over programs of any length without bound, using a UTM-with-random-noise model as justification for the formula. This is an inherently different way of assigning probabilities, where in particular we do not use padding, but on the contrary only take into account minimal programs. This is what the paper talks about in the next two paragraphs, the second of which, starting “This is a highly technical explanation...” explains the UTM-with-random-noise. I talk about this second thing in the third paragraph of my reply above.
Hopefully I made it clearer. If you still think I missed your point, let’s leave it at that.
See this comment for a justification of why shorter programs are more likely.
The argument is taken from this paper.
Err, no, I wouldn’t call it a justification. It’s more like a technical setup of why shorter programs turn out to be more likely (because the padding happens to include them in this way). This is not a profound thing. It’s just one possible way to get where you want, which is to assign 2^-length(p) to p. Note that it’s brittle in the sense that you need a support from your formalism for that. You need your notion of what is a program to allow adding arbitrary bits to the end without affecting its validity or what it does. That’s not a very interesting thing to do: why would you even consider all those programs with dummy padding as real programs, given that whatever’s running them can probably easily rule them out? E.g. modern compilers will usually reject your program if you add some junk at the end.
If you restrict yourself to a finite program length, you could still only count minimal programs, those w/o padding, and give them appropriate weights (the way they do in the paper immediately afterwards for the infinite case, the standard way to set it up) and end up with the same thing. There’s no profound reason I can think of to say “oh, for fixed bound L on program length we’re just only going to look at programs of length exactly L, including padded programs, but for unrestricted length we’re throwing out padded programs and only looking at minimal ones”. The real reason is that you want to sum up to 1, and you want to prefer short programs while doing so.
Now, something closer to justiication is when they point our right afterwards that 2^-length(p) is what you get if you randomly choose program bits one by one. But as justifications go, I think it’s a pretty weak one, because no one really computes anything in any way that looks like that. And more importantly, as I said above, it just doesn’t matter for SI prediction. You don’t have to use 2^-length(p) for SI prediction to work. You just need to have it all sum up to 1.
You kind of missed my point. I provided it as a technical justification why the exact formula 2^-length(p) makes sense. As you said, “It’s one possible way to get where you want”. I know that 2^-length(p) is not the only possible option, but it’s the most natural one given that we know nothing about the input.
The model of computation is a “Universal Turing machine provided with completely random noise” (i.e. fair coin flips on the input tape). It’s a mathematical formalism; it has nothing to do with modern compilers.
I didn’t miss your point; instead, I’ve tried to explain your point to you, and perhaps didn’t do a really good job. You’re confusing two different things which are both spelled out in the paper you referenced on page 1120, but in your comment you linked me to you’ve only recounted the first one, perhaps thinking that the second one is the same thing, which it is not.
The first thing is setting up probabilities over programs of a given fixed length L, so that padding helps us take into account also shorter programs. In the paper, this is the paragraph starting “This sum is still only over programs of length ℓ...”. This is what you talked about in your other comment, and this is what I talked about in my first paragraph of the reply above.
The second thing is a way of setting up probabilities over programs of any length without bound, using a UTM-with-random-noise model as justification for the formula. This is an inherently different way of assigning probabilities, where in particular we do not use padding, but on the contrary only take into account minimal programs. This is what the paper talks about in the next two paragraphs, the second of which, starting “This is a highly technical explanation...” explains the UTM-with-random-noise. I talk about this second thing in the third paragraph of my reply above.
Hopefully I made it clearer. If you still think I missed your point, let’s leave it at that.