If a “hypothesis” is a computable distribution, it doesn’t need to include the random bits. (A computable distribution is a program that accepts a bit string and computes successive approximations to the probability of that bit string.)
A computable distribution is a program that accepts a bit string and computes successive approximations to the probability of that bit string.
Got to be a little more than that, for the probabilities of different bit strings to add up to 1. The obvious way to do that without having to sum all “probabilities” and renormalize, is to map an infinitely long uniformly distributed bit string into the strings for which you’re defining probabilities.
In any case you’re straying well away from what S.I. is defined to be.
I guess I left out too many details. Let’s say a “hypothesis” is a program that accepts a bit string S and prints a sequence of non-negative rational numbers. For each S, that sequence has to be non-decreasing and converge to a limit P(S). The sum of P(S) over all S has to be between 0 (exclusive) and 1 (inclusive). Such an object is called a “lower-semicomputable semimeasure”. SI is a universal lower-semicomputable semimeasure, which means it dominates any other up to a multiplicative constant. See page 8 of this paper for more on which classes of (semi-)measures have or don’t have a universal element.
Yes, SI is such a measure, but the individual programs within SI as usually defined are not. You aren’t feeding the input data into programs in SI, you match program outputs to the input data.
I think you can have an equivalent definition of SI in terms of programs that consume input data and print rational numbers. Or at least, I don’t see any serious obstacles to doing so. It will also be expensive to approximate, and you’ll need to renormalize as you go (and do other things like converting all sequences of rationals to non-decreasing sequences etc.), but at least you won’t have the problem that “hypotheses” grow without bound.
That might be useful in some context, yes. But you’ll still have both the program which is longer but gives probability of 1 to observations so far, and a program that is shorter, but gives a much much smaller probability (on the order of 2^-L where L is length difference).
I think the hypotheses in physics correspond not to individual programs in SI, but to common prefixes. For example, QM would describe a program that would convert subsequent random bits into predicted observations with correct probability distribution. You could treat a prefix as a probability distribution over outputs—the probability distribution that results from pre-pending that prefix to random bits. The bonus is that very simple operations on the random bits correspond to high level mathematical functions that are otherwise difficult to write in any low level TM-like representation.
Yeah, I agree, there’s this inconvenient competition between shorter programs that encode a probability distribution and longer ones that “hardcode” the random bits.
Here’s a random idea, I’m not sure if it works. Formulating SI as a mixture of semimeasures doesn’t require any particular weighting of these semimeasures, you just need all weights to be nonzero. We can try making the weights of longer programs decrease faster than 2^-L, so the shorter programs outcompete the longer ones.
You’ll hit the opposite problem, I think: the winning program can itself implement some approximation to SI. Would be bad predictions wise, too—the dominating hypotheses would tend to have a doomsday (because when you approximate SI you have to set a limit on number of steps). And if you can only learn that the doomsday hypothesis was wrong after the doomsday did not happen… it’s not good.
I don’t think the distinction between ‘data’ and ‘program’ is fundamental in any way. In any universe where a Turing machine (or a bounded TM) can exist, the data is a program too.
In general, I agree that reasoning about approximations to SI is trickier than reasoning about SI itself. But I don’t see how we can confidently say things like “any method of approximating SI will have such-and-such problem”. Such statements seem to be caused by limitations of our imagination, not limitations of SI. I propose to wrap up this discussion and agree to disagree, at least until we come up with some hard math about the limitations of SI.
I were originally speaking of vanilla SI, not of some not yet defined methods.
And I don’t think it’s even a limitation of SI. It just seems to me that the code vs data distinction is, in the case of machines that can run interpreters * , entirely arbitrary, subjective, and a matter of personal taste.
and not only can run interpreters, but ultimately should, if we are to represent a world that contains computers.
If a “hypothesis” is a computable distribution, it doesn’t need to include the random bits. (A computable distribution is a program that accepts a bit string and computes successive approximations to the probability of that bit string.)
Got to be a little more than that, for the probabilities of different bit strings to add up to 1. The obvious way to do that without having to sum all “probabilities” and renormalize, is to map an infinitely long uniformly distributed bit string into the strings for which you’re defining probabilities.
In any case you’re straying well away from what S.I. is defined to be.
I guess I left out too many details. Let’s say a “hypothesis” is a program that accepts a bit string S and prints a sequence of non-negative rational numbers. For each S, that sequence has to be non-decreasing and converge to a limit P(S). The sum of P(S) over all S has to be between 0 (exclusive) and 1 (inclusive). Such an object is called a “lower-semicomputable semimeasure”. SI is a universal lower-semicomputable semimeasure, which means it dominates any other up to a multiplicative constant. See page 8 of this paper for more on which classes of (semi-)measures have or don’t have a universal element.
Yes, SI is such a measure, but the individual programs within SI as usually defined are not. You aren’t feeding the input data into programs in SI, you match program outputs to the input data.
I think you can have an equivalent definition of SI in terms of programs that consume input data and print rational numbers. Or at least, I don’t see any serious obstacles to doing so. It will also be expensive to approximate, and you’ll need to renormalize as you go (and do other things like converting all sequences of rationals to non-decreasing sequences etc.), but at least you won’t have the problem that “hypotheses” grow without bound.
That might be useful in some context, yes. But you’ll still have both the program which is longer but gives probability of 1 to observations so far, and a program that is shorter, but gives a much much smaller probability (on the order of 2^-L where L is length difference).
I think the hypotheses in physics correspond not to individual programs in SI, but to common prefixes. For example, QM would describe a program that would convert subsequent random bits into predicted observations with correct probability distribution. You could treat a prefix as a probability distribution over outputs—the probability distribution that results from pre-pending that prefix to random bits. The bonus is that very simple operations on the random bits correspond to high level mathematical functions that are otherwise difficult to write in any low level TM-like representation.
Yeah, I agree, there’s this inconvenient competition between shorter programs that encode a probability distribution and longer ones that “hardcode” the random bits.
Here’s a random idea, I’m not sure if it works. Formulating SI as a mixture of semimeasures doesn’t require any particular weighting of these semimeasures, you just need all weights to be nonzero. We can try making the weights of longer programs decrease faster than 2^-L, so the shorter programs outcompete the longer ones.
You’ll hit the opposite problem, I think: the winning program can itself implement some approximation to SI. Would be bad predictions wise, too—the dominating hypotheses would tend to have a doomsday (because when you approximate SI you have to set a limit on number of steps). And if you can only learn that the doomsday hypothesis was wrong after the doomsday did not happen… it’s not good.
I don’t think the distinction between ‘data’ and ‘program’ is fundamental in any way. In any universe where a Turing machine (or a bounded TM) can exist, the data is a program too.
In general, I agree that reasoning about approximations to SI is trickier than reasoning about SI itself. But I don’t see how we can confidently say things like “any method of approximating SI will have such-and-such problem”. Such statements seem to be caused by limitations of our imagination, not limitations of SI. I propose to wrap up this discussion and agree to disagree, at least until we come up with some hard math about the limitations of SI.
I were originally speaking of vanilla SI, not of some not yet defined methods.
And I don’t think it’s even a limitation of SI. It just seems to me that the code vs data distinction is, in the case of machines that can run interpreters * , entirely arbitrary, subjective, and a matter of personal taste.
and not only can run interpreters, but ultimately should, if we are to represent a world that contains computers.