Presumably the idea is that 3^^^3 is computable by a fairly short program (for an abstract Turing machine—the universe does not contain the resources to actually run it), and so the Solomonoff prior assigns hypotheses involving the number a probability vastly larger than 1/3^^^3. Hence Solomonoff prior + large numbers computable by small programs --> Pascal’s mugging.
This could be taken as an argument against using the Solomonoff prior for anything (besides the obvious one, that it’s uncomputable). It raises all short hypotheses containing gigantic numbers to prominence, compared with computationally random numbers of the same magnitude, merely because there are short programs that compute them.
Presumably the idea is that 3^^^3 is computable by a fairly short program
Irrelevant.
Solomonoff induction evaluates programs that generate strings of possible future observations conditioned on all your past observations (or equivalently, programs that generate strings of past and future observations, and counts only those that generate a string whose prefix matches your previous observations). These programs correspond to models of the universe (observed from your particular point of view).
Models of the universe where life goes on as we know it until at a certain point and then suddenly you get 4^^^^4 years of fun are presumbly much more complex than models of the universe where life keeps going on as we know it.
In particular, if Solomonoff induction assigns to models where starting from now you get X years of fun a total probabilty p(X) that asymptotically decreases fast enough with X so that U(X) * p(X) also decreases, it will not be subject to Pascal’s mugging. All reasonable light-tailed probability distribution would have that property w.r.t. any reasonable utility function. I see no reason to believe that the Solomonoff distribution would behave otherwise.
Consider a probabilistic model over the world which is a light-tailed probability distribution w.r.t. any physical quantity. Light-tailed means that p(X) decreases at least exponentially with X.
If your utility function U grows sub-exponentially w.r.t. any physical quantity X (typical utility functions grow sub-linearly), then p(X) * U(X) decreases at least exponentially with X, therfore the mugging fails.
Consider a probabilistic model over the world which is a light-tailed probability distribution w.r.t. any physical quantity. Light-tailed means that p(X) decreases at least exponentially with X.
We do not know what physical quantities exist, and Solomonoff induction requires us to consider all computable possibilities compatible with observations so far. Any distribution p so light-tailed as to go to zero faster than every computable function must itself be uncomputable, and therefore inaccessible to Solomonoff induction. Likewise universally sub-exponentially growing utility functions.
We do not know what physical quantities exist, and Solomonoff induction requires us to consider all computable possibilities compatible with observations so far.
Yes.
Any distribution p so light-tailed as to go to zero faster than every computable function must itself be uncomputable, and therefore inaccessible to Solomonoff induction.
It doesn’t have to decrease faster than every computable function, only to decrease at least as fast as an exponential function with negative exponent.
Solomonoff induction doesn’t try to learn your utility function. Clearly, if your utility function is super-exponential, then p(X) * U(X) may diverge even if p(X) is light-tailed.
I can’t imagine how you could possibly think that for Solomonoff induction, p(X) would decrease exponentially for any reasonable physical quantity X. Do you really think that the number of bits required to specify a universe with X >= X_0 grows linearly with X_0?
Any natural number X_0 can be generated by a program of a length KolgomorovComplexity(X_0), which is at most log_2(X_0) + C bits, where C is a constant dependent on the UTM and not dependent on X_0. Numbers with a special form such as 4^^^^4, obviously have a Kolmogorov complexity much smaller than log_2(X_0) + C.
But that’s besides the point.
In the scenario we are considering, the agent makes the prediction after it has been offered the deal. Solomonoff induction considers only programs which are consistent with the observation history, which includes the observation of having been offered the deal. Therefore, all these programs already include the specification of the amount of valuables being offered (X = 4^^^^4 years of fun, in this example). This means that the 2^-(KolgomorovComplexity(X)) factor in the probability estimate is normalized out.
Yes, the relevant probability is conditional on having been offered the deal. I still have no idea how this could possibly help.
Let’s say that your utility is logarithmic in X, the years of fun. What you need to avoid the mugging is that P(more than X years of fun iff I pay up | deal for X years offered) goes to zero faster than log(X) grows. But the number of bits in the program “offer n^^^n years of fun, and grant this iff subject pays up” grows only linearly with n, while log(n^^^n) grows, well, superexponentially. Normalizing by the total weight of all programs that offer the deal makes the probability go further up, not down.
What you need to avoid the mugging is that P(more than X years of fun iff I pay up | deal for X years offered) goes to zero faster than log(X) grows
No.
In order to avoid the mugging you need just that P(more than X years of fun iff I pay up | deal for X years offered) (log(X) - MarginalU($5)) < P(Y years of fun iff I pay up | deal for X years offered) (log(Y) - MarginalU($5)) where Y is the number of years of fun you will get anyway if the mugger gives you nothing (assuming, for simplicity, that the mugger either gives you X or more years of fun or gives you nothing, but the argument can be generalized to scenarios where the mugger may give you Z < X years of fun)
the number of bits in the program “offer n^^^n years of fun, and grant this iff subject pays up” grows only linearly with n
That seems incorrect.
I think that the source of your confusion stems from the fact that the program length of the (shortest) program “write n^^^n on the tape” is K(n^^^n) ⇐ log_2(n) . But the length of programs consistent with “offer n^^^n years of fun, and grant this iff subject pays up” must grow faster than K(n^^^n).
Proof by reductio ad absurdum: Suppose that length of the shortest program A(n^^^n) = “offer n^^^n years of fun, and grant this iff subject pays up” grows with K(n^^^n). Then, up to some additive constant, Len(A(n^^^n)) is the length of the concatenation of two shortest programs: W(n^^^n) = “write n^^^n on the tape” and B = “read X from the tape, offer X years of fun, and grant this iff subject pays up” where Len(W(n^^^n)) = K(n^^^n) and Len(B) is a constant independent of n^^^n.
Consider the posterior: post = p(grant this iff subject pays up | offer n^^^n years of fun) = p(“offer n^^^n years of fun, and grant this iff subject pays up”) / p(“offer n^^^n years of fun”)
Define the shortest program O(n^^^n) = “offer n^^^n years of fun”. Again, if Len(O(n^^^n)) grows with K(n^^^n), then up to some additive constant Len(O(n^^^n)) is the length of the concatenation of W(n^^^n) and program C = “read X from the tape, offer X years of fun” which doesn’t depend on n^^^n.
That is, Len(A(n^^^n)) = K(n^^^n) + Len(B) Len(O(n^^^n)) = K(n^^^n) + Len(C)
As you can see this posterior doesn’t depend on n^^^n or even on n, which is clearly inconsistent with the notion (formalized in a theorem by Solomonoff) that Solomonoff induction learns an accurate model of the environment. Therefore, the assumption that there is a shortest program consistent with “offer n^^^n years of fun, and grant this iff subject pays up” whose length grows with K(n^^^n) must be incorrect.
What you need to avoid the mugging is that P(more than X years of fun iff I pay up | deal for X years offered) goes to zero faster than log(X) grows
No.
Okay, you’re saying that as X goes up, the probability of getting X years of fun even if you don’t pay up also goes up, because any program that offers a deal of X years has to include a specification of the number X? So the expected utility of not paying up doesn’t stay constant as we vary X, but increases with X (at least asymptotically, for very large X)?
Well, you’re right on that, and that’s in fact a point I hadn’t considered, thanks. But I was replying to this:
In particular, if Solomonoff induction assigns to models where starting from now you get X years of fun a total probabilty p(X) that asymptotically decreases fast enough with X so that U(X) * p(X) also decreases, it will not be subject to Pascal’s mugging.
If by this you mean something else than that P(more than X years of fun iff I pay up | deal for X years offered) log(X) → 0, then I don’t understand what you mean by p(X). If you mean e.g. p(X) = P(deal for X years offered), then why would p(X) U(X) → 0 avoid Pascal’s mugging? (Not that it does go to zero.)
As you can see this posterior doesn’t depend on n^^^n or even on n, which is clearly inconsistent with the notion (formalized in a theorem by Solomonoff) that Solomonoff induction learns an accurate model of the environment.
That theorem says, roughly (actually I’m just giving a particular consequence), that given a particular world program, after seeing a certain finite number of bits produced by this program, Solomonoff induction will predict all future bits correctly. The particular number of bits needed initially depends, of course, on the program. (More generally: Given a computable probability distribution over world programs, with probability one there is some number of bits after which Solomonoff induction’s conditional distribution over subsequent bits will equal the “true” conditional distribution.) Of course, Solomonoff’s theorem only allows the agent to observe the environment, not interact with it, but that doesn’t seem to be the issue here (we can consider Hutter’s variants instead).
You are not keeping the world program fixed, and for each world program considered, you only talk about what happens after the agent has a certain number of bits (which is fixed given the world program, i.e. you don’t let it tend to infinity), so the theorem does not apply.
What antecedent do you want to deny in your argument, anyway? If your argument worked, it would still work if we replaced “grant X years of fun” by “write the symbol FUN on the tape X times”. But there is certainly a program B that reads X from the internal tape, writes X to the output tape, reads a symbol from the input tape, and writes FUN X times iff the input symbol is ACCEPT, and similarly for C and W(n^^^n).
Prove it.
Presumably the idea is that 3^^^3 is computable by a fairly short program (for an abstract Turing machine—the universe does not contain the resources to actually run it), and so the Solomonoff prior assigns hypotheses involving the number a probability vastly larger than 1/3^^^3. Hence Solomonoff prior + large numbers computable by small programs --> Pascal’s mugging.
This could be taken as an argument against using the Solomonoff prior for anything (besides the obvious one, that it’s uncomputable). It raises all short hypotheses containing gigantic numbers to prominence, compared with computationally random numbers of the same magnitude, merely because there are short programs that compute them.
Irrelevant.
Solomonoff induction evaluates programs that generate strings of possible future observations conditioned on all your past observations (or equivalently, programs that generate strings of past and future observations, and counts only those that generate a string whose prefix matches your previous observations).
These programs correspond to models of the universe (observed from your particular point of view).
Models of the universe where life goes on as we know it until at a certain point and then suddenly you get 4^^^^4 years of fun are presumbly much more complex than models of the universe where life keeps going on as we know it.
In particular, if Solomonoff induction assigns to models where starting from now you get X years of fun a total probabilty p(X) that asymptotically decreases fast enough with X so that U(X) * p(X) also decreases, it will not be subject to Pascal’s mugging.
All reasonable light-tailed probability distribution would have that property w.r.t. any reasonable utility function.
I see no reason to believe that the Solomonoff distribution would behave otherwise.
Ahem.
Consider a probabilistic model over the world which is a light-tailed probability distribution w.r.t. any physical quantity. Light-tailed means that p(X) decreases at least exponentially with X.
If your utility function U grows sub-exponentially w.r.t. any physical quantity X (typical utility functions grow sub-linearly), then p(X) * U(X) decreases at least exponentially with X, therfore the mugging fails.
We do not know what physical quantities exist, and Solomonoff induction requires us to consider all computable possibilities compatible with observations so far. Any distribution p so light-tailed as to go to zero faster than every computable function must itself be uncomputable, and therefore inaccessible to Solomonoff induction. Likewise universally sub-exponentially growing utility functions.
Yes.
It doesn’t have to decrease faster than every computable function, only to decrease at least as fast as an exponential function with negative exponent.
Solomonoff induction doesn’t try to learn your utility function.
Clearly, if your utility function is super-exponential, then p(X) * U(X) may diverge even if p(X) is light-tailed.
I can’t imagine how you could possibly think that for Solomonoff induction, p(X) would decrease exponentially for any reasonable physical quantity X. Do you really think that the number of bits required to specify a universe with X >= X_0 grows linearly with X_0?
Any natural number X_0 can be generated by a program of a length KolgomorovComplexity(X_0), which is at most log_2(X_0) + C bits, where C is a constant dependent on the UTM and not dependent on X_0.
Numbers with a special form such as 4^^^^4, obviously have a Kolmogorov complexity much smaller than log_2(X_0) + C.
But that’s besides the point.
In the scenario we are considering, the agent makes the prediction after it has been offered the deal. Solomonoff induction considers only programs which are consistent with the observation history, which includes the observation of having been offered the deal. Therefore, all these programs already include the specification of the amount of valuables being offered (X = 4^^^^4 years of fun, in this example).
This means that the 2^-(KolgomorovComplexity(X)) factor in the probability estimate is normalized out.
Yes, the relevant probability is conditional on having been offered the deal. I still have no idea how this could possibly help.
Let’s say that your utility is logarithmic in X, the years of fun. What you need to avoid the mugging is that P(more than X years of fun iff I pay up | deal for X years offered) goes to zero faster than log(X) grows. But the number of bits in the program “offer n^^^n years of fun, and grant this iff subject pays up” grows only linearly with n, while log(n^^^n) grows, well, superexponentially. Normalizing by the total weight of all programs that offer the deal makes the probability go further up, not down.
No.
In order to avoid the mugging you need just that
P(more than X years of fun iff I pay up | deal for X years offered) (log(X) - MarginalU($5)) < P(Y years of fun iff I pay up | deal for X years offered) (log(Y) - MarginalU($5))
where Y is the number of years of fun you will get anyway if the mugger gives you nothing (assuming, for simplicity, that the mugger either gives you X or more years of fun or gives you nothing, but the argument can be generalized to scenarios where the mugger may give you Z < X years of fun)
That seems incorrect.
I think that the source of your confusion stems from the fact that the program length of the (shortest) program “write n^^^n on the tape” is K(n^^^n) ⇐ log_2(n) .
But the length of programs consistent with “offer n^^^n years of fun, and grant this iff subject pays up” must grow faster than K(n^^^n).
Proof by reductio ad absurdum:
Suppose that length of the shortest program A(n^^^n) = “offer n^^^n years of fun, and grant this iff subject pays up” grows with K(n^^^n).
Then, up to some additive constant, Len(A(n^^^n)) is the length of the concatenation of two shortest programs:
W(n^^^n) = “write n^^^n on the tape” and
B = “read X from the tape, offer X years of fun, and grant this iff subject pays up”
where Len(W(n^^^n)) = K(n^^^n) and Len(B) is a constant independent of n^^^n.
Consider the posterior:
post = p(grant this iff subject pays up | offer n^^^n years of fun) = p(“offer n^^^n years of fun, and grant this iff subject pays up”) / p(“offer n^^^n years of fun”)
Define the shortest program O(n^^^n) = “offer n^^^n years of fun”.
Again, if Len(O(n^^^n)) grows with K(n^^^n), then up to some additive constant Len(O(n^^^n)) is the length of the concatenation of W(n^^^n) and program C = “read X from the tape, offer X years of fun” which doesn’t depend on n^^^n.
That is,
Len(A(n^^^n)) = K(n^^^n) + Len(B)
Len(O(n^^^n)) = K(n^^^n) + Len(C)
Therefore
post =~= 2^-(Len(A(n^^^n)) - Len(O(n^^^n))) =
= 2^-((K(n^^^n) + Len(B)) - (K(n^^^n) + Len(C))) =
= 2^-(Len(B) - Len(C))
As you can see this posterior doesn’t depend on n^^^n or even on n, which is clearly inconsistent with the notion (formalized in a theorem by Solomonoff) that Solomonoff induction learns an accurate model of the environment.
Therefore, the assumption that there is a shortest program consistent with “offer n^^^n years of fun, and grant this iff subject pays up” whose length grows with K(n^^^n) must be incorrect.
Okay, you’re saying that as X goes up, the probability of getting X years of fun even if you don’t pay up also goes up, because any program that offers a deal of X years has to include a specification of the number X? So the expected utility of not paying up doesn’t stay constant as we vary X, but increases with X (at least asymptotically, for very large X)?
Well, you’re right on that, and that’s in fact a point I hadn’t considered, thanks. But I was replying to this:
If by this you mean something else than that P(more than X years of fun iff I pay up | deal for X years offered) log(X) → 0, then I don’t understand what you mean by p(X). If you mean e.g. p(X) = P(deal for X years offered), then why would p(X) U(X) → 0 avoid Pascal’s mugging? (Not that it does go to zero.)
That theorem says, roughly (actually I’m just giving a particular consequence), that given a particular world program, after seeing a certain finite number of bits produced by this program, Solomonoff induction will predict all future bits correctly. The particular number of bits needed initially depends, of course, on the program. (More generally: Given a computable probability distribution over world programs, with probability one there is some number of bits after which Solomonoff induction’s conditional distribution over subsequent bits will equal the “true” conditional distribution.) Of course, Solomonoff’s theorem only allows the agent to observe the environment, not interact with it, but that doesn’t seem to be the issue here (we can consider Hutter’s variants instead).
You are not keeping the world program fixed, and for each world program considered, you only talk about what happens after the agent has a certain number of bits (which is fixed given the world program, i.e. you don’t let it tend to infinity), so the theorem does not apply.
What antecedent do you want to deny in your argument, anyway? If your argument worked, it would still work if we replaced “grant X years of fun” by “write the symbol FUN on the tape X times”. But there is certainly a program B that reads X from the internal tape, writes X to the output tape, reads a symbol from the input tape, and writes FUN X times iff the input symbol is ACCEPT, and similarly for C and W(n^^^n).