By ‘mean of the utility function’, I meant the mean of the utility function over all possible universes rather than just valid universes. The validity constraint forces the expected utility to diverge from the mean of the utility function—it must for the agent to make any useful decisions!
Okay. In that case there are two reasons that mugger hypotheses are still important: the unupdated expected utility is not necessarily anywhere near the naive tail-less expected utility and that while the central limit theorem shows that updating based on observations is unlikely to produce a shift in the utility of the tails that is large relative to the bounds on the utility function, it will still be large relative to the actual utility.
The way I’m approaching this is to ask whether most of the expected utility comes from high probability events or low probability ones
My entire post concerns the subset of universes with probabilities approaching 1/infinity, corresponding to programs with length going to infinity. The high probability scenarios (shorter program universes) don’t matter in mugger scenarios, we categorically assume they all have boring extremely low utilities (the mugger is jokin/lying/crazy).
The utility of the likely scenarios is essential here. If we don’t take into account the utility of $5, we have no obvious reason not to pay the mugger. The ratio of the utility differences of various action due to the likely hypotheses and due to the high-utility hypotheses is what is important.
Your observations have some probability P(T|N) to retain a hypothesis of length N. I don’t see why this would depend that strongly on the value of N.
In AIXI-models, hypothesis acceptance is not probabilistic, it is completely binary: a universe program either perfectly fits the observation history or it does not. If even 1 bit is off, the program is ignored.
That is a probability (well really a frequency) taken over all hypotheses of length N (or L if you prefer).
It’s unfortunate I started using N for program length in my prior post, that was a mistake, L was the term for program length in the EU equation. L (program length) matters because of the solomonoff prior complexity penalty: 2^-L.
The space of valid programs of length L, for any L, is simply all possible programs of length L, which is expected to be a set of around 2^L in size.
Well, an O(1) factor less, since otherwise our prior measure would diverge, but you don’t have to write it explicitly; when working with Kolmogorov complexity, you expect everything to be within a constant factor.
Now consider O:{1}. We have cut out exactly half of the program space. O:{11}, cuts out 3/4th of the tegmark, and in general an observation history with length(O) filters the universe space down to 2^-length(O) of it’s previous size, removing 1 − 2^-length(O) possible universes—but there are an infinite number of total universes.
No, not quite. Observations are not perfectly informative. If someone wanted to optimally communicate their observations, they would use such a system, but a real observation will not be perfectly optimized to rule out half the hypothesis space. We are reading bits from the output of the program, not its source code!
However, for length(P) > length(O) + C, for some small C, valid programs are absolutely guaranteed. Specifically for some constant C there are programs which simply directly encode random strings which happen to align with O. This set of programs correspond to ‘chaos’.
I don’t think this set behaves how you think it behaves. 1 − 2^-length(O) of this set will be ruled out, but there are more programs that have with more structure than “print this string” that don’t get falsified, since they actually have enough structure to reproduce our observation (about K(O) bits) and they use the leftover bits to encode various unobservable things that might have high utility.
Looking at you conclusions, you can actually replace l(O) with K(O) and everything qualitatively survives.
The utility of the likely scenarios is essential here. If we don’t take into account the utility of $5, we have no obvious reason not to pay the mugger.
No, not necessarily. It could be an arbitrarily small cost: the mugger could say just look at me for a nanosecond, and this tiny action of almost no cost could still not be worthwhile.
If AIXI can not find a full observation history O matching program P which generates a future we would describe as (mugger really does have matrix powers and causes massive negative reward) under the constraints that length(P) < length(O), then AIXI’s expected utility decision for the mugger futures goes to zero . The length(P) < length(O) is a likelihood bound.
AIXI essentially stops considering theories beyond some upper improbability (much longer than it’s observation history).
but a real observation will not be perfectly optimized to rule out half the hypothesis space.
For AIXI, each observation rules out exactly half of the hypothesis space, because it’s hypothesis space is the entirety of everything.
there are more programs that have with more structure than “print this string” that don’t get falsified, since they actually have enough structure to reproduce our observation (about K(O) bits) and they use the leftover bits to encode various unobservable things that might have high utility
No—this is a contradiction. The programs of K(O) bits are the first valid universes, and by the definition/mapping of the mugger problem to AIXI-logic, those correspond to the mundane worlds where the mugger is [joking,lying,crazy]. If the program is valid and it is K(O) bits, then the leftover bits can’t matter—as you said yourself they are unobservable! And any unobservable bits are thus unavailable to the utility function.
Moreover, they are necessarily just repeats, if the program is K(O) bits, then it has appeared far earlier than length(O) in the ensemble, and is some mundane low utility universe.
Okay. In that case there are two reasons that mugger hypotheses are still important: the unupdated expected utility is not necessarily anywhere near the naive tail-less expected utility and that while the central limit theorem shows that updating based on observations is unlikely to produce a shift in the utility of the tails that is large relative to the bounds on the utility function, it will still be large relative to the actual utility.
The utility of the likely scenarios is essential here. If we don’t take into account the utility of $5, we have no obvious reason not to pay the mugger. The ratio of the utility differences of various action due to the likely hypotheses and due to the high-utility hypotheses is what is important.
That is a probability (well really a frequency) taken over all hypotheses of length N (or L if you prefer).
It’s unfortunate I started using N for program length in my prior post, that was a mistake, L was the term for program length in the EU equation. L (program length) matters because of the solomonoff prior complexity penalty: 2^-L.
Well, an O(1) factor less, since otherwise our prior measure would diverge, but you don’t have to write it explicitly; when working with Kolmogorov complexity, you expect everything to be within a constant factor.
No, not quite. Observations are not perfectly informative. If someone wanted to optimally communicate their observations, they would use such a system, but a real observation will not be perfectly optimized to rule out half the hypothesis space. We are reading bits from the output of the program, not its source code!
I don’t think this set behaves how you think it behaves. 1 − 2^-length(O) of this set will be ruled out, but there are more programs that have with more structure than “print this string” that don’t get falsified, since they actually have enough structure to reproduce our observation (about K(O) bits) and they use the leftover bits to encode various unobservable things that might have high utility.
Looking at you conclusions, you can actually replace l(O) with K(O) and everything qualitatively survives.
No, not necessarily. It could be an arbitrarily small cost: the mugger could say just look at me for a nanosecond, and this tiny action of almost no cost could still not be worthwhile.
If AIXI can not find a full observation history O matching program P which generates a future we would describe as (mugger really does have matrix powers and causes massive negative reward) under the constraints that length(P) < length(O), then AIXI’s expected utility decision for the mugger futures goes to zero . The length(P) < length(O) is a likelihood bound.
AIXI essentially stops considering theories beyond some upper improbability (much longer than it’s observation history).
For AIXI, each observation rules out exactly half of the hypothesis space, because it’s hypothesis space is the entirety of everything.
No—this is a contradiction. The programs of K(O) bits are the first valid universes, and by the definition/mapping of the mugger problem to AIXI-logic, those correspond to the mundane worlds where the mugger is [joking,lying,crazy]. If the program is valid and it is K(O) bits, then the leftover bits can’t matter—as you said yourself they are unobservable! And any unobservable bits are thus unavailable to the utility function.
Moreover, they are necessarily just repeats, if the program is K(O) bits, then it has appeared far earlier than length(O) in the ensemble, and is some mundane low utility universe.