I just read through the linked paper by Peter De Blanc (thanks), but I don’t see your point. The paper shows:
“After making some plausible assumptions (and maybe one not-so-plausible assumption), we show that if the utility function is unbounded, then the expected utility of any policy is undefined.”
If De Blanc’s conclusion is correct it only shows that unbounded utility functions result in potentially unbounded/undefined expected utilities. The limit I explore in the OP should still hold for bounded utility functions, regardless of how they are bounded.
I’ve run a bunch of simulations calculating expected utility over different distributions and the ones that seem most like reality, ie fat tails of utility that wouldn’t converge if the utility function were unbounded, seem to be rather similar to the standard random walk, which is very hard to not think obvious in retrospect. If you have a mugger scenario where the high-utility possibilities are chaotic in the sense described in the post, +/-(typical utility magnitude)*P(typical crazy scenario)*sqrt(# of crazy scenarios) therefore seems like a qualitatively correct model. This is greater in absolute value than the naive model +/-(typical utility magnitude)*P(typical crazy scenario), so I don’t think that this ‘solves’ the thought experiment. I can provide more mathematical details if anything is unclear.
I’m not following your math. Consider an agent on cycle 0, or an agent fed only pure white noise. It’s observations don’t limit the space of universes, so It’s computed utilities will then be random samples from the utility function over the space of all programs, and should then converge to the mean of the utility function by the central limit theorem.
Simplified AIXI type expected utility for an action A and observation history O is something like:
EU(A) = SUM[L=[0,+inf]] ( 2^-L PROGSUM(L, A, O))
PROGSUM(L, A, O) = SUM[P, length(P)==L] ( T(P(A),O) U(P(A)))
PROGSUM sums the utility for every program of length L. The T term is 1 for programs that match the observation history, 0 for those that don’t. It filters out hypothesizes.
Mapping the concept of probability on to this, considering scenarios of low probability such as 2^-N, where N is very large in comparison to the length(O), is considering the subsets of the equation starting at programs of length N. For very large N in comparison to length(O), the T terms filters less and less of the hypothesis space, and the result converges again to an infinite random sampling of the utility function, and thus the mean of that function.
Specifically for N > length(O), T can at most filter, and thus contribute, 2^-(N—length(0)) of the hypothesis space, and it’s contribution thus goes to zero as N grows.
it’s computed utilities will then be random samples from the utility function over the space of all programs, and should then converge to the mean of the utility function by the central limit theorem.
Well the mean of the utility function is just the expected utility. The way I’m approaching this is to ask whether most of the expected utility comes from high probability events or low probability ones—does the distribution look basically the same if we cut off the tails? We are specifically referring to distributions where even some small events from the tails—the ‘mugger hypotheses’ - would make large contributions if they were considered individually. Then, if we model them as ‘positive mugger hypotheses’ and ‘negative mugger hypotheses’, the expected absolute value of the contribution from the tails just gets larger.
Simplified AIXI type expected utility for an action A and observation history O is something like [equation].
That is unnormalized, but otherwise accurate enough.
For very large N in comparison to length(O), the T terms filters less and less of the hypothesis space, and the result converges again to an infinite random sampling of the utility function, and thus the mean of that function.
Do you mean “less of the hypothesis space” in a relative or absolute sense? I don’t think either would hold. Your observations have some probability P(T|N) to retain a hypothesis of length N. I don’t see why this would depend that strongly on the value of N. In that case, the ratio of unfalsified to falsified hypotheses of length N would be about the same for all N. How did you get the number 2^-(N—length(0)) as a limit on the amount of the hypothesis space that is filtered (and do you mean retained by the filter or removed by the filter when you say ‘filtered’).
it’s computed utilities will then be random samples from the utility function over the space of all programs, and should then converge to the mean of the utility function by the central limit theorem.
Well the mean of the utility function is just the expected utility.
There are number of utility terms in the AIXI equation. The utility function is evaluated for every hypothesis/program/universe forward evaluated for all future action paths, giving one best utility for just that universe, and the total expected utility is then the sum over all valid universes weighted by their complexity penalty.
By ‘mean of the utility function’, I meant the mean of the utility function over all possible universes rather than just valid universes. The validity constraint forces the expected utility to diverge from the mean of the utility function—it must for the agent to make any useful decisions!
So the total expected utility is not normally the mean utility, but it reduces to it in the case where the observation filter is removed.
The way I’m approaching this is to ask whether most of the expected utility comes from high probability events or low probability ones
My entire post concerns the subset of universes with probabilities approaching 1/infinity, corresponding to programs with length going to infinity. The high probability scenarios (shorter program universes) don’t matter in mugger scenarios, we categorically assume they all have boring extremely low utilities (the mugger is jokin/lying/crazy).
Your observations have some probability P(T|N) to retain a hypothesis of length N. I don’t see why this would depend that strongly on the value of N.
In AIXI-models, hypothesis acceptance is not probabilistic, it is completely binary: a universe program either perfectly fits the observation history or it does not. If even 1 bit is off, the program is ignored.
It’s unfortunate I started using N for program length in my prior post, that was a mistake, L was the term for program length in the EU equation. L (program length) matters because of the solomonoff prior complexity penalty: 2^-L.
How did you get the number 2^-(L—length(O)) as a limit on the amount of the hypothesis space that is filtered (and do you mean retained by the filter or removed by the filter when you say ‘filtered’).
This simply comes from the fact that an observation history O can at most filter out only a fraction of the space of programs that are longer than it.
For example, start with an empty observation history O: {}. Clearly, this filters nothing. The space of valid programs of length L, for any L, is simply all possible programs of length L, which is expected to be a set of around 2^L in size. The sum over all programs for L going to infinity is thus the space of everything, the full Tegmark. In this case, the expected utility is simply the mean of the utility function over the full Tegmark.
Now consider O:{1}. We have cut out exactly half of the program space.
O:{11}, cuts out 3/4th of the tegmark, and in general an observation history with length(O) filters the universe space down to 2^-length(O) of it’s previous size, removing 1 − 2^-length(O) possible universes—but there are an infinite number of total universes.
Now, let’s say we are ONLY interested in the contribution of universes of a certain prior likelihood (corresponding to a certain program length). These are the subsets of the tegmark with programs P where length(P) = L for some L. This is a FINITE, enumerable set.
Then for JUST the subset of universes with length(P)=L, there are 2^L universes in this set. For an observation history O with length(O) > L, it is not guaranteed that there are any valid programs that match the observation history. It could be 1, could be 0.
However, for length(P) > length(O) + C, for some small C, valid programs are absolutely guaranteed. Specifically for some constant C there are programs which simply directly encode random strings which happen to align with O. This set of programs correspond to ‘chaos’.
Now consider the limit behavior as complexity goes to infinity. For any fixed observation history with length(O), as length(P) goes to infinity, the chaos set grows at the maximum possible rate, with 2^length(P), and dominates (because the chaos programs just fill extra length with any random bits).
In particular, for observation set O and the subset of universes with length(P)=L, there are expected to be roughly 2^-(length(O)+C) * 2^L observationally valid chaos universes. This simplifies to 2^(L-length(O)-C) valid chaos universes.
So when length(O)+C > L, there are unlikely to be any valid chaos universes. So the expected utility over this subset, EU[L], will be averaged over a small number of universes, possibly even 1 (if there are any at all that match O), or none. But as L grows larger than length(O)+C, the chaos universes suddenly appear (guaranteed) and their number grow exponentially with L, and the expected utility over that exponentially growing set quickly converges to the mean of the utility function (because the chaos universes are random).
Assuming a utility function with positive/negative bounds normalized around zero, the convergence should be to zero.
By ‘mean of the utility function’, I meant the mean of the utility function over all possible universes rather than just valid universes. The validity constraint forces the expected utility to diverge from the mean of the utility function—it must for the agent to make any useful decisions!
Okay. In that case there are two reasons that mugger hypotheses are still important: the unupdated expected utility is not necessarily anywhere near the naive tail-less expected utility and that while the central limit theorem shows that updating based on observations is unlikely to produce a shift in the utility of the tails that is large relative to the bounds on the utility function, it will still be large relative to the actual utility.
The way I’m approaching this is to ask whether most of the expected utility comes from high probability events or low probability ones
My entire post concerns the subset of universes with probabilities approaching 1/infinity, corresponding to programs with length going to infinity. The high probability scenarios (shorter program universes) don’t matter in mugger scenarios, we categorically assume they all have boring extremely low utilities (the mugger is jokin/lying/crazy).
The utility of the likely scenarios is essential here. If we don’t take into account the utility of $5, we have no obvious reason not to pay the mugger. The ratio of the utility differences of various action due to the likely hypotheses and due to the high-utility hypotheses is what is important.
Your observations have some probability P(T|N) to retain a hypothesis of length N. I don’t see why this would depend that strongly on the value of N.
In AIXI-models, hypothesis acceptance is not probabilistic, it is completely binary: a universe program either perfectly fits the observation history or it does not. If even 1 bit is off, the program is ignored.
That is a probability (well really a frequency) taken over all hypotheses of length N (or L if you prefer).
It’s unfortunate I started using N for program length in my prior post, that was a mistake, L was the term for program length in the EU equation. L (program length) matters because of the solomonoff prior complexity penalty: 2^-L.
The space of valid programs of length L, for any L, is simply all possible programs of length L, which is expected to be a set of around 2^L in size.
Well, an O(1) factor less, since otherwise our prior measure would diverge, but you don’t have to write it explicitly; when working with Kolmogorov complexity, you expect everything to be within a constant factor.
Now consider O:{1}. We have cut out exactly half of the program space. O:{11}, cuts out 3/4th of the tegmark, and in general an observation history with length(O) filters the universe space down to 2^-length(O) of it’s previous size, removing 1 − 2^-length(O) possible universes—but there are an infinite number of total universes.
No, not quite. Observations are not perfectly informative. If someone wanted to optimally communicate their observations, they would use such a system, but a real observation will not be perfectly optimized to rule out half the hypothesis space. We are reading bits from the output of the program, not its source code!
However, for length(P) > length(O) + C, for some small C, valid programs are absolutely guaranteed. Specifically for some constant C there are programs which simply directly encode random strings which happen to align with O. This set of programs correspond to ‘chaos’.
I don’t think this set behaves how you think it behaves. 1 − 2^-length(O) of this set will be ruled out, but there are more programs that have with more structure than “print this string” that don’t get falsified, since they actually have enough structure to reproduce our observation (about K(O) bits) and they use the leftover bits to encode various unobservable things that might have high utility.
Looking at you conclusions, you can actually replace l(O) with K(O) and everything qualitatively survives.
The utility of the likely scenarios is essential here. If we don’t take into account the utility of $5, we have no obvious reason not to pay the mugger.
No, not necessarily. It could be an arbitrarily small cost: the mugger could say just look at me for a nanosecond, and this tiny action of almost no cost could still not be worthwhile.
If AIXI can not find a full observation history O matching program P which generates a future we would describe as (mugger really does have matrix powers and causes massive negative reward) under the constraints that length(P) < length(O), then AIXI’s expected utility decision for the mugger futures goes to zero . The length(P) < length(O) is a likelihood bound.
AIXI essentially stops considering theories beyond some upper improbability (much longer than it’s observation history).
but a real observation will not be perfectly optimized to rule out half the hypothesis space.
For AIXI, each observation rules out exactly half of the hypothesis space, because it’s hypothesis space is the entirety of everything.
there are more programs that have with more structure than “print this string” that don’t get falsified, since they actually have enough structure to reproduce our observation (about K(O) bits) and they use the leftover bits to encode various unobservable things that might have high utility
No—this is a contradiction. The programs of K(O) bits are the first valid universes, and by the definition/mapping of the mugger problem to AIXI-logic, those correspond to the mundane worlds where the mugger is [joking,lying,crazy]. If the program is valid and it is K(O) bits, then the leftover bits can’t matter—as you said yourself they are unobservable! And any unobservable bits are thus unavailable to the utility function.
Moreover, they are necessarily just repeats, if the program is K(O) bits, then it has appeared far earlier than length(O) in the ensemble, and is some mundane low utility universe.
I thought you were referring only to unbounded utility functions. If utility is bounded, your statement EU(F) → 0 as p(F) → 0 is true (though the meaning of the notation might be confusing; expectations are supposed to be taken over the entire probability space), but I don’t think it gives reason to believe that everything will cancel out nicely, though it is possible that my intuition is distorted by thinking about busy-beaver size bounds, which are unlikely to be implemented in real agents anyways.
This provably doesn’t work.
I just read through the linked paper by Peter De Blanc (thanks), but I don’t see your point. The paper shows:
“After making some plausible assumptions (and maybe one not-so-plausible assumption), we show that if the utility function is unbounded, then the expected utility of any policy is undefined.”
If De Blanc’s conclusion is correct it only shows that unbounded utility functions result in potentially unbounded/undefined expected utilities. The limit I explore in the OP should still hold for bounded utility functions, regardless of how they are bounded.
I’ve run a bunch of simulations calculating expected utility over different distributions and the ones that seem most like reality, ie fat tails of utility that wouldn’t converge if the utility function were unbounded, seem to be rather similar to the standard random walk, which is very hard to not think obvious in retrospect. If you have a mugger scenario where the high-utility possibilities are chaotic in the sense described in the post, +/-(typical utility magnitude)*P(typical crazy scenario)*sqrt(# of crazy scenarios) therefore seems like a qualitatively correct model. This is greater in absolute value than the naive model +/-(typical utility magnitude)*P(typical crazy scenario), so I don’t think that this ‘solves’ the thought experiment. I can provide more mathematical details if anything is unclear.
I’m not following your math. Consider an agent on cycle 0, or an agent fed only pure white noise. It’s observations don’t limit the space of universes, so It’s computed utilities will then be random samples from the utility function over the space of all programs, and should then converge to the mean of the utility function by the central limit theorem.
Simplified AIXI type expected utility for an action A and observation history O is something like: EU(A) = SUM[L=[0,+inf]] ( 2^-L PROGSUM(L, A, O)) PROGSUM(L, A, O) = SUM[P, length(P)==L] ( T(P(A),O) U(P(A)))
PROGSUM sums the utility for every program of length L. The T term is 1 for programs that match the observation history, 0 for those that don’t. It filters out hypothesizes.
Mapping the concept of probability on to this, considering scenarios of low probability such as 2^-N, where N is very large in comparison to the length(O), is considering the subsets of the equation starting at programs of length N. For very large N in comparison to length(O), the T terms filters less and less of the hypothesis space, and the result converges again to an infinite random sampling of the utility function, and thus the mean of that function.
Specifically for N > length(O), T can at most filter, and thus contribute, 2^-(N—length(0)) of the hypothesis space, and it’s contribution thus goes to zero as N grows.
Well the mean of the utility function is just the expected utility. The way I’m approaching this is to ask whether most of the expected utility comes from high probability events or low probability ones—does the distribution look basically the same if we cut off the tails? We are specifically referring to distributions where even some small events from the tails—the ‘mugger hypotheses’ - would make large contributions if they were considered individually. Then, if we model them as ‘positive mugger hypotheses’ and ‘negative mugger hypotheses’, the expected absolute value of the contribution from the tails just gets larger.
That is unnormalized, but otherwise accurate enough.
Do you mean “less of the hypothesis space” in a relative or absolute sense? I don’t think either would hold. Your observations have some probability P(T|N) to retain a hypothesis of length N. I don’t see why this would depend that strongly on the value of N. In that case, the ratio of unfalsified to falsified hypotheses of length N would be about the same for all N. How did you get the number 2^-(N—length(0)) as a limit on the amount of the hypothesis space that is filtered (and do you mean retained by the filter or removed by the filter when you say ‘filtered’).
There are number of utility terms in the AIXI equation. The utility function is evaluated for every hypothesis/program/universe forward evaluated for all future action paths, giving one best utility for just that universe, and the total expected utility is then the sum over all valid universes weighted by their complexity penalty.
By ‘mean of the utility function’, I meant the mean of the utility function over all possible universes rather than just valid universes. The validity constraint forces the expected utility to diverge from the mean of the utility function—it must for the agent to make any useful decisions!
So the total expected utility is not normally the mean utility, but it reduces to it in the case where the observation filter is removed.
My entire post concerns the subset of universes with probabilities approaching 1/infinity, corresponding to programs with length going to infinity. The high probability scenarios (shorter program universes) don’t matter in mugger scenarios, we categorically assume they all have boring extremely low utilities (the mugger is jokin/lying/crazy).
In AIXI-models, hypothesis acceptance is not probabilistic, it is completely binary: a universe program either perfectly fits the observation history or it does not. If even 1 bit is off, the program is ignored.
It’s unfortunate I started using N for program length in my prior post, that was a mistake, L was the term for program length in the EU equation. L (program length) matters because of the solomonoff prior complexity penalty: 2^-L.
This simply comes from the fact that an observation history O can at most filter out only a fraction of the space of programs that are longer than it.
For example, start with an empty observation history O: {}. Clearly, this filters nothing. The space of valid programs of length L, for any L, is simply all possible programs of length L, which is expected to be a set of around 2^L in size. The sum over all programs for L going to infinity is thus the space of everything, the full Tegmark. In this case, the expected utility is simply the mean of the utility function over the full Tegmark.
Now consider O:{1}. We have cut out exactly half of the program space. O:{11}, cuts out 3/4th of the tegmark, and in general an observation history with length(O) filters the universe space down to 2^-length(O) of it’s previous size, removing 1 − 2^-length(O) possible universes—but there are an infinite number of total universes.
Now, let’s say we are ONLY interested in the contribution of universes of a certain prior likelihood (corresponding to a certain program length). These are the subsets of the tegmark with programs P where length(P) = L for some L. This is a FINITE, enumerable set.
Then for JUST the subset of universes with length(P)=L, there are 2^L universes in this set. For an observation history O with length(O) > L, it is not guaranteed that there are any valid programs that match the observation history. It could be 1, could be 0.
However, for length(P) > length(O) + C, for some small C, valid programs are absolutely guaranteed. Specifically for some constant C there are programs which simply directly encode random strings which happen to align with O. This set of programs correspond to ‘chaos’.
Now consider the limit behavior as complexity goes to infinity. For any fixed observation history with length(O), as length(P) goes to infinity, the chaos set grows at the maximum possible rate, with 2^length(P), and dominates (because the chaos programs just fill extra length with any random bits).
In particular, for observation set O and the subset of universes with length(P)=L, there are expected to be roughly 2^-(length(O)+C) * 2^L observationally valid chaos universes. This simplifies to 2^(L-length(O)-C) valid chaos universes.
So when length(O)+C > L, there are unlikely to be any valid chaos universes. So the expected utility over this subset, EU[L], will be averaged over a small number of universes, possibly even 1 (if there are any at all that match O), or none. But as L grows larger than length(O)+C, the chaos universes suddenly appear (guaranteed) and their number grow exponentially with L, and the expected utility over that exponentially growing set quickly converges to the mean of the utility function (because the chaos universes are random).
Assuming a utility function with positive/negative bounds normalized around zero, the convergence should be to zero.
Okay. In that case there are two reasons that mugger hypotheses are still important: the unupdated expected utility is not necessarily anywhere near the naive tail-less expected utility and that while the central limit theorem shows that updating based on observations is unlikely to produce a shift in the utility of the tails that is large relative to the bounds on the utility function, it will still be large relative to the actual utility.
The utility of the likely scenarios is essential here. If we don’t take into account the utility of $5, we have no obvious reason not to pay the mugger. The ratio of the utility differences of various action due to the likely hypotheses and due to the high-utility hypotheses is what is important.
That is a probability (well really a frequency) taken over all hypotheses of length N (or L if you prefer).
It’s unfortunate I started using N for program length in my prior post, that was a mistake, L was the term for program length in the EU equation. L (program length) matters because of the solomonoff prior complexity penalty: 2^-L.
Well, an O(1) factor less, since otherwise our prior measure would diverge, but you don’t have to write it explicitly; when working with Kolmogorov complexity, you expect everything to be within a constant factor.
No, not quite. Observations are not perfectly informative. If someone wanted to optimally communicate their observations, they would use such a system, but a real observation will not be perfectly optimized to rule out half the hypothesis space. We are reading bits from the output of the program, not its source code!
I don’t think this set behaves how you think it behaves. 1 − 2^-length(O) of this set will be ruled out, but there are more programs that have with more structure than “print this string” that don’t get falsified, since they actually have enough structure to reproduce our observation (about K(O) bits) and they use the leftover bits to encode various unobservable things that might have high utility.
Looking at you conclusions, you can actually replace l(O) with K(O) and everything qualitatively survives.
No, not necessarily. It could be an arbitrarily small cost: the mugger could say just look at me for a nanosecond, and this tiny action of almost no cost could still not be worthwhile.
If AIXI can not find a full observation history O matching program P which generates a future we would describe as (mugger really does have matrix powers and causes massive negative reward) under the constraints that length(P) < length(O), then AIXI’s expected utility decision for the mugger futures goes to zero . The length(P) < length(O) is a likelihood bound.
AIXI essentially stops considering theories beyond some upper improbability (much longer than it’s observation history).
For AIXI, each observation rules out exactly half of the hypothesis space, because it’s hypothesis space is the entirety of everything.
No—this is a contradiction. The programs of K(O) bits are the first valid universes, and by the definition/mapping of the mugger problem to AIXI-logic, those correspond to the mundane worlds where the mugger is [joking,lying,crazy]. If the program is valid and it is K(O) bits, then the leftover bits can’t matter—as you said yourself they are unobservable! And any unobservable bits are thus unavailable to the utility function.
Moreover, they are necessarily just repeats, if the program is K(O) bits, then it has appeared far earlier than length(O) in the ensemble, and is some mundane low utility universe.
I thought you were referring only to unbounded utility functions. If utility is bounded, your statement EU(F) → 0 as p(F) → 0 is true (though the meaning of the notation might be confusing; expectations are supposed to be taken over the entire probability space), but I don’t think it gives reason to believe that everything will cancel out nicely, though it is possible that my intuition is distorted by thinking about busy-beaver size bounds, which are unlikely to be implemented in real agents anyways.