Awesome question! I spent about a day chewing on this exact problem.
First, if our variables are drawn from finite sets, then the problem goes away (as long as we don’t have actually-infinite utilities). If we can construct everything as limits from finite sets (as is almost always the case), then that limit should involve a sequence of world models.
The more interesting question is what that limit converges to. In general, we may end up with an improper distribution (conceptually, we have to carry around two infinities which cancel each other out). That’s fine—improper distributions happen sometimes in Bayesian probability, we usually know how to handle them.
Thanks for the reply, but I might need you to explain/dumb-down a bit more.
--I get how if the variables which describe the world can only take a finite combination of values, then the problem goes away. But this isn’t good enough because e.g. “number of paperclips” seems like something that can be arbitrarily big. Even if we suppose they can’t get infinitely big (though why suppose that?) we face problems, see below.
--What does it mean in this context to construct everything as limits from finite sets? Specifically, consider someone who is a classical hedonistic utilitarian. It seems that their utility is unbounded above and below, i.e. for any setting of the variables, there is a setting which is a zillion times better and a setting which is a zillion times worse. So how can we interpret them as minimizing the bits needed to describe the variable-settings according to some model M2? For any M2 there will be at least one minimum-bit variable-setting, which contradicts what we said earlier about every variable-setting having something which is worse and something which is better.
I’ll answer the second question, and hopefully the first will be answered in the process.
First, note that P[X|M2]∝eαu(X), so arbitrarily large negative utilities aren’t a problem—they get exponentiated, and yield probabilities arbitrarily close to 0. The problem is arbitrarily large positive utilities. In fact, they don’t even need to be arbitrarily large, they just need to have an infinite exponential sum; e.g. if u(X) is 1 for any whole number of paperclips X, then to normalize the probability distribution we need to divide by ∑∞X=0eα⋅1=∞. The solution to this is to just leave the distribution unnormalized. That’s what “improper distribution” means: it’s a distribution which can’t be normalized, because it sums to ∞.
The main question here seems to be “ok, but what does an improper distribution mean in terms of bits needed to encode X?”. Basically, we need infinitely many bits in order to encode X, using this distribution. But it’s “not the same infinity” for each X-value—not in the sense of “set of reals is bigger than the set of integers”, but in the sense of “we constructed these infinities from a limit so one can be subtracted from the other”. Every X value requires infinitely many bits, but one X-value may require 2 bits more than another, or 3 bits less than another, in such a way that all these comparisons are consistent. By leaving the distribution unnormalized, we’re effectively picking a “reference point” for our infinity, and then keeping track of how many more or fewer bits each X-value needs, compared to the reference point.
In the case of the paperclip example, we could have a sequence of utilities un(X) which each assigns utility X to any number of paperclips X < n (i.e. 1 util per clip, up to n clips), and then we take the limit n→∞. Then our nthunnormalized distribution is Punnorm[X|Mn]=eαXI[X<n], and the normalizing constant is Zn=1−eαn1−eα, which grows like O(eαn) as n→∞. The number of bits required to encode a particular value X<n is
−logPunnorm[X|Mn]Zn=log1−eαn1−eα−αX
Key thing to notice: the first term, log1−eαn1−eα, is the part which goes to ∞ with n, and it does not depend on X. So, we can take that term to be our “reference point”, and measure the number of bits required for any particular Xrelative to that reference point. That’s exactly what we’re implicitly doing if we don’t normalize the distribution: ignoring normalization, we compute the number of bits required to encode X as
−logPunnorm[X|Mn]=−αX
… which is exactly the “adjustment” from our reference point.
(Side note: this is exactly how information theory handles continuous distributions. An infinite number of bits is required to encode a real number, so we pull out a term logdx which diverges in the limit dx→0, and we measure everything relative to that. Equivalently, we measure the number of bits required to encode up to precision dx, and as long as the distribution is smooth and dx is small, the number of bits required to encode the rest of x using the distribution won’t depend on the value of x.)
Does this make sense? Should I give a different example/use more English?
Awesome question! I spent about a day chewing on this exact problem.
First, if our variables are drawn from finite sets, then the problem goes away (as long as we don’t have actually-infinite utilities). If we can construct everything as limits from finite sets (as is almost always the case), then that limit should involve a sequence of world models.
The more interesting question is what that limit converges to. In general, we may end up with an improper distribution (conceptually, we have to carry around two infinities which cancel each other out). That’s fine—improper distributions happen sometimes in Bayesian probability, we usually know how to handle them.
Thanks for the reply, but I might need you to explain/dumb-down a bit more.
--I get how if the variables which describe the world can only take a finite combination of values, then the problem goes away. But this isn’t good enough because e.g. “number of paperclips” seems like something that can be arbitrarily big. Even if we suppose they can’t get infinitely big (though why suppose that?) we face problems, see below.
--What does it mean in this context to construct everything as limits from finite sets? Specifically, consider someone who is a classical hedonistic utilitarian. It seems that their utility is unbounded above and below, i.e. for any setting of the variables, there is a setting which is a zillion times better and a setting which is a zillion times worse. So how can we interpret them as minimizing the bits needed to describe the variable-settings according to some model M2? For any M2 there will be at least one minimum-bit variable-setting, which contradicts what we said earlier about every variable-setting having something which is worse and something which is better.
I’ll answer the second question, and hopefully the first will be answered in the process.
First, note that P[X|M2]∝eαu(X), so arbitrarily large negative utilities aren’t a problem—they get exponentiated, and yield probabilities arbitrarily close to 0. The problem is arbitrarily large positive utilities. In fact, they don’t even need to be arbitrarily large, they just need to have an infinite exponential sum; e.g. if u(X) is 1 for any whole number of paperclips X, then to normalize the probability distribution we need to divide by ∑∞X=0eα⋅1=∞. The solution to this is to just leave the distribution unnormalized. That’s what “improper distribution” means: it’s a distribution which can’t be normalized, because it sums to ∞.
The main question here seems to be “ok, but what does an improper distribution mean in terms of bits needed to encode X?”. Basically, we need infinitely many bits in order to encode X, using this distribution. But it’s “not the same infinity” for each X-value—not in the sense of “set of reals is bigger than the set of integers”, but in the sense of “we constructed these infinities from a limit so one can be subtracted from the other”. Every X value requires infinitely many bits, but one X-value may require 2 bits more than another, or 3 bits less than another, in such a way that all these comparisons are consistent. By leaving the distribution unnormalized, we’re effectively picking a “reference point” for our infinity, and then keeping track of how many more or fewer bits each X-value needs, compared to the reference point.
In the case of the paperclip example, we could have a sequence of utilities un(X) which each assigns utility X to any number of paperclips X < n (i.e. 1 util per clip, up to n clips), and then we take the limit n→∞. Then our nthunnormalized distribution is Punnorm[X|Mn]=eαXI[X<n], and the normalizing constant is Zn=1−eαn1−eα, which grows like O(eαn) as n→∞. The number of bits required to encode a particular value X<n is
−logPunnorm[X|Mn]Zn=log1−eαn1−eα−αX
Key thing to notice: the first term, log1−eαn1−eα, is the part which goes to ∞ with n, and it does not depend on X. So, we can take that term to be our “reference point”, and measure the number of bits required for any particular X relative to that reference point. That’s exactly what we’re implicitly doing if we don’t normalize the distribution: ignoring normalization, we compute the number of bits required to encode X as
−logPunnorm[X|Mn]=−αX
… which is exactly the “adjustment” from our reference point.
(Side note: this is exactly how information theory handles continuous distributions. An infinite number of bits is required to encode a real number, so we pull out a term logdx which diverges in the limit dx→0, and we measure everything relative to that. Equivalently, we measure the number of bits required to encode up to precision dx, and as long as the distribution is smooth and dx is small, the number of bits required to encode the rest of x using the distribution won’t depend on the value of x.)
Does this make sense? Should I give a different example/use more English?