I have a question about the conjecture at the end of Direction 17.5. Let U1 be a utility function with values in [0,1] and let f:[0,1]→[0,1] be a strictly monotonous function. Then U1 and U2=f∘U1 have the same maxima.f can be non-linear, e.g. f(x)=x2. Therefore, I wonder if the condition u(y)=αv(y)+β should be weaker.
No, because it changes the expected value of the utility function under various distributions.
Moreover, I ask myself if it is possible to modify U1 by a small amount at a place far away from the optimal policy such that π is still optimal for the modified utility function.
Good catch, the conjecture as stated is obviously false. Because, we can e.g. take U2 to be the same as U1 everywhere except after some action which π∗ doesn’t actually take, in which case make it identically 0. Some possible ways to fix it:
Require the utility function to be of the form U:Oω→[0,1] (i.e. not depend on actions).
Use (strictly) instrumental reward functions.
Weaken the conclusion so that we’re only comparing U1 and U2 on-policy (but this might be insufficient for superimitation).
Require π∗ to be optimal off-policy (but it’s unclear how can this generalize to finite g).
I think the conjecture is also false in the case that utility functions map from Oω to [0,1].
Let us consider the case of A={a1,a2} and O={o1,o2}.
We use U1(o)=1−2−k, where k is the largest integer such that o starts with ok1 (and U1(oω1)=1).
As for U2, we use U2(o)=1−3−k, where k is the largest integer such that o starts with ok1 (and U2(oω1)=1).
Both U1 and U2 are computable, but they are not locally equivalent.
Under reasonable assumptions on the Solomonoff prior,
the policy π that always picks action a1 is the optimal policy for both U1 and U2 (see proof below).
Note that since the policy is computable and very simple, g0(π)=∞ is not true,
and we have g0(π)=O(1) instead.
I suspect that the issues are still present even with an additional g0(π)=∞ condition,
but finding a concrete example with an uncomputable policy is challenging.
proof:
Suppose that U1 and U2 are locally equivalent. Let V be an open neighborhood of the point x=oω1 and α>0, β∈R be such that U1(y)=αU2(y)+β for all y∈V.
Since x∈V, we have 1=U1(x)=αU2(x)+β=α+β.
Because V is an open neighborhood of oω1, there is an integer N such that on1oω2∈V for all n≥N. For such n≥N, we have
1−2−n=U1(on1oω2)=αU2(on1oω2)+β=α(1−3−n)+β=α+β−α3−n=1−α3−n.
This implies α=(2/3)−n.
However, this is not possible for all n≥N.
Thus, our assumption that U1 and U2 are locally equivalent was wrong.
Assumptions about the solomonoff prior:
For all n, the sequence of actions that produces the sequence of on1 with the highest probability is an−11 (recall that we start with observations in this setting).
With this assumption, it can be seen that the policy that always picks action a1 is among the best policies for both U1 and U2.
I think this is actually a natural behaviour for a reasonable Solomonoff prior:
It is natural to expect that o1a1o1 is more likely than o1a2o1. It is natural to expect that the sequence of actions that leads to o1 over o2 has low complexity. Always picking a1 is low complexity.
It is possible to construct an artificial UTM that ensures that
“always take a1” is the best policy for U1, U2:
An UTM can be constructed such that the corresponding Solomonoff prior assigns 3⁄4 probability to the program/environment “start with o_1. after action a_i, output o_i”. The rest of the probability mass gets distributed according to some other more natural UTM.
Then, for U1, in each situation with history on1 the optimal policy has to pick a1
(the actions outside of this history have no impact on the utility):
With 3⁄4 probability it will get utility of at least 1−2−(n+1). And with 1/4 probability at least 1−2−n. Whereas, for the choice of a2, with probability 3/4 it will have utility of 1−2−n, and with probability 1/4 it can get at most 1.
We calculate
(1−2−(n+1))3/4+(1−2−n)1/4=1−52−(n+3)1−3⋅2−(n+2)=(1−2−n)3/4+1/4,
ie. taking action a1 is the better choice.
Similarly, for U2, the optimal policy has to pick a1 too in each situation with history on1.
Here, the calculation looks like
(1−3−(n+1))3/4+(1−3−n)1/4=1−3−n/21−⋅3−n+1/4=(1−3−n)3/4+1/4.
No, because it changes the expected value of the utility function under various distributions.
Good catch, the conjecture as stated is obviously false. Because, we can e.g. take U2 to be the same as U1 everywhere except after some action which π∗ doesn’t actually take, in which case make it identically 0. Some possible ways to fix it:
Require the utility function to be of the form U:Oω→[0,1] (i.e. not depend on actions).
Use (strictly) instrumental reward functions.
Weaken the conclusion so that we’re only comparing U1 and U2 on-policy (but this might be insufficient for superimitation).
Require π∗ to be optimal off-policy (but it’s unclear how can this generalize to finite g).
I think the conjecture is also false in the case that utility functions map from Oω to [0,1].
Let us consider the case of A={a1,a2} and O={o1,o2}. We use U1(o)=1−2−k, where k is the largest integer such that o starts with ok1 (and U1(oω1)=1). As for U2, we use U2(o)=1−3−k, where k is the largest integer such that o starts with ok1 (and U2(oω1)=1). Both U1 and U2 are computable, but they are not locally equivalent. Under reasonable assumptions on the Solomonoff prior, the policy π that always picks action a1 is the optimal policy for both U1 and U2 (see proof below).
Note that since the policy is computable and very simple, g0(π)=∞ is not true, and we have g0(π)=O(1) instead. I suspect that the issues are still present even with an additional g0(π)=∞ condition, but finding a concrete example with an uncomputable policy is challenging.
proof: Suppose that U1 and U2 are locally equivalent. Let V be an open neighborhood of the point x=oω1 and α>0, β∈R be such that U1(y)=αU2(y)+β for all y∈V.
Since x∈V, we have 1=U1(x)=αU2(x)+β=α+β. Because V is an open neighborhood of oω1, there is an integer N such that on1oω2∈V for all n≥N. For such n≥N, we have 1−2−n=U1(on1oω2)=αU2(on1oω2)+β=α(1−3−n)+β=α+β−α3−n=1−α3−n. This implies α=(2/3)−n. However, this is not possible for all n≥N. Thus, our assumption that U1 and U2 are locally equivalent was wrong.
Assumptions about the solomonoff prior: For all n, the sequence of actions that produces the sequence of on1 with the highest probability is an−11 (recall that we start with observations in this setting). With this assumption, it can be seen that the policy that always picks action a1 is among the best policies for both U1 and U2.
I think this is actually a natural behaviour for a reasonable Solomonoff prior: It is natural to expect that o1a1o1 is more likely than o1a2o1. It is natural to expect that the sequence of actions that leads to o1 over o2 has low complexity. Always picking a1 is low complexity.
It is possible to construct an artificial UTM that ensures that “always take a1” is the best policy for U1, U2: An UTM can be constructed such that the corresponding Solomonoff prior assigns 3⁄4 probability to the program/environment “start with o_1. after action a_i, output o_i”. The rest of the probability mass gets distributed according to some other more natural UTM.
Then, for U1, in each situation with history on1 the optimal policy has to pick a1 (the actions outside of this history have no impact on the utility): With 3⁄4 probability it will get utility of at least 1−2−(n+1). And with 1/4 probability at least 1−2−n. Whereas, for the choice of a2, with probability 3/4 it will have utility of 1−2−n, and with probability 1/4 it can get at most 1. We calculate (1−2−(n+1))3/4+(1−2−n)1/4=1−52−(n+3)1−3⋅2−(n+2)=(1−2−n)3/4+1/4, ie. taking action a1 is the better choice.
Similarly, for U2, the optimal policy has to pick a1 too in each situation with history on1. Here, the calculation looks like (1−3−(n+1))3/4+(1−3−n)1/4=1−3−n/21−⋅3−n+1/4=(1−3−n)3/4+1/4.