I agree this is a problem, but isn’t this a problem for logical counterfactual approaches as well? Isn’t it also weird for a known fixed optimizer source code to produce a different result on this decision where it’s obvious that ‘left’ is the best decision?
If you assume that the agent chose ‘right’, it’s more reasonable to think it’s because it’s not a pure optimizer than that a pure optimizer would have chosen ‘right’, in my view.
If you form the intent to, as a policy, go ‘right’ on the 100th turn, you should anticipate learning that your source code is not the code of a pure optimizer.
I’m left with the feeling that you don’t see the problem I’m pointing at.
My concern is that the most plausible world where you aren’t a pure optimizer might look very very different, and whether this very very different world looks better or worse than the normal-looking world does not seem very relevant to the current decision.
Consider the “special exception selves” you mention—the Nth exception-self has a hard-coded exception “go right if it’s beet at least N turns and you’ve gone right at most 1/N of the time”.
Now let’s suppose that the worlds which give rise to exception-selves are a bit wild. That is to say, the rewards in those worlds have pretty high variance. So a significant fraction of them have quite high reward—let’s just say 10% of them have value much higher than is achievable in the real world.
So we expect that by around N=10, there will be an exception-self living in a world that looks really good.
This suggests to me that the policy-dependent-source agent cannot learn to go left > 90% of the time, because once it crosses that threshhold, the exception-self in the really good looking world is ready to trigger its exception—so going right starts to appear really good. The agent goes right until it is under the threshhold again.
If that’s true, then it seems to me rather bad: the agent ends up repeatedly going right in a situation where it should be able to learn to go left easily. Its reason for repeatedly going right? There is one enticing world, which looks much like the real world, except that in that world the agent definitely goes right. Because that agent is a lucky agent who gets a lot of utility, the actual agent has decided to copy its behavior exactly—anything else would prove the real agent unlucky, which would be sad.
Of course, this outcome is far from obvious; I’m playing fast and loose with how this sort of agent might reason.
I think it’s worth examining more closely what it means to be “not a pure optimizer”. Formally, a VNM utility function is a rationalization of a coherent policy. Say that you have some idea about what your utility function is, U. Suppose you then decide to follow a policy that does not maximize U. Logically, it follows that U is not really your utility function; either your policy doesn’t coherently maximize any utility function, or it maximizes some other utility function. (Because the utility function is, by definition, a rationalization of the policy)
Failing to disambiguate these two notions of “the agent’s utility function” is a map-territory error.
Decision theories require, as input, a utility function to maximize, and output a policy. If a decision theory is adopted by an agent who is using it to determine their policy (rather than already knowing their policy), then they are operating on some preliminary idea about what their utility function is. Their “actual” utility function is dependent on their policy; it need not match up with their idea.
So, it is very much possible for an agent who is operating on an idea U of their utility function, to evaluate counterfactuals in which their true behavioral utility function is not U. Indeed, this is implied by the fact that utility functions are rationalizations of policies.
Let’s look at the “turn left/right” example. The agent is operating on a utility function idea U, which is higher the more the agent turns left. When they evaluate the policy of turning “right” on the 10th time, they must conclude that, in this hypothetical, either (a) “right” maximizes U, (b) they are maximizing some utility function other than U, or (c) they aren’t a maximizer at all.
The logical counterfactual framework says the answer is (a): that the fixed computation of U-maximization results in turning right, not left. But, this is actually the weirdest of the three worlds. It is hard to imagine ways that “right” maximizes U, whereas it is easy to imagine that the agent is maximizing a utility function other than U, or is not a maximizer.
Yes, the (b) and (c) worlds may be weird in a problematic way. However, it is hard to imagine these being nearly as weird as (a).
One way they could be weird is that an agent having a complex utility function is likely to have been produced by a different process than an agent with a simple utility function. So the more weird exceptional decisions you make, the greater the evidence is that you were produced by the sort of process that produces complex utility functions.
This is pretty similar to the smoking lesion problem, then. I expect that policy-dependent source code will have a lot in common with EDT, as they both consider “what sort of agent I am” to be a consequence of one’s policy. (However, as you’ve pointed out, there are important complications with the framing of the smoking lesion problem)
I think further disambiguation on this could benefit from re-analyzing the smoking lesion problem (or a similar problem), but I’m not sure if I have the right set of concepts for this yet.
OK, all of that made sense to me. I find the direction more plausible than when I first read your post, although it still seems like it’ll fall to the problem I sketched.
I both like and hate that it treats logical uncertainty in a radically different way from empirical uncertainty—like, because we have so far failed to find any way to treat the two uniformly (besides being entirely updateful that is); and hate, because it still feels so wrong for the two to be very different.
I agree this is a problem, but isn’t this a problem for logical counterfactual approaches as well? Isn’t it also weird for a known fixed optimizer source code to produce a different result on this decision where it’s obvious that ‘left’ is the best decision?
If you assume that the agent chose ‘right’, it’s more reasonable to think it’s because it’s not a pure optimizer than that a pure optimizer would have chosen ‘right’, in my view.
If you form the intent to, as a policy, go ‘right’ on the 100th turn, you should anticipate learning that your source code is not the code of a pure optimizer.
I’m left with the feeling that you don’t see the problem I’m pointing at.
My concern is that the most plausible world where you aren’t a pure optimizer might look very very different, and whether this very very different world looks better or worse than the normal-looking world does not seem very relevant to the current decision.
Consider the “special exception selves” you mention—the Nth exception-self has a hard-coded exception “go right if it’s beet at least N turns and you’ve gone right at most 1/N of the time”.
Now let’s suppose that the worlds which give rise to exception-selves are a bit wild. That is to say, the rewards in those worlds have pretty high variance. So a significant fraction of them have quite high reward—let’s just say 10% of them have value much higher than is achievable in the real world.
So we expect that by around N=10, there will be an exception-self living in a world that looks really good.
This suggests to me that the policy-dependent-source agent cannot learn to go left > 90% of the time, because once it crosses that threshhold, the exception-self in the really good looking world is ready to trigger its exception—so going right starts to appear really good. The agent goes right until it is under the threshhold again.
If that’s true, then it seems to me rather bad: the agent ends up repeatedly going right in a situation where it should be able to learn to go left easily. Its reason for repeatedly going right? There is one enticing world, which looks much like the real world, except that in that world the agent definitely goes right. Because that agent is a lucky agent who gets a lot of utility, the actual agent has decided to copy its behavior exactly—anything else would prove the real agent unlucky, which would be sad.
Of course, this outcome is far from obvious; I’m playing fast and loose with how this sort of agent might reason.
I think it’s worth examining more closely what it means to be “not a pure optimizer”. Formally, a VNM utility function is a rationalization of a coherent policy. Say that you have some idea about what your utility function is, U. Suppose you then decide to follow a policy that does not maximize U. Logically, it follows that U is not really your utility function; either your policy doesn’t coherently maximize any utility function, or it maximizes some other utility function. (Because the utility function is, by definition, a rationalization of the policy)
Failing to disambiguate these two notions of “the agent’s utility function” is a map-territory error.
Decision theories require, as input, a utility function to maximize, and output a policy. If a decision theory is adopted by an agent who is using it to determine their policy (rather than already knowing their policy), then they are operating on some preliminary idea about what their utility function is. Their “actual” utility function is dependent on their policy; it need not match up with their idea.
So, it is very much possible for an agent who is operating on an idea U of their utility function, to evaluate counterfactuals in which their true behavioral utility function is not U. Indeed, this is implied by the fact that utility functions are rationalizations of policies.
Let’s look at the “turn left/right” example. The agent is operating on a utility function idea U, which is higher the more the agent turns left. When they evaluate the policy of turning “right” on the 10th time, they must conclude that, in this hypothetical, either (a) “right” maximizes U, (b) they are maximizing some utility function other than U, or (c) they aren’t a maximizer at all.
The logical counterfactual framework says the answer is (a): that the fixed computation of U-maximization results in turning right, not left. But, this is actually the weirdest of the three worlds. It is hard to imagine ways that “right” maximizes U, whereas it is easy to imagine that the agent is maximizing a utility function other than U, or is not a maximizer.
Yes, the (b) and (c) worlds may be weird in a problematic way. However, it is hard to imagine these being nearly as weird as (a).
One way they could be weird is that an agent having a complex utility function is likely to have been produced by a different process than an agent with a simple utility function. So the more weird exceptional decisions you make, the greater the evidence is that you were produced by the sort of process that produces complex utility functions.
This is pretty similar to the smoking lesion problem, then. I expect that policy-dependent source code will have a lot in common with EDT, as they both consider “what sort of agent I am” to be a consequence of one’s policy. (However, as you’ve pointed out, there are important complications with the framing of the smoking lesion problem)
I think further disambiguation on this could benefit from re-analyzing the smoking lesion problem (or a similar problem), but I’m not sure if I have the right set of concepts for this yet.
OK, all of that made sense to me. I find the direction more plausible than when I first read your post, although it still seems like it’ll fall to the problem I sketched.
I both like and hate that it treats logical uncertainty in a radically different way from empirical uncertainty—like, because we have so far failed to find any way to treat the two uniformly (besides being entirely updateful that is); and hate, because it still feels so wrong for the two to be very different.