In the happy dance problem, when the agent is considering doing a happy dance, the agent should have already updated on M. This is more like timeless decision theory than updateless decision theory.
Conditioning on ‘A(obs) = act’ is still a conditional, not a counterfactual. The difference between conditionals and counterfactuals is the difference between “If Oswald didn’t kill Kennedy, then someone else did” and “If Oswald didn’t kill Kennedy, then someone else would have”.
Indeed, troll bridge will present a problem for “playing chicken” approaches, which are probably necessary in counterfactual nonrealism.
For policy-dependent source code, I intend for the agent to be logically updateful, while updateless about observations.
Why is this much better than counterfactuals which keep the source code fixed but imagine the execution trace being different?
Because it doesn’t lead to logical incoherence, so reasoning about counterfactuals doesn’t have to be limited.
This seems to only push the rough spots further back—there can still be contradictions, e.g. between the source code and the process by which programmers wrote the source code.
If you see your source code is B instead of A, you should anticipate learning that the programmers programmed B instead of A, which means something was different in the process. So the counterfactual has implications backwards in physical time.
At some point it will ground out in: different indexical facts, different laws of physics, different initial conditions, different random events...
This theory isn’t worked out yet but it doesn’t yet seem that it will run into logical incoherence, the way logical counterfactuals do.
But then we are faced with the usual questions about spurious counterfactuals, chicken rule, exploration, and Troll Bridge.
Maybe some of these.
Spurious counterfactuals require getting a proof of “I will take action X”. The proof proceeds by showing “source code A outputs action X”. But an agent who accepts policy-dependent source code will believe they have source code other than A if they don’t take action X. So the spurious proof doesn’t prevent the counterfactual from being evaluated.
Chicken rule is hence unnecessary.
Exploration is a matter of whether the world model is any good; the world model may, for example, map a policy to a distribution of expected observations. (That is, the world model already has policy counterfactuals as part of it; theories such as physics provide constraints on the world model rather than fully determining it). Learning a good world model is of course a problem in any approach.
Whether troll bridge is a problem depends on how the source code counterfactual is evaluated. Indeed, many ways of running this counterfactual (e.g. inserting special cases into the source code) are “stupid” and could be punished in a troll bridge problem.
I by no means think “policy-dependent source code” is presently a well worked-out theory; the advantage relative to logical counterfactuals is that in the latter case, there is a strong theoretical obstacle to ever having a well worked-out theory, namely logical incoherence of the counterfactuals. Hence, coming up with a theory of policy-dependent source code seems more likely to succeed than coming up with a theory of logical counterfactuals.
If you see your source code is B instead of A, you should anticipate learning that the programmers programmed B instead of A, which means something was different in the process. So the counterfactual has implications backwards in physical time.
At some point it will ground out in: different indexical facts, different laws of physics, different initial conditions, different random events...
I’m not sure how you are thinking about this. It seems to me like this will imply really radical changes to the universe. Suppose the agent is choosing between a left path and a right path. Its actual programming will go left. It has to come up with alternate programming which would make it go right, in order to consider that scenario. The most probable universe in which its programming would make it go right is potentially really different from our own. In particular, it is a universe where it would go right despite everything it has observed, a lifetime of (updateless) learning, which in the real universe, has taught it that it should go left in situations like this.
EG, perhaps it has faced an iterated 5&10 problem, where left always yields 10. It has to consider alternate selves who, faced with that history, go right.
It just seems implausible that thinking about universes like that will result in systematically good decisions. In the iterated 5&10 example, perhaps universes where its programming fails iterated 5&10 are universes where iterated 5&10 is an exceedingly unlikely situation; so in fact, the reward for going right is quite unlikely to be 5, and very likely to be 100. Then the AI would choose to go right.
Obviously, this is not necessarily how you are thinking about it at all—as you said, you haven’t given an actual decision procedure. But the idea of considering only really consistent counterfactual worlds seems quite problematic.
I agree this is a problem, but isn’t this a problem for logical counterfactual approaches as well? Isn’t it also weird for a known fixed optimizer source code to produce a different result on this decision where it’s obvious that ‘left’ is the best decision?
If you assume that the agent chose ‘right’, it’s more reasonable to think it’s because it’s not a pure optimizer than that a pure optimizer would have chosen ‘right’, in my view.
If you form the intent to, as a policy, go ‘right’ on the 100th turn, you should anticipate learning that your source code is not the code of a pure optimizer.
I’m left with the feeling that you don’t see the problem I’m pointing at.
My concern is that the most plausible world where you aren’t a pure optimizer might look very very different, and whether this very very different world looks better or worse than the normal-looking world does not seem very relevant to the current decision.
Consider the “special exception selves” you mention—the Nth exception-self has a hard-coded exception “go right if it’s beet at least N turns and you’ve gone right at most 1/N of the time”.
Now let’s suppose that the worlds which give rise to exception-selves are a bit wild. That is to say, the rewards in those worlds have pretty high variance. So a significant fraction of them have quite high reward—let’s just say 10% of them have value much higher than is achievable in the real world.
So we expect that by around N=10, there will be an exception-self living in a world that looks really good.
This suggests to me that the policy-dependent-source agent cannot learn to go left > 90% of the time, because once it crosses that threshhold, the exception-self in the really good looking world is ready to trigger its exception—so going right starts to appear really good. The agent goes right until it is under the threshhold again.
If that’s true, then it seems to me rather bad: the agent ends up repeatedly going right in a situation where it should be able to learn to go left easily. Its reason for repeatedly going right? There is one enticing world, which looks much like the real world, except that in that world the agent definitely goes right. Because that agent is a lucky agent who gets a lot of utility, the actual agent has decided to copy its behavior exactly—anything else would prove the real agent unlucky, which would be sad.
Of course, this outcome is far from obvious; I’m playing fast and loose with how this sort of agent might reason.
I think it’s worth examining more closely what it means to be “not a pure optimizer”. Formally, a VNM utility function is a rationalization of a coherent policy. Say that you have some idea about what your utility function is, U. Suppose you then decide to follow a policy that does not maximize U. Logically, it follows that U is not really your utility function; either your policy doesn’t coherently maximize any utility function, or it maximizes some other utility function. (Because the utility function is, by definition, a rationalization of the policy)
Failing to disambiguate these two notions of “the agent’s utility function” is a map-territory error.
Decision theories require, as input, a utility function to maximize, and output a policy. If a decision theory is adopted by an agent who is using it to determine their policy (rather than already knowing their policy), then they are operating on some preliminary idea about what their utility function is. Their “actual” utility function is dependent on their policy; it need not match up with their idea.
So, it is very much possible for an agent who is operating on an idea U of their utility function, to evaluate counterfactuals in which their true behavioral utility function is not U. Indeed, this is implied by the fact that utility functions are rationalizations of policies.
Let’s look at the “turn left/right” example. The agent is operating on a utility function idea U, which is higher the more the agent turns left. When they evaluate the policy of turning “right” on the 10th time, they must conclude that, in this hypothetical, either (a) “right” maximizes U, (b) they are maximizing some utility function other than U, or (c) they aren’t a maximizer at all.
The logical counterfactual framework says the answer is (a): that the fixed computation of U-maximization results in turning right, not left. But, this is actually the weirdest of the three worlds. It is hard to imagine ways that “right” maximizes U, whereas it is easy to imagine that the agent is maximizing a utility function other than U, or is not a maximizer.
Yes, the (b) and (c) worlds may be weird in a problematic way. However, it is hard to imagine these being nearly as weird as (a).
One way they could be weird is that an agent having a complex utility function is likely to have been produced by a different process than an agent with a simple utility function. So the more weird exceptional decisions you make, the greater the evidence is that you were produced by the sort of process that produces complex utility functions.
This is pretty similar to the smoking lesion problem, then. I expect that policy-dependent source code will have a lot in common with EDT, as they both consider “what sort of agent I am” to be a consequence of one’s policy. (However, as you’ve pointed out, there are important complications with the framing of the smoking lesion problem)
I think further disambiguation on this could benefit from re-analyzing the smoking lesion problem (or a similar problem), but I’m not sure if I have the right set of concepts for this yet.
OK, all of that made sense to me. I find the direction more plausible than when I first read your post, although it still seems like it’ll fall to the problem I sketched.
I both like and hate that it treats logical uncertainty in a radically different way from empirical uncertainty—like, because we have so far failed to find any way to treat the two uniformly (besides being entirely updateful that is); and hate, because it still feels so wrong for the two to be very different.
Conditioning on ‘A(obs) = act’ is still a conditional, not a counterfactual. The difference between conditionals and counterfactuals is the difference between “If Oswald didn’t kill Kennedy, then someone else did” and “If Oswald didn’t kill Kennedy, then someone else would have”.
I still disagree. We need a counterfactual structure in order to consider the agent as a function A(obs). EG, if the agent is a computer program, the function A() would contain all the counterfactual information about what the agent would do if it observed different things. Hence, considering the agent’s computer program as such a function leverages an ontological commitment to those counterfactuals.
To illustrate this, consider counterfactual mugging where we already see that the coin is heads—so, there is nothing we can do, we are at the mercy of our counterfactual partner. But suppose we haven’t yet observed whether Omega gives us the money.
A “real counterfactual” is one which can be true or false independently of whether its condition is met. In this case, if we believe in real counterfactuals, we believe that there is a fact of the matter about what we do in the coin=tails case, even though the coin came up heads. If we don’t believe in real counterfactuals, we instead think only that there is a fact of how Omega is computing “what I would have done if the coin had been tails”—but we do not believe there is any “correct” way for Omega to compute that.
The obs→act representation and the P(act|obs) representation both appear to satisfy this test of non-realism. The first is always true if the observation is false, so, lacks the ability to vary independently of the observation. The second is undefined when the observation is false, which is perhaps even more appealing for the non-realist.
Now consider the A(obs)=act representation.A(tails)=pay can still vary even when we know coin=heads. So, it fails this test—it is a realist representation!
Putting something into functional form imputes a causal/counterfactual structure.
This indeed makes sense when “obs” is itself a logical fact. If obs is a sensory input, though, ‘A(obs) = act’ is a logical fact, not a logical counterfactual. (I’m not trying to avoid causal interpretations of source code interpreters here, just logical counterfactuals)
In the happy dance problem, when the agent is considering doing a happy dance, the agent should have already updated on M. This is more like timeless decision theory than updateless decision theory.
I agree that this gets around the problem, but to me the happy dance problem is still suggestive—it looks like the material conditional is the wrong representation of the thing we want to condition on.
Also—if the agent has already updated on observations, then updating on obs→act is just the same as updating on act. So this difference only matters in the updateless case, where it seems to cause us trouble.
In the happy dance problem, when the agent is considering doing a happy dance, the agent should have already updated on M. This is more like timeless decision theory than updateless decision theory.
Conditioning on ‘A(obs) = act’ is still a conditional, not a counterfactual. The difference between conditionals and counterfactuals is the difference between “If Oswald didn’t kill Kennedy, then someone else did” and “If Oswald didn’t kill Kennedy, then someone else would have”.
Indeed, troll bridge will present a problem for “playing chicken” approaches, which are probably necessary in counterfactual nonrealism.
For policy-dependent source code, I intend for the agent to be logically updateful, while updateless about observations.
Because it doesn’t lead to logical incoherence, so reasoning about counterfactuals doesn’t have to be limited.
If you see your source code is B instead of A, you should anticipate learning that the programmers programmed B instead of A, which means something was different in the process. So the counterfactual has implications backwards in physical time.
At some point it will ground out in: different indexical facts, different laws of physics, different initial conditions, different random events...
This theory isn’t worked out yet but it doesn’t yet seem that it will run into logical incoherence, the way logical counterfactuals do.
Maybe some of these.
Spurious counterfactuals require getting a proof of “I will take action X”. The proof proceeds by showing “source code A outputs action X”. But an agent who accepts policy-dependent source code will believe they have source code other than A if they don’t take action X. So the spurious proof doesn’t prevent the counterfactual from being evaluated.
Chicken rule is hence unnecessary.
Exploration is a matter of whether the world model is any good; the world model may, for example, map a policy to a distribution of expected observations. (That is, the world model already has policy counterfactuals as part of it; theories such as physics provide constraints on the world model rather than fully determining it). Learning a good world model is of course a problem in any approach.
Whether troll bridge is a problem depends on how the source code counterfactual is evaluated. Indeed, many ways of running this counterfactual (e.g. inserting special cases into the source code) are “stupid” and could be punished in a troll bridge problem.
I by no means think “policy-dependent source code” is presently a well worked-out theory; the advantage relative to logical counterfactuals is that in the latter case, there is a strong theoretical obstacle to ever having a well worked-out theory, namely logical incoherence of the counterfactuals. Hence, coming up with a theory of policy-dependent source code seems more likely to succeed than coming up with a theory of logical counterfactuals.
I’m not sure how you are thinking about this. It seems to me like this will imply really radical changes to the universe. Suppose the agent is choosing between a left path and a right path. Its actual programming will go left. It has to come up with alternate programming which would make it go right, in order to consider that scenario. The most probable universe in which its programming would make it go right is potentially really different from our own. In particular, it is a universe where it would go right despite everything it has observed, a lifetime of (updateless) learning, which in the real universe, has taught it that it should go left in situations like this.
EG, perhaps it has faced an iterated 5&10 problem, where left always yields 10. It has to consider alternate selves who, faced with that history, go right.
It just seems implausible that thinking about universes like that will result in systematically good decisions. In the iterated 5&10 example, perhaps universes where its programming fails iterated 5&10 are universes where iterated 5&10 is an exceedingly unlikely situation; so in fact, the reward for going right is quite unlikely to be 5, and very likely to be 100. Then the AI would choose to go right.
Obviously, this is not necessarily how you are thinking about it at all—as you said, you haven’t given an actual decision procedure. But the idea of considering only really consistent counterfactual worlds seems quite problematic.
I agree this is a problem, but isn’t this a problem for logical counterfactual approaches as well? Isn’t it also weird for a known fixed optimizer source code to produce a different result on this decision where it’s obvious that ‘left’ is the best decision?
If you assume that the agent chose ‘right’, it’s more reasonable to think it’s because it’s not a pure optimizer than that a pure optimizer would have chosen ‘right’, in my view.
If you form the intent to, as a policy, go ‘right’ on the 100th turn, you should anticipate learning that your source code is not the code of a pure optimizer.
I’m left with the feeling that you don’t see the problem I’m pointing at.
My concern is that the most plausible world where you aren’t a pure optimizer might look very very different, and whether this very very different world looks better or worse than the normal-looking world does not seem very relevant to the current decision.
Consider the “special exception selves” you mention—the Nth exception-self has a hard-coded exception “go right if it’s beet at least N turns and you’ve gone right at most 1/N of the time”.
Now let’s suppose that the worlds which give rise to exception-selves are a bit wild. That is to say, the rewards in those worlds have pretty high variance. So a significant fraction of them have quite high reward—let’s just say 10% of them have value much higher than is achievable in the real world.
So we expect that by around N=10, there will be an exception-self living in a world that looks really good.
This suggests to me that the policy-dependent-source agent cannot learn to go left > 90% of the time, because once it crosses that threshhold, the exception-self in the really good looking world is ready to trigger its exception—so going right starts to appear really good. The agent goes right until it is under the threshhold again.
If that’s true, then it seems to me rather bad: the agent ends up repeatedly going right in a situation where it should be able to learn to go left easily. Its reason for repeatedly going right? There is one enticing world, which looks much like the real world, except that in that world the agent definitely goes right. Because that agent is a lucky agent who gets a lot of utility, the actual agent has decided to copy its behavior exactly—anything else would prove the real agent unlucky, which would be sad.
Of course, this outcome is far from obvious; I’m playing fast and loose with how this sort of agent might reason.
I think it’s worth examining more closely what it means to be “not a pure optimizer”. Formally, a VNM utility function is a rationalization of a coherent policy. Say that you have some idea about what your utility function is, U. Suppose you then decide to follow a policy that does not maximize U. Logically, it follows that U is not really your utility function; either your policy doesn’t coherently maximize any utility function, or it maximizes some other utility function. (Because the utility function is, by definition, a rationalization of the policy)
Failing to disambiguate these two notions of “the agent’s utility function” is a map-territory error.
Decision theories require, as input, a utility function to maximize, and output a policy. If a decision theory is adopted by an agent who is using it to determine their policy (rather than already knowing their policy), then they are operating on some preliminary idea about what their utility function is. Their “actual” utility function is dependent on their policy; it need not match up with their idea.
So, it is very much possible for an agent who is operating on an idea U of their utility function, to evaluate counterfactuals in which their true behavioral utility function is not U. Indeed, this is implied by the fact that utility functions are rationalizations of policies.
Let’s look at the “turn left/right” example. The agent is operating on a utility function idea U, which is higher the more the agent turns left. When they evaluate the policy of turning “right” on the 10th time, they must conclude that, in this hypothetical, either (a) “right” maximizes U, (b) they are maximizing some utility function other than U, or (c) they aren’t a maximizer at all.
The logical counterfactual framework says the answer is (a): that the fixed computation of U-maximization results in turning right, not left. But, this is actually the weirdest of the three worlds. It is hard to imagine ways that “right” maximizes U, whereas it is easy to imagine that the agent is maximizing a utility function other than U, or is not a maximizer.
Yes, the (b) and (c) worlds may be weird in a problematic way. However, it is hard to imagine these being nearly as weird as (a).
One way they could be weird is that an agent having a complex utility function is likely to have been produced by a different process than an agent with a simple utility function. So the more weird exceptional decisions you make, the greater the evidence is that you were produced by the sort of process that produces complex utility functions.
This is pretty similar to the smoking lesion problem, then. I expect that policy-dependent source code will have a lot in common with EDT, as they both consider “what sort of agent I am” to be a consequence of one’s policy. (However, as you’ve pointed out, there are important complications with the framing of the smoking lesion problem)
I think further disambiguation on this could benefit from re-analyzing the smoking lesion problem (or a similar problem), but I’m not sure if I have the right set of concepts for this yet.
OK, all of that made sense to me. I find the direction more plausible than when I first read your post, although it still seems like it’ll fall to the problem I sketched.
I both like and hate that it treats logical uncertainty in a radically different way from empirical uncertainty—like, because we have so far failed to find any way to treat the two uniformly (besides being entirely updateful that is); and hate, because it still feels so wrong for the two to be very different.
I still disagree. We need a counterfactual structure in order to consider the agent as a function A(obs). EG, if the agent is a computer program, the function A() would contain all the counterfactual information about what the agent would do if it observed different things. Hence, considering the agent’s computer program as such a function leverages an ontological commitment to those counterfactuals.
To illustrate this, consider counterfactual mugging where we already see that the coin is heads—so, there is nothing we can do, we are at the mercy of our counterfactual partner. But suppose we haven’t yet observed whether Omega gives us the money.
A “real counterfactual” is one which can be true or false independently of whether its condition is met. In this case, if we believe in real counterfactuals, we believe that there is a fact of the matter about what we do in the coin=tails case, even though the coin came up heads. If we don’t believe in real counterfactuals, we instead think only that there is a fact of how Omega is computing “what I would have done if the coin had been tails”—but we do not believe there is any “correct” way for Omega to compute that.
The obs→act representation and the P(act|obs) representation both appear to satisfy this test of non-realism. The first is always true if the observation is false, so, lacks the ability to vary independently of the observation. The second is undefined when the observation is false, which is perhaps even more appealing for the non-realist.
Now consider the A(obs)=act representation.A(tails)=pay can still vary even when we know coin=heads. So, it fails this test—it is a realist representation!
Putting something into functional form imputes a causal/counterfactual structure.
This indeed makes sense when “obs” is itself a logical fact. If obs is a sensory input, though, ‘A(obs) = act’ is a logical fact, not a logical counterfactual. (I’m not trying to avoid causal interpretations of source code interpreters here, just logical counterfactuals)
Ahhh ok.
I agree that this gets around the problem, but to me the happy dance problem is still suggestive—it looks like the material conditional is the wrong representation of the thing we want to condition on.
Also—if the agent has already updated on observations, then updating on obs→act is just the same as updating on act. So this difference only matters in the updateless case, where it seems to cause us trouble.