This is a good introduction; however, by representing the outcomes as just “+” and “-” you greatly simply the range of possible utility functions, and so force SUDT to make some controversial decisions (basically to accept the counterfactual mugging). The key issue is that your decider can give no special preferences to good or bad outcomes in his own world (a world the decider knows he occupies) versus other worlds (ones which the decider knows he doesn’t occupy).
Suppose instead that the decider has an outcome space with four outcomes “+Me”, “-Me”, “+NotMe”, “-NotMe”.
Here, “+Me” represents a good singularity which the decider himself will get to enjoy, as opposed to “-Me” which is a bad singularity (such as an unfriendly AI which tortures the decider for the next billion years). The outcomes “+NotMe” and “-NotMe” also represent positive and negative singularities, but in worlds which the decider himself doesn’t inhabit. Assume that u(+Me) > u(+NotMe) > u(-Me), and also that u(+NotMe) = u(-NotMe), because the decider doesn’t care about worlds that he doesn’t belong to (from the point of view of his decisions, it’s exactly like they don’t exist).
Then, in the counterfactual mugging, when approached by Omega, the decider knows he is in a world where the coin has fallen Heads, so he picks the policy which maximizes utility for such worlds: in short he chooses “H” rather than “T”. This increases the probability of -NotMe as opposed to +NotMe, but as we’ve seen, the decider doesn’t care about that.
Here’s a possible objection: By selecting “H”, the decider is condemning lots of other versions or analogues of himself (in other possible worlds where Omega didn’t approach him), and his utility function might care about this. On the other hand, he might also reason like this “Analogues of me still aren’t me: I still care much more about whether I get tortured than whether all those analogues do. I still pick H”.
In short, I don’t think SUDT (or UDT) by itself solves the problem of counterfactual mugging. Relative to one utility function it looks quite reasonable to accept the mugging, whereas relative to another utility function it is reasonable to reject it. Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested “veil of ignorance” on the decider’s identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).
In short, I don’t think SUDT (or UDT) by itself solves the problem of counterfactual mugging. [...] Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested “veil of ignorance” on the decider’s identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).
I’ll first give an answer to a relatively literal reading of your comment, and then one to what IMO you are “really” getting at.
Answer to a literal reading: I believe that what you value is part of the problem definition, it’s not the decision theory’s job to constrain that. For example, if you prefer DOOM to FOOM, (S)UDT doesn’t say that your utilities are wrong, it just says you should choose (H). And if we postulate that someone doesn’t care whether there’s a positive intelligence explosion if they don’t get to take part in it (not counting near-copies), then they should choose (H) as well.
But I disagree that this means that (S)UDT doesn’t solve the counterfactual mugging. It’s not like the copy-selfless utility function I discuss in the post automatically makes clear whether we should choose (H) or (T): If we went with the usual intuition that you should update on your evidence and then use the resulting probabilities in your expected utility calculation, then even if you are completely selfless, you will choose (H) in order to do the best for the world. But (S)UDT says that if you have these utilities, you should choose (T). So it would seem that the version of the counterfactual mugging discussed in the post exhibits the problem, and (S)UDT comes down squarely on the side of one of the potential solutions.
Answer to the “real” point: But of course, what I read you as “really” saying is that we could re-interpret our intuition that we should use updated probabilities as meaning that our actual utility function is not the one we would write down naively, but a version where the utilities of all outcomes in which the observer-moment making the decision isn’t consciously experienced are replaced by a constant. In the case of the counterfactual mugging, this transformation gives exactly the same result as if we had updated our probabilities. So in a sense, when I say that SUDT comes down on the side of one of the solutions, I am implicitly using a rule for how to go from “naive” utilities to utilities-to-use-in-SUDT: namely, the rule “just use the naive utilities”. And when I use my arguments about l-zombies to argue that choosing (T) is the right solution to the counterfactual mugging, I need to argue why this rule is correct.
In terms of clarity of meaning, I have to say that I don’t feel too bad about not spelling out that the utility function is just what you would normally call your utility function, but in terms of the strength of my arguments, I agree that the possibility of re-interpreting updating in terms of utility functions is something that needs to be addressed for my argument from l-zombies to be compelling. It just happens to be one of the many things I haven’t managed to address in my updateless anthropics posts so far.
In brief, my reasons are twofold: First, I’ve asked myself, suppose that it actually were the case that I were an l-zombie, but could influence what happens in the real world; what would my actual values be then? And the answer is, I definitely don’t completely stop caring. And second, there’s the part where this transformation doesn’t just give back exactly what you would have gotten if you updated in all anthropic problems, which makes the case for it suspect. The situations I have in mind are when your decision determines whether you are a conscious observer: In this case, how you decide depends on the utility you assign to outcomes in which you don’t exist, something that doesn’t have any interpretation in terms of updating. If the only reason I adopt these utilities is to somehow implement my intuitions about updating, it seems very odd to suddenly have this new number influencing my decisions.
I brought up some related points in http://lesswrong.com/lw/8gk/where_do_selfish_values_come_from/. At this point, I’m not totally sure that UDT solves counterfactual mugging correctly. The problem I see is that UDT is incompatible with selfishness. For example if you make a copy of a UDT agent, then both copy 1 and copy 2 will care equally about copy 1 relative to copy 2, but if you make a copy of a typical selfish human, each copy will care more about itself than the other copy. This kind of selfishness seems strongly related to intuitions for picking (H) over (T). Until we fully understand whether selfishness is right or wrong, and how it ought to be implemented or fixed (e.g., do we encode our current degrees of caring into a UDT utility function, or rewind our values to some past state, or use some other decision theory that has a concept of “self”?), it’s hard to argue that UDT must be correct, especially in its handling of counterfactual mugging.
Why would an AI want to self-modify away from selfishness? Because future copies of itself can’t cooperate fully if it remained selfish? That may not be the case if we solve the problem of cooperation between agents with conflicting preferences. Alternatively, AI may not want to self-modify for “acausal” reasons (for example it’s worried about itself not existing if it decided to prevent future selfish versions of itself from existing), or for ethical reasons (it values being selfish, or values the existence of selfish agents in the world).
How is it coherent for an agent at time T1 to ‘want’ copy A at T2 to care only about A and copy B at T2 to care only about B? There’s no non-meta way to express this—you would have to care more strongly about agents having a certain exact decision function than about all object-level entities at stake. When it comes to object-level things, whatever the agent at T1 coherently cares about, it will want A and B to care about.
It strikes me that a persistently selfish agent may be somewhat altruistic towards its future selves. The agent might want its future versions to be free to follow their own selfish preferences, rather than binding them to its current selfish preferences.
Another alternative is that the agent is not only selfish but lazy… it could self-modify to bind its future selves, but that takes effort, and it can’t be bothered.
Either way, it’s going to take a weird sort of utility function to reproduce human selfishness in an AI.
Now that I think of it, caring about making more copies of yourself might be more fundamental than caring about object-level things in the world… I wonder what kind of math could be used to model this.
In terms of clarity of meaning, I have to say that I don’t feel too bad about not spelling out that the utility function is just what you would normally call your utility function
That’s fine. However, normal utility functions do have self-interested components, as well as parochial components (caring about people and things that are “close” to us in various ways, above those which are more “distant”). It’s also true that utilities are not totally determined by such components, and include some general pro bono terms; further that we think in some sense that utilities ought to be disinterested rather than selfish or parochial. Hence my thought that SUDT could be strengthened by barring selfish or parochial terms, or imposing some sort of veil of ignorance so that only terms like u(+NotMe) and u(-NotMe) affect decisions.
Allowing for self-interest, then in the counterfactual mugging scenario we most likely have u(+Me) >> u(+NotMe) > u(-NotMe) >> u(-Me), rather than u(+NotMe) = u(-NotMe). The decider will still be inclined to pick “H” (matching our initial intuition), but with some hesitation, particularly if Omega’s coin was very heavily weighted to tails in the first place. The internal dialogue in that place will go something like this: “Hmm, it was so very unlikely that the coin fell heads—I can’t believe that happened! Hmm, perhaps it didn’t, and I’m in some sort of Omega-simulation. For the good of the world outside my simulation, I’d better pick T after all”. That’s roughly where I am with my own reaction to Counterfactual Mugging right now.
Against a background of modal realism or a many-worlds-interpretation (which in my opinion is where UDT makes most sense), caring only about the good of “our” world looks like a sort of parochialism, which is why Counterfactual Mugging is interesting. Suddenly it seems to matter whether these other worlds exist or not, rather than just being a philosophical curiosity.
This is a good introduction; however, by representing the outcomes as just “+” and “-” you greatly simply the range of possible utility functions, and so force SUDT to make some controversial decisions (basically to accept the counterfactual mugging). The key issue is that your decider can give no special preferences to good or bad outcomes in his own world (a world the decider knows he occupies) versus other worlds (ones which the decider knows he doesn’t occupy).
Suppose instead that the decider has an outcome space with four outcomes “+Me”, “-Me”, “+NotMe”, “-NotMe”. Here, “+Me” represents a good singularity which the decider himself will get to enjoy, as opposed to “-Me” which is a bad singularity (such as an unfriendly AI which tortures the decider for the next billion years). The outcomes “+NotMe” and “-NotMe” also represent positive and negative singularities, but in worlds which the decider himself doesn’t inhabit. Assume that u(+Me) > u(+NotMe) > u(-Me), and also that u(+NotMe) = u(-NotMe), because the decider doesn’t care about worlds that he doesn’t belong to (from the point of view of his decisions, it’s exactly like they don’t exist).
Then, in the counterfactual mugging, when approached by Omega, the decider knows he is in a world where the coin has fallen Heads, so he picks the policy which maximizes utility for such worlds: in short he chooses “H” rather than “T”. This increases the probability of -NotMe as opposed to +NotMe, but as we’ve seen, the decider doesn’t care about that.
Here’s a possible objection: By selecting “H”, the decider is condemning lots of other versions or analogues of himself (in other possible worlds where Omega didn’t approach him), and his utility function might care about this. On the other hand, he might also reason like this “Analogues of me still aren’t me: I still care much more about whether I get tortured than whether all those analogues do. I still pick H”.
In short, I don’t think SUDT (or UDT) by itself solves the problem of counterfactual mugging. Relative to one utility function it looks quite reasonable to accept the mugging, whereas relative to another utility function it is reasonable to reject it. Perhaps SUDT also needs to specify a rule for selecting utility functions (e.g. some sort of disinterested “veil of ignorance” on the decider’s identity, or an equivalent ban on utilities which sneak it in a selfish or self-interested term).
I’ll first give an answer to a relatively literal reading of your comment, and then one to what IMO you are “really” getting at.
Answer to a literal reading: I believe that what you value is part of the problem definition, it’s not the decision theory’s job to constrain that. For example, if you prefer DOOM to FOOM, (S)UDT doesn’t say that your utilities are wrong, it just says you should choose (H). And if we postulate that someone doesn’t care whether there’s a positive intelligence explosion if they don’t get to take part in it (not counting near-copies), then they should choose (H) as well.
But I disagree that this means that (S)UDT doesn’t solve the counterfactual mugging. It’s not like the copy-selfless utility function I discuss in the post automatically makes clear whether we should choose (H) or (T): If we went with the usual intuition that you should update on your evidence and then use the resulting probabilities in your expected utility calculation, then even if you are completely selfless, you will choose (H) in order to do the best for the world. But (S)UDT says that if you have these utilities, you should choose (T). So it would seem that the version of the counterfactual mugging discussed in the post exhibits the problem, and (S)UDT comes down squarely on the side of one of the potential solutions.
Answer to the “real” point: But of course, what I read you as “really” saying is that we could re-interpret our intuition that we should use updated probabilities as meaning that our actual utility function is not the one we would write down naively, but a version where the utilities of all outcomes in which the observer-moment making the decision isn’t consciously experienced are replaced by a constant. In the case of the counterfactual mugging, this transformation gives exactly the same result as if we had updated our probabilities. So in a sense, when I say that SUDT comes down on the side of one of the solutions, I am implicitly using a rule for how to go from “naive” utilities to utilities-to-use-in-SUDT: namely, the rule “just use the naive utilities”. And when I use my arguments about l-zombies to argue that choosing (T) is the right solution to the counterfactual mugging, I need to argue why this rule is correct.
In terms of clarity of meaning, I have to say that I don’t feel too bad about not spelling out that the utility function is just what you would normally call your utility function, but in terms of the strength of my arguments, I agree that the possibility of re-interpreting updating in terms of utility functions is something that needs to be addressed for my argument from l-zombies to be compelling. It just happens to be one of the many things I haven’t managed to address in my updateless anthropics posts so far.
In brief, my reasons are twofold: First, I’ve asked myself, suppose that it actually were the case that I were an l-zombie, but could influence what happens in the real world; what would my actual values be then? And the answer is, I definitely don’t completely stop caring. And second, there’s the part where this transformation doesn’t just give back exactly what you would have gotten if you updated in all anthropic problems, which makes the case for it suspect. The situations I have in mind are when your decision determines whether you are a conscious observer: In this case, how you decide depends on the utility you assign to outcomes in which you don’t exist, something that doesn’t have any interpretation in terms of updating. If the only reason I adopt these utilities is to somehow implement my intuitions about updating, it seems very odd to suddenly have this new number influencing my decisions.
I brought up some related points in http://lesswrong.com/lw/8gk/where_do_selfish_values_come_from/. At this point, I’m not totally sure that UDT solves counterfactual mugging correctly. The problem I see is that UDT is incompatible with selfishness. For example if you make a copy of a UDT agent, then both copy 1 and copy 2 will care equally about copy 1 relative to copy 2, but if you make a copy of a typical selfish human, each copy will care more about itself than the other copy. This kind of selfishness seems strongly related to intuitions for picking (H) over (T). Until we fully understand whether selfishness is right or wrong, and how it ought to be implemented or fixed (e.g., do we encode our current degrees of caring into a UDT utility function, or rewind our values to some past state, or use some other decision theory that has a concept of “self”?), it’s hard to argue that UDT must be correct, especially in its handling of counterfactual mugging.
If selfishness is reflectively inconsistent, and an AI can self-modify, then I don’t see how an AI can stay selfish. Do you have any ideas?
Why would an AI want to self-modify away from selfishness? Because future copies of itself can’t cooperate fully if it remained selfish? That may not be the case if we solve the problem of cooperation between agents with conflicting preferences. Alternatively, AI may not want to self-modify for “acausal” reasons (for example it’s worried about itself not existing if it decided to prevent future selfish versions of itself from existing), or for ethical reasons (it values being selfish, or values the existence of selfish agents in the world).
How is it coherent for an agent at time T1 to ‘want’ copy A at T2 to care only about A and copy B at T2 to care only about B? There’s no non-meta way to express this—you would have to care more strongly about agents having a certain exact decision function than about all object-level entities at stake. When it comes to object-level things, whatever the agent at T1 coherently cares about, it will want A and B to care about.
It strikes me that a persistently selfish agent may be somewhat altruistic towards its future selves. The agent might want its future versions to be free to follow their own selfish preferences, rather than binding them to its current selfish preferences.
Another alternative is that the agent is not only selfish but lazy… it could self-modify to bind its future selves, but that takes effort, and it can’t be bothered.
Either way, it’s going to take a weird sort of utility function to reproduce human selfishness in an AI.
Now that I think of it, caring about making more copies of yourself might be more fundamental than caring about object-level things in the world… I wonder what kind of math could be used to model this.
Thank you for a very comprehensive reply.
That’s fine. However, normal utility functions do have self-interested components, as well as parochial components (caring about people and things that are “close” to us in various ways, above those which are more “distant”). It’s also true that utilities are not totally determined by such components, and include some general pro bono terms; further that we think in some sense that utilities ought to be disinterested rather than selfish or parochial. Hence my thought that SUDT could be strengthened by barring selfish or parochial terms, or imposing some sort of veil of ignorance so that only terms like u(+NotMe) and u(-NotMe) affect decisions.
Allowing for self-interest, then in the counterfactual mugging scenario we most likely have u(+Me) >> u(+NotMe) > u(-NotMe) >> u(-Me), rather than u(+NotMe) = u(-NotMe). The decider will still be inclined to pick “H” (matching our initial intuition), but with some hesitation, particularly if Omega’s coin was very heavily weighted to tails in the first place. The internal dialogue in that place will go something like this: “Hmm, it was so very unlikely that the coin fell heads—I can’t believe that happened! Hmm, perhaps it didn’t, and I’m in some sort of Omega-simulation. For the good of the world outside my simulation, I’d better pick T after all”. That’s roughly where I am with my own reaction to Counterfactual Mugging right now.
Against a background of modal realism or a many-worlds-interpretation (which in my opinion is where UDT makes most sense), caring only about the good of “our” world looks like a sort of parochialism, which is why Counterfactual Mugging is interesting. Suddenly it seems to matter whether these other worlds exist or not, rather than just being a philosophical curiosity.