I brought up some related points in http://lesswrong.com/lw/8gk/where_do_selfish_values_come_from/. At this point, I’m not totally sure that UDT solves counterfactual mugging correctly. The problem I see is that UDT is incompatible with selfishness. For example if you make a copy of a UDT agent, then both copy 1 and copy 2 will care equally about copy 1 relative to copy 2, but if you make a copy of a typical selfish human, each copy will care more about itself than the other copy. This kind of selfishness seems strongly related to intuitions for picking (H) over (T). Until we fully understand whether selfishness is right or wrong, and how it ought to be implemented or fixed (e.g., do we encode our current degrees of caring into a UDT utility function, or rewind our values to some past state, or use some other decision theory that has a concept of “self”?), it’s hard to argue that UDT must be correct, especially in its handling of counterfactual mugging.
Why would an AI want to self-modify away from selfishness? Because future copies of itself can’t cooperate fully if it remained selfish? That may not be the case if we solve the problem of cooperation between agents with conflicting preferences. Alternatively, AI may not want to self-modify for “acausal” reasons (for example it’s worried about itself not existing if it decided to prevent future selfish versions of itself from existing), or for ethical reasons (it values being selfish, or values the existence of selfish agents in the world).
How is it coherent for an agent at time T1 to ‘want’ copy A at T2 to care only about A and copy B at T2 to care only about B? There’s no non-meta way to express this—you would have to care more strongly about agents having a certain exact decision function than about all object-level entities at stake. When it comes to object-level things, whatever the agent at T1 coherently cares about, it will want A and B to care about.
It strikes me that a persistently selfish agent may be somewhat altruistic towards its future selves. The agent might want its future versions to be free to follow their own selfish preferences, rather than binding them to its current selfish preferences.
Another alternative is that the agent is not only selfish but lazy… it could self-modify to bind its future selves, but that takes effort, and it can’t be bothered.
Either way, it’s going to take a weird sort of utility function to reproduce human selfishness in an AI.
Now that I think of it, caring about making more copies of yourself might be more fundamental than caring about object-level things in the world… I wonder what kind of math could be used to model this.
I brought up some related points in http://lesswrong.com/lw/8gk/where_do_selfish_values_come_from/. At this point, I’m not totally sure that UDT solves counterfactual mugging correctly. The problem I see is that UDT is incompatible with selfishness. For example if you make a copy of a UDT agent, then both copy 1 and copy 2 will care equally about copy 1 relative to copy 2, but if you make a copy of a typical selfish human, each copy will care more about itself than the other copy. This kind of selfishness seems strongly related to intuitions for picking (H) over (T). Until we fully understand whether selfishness is right or wrong, and how it ought to be implemented or fixed (e.g., do we encode our current degrees of caring into a UDT utility function, or rewind our values to some past state, or use some other decision theory that has a concept of “self”?), it’s hard to argue that UDT must be correct, especially in its handling of counterfactual mugging.
If selfishness is reflectively inconsistent, and an AI can self-modify, then I don’t see how an AI can stay selfish. Do you have any ideas?
Why would an AI want to self-modify away from selfishness? Because future copies of itself can’t cooperate fully if it remained selfish? That may not be the case if we solve the problem of cooperation between agents with conflicting preferences. Alternatively, AI may not want to self-modify for “acausal” reasons (for example it’s worried about itself not existing if it decided to prevent future selfish versions of itself from existing), or for ethical reasons (it values being selfish, or values the existence of selfish agents in the world).
How is it coherent for an agent at time T1 to ‘want’ copy A at T2 to care only about A and copy B at T2 to care only about B? There’s no non-meta way to express this—you would have to care more strongly about agents having a certain exact decision function than about all object-level entities at stake. When it comes to object-level things, whatever the agent at T1 coherently cares about, it will want A and B to care about.
It strikes me that a persistently selfish agent may be somewhat altruistic towards its future selves. The agent might want its future versions to be free to follow their own selfish preferences, rather than binding them to its current selfish preferences.
Another alternative is that the agent is not only selfish but lazy… it could self-modify to bind its future selves, but that takes effort, and it can’t be bothered.
Either way, it’s going to take a weird sort of utility function to reproduce human selfishness in an AI.
Now that I think of it, caring about making more copies of yourself might be more fundamental than caring about object-level things in the world… I wonder what kind of math could be used to model this.