You’ve then incorporated identity twice: once when you gave each agent its own goals, and again inside of those goals. If an agent’s goals have a dangling identity-pointer inside, then they won’t stay consistent (or well-defined) in case of self-copying, so by the same argument which says agents should stop their utility functions from drifting over time, it should replace that pointer with a specific value.
So, in other words:
If I am D and all I want is to be king of the universe, then before stepping into a copying machine I should self-modify so that my utility function will say “+1000 if D is king of the universe” rather than “+1000 if I am king of the universe”, because then my copy D2 will have a utility function of “+1000 if D is king of the universe”, and that maximises my chances of being king of the universe.
That is what you mean, right?
I guess the anthropic counter is this:
What if, after stepping into the machine, I will end up being D2 instead of being D!? If I was to self-modify to care only about D then I wouldn’t end up being king of the universe, D would!
The agent, and the utility function’s implementation in the agent, are already part of the world and its world-history. If two agents in two universes cannot be distinguished by any observation in their universes, then they must exhibit identical behavior. I claim it makes no sense to say two agents have different goals or different utility functions if they are physically identical.
It should be (world-history, identity)=>R. Different agents have different goals, which give different utility values to actions.
You’ve then incorporated identity twice: once when you gave each agent its own goals, and again inside of those goals. If an agent’s goals have a dangling identity-pointer inside, then they won’t stay consistent (or well-defined) in case of self-copying, so by the same argument which says agents should stop their utility functions from drifting over time, it should replace that pointer with a specific value.
So, in other words: If I am D and all I want is to be king of the universe, then before stepping into a copying machine I should self-modify so that my utility function will say “+1000 if D is king of the universe” rather than “+1000 if I am king of the universe”, because then my copy D2 will have a utility function of “+1000 if D is king of the universe”, and that maximises my chances of being king of the universe.
That is what you mean, right?
I guess the anthropic counter is this: What if, after stepping into the machine, I will end up being D2 instead of being D!? If I was to self-modify to care only about D then I wouldn’t end up being king of the universe, D would!
The agent, and the utility function’s implementation in the agent, are already part of the world and its world-history. If two agents in two universes cannot be distinguished by any observation in their universes, then they must exhibit identical behavior. I claim it makes no sense to say two agents have different goals or different utility functions if they are physically identical.