That Paul knows full well that he cares about both copies. Yet sometime between the copying and the bell Paul has become much more parochial, and only cares about one. It seems to me like there is little way to escape from the inconsistency here.
Once we have the necessary technology, “that Paul” (earlier Paul) could modify his brain/mind so that his future selves no longer become more parochial and continue to care about both copies. Would you do this if you could? Should you?
Of course an agent free to modify itself at time T would benefit by implementing some efficient compromise amongst all copies forked off after time T.
But “all copies forked off after time T” depends on the agent’s decision at time T. For example, if the agent self-modified so that its preferences stop changing (e.g., don’t become “more parochial” over time) then “all copies forked off after time T” would have identical preferences. So is this statement equivalent to saying that the agent could self-modify to stop its preferences from changing, or do you have something else in mind, like “compromise amongst all copies forked off after time T, counterfactually if the agent didn’t self-modify”? If you mean the latter, what’s the justification for doing that instead of the former?
Yes, I think if possible you’d want to resolve to continue caring about copies even after you learn which one you are. I don’t think that you particularly want to rewind values to before prior changes, though I do think that standard decision-theoretic or “moral” arguments have a lot of force in this setting and are sufficient to recover high degrees of altruism towards copies and approximately pareto-efficient behavior.
I think it’s not clear if you should self-modify to avoid preference change unless doing so is super cheap (because of complicated decision-theoretic relationships with future and past copies of yourself, as discussed in some other comments). But I think it’s relatively clear that if your preferences were going to change into either A or B stochastically, it would be worth paying to modify yourself so that they change into some appropriately-weighted mixture of A and B. And in this case that’s the same as having your preferences not change, and so we have an unusually strong argument for avoiding this kind of preference change.
Not sure if this should in particular preserve information about preference and opportunity for more agency targeted at it (corrigibility in my sense), since losing that opportunity doesn’t seem conservative, wastes utility for most preferences. But then shutdown would involve some optimal level of agency in the environment, caretakers of corrigibility, not inactivity. Which does seem possibly correct, the agent should’t be eradicating environmental agents that have nothing to do with the agent, when going into shutdown, while refusing total shutdown when there are no environmental agents left at all (including humans and other AGIs) might be right.
If this is the case, maximal anti-goodharting is not shutdown, but maximal uncertainty about preference, so a maximally anti-goodharting agent purely pursues corrigibility (receptiveness to preference), computes what it is without optimizing for it, since it has no tractable knowledge of what it is at the moment. If environment already contains other systems receptive to preference, this might look like shutdown.
Once we have the necessary technology, “that Paul” (earlier Paul) could modify his brain/mind so that his future selves no longer become more parochial and continue to care about both copies. Would you do this if you could? Should you?
(I raised this question previously in Where do selfish values come from?)
ETA:
But “all copies forked off after time T” depends on the agent’s decision at time T. For example, if the agent self-modified so that its preferences stop changing (e.g., don’t become “more parochial” over time) then “all copies forked off after time T” would have identical preferences. So is this statement equivalent to saying that the agent could self-modify to stop its preferences from changing, or do you have something else in mind, like “compromise amongst all copies forked off after time T, counterfactually if the agent didn’t self-modify”? If you mean the latter, what’s the justification for doing that instead of the former?
Yes, I think if possible you’d want to resolve to continue caring about copies even after you learn which one you are. I don’t think that you particularly want to rewind values to before prior changes, though I do think that standard decision-theoretic or “moral” arguments have a lot of force in this setting and are sufficient to recover high degrees of altruism towards copies and approximately pareto-efficient behavior.
I think it’s not clear if you should self-modify to avoid preference change unless doing so is super cheap (because of complicated decision-theoretic relationships with future and past copies of yourself, as discussed in some other comments). But I think it’s relatively clear that if your preferences were going to change into either A or B stochastically, it would be worth paying to modify yourself so that they change into some appropriately-weighted mixture of A and B. And in this case that’s the same as having your preferences not change, and so we have an unusually strong argument for avoiding this kind of preference change.
The outcome of shutdown seems important. It’s the limiting case of soft optimization (anti-goodharting, non-agency), something you do when maximally logically uncertain about preference, conservative decisions robust to adversarial assignment of your preference.
Not sure if this should in particular preserve information about preference and opportunity for more agency targeted at it (corrigibility in my sense), since losing that opportunity doesn’t seem conservative, wastes utility for most preferences. But then shutdown would involve some optimal level of agency in the environment, caretakers of corrigibility, not inactivity. Which does seem possibly correct, the agent should’t be eradicating environmental agents that have nothing to do with the agent, when going into shutdown, while refusing total shutdown when there are no environmental agents left at all (including humans and other AGIs) might be right.
If this is the case, maximal anti-goodharting is not shutdown, but maximal uncertainty about preference, so a maximally anti-goodharting agent purely pursues corrigibility (receptiveness to preference), computes what it is without optimizing for it, since it has no tractable knowledge of what it is at the moment. If environment already contains other systems receptive to preference, this might look like shutdown.