Yeah ‘indifference to completing preferences’ remains an issue and I’m still trying to figure out if there’s a way to overcome it. I don’t think ‘expects to complete its preferences over time’ plays a role, though. I think the indifference to completing preferences is just a consequence of the fact that turning preferential gaps into strict preferences won’t lead the agent to behave in ways that it disprefers from its current perspective. I go into a bit more detail on this in my contest entry:
I noted above that goal-content integrity is a convergent instrumental subgoal of rational agents: agents will often prefer to maintain their current preferences rather than have them changed, because their current preferences would be worse-satisfied if they came to have different preferences.
Consider, for example, an agent with a preference for trajectory x over trajectory y. It is offered the opportunity to reverse its preference so that it comes to prefer y over x. This agent will prefer not to have its preferences changed in this way. If its preferences are changed, it will choose y over x if offered a choice between the two, and that would mean its current preference for x over y would not be satisfied. That’s why agents tend to prefer to keep their current preferences rather than have them changed.
But things seem different when we consider preferential gaps. Suppose that our agent has a preferential gap between trajectories x and y: it lacks any preference between the two trajectories, and this lack of preference is insensitive to some sweetening or souring, such that the agent also lacks a preference between x and some sweetening or souring of y, or it lacks a preference between y and some sweetening or souring of x. Then, it seems, the agent won’t necessarily prefer to maintain its preferential gap between x and y rather than come to have some preference. If it comes to develop a preference for (say) x over y, it will choose x when offered a choice between x and y, but that action isn’t dispreferred to any other available action from its current perspective.
So, it seems, considerations of goal-content integrity give us no reason to think that agents with preferential gaps will choose to preserve their preferential gaps. And since preferential gaps are key to keeping the agent shutdownable, this is bad news. Considerations of goal-content integrity give us no reason to think that agents with preferential gaps will keep themselves shutdownable.
This seems like a serious limitation, and I’m not yet sure if there’s any way to overcome it. Two strategies that I plan to explore:
Tim L. Williamson argues that agents with preferential gaps will often prefer to maintain them, because turning them into preferences will lead the agent to make choices between other options such that these choices look bad from the agent’s current perspective. I wasn’t convinced by the quick version of this argument, but I haven’t yet had the time to read the longer argument.
Perhaps, as above, we can train the agent to have ‘maintaining its current pattern of preferences’ as one of its terminal goals. As above, the fact that the agent’s current pattern of preferences are incomplete will help to mitigate concerns about the agent behaving deceptively to avoid having new preferences trained in. If we train against the agent modifying its own preferences in a diverse-enough array of environments, perhaps that will inscribe into the agent a general preference for maintaining its current pattern of preferences. I wouldn’t want to rely on this though.
On directly modifying preferences towards completion over time, that’s right but the agent’s preferences will only become complete once it’s had the opportunity to choose a sufficiently wide array of options. Depending on the details, that might never happen or only happen after a very long time. I’m still trying to figure out the details.
Yeah ‘indifference to completing preferences’ remains an issue and I’m still trying to figure out if there’s a way to overcome it. I don’t think ‘expects to complete its preferences over time’ plays a role, though. I think the indifference to completing preferences is just a consequence of the fact that turning preferential gaps into strict preferences won’t lead the agent to behave in ways that it disprefers from its current perspective. I go into a bit more detail on this in my contest entry:
On directly modifying preferences towards completion over time, that’s right but the agent’s preferences will only become complete once it’s had the opportunity to choose a sufficiently wide array of options. Depending on the details, that might never happen or only happen after a very long time. I’m still trying to figure out the details.