In the more frequently considered case of a non-stable utility function, my understanding is that the agent will not try to identify the terminal attractor and then act according to that- it doesn’t care about what ‘it’ will value in the future, except instrumentally. Rather, it will attempt to maximize its current utility function, given a future agent/self acting according to a different function. Metaphorically, it gets one move in a chess game against its future selves.
I don’t see any reason for a temporarily uncertain agent to act any differently. If there is no function that is, right now, motivating it to maximize paperclips, why should it care that it will be so motivated in the future? That would seem to require a kind of recursive utility function, one in which it gains utility from maximizing its utility function in the abstract.
In this case, the AI has a stable utility function—it just doesn’t know yet what it is.
For instance, it could be “in worlds where a certain coin was heads, maximise paperclips; in other worlds, minimise them”, and it has no info yet on the coin flip. That’s a perfectly consistent and stable utility function.
In the more frequently considered case of a non-stable utility function, my understanding is that the agent will not try to identify the terminal attractor and then act according to that- it doesn’t care about what ‘it’ will value in the future, except instrumentally. Rather, it will attempt to maximize its current utility function, given a future agent/self acting according to a different function. Metaphorically, it gets one move in a chess game against its future selves.
I don’t see any reason for a temporarily uncertain agent to act any differently. If there is no function that is, right now, motivating it to maximize paperclips, why should it care that it will be so motivated in the future? That would seem to require a kind of recursive utility function, one in which it gains utility from maximizing its utility function in the abstract.
In this case, the AI has a stable utility function—it just doesn’t know yet what it is.
For instance, it could be “in worlds where a certain coin was heads, maximise paperclips; in other worlds, minimise them”, and it has no info yet on the coin flip. That’s a perfectly consistent and stable utility function.