Made me think of Rawl’s veil of ignorance, somewhat. I wonder- is there a whole family of techniques along the lines of “design intelligence B, given some ambiguity about your own values”, with different forms or degrees of uncertainty?
It seems like it should avoid extreme or weirdly specialized results (i.e. paper-clipping), since hedging your bets is an immediate consequence. But it’s still highly dependent on the language you’re using to model those values in the first place.
I’m a little unclear on the behavioral consequences of ‘utility function uncertainty’ as opposed to the more usual empirical uncertainty. Technically, it is an empirical question, but what does it mean to act without having perfect confidence in your own utility function?
but what does it mean to act without having perfect confidence in your own utility function?
If you look at utility functions as actual functions (not as affine equivalence classes of functions) then that uncertainty can be handled the usual way.
Suppose you want to either maximise u (the number of paperclips) or -u, you don’t know which, but will find out soon. Then, in any case, you want to gain control of the paperclip factories...
Well, let’s further say that you assign p(+u)=0.51 and p(-u)=0.49, slightly favoring the production of paperclips over their destruction. And just to keep it a toy problem, you’ve got a paperclip-making button and a paperclip-destroying button you can push, and no other means of interacting with reality.
A plain old ‘confident’ paperclip maximizer in this situation will happily just push the former button all day, receiving one Point every time it does so. But an uncertain agent will have the exact same behavior; the only difference is that it only gets .02 Points every time it pushes the button, and thus a lower overall score in the same period of time. But the number of paperclips produced is identical. The agent would not (for example) push the ‘destroy’ button 49 times and the ‘create’ button 51 times. In practical effect, this is as inconsequential as telling the confident agent that it gets two Points for every paperclip.
So in this toy problem, at least, uncertainty isn’t a moderating force. On the other hand, I would intuitively expect different behavior in a less ‘toy’ problem- for example, an uncertain maximizer might build every paperclip with a secret self-destruct command so that the number of paperclips could be quickly reduced to zero. So there’s a line somewhere where behavior changes. Maybe a good way to phrase my question would be- what are the special circumstances under which an uncertain utility function produces a change in behavior?
If the AI expects to know tomorrow what utility function it has, it will be willing to wait, even if there is a (mild) discount rate, while a pure maximiser would not.
In the more frequently considered case of a non-stable utility function, my understanding is that the agent will not try to identify the terminal attractor and then act according to that- it doesn’t care about what ‘it’ will value in the future, except instrumentally. Rather, it will attempt to maximize its current utility function, given a future agent/self acting according to a different function. Metaphorically, it gets one move in a chess game against its future selves.
I don’t see any reason for a temporarily uncertain agent to act any differently. If there is no function that is, right now, motivating it to maximize paperclips, why should it care that it will be so motivated in the future? That would seem to require a kind of recursive utility function, one in which it gains utility from maximizing its utility function in the abstract.
In this case, the AI has a stable utility function—it just doesn’t know yet what it is.
For instance, it could be “in worlds where a certain coin was heads, maximise paperclips; in other worlds, minimise them”, and it has no info yet on the coin flip. That’s a perfectly consistent and stable utility function.
Made me think of Rawl’s veil of ignorance, somewhat. I wonder- is there a whole family of techniques along the lines of “design intelligence B, given some ambiguity about your own values”, with different forms or degrees of uncertainty?
It seems like it should avoid extreme or weirdly specialized results (i.e. paper-clipping), since hedging your bets is an immediate consequence. But it’s still highly dependent on the language you’re using to model those values in the first place.
I’m a little unclear on the behavioral consequences of ‘utility function uncertainty’ as opposed to the more usual empirical uncertainty. Technically, it is an empirical question, but what does it mean to act without having perfect confidence in your own utility function?
If you look at utility functions as actual functions (not as affine equivalence classes of functions) then that uncertainty can be handled the usual way.
Suppose you want to either maximise u (the number of paperclips) or -u, you don’t know which, but will find out soon. Then, in any case, you want to gain control of the paperclip factories...
Well, let’s further say that you assign p(+u)=0.51 and p(-u)=0.49, slightly favoring the production of paperclips over their destruction. And just to keep it a toy problem, you’ve got a paperclip-making button and a paperclip-destroying button you can push, and no other means of interacting with reality.
A plain old ‘confident’ paperclip maximizer in this situation will happily just push the former button all day, receiving one Point every time it does so. But an uncertain agent will have the exact same behavior; the only difference is that it only gets .02 Points every time it pushes the button, and thus a lower overall score in the same period of time. But the number of paperclips produced is identical. The agent would not (for example) push the ‘destroy’ button 49 times and the ‘create’ button 51 times. In practical effect, this is as inconsequential as telling the confident agent that it gets two Points for every paperclip.
So in this toy problem, at least, uncertainty isn’t a moderating force. On the other hand, I would intuitively expect different behavior in a less ‘toy’ problem- for example, an uncertain maximizer might build every paperclip with a secret self-destruct command so that the number of paperclips could be quickly reduced to zero. So there’s a line somewhere where behavior changes. Maybe a good way to phrase my question would be- what are the special circumstances under which an uncertain utility function produces a change in behavior?
If the AI expects to know tomorrow what utility function it has, it will be willing to wait, even if there is a (mild) discount rate, while a pure maximiser would not.
In the more frequently considered case of a non-stable utility function, my understanding is that the agent will not try to identify the terminal attractor and then act according to that- it doesn’t care about what ‘it’ will value in the future, except instrumentally. Rather, it will attempt to maximize its current utility function, given a future agent/self acting according to a different function. Metaphorically, it gets one move in a chess game against its future selves.
I don’t see any reason for a temporarily uncertain agent to act any differently. If there is no function that is, right now, motivating it to maximize paperclips, why should it care that it will be so motivated in the future? That would seem to require a kind of recursive utility function, one in which it gains utility from maximizing its utility function in the abstract.
In this case, the AI has a stable utility function—it just doesn’t know yet what it is.
For instance, it could be “in worlds where a certain coin was heads, maximise paperclips; in other worlds, minimise them”, and it has no info yet on the coin flip. That’s a perfectly consistent and stable utility function.
it is if you can get evidence about your UF.