Utility functions have two very different meanings, and people keep confusing them.
On the one hand, you can take any object, and present it with choices, record what it actually does, and try to represent the pattern of its choices AS IF its internal architecture was generating all possible actions, evaluating them with a utility-function-module, and then taking the action with the highest utility in this situation. Call this the “observational” utility function.
On the other hand, you can build entities that do in fact have utility-function-modules as part of their internal architecture, either a single black box (as in some current AI architectures), or as some more subtle design parameter. Call this the “architectural utility function”.
However, entities with an explicit utility function component have a failure mode, so-called “wireheading”. If some industrial accident occurred and drove a spike into its brain, the output of the utility function module might be “fail high” and cause the entity to do nothing, or nothing except pursue similar industrial accidents. More subtle, distributed utility function modules would require more subtle, “lucky” industrial accidents, but the failure mode is still there.
Now, if you consider industrial accidents to be a form of stimulus, and you take an entity with a utility-function component, and try to compute its observational utility function, you will find that the observational utility function differs from the designed utility function—specifically, it pursues certain “wireheading” stimuli, even though those are not part of the designed utility function.
If you insist on using observational utility, then wireheading is meaningless; for example, addicted humans want addictive substances, and that’s simply part of their utility function. However, I suggest that this is actually an argument against using observational utility. In order to design minds that are resistant to wireheading, we should use the (admittedly fuzzy) concept of “architectural utility”—meaning the output of utility function modules, even though that means that we can no longer say that a (for example) paperclip maximizer necessarily maximizes paperclips. It might try to maximize paperclips but routinely fail, and that pattern of behavior might be characterizable using a different, observational utility function something like “maximize paperclip-making attempts”.
Utility functions have two very different meanings, and people keep confusing them.
On the one hand, you can take any object, and present it with choices, record what it actually does, and try to represent the pattern of its choices AS IF its internal architecture was generating all possible actions, evaluating them with a utility-function-module, and then taking the action with the highest utility in this situation. Call this the “observational” utility function.
On the other hand, you can build entities that do in fact have utility-function-modules as part of their internal architecture, either a single black box (as in some current AI architectures), or as some more subtle design parameter. Call this the “architectural utility function”.
However, entities with an explicit utility function component have a failure mode, so-called “wireheading”. If some industrial accident occurred and drove a spike into its brain, the output of the utility function module might be “fail high” and cause the entity to do nothing, or nothing except pursue similar industrial accidents. More subtle, distributed utility function modules would require more subtle, “lucky” industrial accidents, but the failure mode is still there.
Now, if you consider industrial accidents to be a form of stimulus, and you take an entity with a utility-function component, and try to compute its observational utility function, you will find that the observational utility function differs from the designed utility function—specifically, it pursues certain “wireheading” stimuli, even though those are not part of the designed utility function.
If you insist on using observational utility, then wireheading is meaningless; for example, addicted humans want addictive substances, and that’s simply part of their utility function. However, I suggest that this is actually an argument against using observational utility. In order to design minds that are resistant to wireheading, we should use the (admittedly fuzzy) concept of “architectural utility”—meaning the output of utility function modules, even though that means that we can no longer say that a (for example) paperclip maximizer necessarily maximizes paperclips. It might try to maximize paperclips but routinely fail, and that pattern of behavior might be characterizable using a different, observational utility function something like “maximize paperclip-making attempts”.