Some cognitive architectures intrinsically exhibit instability of values (e.g. those where goals compete stochastically for priority), but Omohundro’s drive to protect the utility function from modification should prevent a self-modifying AI with a stable architecture from adopting an architecture that is knowably or even possibly unstable.
However, the human cognitive architecture certainly looks to have value instability, and so this will be a problem for any attempt to codify a fixed human-friendly utility function by renormalizing the existing unstable architecture. Omohundro’s drive won’t automatically work here since the starting point isn’t stable. It’s also very possible that there’s more than one reflectively stable equilibrium that can be obtained starting from the human decision architecture, because of its stochastic or context-dependent aspects.
Omohundro’s drive to protect the utility function from modification
The machines in my post have no such drive coded in, and this isn’t a problem. Just having a utility function over universes works out fine: if there’s an action that makes the universe end up in the desired state, the computer will find it and do it. If there’s uncertainty about possible interference, it will be taken into account.
There is also nothing to say that the eventual stable preference will have anything to do with the initial one, while the post argued about the initial utility. In this sense, Omohundro’s argument is not relevant.
Some cognitive architectures intrinsically exhibit instability of values (e.g. those where goals compete stochastically for priority), but Omohundro’s drive to protect the utility function from modification should prevent a self-modifying AI with a stable architecture from adopting an architecture that is knowably or even possibly unstable.
However, the human cognitive architecture certainly looks to have value instability, and so this will be a problem for any attempt to codify a fixed human-friendly utility function by renormalizing the existing unstable architecture. Omohundro’s drive won’t automatically work here since the starting point isn’t stable. It’s also very possible that there’s more than one reflectively stable equilibrium that can be obtained starting from the human decision architecture, because of its stochastic or context-dependent aspects.
The machines in my post have no such drive coded in, and this isn’t a problem. Just having a utility function over universes works out fine: if there’s an action that makes the universe end up in the desired state, the computer will find it and do it. If there’s uncertainty about possible interference, it will be taken into account.
Omohundro’s drives are emergent behaviors expected in any sufficiently advanced intelligence, not something that gets coded in at the beginning.
Oh. Thanks.
There is also nothing to say that the eventual stable preference will have anything to do with the initial one, while the post argued about the initial utility. In this sense, Omohundro’s argument is not relevant.