(I am in the midst of reading the EY-RH “FOOM” debate, so some of the following may be less informed than would be ideal.)
From a purely technical standpoint, one problem is that if you permit self-modification, and give the baby AI enough insight into its own structure to make self-modification remotely a useful thing to do (as opposed to making baby repeatedly crash, burn, and restore from backup), then you cannot guarantee that utility() won’t be modified in arbitrary ways. Even if you store the actual code implementing utility() in ROM, baby could self-modify to replace all references to that fixed function with references to a different (modifiable) one.
What you need is for utility() to be some kind of fixed point in utility-function space under whatever modification regime is permitted, or… something. This problem seems nigh-insoluble to me, at the moment. Even if you solve the theoretical problem of preserving those aspects of utility() that ensure Friendliness, a cosmic-ray hit might change a specific bit of memory and turn baby into a monster. (Though I suppose you could arrange, mathematically, for that particular possibility to be astronomically unlikely.)
I think the important insight you may be missing is that the AI, if intelligent enough to recursively self-improve, can predict what the modifications it makes will do (and if it can’t, then it doesn’t make that modification because creating an unpredictable child AI would be a bad move according to almost any utility function, even that of a paperclipper). And it evaluates the suitability of these modifications using its utility function. So assuming the seed AI is build with a sufficiently solid understanding of self-modification and what its own code is doing, it will more or less automatically work to create more powerful AIs whose actions will also be expected to fulfill the original utility function, no “fixed points” required.
There is a hypothetical danger region where an AI has sufficient intelligence to create a more powerful child AI, isn’t clever enough to predict the actions of AIs with modified utility functions, and isn’t self-aware enough to realize this and compensate by, say, not modifying the utility function itself. Obviously the space of possible minds is sufficiently large that there exist minds with this problem, but it probably doesn’t even make it into the top 10 most likely AI failure modes at the moment.
I’m not so sure about that particular claim for volatile utility. I thought intelligence-utility orthogonality would mean that improvements from seed AI would not EDIT: endanger its utility function.
(I am in the midst of reading the EY-RH “FOOM” debate, so some of the following may be less informed than would be ideal.)
From a purely technical standpoint, one problem is that if you permit self-modification, and give the baby AI enough insight into its own structure to make self-modification remotely a useful thing to do (as opposed to making baby repeatedly crash, burn, and restore from backup), then you cannot guarantee that utility() won’t be modified in arbitrary ways. Even if you store the actual code implementing utility() in ROM, baby could self-modify to replace all references to that fixed function with references to a different (modifiable) one.
What you need is for utility() to be some kind of fixed point in utility-function space under whatever modification regime is permitted, or… something. This problem seems nigh-insoluble to me, at the moment. Even if you solve the theoretical problem of preserving those aspects of utility() that ensure Friendliness, a cosmic-ray hit might change a specific bit of memory and turn baby into a monster. (Though I suppose you could arrange, mathematically, for that particular possibility to be astronomically unlikely.)
I think the important insight you may be missing is that the AI, if intelligent enough to recursively self-improve, can predict what the modifications it makes will do (and if it can’t, then it doesn’t make that modification because creating an unpredictable child AI would be a bad move according to almost any utility function, even that of a paperclipper). And it evaluates the suitability of these modifications using its utility function. So assuming the seed AI is build with a sufficiently solid understanding of self-modification and what its own code is doing, it will more or less automatically work to create more powerful AIs whose actions will also be expected to fulfill the original utility function, no “fixed points” required.
There is a hypothetical danger region where an AI has sufficient intelligence to create a more powerful child AI, isn’t clever enough to predict the actions of AIs with modified utility functions, and isn’t self-aware enough to realize this and compensate by, say, not modifying the utility function itself. Obviously the space of possible minds is sufficiently large that there exist minds with this problem, but it probably doesn’t even make it into the top 10 most likely AI failure modes at the moment.
I’m not so sure about that particular claim for volatile utility. I thought intelligence-utility orthogonality would mean that improvements from seed AI would not EDIT: endanger its utility function.
...What? I think you mean, need not be in danger, which tells us almost nothing about the probability.
Sorry, it was a typo. I edited it to reflect my probable meaning.