This all looks clever, apart from the fact that the AI becomes completely indifferent to arbitrary changes in its value system. The way you describe it, the AI will happily and uncomplainingly accept a switch from a friendly v (such as promoting human survival, welfare and settlement of Galaxy) to an almost arbitrary w (such as making paperclips), just by pushing the right “update” buttons. An immediate worry is about who will be in charge of the update routine, and what happens if they are corrupt or make a mistake: if the AI is friendly, then it had better worry about this as well.
Interestingly, the examples you started with suggested that the AI should be rewarded somehow in its current utility v as a compensation for accepting a change to a different utility w. That does sound more natural, and more stable against rogue updates.
This all looks clever, apart from the fact that the AI becomes completely indifferent to arbitrary changes in its value system. The way you describe it, the AI will happily and uncomplainingly accept a switch from a friendly v (such as promoting human survival, welfare and settlement of Galaxy) to an almost arbitrary w (such as making paperclips), just by pushing the right “update” buttons. An immediate worry is about who will be in charge of the update routine, and what happens if they are corrupt or make a mistake: if the AI is friendly, then it had better worry about this as well.
Interestingly, the examples you started with suggested that the AI should be rewarded somehow in its current utility v as a compensation for accepting a change to a different utility w. That does sound more natural, and more stable against rogue updates.