(B): If the static utility function is not based on object-level desires, and instead only on your desire to be able to change your mind and then get whatever you end up deciding on, but you haven’t yet decided what to change it to, then that makes the scenario more like (A). The AI has every incentive to find some method of changing your mind to something easy to satisfy, that doesn’t violate the desire of not having your head messed with. Maybe it uses extraordinarily convincing ordinary conversation? Maybe it manipulates which philosophers you meet? Maybe it uses some method you don’t even understand well enough to have a preference about? I don’t know, but you’ve pitted a restriction against the AI’s ingenuity.
(C): Consider two utility functions, U1 based on the desires of humans at time t1, and U2 based on the desires of humans at time t2. U1 and U2 are similar in some ways (depending on how much one set of humans resembles the other), but not identical, and in particular they will tend to have maxima in somewhat different places. At t1, there is an AI with utility function U1, and also with a module that repeatedly scans its environment and overwrites the utility function with the new observed human desires. This AI considers two possible actions: it can self-improve while discarding the utility-updating module and thus keep U1, or it can self-improve in such a way as to preserve the module. The first action leads to a future containing an AI with utility function U1, which will then optimize the world into a maximum of U1. The second action leads to a future containing an AI with utility function U2, which will then optimize the world into a maximum of U2, which is not a maximum of U1. Since at t1 the AI decides by the criteria of U1 and not U2, it chooses the first action.
(B): If the static utility function is not based on object-level desires, and instead only on your desire to be able to change your mind and then get whatever you end up deciding on, but you haven’t yet decided what to change it to, then that makes the scenario more like (A). The AI has every incentive to find some method of changing your mind to something easy to satisfy, that doesn’t violate the desire of not having your head messed with. Maybe it uses extraordinarily convincing ordinary conversation? Maybe it manipulates which philosophers you meet? Maybe it uses some method you don’t even understand well enough to have a preference about? I don’t know, but you’ve pitted a restriction against the AI’s ingenuity.
(C): Consider two utility functions, U1 based on the desires of humans at time t1, and U2 based on the desires of humans at time t2. U1 and U2 are similar in some ways (depending on how much one set of humans resembles the other), but not identical, and in particular they will tend to have maxima in somewhat different places. At t1, there is an AI with utility function U1, and also with a module that repeatedly scans its environment and overwrites the utility function with the new observed human desires. This AI considers two possible actions: it can self-improve while discarding the utility-updating module and thus keep U1, or it can self-improve in such a way as to preserve the module. The first action leads to a future containing an AI with utility function U1, which will then optimize the world into a maximum of U1. The second action leads to a future containing an AI with utility function U2, which will then optimize the world into a maximum of U2, which is not a maximum of U1. Since at t1 the AI decides by the criteria of U1 and not U2, it chooses the first action.