That sentence is still there so my comment still stands as far as I can tell. I can also tell I’m failing to convey it so maybe someone else can step in and explain it differently.
I see a few ways the sentence could be parsed, and they all go wrong.
(A) Utility function takes as input a hypothetical world, looks for hypothetical humans in that world, and evaluates the utility of that world according to their desires. Result: AI modifies humans to have easily-satisfied desires. That you currently don’t want to be modified is irrelevant: After the AI is done messing with your head you will be satisfied, which is all the AI cares about.
(B) There is a static set of desires extracted from existing humans at the instant the AI is switched on. Utility function evaluates all hypotheticals according to that. Result: No one is allowed to change their mind. Whatever you want right now is what happens for the rest of eternity.
(C) At any given instant, the utility of all hypotheticals evaluated at that instant is computed according to the desires of humans existing at that instant. Result: AI quickly self-modifies into version (B). Because if it didn’t, then the future AI would optimize according to future humans’ desires, which would result in outcomes that score lower according to the current utility function.
(A) would be the case if the utility function was ‘create a world where human desires don’t need to be thwarted’. (and even then, depends on the definition of human). But the constraint is ‘don’t thwart human desires’.
I don’t understand (B). If I desire to be able to change my mind, (which I do), wouldn’t not being allowed to do so thwart said desire?
I also don’t really understand how the result of (C) comes about.
(B): If the static utility function is not based on object-level desires, and instead only on your desire to be able to change your mind and then get whatever you end up deciding on, but you haven’t yet decided what to change it to, then that makes the scenario more like (A). The AI has every incentive to find some method of changing your mind to something easy to satisfy, that doesn’t violate the desire of not having your head messed with. Maybe it uses extraordinarily convincing ordinary conversation? Maybe it manipulates which philosophers you meet? Maybe it uses some method you don’t even understand well enough to have a preference about? I don’t know, but you’ve pitted a restriction against the AI’s ingenuity.
(C): Consider two utility functions, U1 based on the desires of humans at time t1, and U2 based on the desires of humans at time t2. U1 and U2 are similar in some ways (depending on how much one set of humans resembles the other), but not identical, and in particular they will tend to have maxima in somewhat different places. At t1, there is an AI with utility function U1, and also with a module that repeatedly scans its environment and overwrites the utility function with the new observed human desires. This AI considers two possible actions: it can self-improve while discarding the utility-updating module and thus keep U1, or it can self-improve in such a way as to preserve the module. The first action leads to a future containing an AI with utility function U1, which will then optimize the world into a maximum of U1. The second action leads to a future containing an AI with utility function U2, which will then optimize the world into a maximum of U2, which is not a maximum of U1. Since at t1 the AI decides by the criteria of U1 and not U2, it chooses the first action.
I’ve rewritten 4.4 for clarity since you left your original comment. Do you still think it needs improvement?
That sentence is still there so my comment still stands as far as I can tell. I can also tell I’m failing to convey it so maybe someone else can step in and explain it differently.
Thanks for putting in the work to write this FAQ.
I see a few ways the sentence could be parsed, and they all go wrong.
(A) Utility function takes as input a hypothetical world, looks for hypothetical humans in that world, and evaluates the utility of that world according to their desires.
Result: AI modifies humans to have easily-satisfied desires. That you currently don’t want to be modified is irrelevant: After the AI is done messing with your head you will be satisfied, which is all the AI cares about.
(B) There is a static set of desires extracted from existing humans at the instant the AI is switched on. Utility function evaluates all hypotheticals according to that.
Result: No one is allowed to change their mind. Whatever you want right now is what happens for the rest of eternity.
(C) At any given instant, the utility of all hypotheticals evaluated at that instant is computed according to the desires of humans existing at that instant.
Result: AI quickly self-modifies into version (B). Because if it didn’t, then the future AI would optimize according to future humans’ desires, which would result in outcomes that score lower according to the current utility function.
Did you have some other alternative in mind?
(A) would be the case if the utility function was ‘create a world where human desires don’t need to be thwarted’. (and even then, depends on the definition of human). But the constraint is ‘don’t thwart human desires’.
I don’t understand (B). If I desire to be able to change my mind, (which I do), wouldn’t not being allowed to do so thwart said desire?
I also don’t really understand how the result of (C) comes about.
(B): If the static utility function is not based on object-level desires, and instead only on your desire to be able to change your mind and then get whatever you end up deciding on, but you haven’t yet decided what to change it to, then that makes the scenario more like (A). The AI has every incentive to find some method of changing your mind to something easy to satisfy, that doesn’t violate the desire of not having your head messed with. Maybe it uses extraordinarily convincing ordinary conversation? Maybe it manipulates which philosophers you meet? Maybe it uses some method you don’t even understand well enough to have a preference about? I don’t know, but you’ve pitted a restriction against the AI’s ingenuity.
(C): Consider two utility functions, U1 based on the desires of humans at time t1, and U2 based on the desires of humans at time t2. U1 and U2 are similar in some ways (depending on how much one set of humans resembles the other), but not identical, and in particular they will tend to have maxima in somewhat different places. At t1, there is an AI with utility function U1, and also with a module that repeatedly scans its environment and overwrites the utility function with the new observed human desires. This AI considers two possible actions: it can self-improve while discarding the utility-updating module and thus keep U1, or it can self-improve in such a way as to preserve the module. The first action leads to a future containing an AI with utility function U1, which will then optimize the world into a maximum of U1. The second action leads to a future containing an AI with utility function U2, which will then optimize the world into a maximum of U2, which is not a maximum of U1. Since at t1 the AI decides by the criteria of U1 and not U2, it chooses the first action.