The general sentiment based on which LessWrong is founded assumes that it’s hard to have utility functions that are stable under self-modification and that’s one of the reasons why friendly AGI is a very hard problem.
Would it be likely for the utility function to flip *completely*, though? There’s a difference between some drift in the utility function and the AI screwing up and designing a successor with the complete opposite of its utility function.
The general sentiment based on which LessWrong is founded assumes that it’s hard to have utility functions that are stable under self-modification and that’s one of the reasons why friendly AGI is a very hard problem.
Would it be likely for the utility function to flip *completely*, though? There’s a difference between some drift in the utility function and the AI screwing up and designing a successor with the complete opposite of its utility function.
Any AGI is likely complex enough that there wouldn’t be a complete opposite but you don’t need that for an AGI that gets rid of all humans.
The scenario I’m imagining isn’t an AGI that merely “gets rid of” humans. See SignFlip.