The class of non-agent AI’s (not choosing actions based on the predicted resulting utility) seems very broad. We could choose actions alphabetically, or use an expert system representing the outside view, or use a biased/inaccurate model when predicting consequences, or include preferences about which actions are good or bad in themselves.
I don’t think there’s any general failure mode (there are certainly specific ones), but if we condition on this AI being selected by humans, maybe we select something that’s doing enough optimization that it will take a highly-optimizing action like rewriting itself to be an agent.
The class of non-agent AI’s (not choosing actions based on the predicted resulting utility) seems very broad. We could choose actions alphabetically, or use an expert system representing the outside view, or use a biased/inaccurate model when predicting consequences, or include preferences about which actions are good or bad in themselves.
I don’t think there’s any general failure mode (there are certainly specific ones), but if we condition on this AI being selected by humans, maybe we select something that’s doing enough optimization that it will take a highly-optimizing action like rewriting itself to be an agent.