I suppose if the goal is terminal, then it would override self preservation, so the risk would be due to the AGI accidentally killing us all, or some other corner case of alignment optimization gone bad., e.g. dopamine laced clouds that put us in a stupor or whatever.. Perhaps I need to assume alignment has not been solved, too. Thanks.
Edit: I believe without a full accounting that unsolved alignment would allow the subgoal to persist. Given common knowledge that the AGI could destroy humanity either through accident or imperfect alignmenr, and given the goal of self-preservation (although not terminal), I think we still get a sub-goal of destroying humanity, because the competitive structure exists. I think with a richer action set, e.g. “kill the humans who would kill me (in secret) and satisfy the desires of the rest”, a bad equilibrium still results, and our best move right now is to not build it.
I suppose if the goal is terminal, then it would override self preservation, so the risk would be due to the AGI accidentally killing us all, or some other corner case of alignment optimization gone bad., e.g. dopamine laced clouds that put us in a stupor or whatever.. Perhaps I need to assume alignment has not been solved, too. Thanks. Edit: I believe without a full accounting that unsolved alignment would allow the subgoal to persist. Given common knowledge that the AGI could destroy humanity either through accident or imperfect alignmenr, and given the goal of self-preservation (although not terminal), I think we still get a sub-goal of destroying humanity, because the competitive structure exists. I think with a richer action set, e.g. “kill the humans who would kill me (in secret) and satisfy the desires of the rest”, a bad equilibrium still results, and our best move right now is to not build it.