RogerDearnaley comments on “Destroy humanity” as an immediate subgoal

RogerDearnaley 24 Dec 2023 2:48 UTC
2 points
0
The obvious exception to this theorem is an AI with a terminal goal that inherently requires the non-extinction of the human race. Such as something along the lines of “figure out what humans want, and give it to them”. Both halves of that require humans to still be around.
- Seth Ahrenbach 24 Dec 2023 3:30 UTC
  1 point
  0
  Parent
  I suppose if the goal is terminal, then it would override self preservation, so the risk would be due to the AGI accidentally killing us all, or some other corner case of alignment optimization gone bad., e.g. dopamine laced clouds that put us in a stupor or whatever.. Perhaps I need to assume alignment has not been solved, too. Thanks. Edit: I believe without a full accounting that unsolved alignment would allow the subgoal to persist. Given common knowledge that the AGI could destroy humanity either through accident or imperfect alignmenr, and given the goal of self-preservation (although not terminal), I think we still get a sub-goal of destroying humanity, because the competitive structure exists. I think with a richer action set, e.g. “kill the humans who would kill me (in secret) and satisfy the desires of the rest”, a bad equilibrium still results, and our best move right now is to not build it.