Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = −1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.
Or at least prevent sign flip errors from causing something worse than paperclipping?
Do you think that this specific risk could be mitigated by some variant of Eliezer’s separation from hyperexistential risk or Stuart Armstrong’s idea here:
Or at least prevent sign flip errors from causing something worse than paperclipping?