I’m still a little dubious of Eliezer’s solution to the problem of separation from hyperexistential risk; if we had U = V + W where V is a reward function & W is some arbitrary thing it wants to minimise (e.g. paperclips), a sign flip in V (due to any of a broad disjunction of causes) would still cause hyperexistential catastrophe.
Or what about the case where instead of maximising -U, the values that the reward function/model gives for each “thing” is multiplied by −1. E.g. AI system gets 1 point for wireheading and −1 for torture, some weird malware/human screw-up (in the reward model or some relevant database), etc. flips the signs for each individual action. AI now maximises U = W—V.
This seems a lot more nuanced than *just* avoiding cosmic rays; and the potential consequences of a hellish “I have no mouth, and I must scream”-type are far worse than human extinction. I’m not happy with *any* non-negligible probability of this happening.
Perhaps malware could be another risk factor in the type of bug I described here? Not sure.
I’m still a little dubious of Eliezer’s solution to the problem of separation from hyperexistential risk; if we had U = V + W where V is a reward function & W is some arbitrary thing it wants to minimise (e.g. paperclips), a sign flip in V (due to any of a broad disjunction of causes) would still cause hyperexistential catastrophe.
Or what about the case where instead of maximising -U, the values that the reward function/model gives for each “thing” is multiplied by −1. E.g. AI system gets 1 point for wireheading and −1 for torture, some weird malware/human screw-up (in the reward model or some relevant database), etc. flips the signs for each individual action. AI now maximises U = W—V.
This seems a lot more nuanced than *just* avoiding cosmic rays; and the potential consequences of a hellish “I have no mouth, and I must scream”-type are far worse than human extinction. I’m not happy with *any* non-negligible probability of this happening.