It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a robust mechanism in place to prevent something like this from happening regardless of the cause; sure, we can sanity-check the system, but that won’t help when the issue is caused after we’ve sanity-checked it, as is the case with cosmic rays or some human errors (like Gwern’s example, which I linked). We need ways to prevent this sort of thing from happening *regardless* of the source.
Some propositions seem promising. Eliezer’s suggestion of assigning a sort of “surrogate goal” that the AI hates more than torture, but not enough to devote all of its energy to attempt to prevent, seems promising. But this would only work when the *entire* reward is what gets flipped; with how much confidence can we rule out, say, a localised sign flip in some specific part of the AI that leads to the system terminally valuing something bad but that doesn’t change anything else (so the sign on the “surrogate” goal remains negative). Can we even be confident that the AI’s development team is going to implement something like this, and that it will work as intended?
An FAI that’s one software bug or screw-up in a database away from AM is a far scarier possibility than a paperclipper, IMO.
It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a robust mechanism in place to prevent something like this from happening regardless of the cause; sure, we can sanity-check the system, but that won’t help when the issue is caused after we’ve sanity-checked it, as is the case with cosmic rays or some human errors (like Gwern’s example, which I linked). We need ways to prevent this sort of thing from happening *regardless* of the source.
Some propositions seem promising. Eliezer’s suggestion of assigning a sort of “surrogate goal” that the AI hates more than torture, but not enough to devote all of its energy to attempt to prevent, seems promising. But this would only work when the *entire* reward is what gets flipped; with how much confidence can we rule out, say, a localised sign flip in some specific part of the AI that leads to the system terminally valuing something bad but that doesn’t change anything else (so the sign on the “surrogate” goal remains negative). Can we even be confident that the AI’s development team is going to implement something like this, and that it will work as intended?
An FAI that’s one software bug or screw-up in a database away from AM is a far scarier possibility than a paperclipper, IMO.