Surely with a sufficiently hard take-off it would be possible for the AI to prevent its turning off? If not, couldn’t the AI just deceive its creators into thinking that no signflip has occurred (e.g. making it look like it’s gaining utility from doing something beneficial to human values when it’s actually losing it). How would we be able to determine that it’s happened before it’s too late?
Further to that, what if this fuck-up happens during an arms race when its creators haven’t put enough time into safety to prevent this type of thing from happening?
In this specific example, the error becomes clear very early on in the training process. The standard control problem issues with advanced AI systems don’t apply in that situation.
As for the arms race example, building an AI system of that sophistication to fight in your conflict is like building a Dyson Sphere to power your refrigerator. Friendly AI isn’t the sort of thing major factions are going to want to fight with each other over. If there’s an arm’s race, either something delightfully improbable and horrible has happened, or it’s an extremely lopsided “race” between a Friendly AI faction and a bunch of terrorist groups.
EDIT (From two months in the future...): I am not implying that such a race would be an automatic win, or even a likely win, for said hypothesized Friendly AI faction. For various reasons, this is most certainly not the case. I’m merely saying that the Friendly AI faction will have vastly more resources than all of its competitors combined, and all of its competitors will be enemies of the world at large, etc.
Addressing this whole situation would require actual nuance. This two month old throw away comment is not the place to put that nuance. And besides, it’s been done before.
Surely with a sufficiently hard take-off it would be possible for the AI to prevent its turning off? If not, couldn’t the AI just deceive its creators into thinking that no signflip has occurred (e.g. making it look like it’s gaining utility from doing something beneficial to human values when it’s actually losing it). How would we be able to determine that it’s happened before it’s too late?
Further to that, what if this fuck-up happens during an arms race when its creators haven’t put enough time into safety to prevent this type of thing from happening?
In this specific example, the error becomes clear very early on in the training process. The standard control problem issues with advanced AI systems don’t apply in that situation.
As for the arms race example, building an AI system of that sophistication to fight in your conflict is like building a Dyson Sphere to power your refrigerator. Friendly AI isn’t the sort of thing major factions are going to want to fight with each other over. If there’s an arm’s race, either something delightfully improbable and horrible has happened, or it’s an extremely lopsided “race” between a Friendly AI faction and a bunch of terrorist groups.
EDIT (From two months in the future...): I am not implying that such a race would be an automatic win, or even a likely win, for said hypothesized Friendly AI faction. For various reasons, this is most certainly not the case. I’m merely saying that the Friendly AI faction will have vastly more resources than all of its competitors combined, and all of its competitors will be enemies of the world at large, etc.
Addressing this whole situation would require actual nuance. This two month old throw away comment is not the place to put that nuance. And besides, it’s been done before.
Can we be sure that we’d pick it up during the training process, though? And would it be possible for it to happen after the training process?