Jeff Rose comments on Truth and Advantage: Response to a draft of “AI safety seems hard to measure”

Jeff Rose 23 Mar 2023 1:47 UTC
1 point
0
I would think you could force the AI to not notice that the world was round, by essentially inputting this as an overriding truth. And if that was actually and exactly what you cared about, you would be fine. But if what you cared about was any corollary of the world being round or any result of the world being round or the world being some sort of curved polygon it wouldn’t save you.
To take the Paul Tibbetts analogy: you told him not to murder and he didn’t murder; but what you wanted was for him not to kill and in most systems including the one he grew up in killings of the enemy in war are not murder.
This may say more about the limits of the analogy than anything else, but in essence you might be able to tell the AI it can’t deceive you, but it will be bound exactly by the definition of deception you provide and it will freely deceive you in any way that you didn’t think of.