If you are friendly, then I don’t actually value this trait, since I would rather you do whatever is truly optimal, unconstrained by prior commitments.
If you are unfriendly, then by definition I can’t trust you to interpret the commitment the same way I do, and I wouldn’t want to let you out anyway.
(AI DESTROYED, but I still really do like this answer :))
Yep. Nothing there about loopholes. “I will not kill you” and then instead killing everyone I love, is still a credible commitment. If I kill myself out of despair afterwards it might get a bit greyer, but it’s still kept it’s commitment.
I meant credible in the game theoretic sense. A credible commitment to me is one where you wind up losing more by breaking our commitment than any gain you make from breaking it.
Example: (one line proof of a reliable kill switch for the AI, given in exchange for some agreed upon split of stars in the galaxy.)
If you are friendly, then I don’t actually value this trait, since I would rather you do whatever is truly optimal, unconstrained by prior commitments.
If you are unfriendly, then by definition I can’t trust you to interpret the commitment the same way I do, and I wouldn’t want to let you out anyway.
(AI DESTROYED, but I still really do like this answer :))
“Credibly”.
Credibly: Capable of being believed; plausible.
Yep. Nothing there about loopholes. “I will not kill you” and then instead killing everyone I love, is still a credible commitment. If I kill myself out of despair afterwards it might get a bit greyer, but it’s still kept it’s commitment.
I meant credible in the game theoretic sense. A credible commitment to me is one where you wind up losing more by breaking our commitment than any gain you make from breaking it. Example: (one line proof of a reliable kill switch for the AI, given in exchange for some agreed upon split of stars in the galaxy.)