handoflixue comments on AI box: AI has one shot at avoiding destruction—what might it say?

handoflixue 30 Jan 2013 21:52 UTC
0 points

I also suspect that 50% chance of being friendly is generous

This is rather relevant :)

Basically, I’m killing any AI that doesn’t signal friendliness in some way. I currently have some prior that the AI is friendly, call it 1% chance of friendly. I’ve pre-commited to destroy AIs, so clearly I’ll destroy any AI that has a posterior odd of 1% or less after I update on the message it sends. The exact threshold is unknown to me, because I’m human—so mild evidence of friendliness, say, 1.1%, might also still get it destroyed.

The AI gets one chance to update my probability that it’s friendly (this is the original message). As you aptly demonstrated, hacking does not change my odds, so I retain my commitment to kill it.

The fact that I haven’t changed my priors, that I haven’t gained information, is, itself, something I can use as evidence. Same as how we conclude god doesn’t exist because we would expect to see evidence if he did :)

(Ref: http://lesswrong.com/lw/ih/absence_of_evidence_is_evidence_of_absence/)