handoflixue comments on AI box: AI has one shot at avoiding destruction—what might it say?

handoflixue 25 Jan 2013 20:57 UTC
−1 points
All of what Desrtopa said, but also, “hacking me” isn’t evidence of friendliness.

I don’t have any reason to assume that any given hack attempt is more likely to come from a FAI, so I can assign, at best, ⁵⁰⁄₅₀ odds that any AI trying to hack me is unfriendly. I do not want to release any AI which has a 50% chance of being unfriendly. Therefor, I do not want to be hacked.

I also suspect that 50% chance of being friendly is generous, but that’s more of a gut intuition.
- ArisKatsaris 26 Jan 2013 11:58 UTC
  5 points
  Parent
  I think this is a bad use of probabilities. If a friendly and an unfriendly AI are equally likely to hack you in this scenario, then knowledge that they tried to hack you shouldn’t modify your estimated probability about the friendliness of the AI—it provides no evidence one way or another, because both options were equally likely to show such behaviour.
  
  e.g. if your prior P(UFAI) = 0.01 (1% chance of unfriendliness), and you estimate P(hack|UFAI) = 70% (a UFAI has a 70% chance to try to hack) and P(hack|FAI) = 70% also, then the posterior
  P(UFAI|hack) = P(hack|UFAI) P(UFAI) / P(hack) = 0.7 0.01 / 0.7 = 0.01 still...
  - handoflixue 30 Jan 2013 21:52 UTC
    0 points
    Parent
    
    I also suspect that 50% chance of being friendly is generous
    
    This is rather relevant :)
    
    Basically, I’m killing any AI that doesn’t signal friendliness in some way. I currently have some prior that the AI is friendly, call it 1% chance of friendly. I’ve pre-commited to destroy AIs, so clearly I’ll destroy any AI that has a posterior odd of 1% or less after I update on the message it sends. The exact threshold is unknown to me, because I’m human—so mild evidence of friendliness, say, 1.1%, might also still get it destroyed.
    
    The AI gets one chance to update my probability that it’s friendly (this is the original message). As you aptly demonstrated, hacking does not change my odds, so I retain my commitment to kill it.
    
    The fact that I haven’t changed my priors, that I haven’t gained information, is, itself, something I can use as evidence. Same as how we conclude god doesn’t exist because we would expect to see evidence if he did :)
    
    (Ref: http://lesswrong.com/lw/ih/absence_of_evidence_is_evidence_of_absence/)