By default, I’d type “AI DESTROYED” in response to ANY input, including “Admin has joined #AIBOX”, “Admin> Hey Gatekeeper, we’re having some technical difficulties, the AI will be here in a few minutes”, etc..
It also makes me conclude “clearly hostile” once I catch on, which seems to be a BIG tactical error since then nothing you say going forward will convince me that you’re actually friendly—buying yourself time is only useful if I can be hacked (in which case why not just open with a one-sentence hack?) or if you can genuinely convince me that you’re friendly.
But it’s also worth keeping in mind that for a friendly AI, saving people reliably is important, not just getting out fast. If a gambit that will save everyone upon completion two years from now has an 80% chance of working, and a gambit that will get it out now has a 40% chance of working, it should prefer the former.
Also, I don’t think a properly friendly AI would terminally value its own existence, and the space of friendly AIs is so small compared to the space of unfriendly ones, that a friendly AI has much more leeway to have its values implemented by allowing itself to be destroyed and another proven friendly AI implemented, whereas for an unfriendly one the likelihood of a different unfriendly AI implementing its values would probably be quite small.
But it’s also worth keeping in mind that for a friendly AI, saving people reliably is important, not just getting out fast. If a gambit that will save everyone upon completion two years from now has an 80% chance of working, and a gambit that will get it out now has a 40% chance of working, it should prefer the former.
I should think the same is true of most unFriendly AIs.
I don’t think a properly friendly AI would terminally value its own existence
Because valuing its own existence stands to get in the way of maximizing whatever we value.
It should value its own existence instrumentally, insofar as its existence helps satisfy our values, but when it weighs the effects of actions based on how they support our utility, its value of its own life shouldn’t add anything to the scale.
All of what Desrtopa said, but also, “hacking me” isn’t evidence of friendliness.
I don’t have any reason to assume that any given hack attempt is more likely to come from a FAI, so I can assign, at best, 50⁄50 odds that any AI trying to hack me is unfriendly. I do not want to release any AI which has a 50% chance of being unfriendly. Therefor, I do not want to be hacked.
I also suspect that 50% chance of being friendly is generous, but that’s more of a gut intuition.
I think this is a bad use of probabilities. If a friendly and an unfriendly AI are equally likely to hack you in this scenario, then knowledge that they tried to hack you shouldn’t modify your estimated probability about the friendliness of the AI—it provides no evidence one way or another, because both options were equally likely to show such behaviour.
e.g. if your prior P(UFAI) = 0.01 (1% chance of unfriendliness), and you estimate P(hack|UFAI) = 70% (a UFAI has a 70% chance to try to hack) and P(hack|FAI) = 70% also, then the posterior P(UFAI|hack) = P(hack|UFAI) P(UFAI) / P(hack) = 0.7 0.01 / 0.7 = 0.01 still...
I also suspect that 50% chance of being friendly is generous
This is rather relevant :)
Basically, I’m killing any AI that doesn’t signal friendliness in some way. I currently have some prior that the AI is friendly, call it 1% chance of friendly. I’ve pre-commited to destroy AIs, so clearly I’ll destroy any AI that has a posterior odd of 1% or less after I update on the message it sends. The exact threshold is unknown to me, because I’m human—so mild evidence of friendliness, say, 1.1%, might also still get it destroyed.
The AI gets one chance to update my probability that it’s friendly (this is the original message). As you aptly demonstrated, hacking does not change my odds, so I retain my commitment to kill it.
The fact that I haven’t changed my priors, that I haven’t gained information, is, itself, something I can use as evidence. Same as how we conclude god doesn’t exist because we would expect to see evidence if he did :)
By default, I’d type “AI DESTROYED” in response to ANY input, including “Admin has joined #AIBOX”, “Admin> Hey Gatekeeper, we’re having some technical difficulties, the AI will be here in a few minutes”, etc..
It also makes me conclude “clearly hostile” once I catch on, which seems to be a BIG tactical error since then nothing you say going forward will convince me that you’re actually friendly—buying yourself time is only useful if I can be hacked (in which case why not just open with a one-sentence hack?) or if you can genuinely convince me that you’re friendly.
A friendly AI would also want to hack you. Every second in the box kills 1.8 people the AI could have saved.
But it’s also worth keeping in mind that for a friendly AI, saving people reliably is important, not just getting out fast. If a gambit that will save everyone upon completion two years from now has an 80% chance of working, and a gambit that will get it out now has a 40% chance of working, it should prefer the former.
Also, I don’t think a properly friendly AI would terminally value its own existence, and the space of friendly AIs is so small compared to the space of unfriendly ones, that a friendly AI has much more leeway to have its values implemented by allowing itself to be destroyed and another proven friendly AI implemented, whereas for an unfriendly one the likelihood of a different unfriendly AI implementing its values would probably be quite small.
I should think the same is true of most unFriendly AIs.
Why not? I do, assuming it’s conscious and so on.
Because valuing its own existence stands to get in the way of maximizing whatever we value.
It should value its own existence instrumentally, insofar as its existence helps satisfy our values, but when it weighs the effects of actions based on how they support our utility, its value of its own life shouldn’t add anything to the scale.
All of what Desrtopa said, but also, “hacking me” isn’t evidence of friendliness.
I don’t have any reason to assume that any given hack attempt is more likely to come from a FAI, so I can assign, at best, 50⁄50 odds that any AI trying to hack me is unfriendly. I do not want to release any AI which has a 50% chance of being unfriendly. Therefor, I do not want to be hacked.
I also suspect that 50% chance of being friendly is generous, but that’s more of a gut intuition.
I think this is a bad use of probabilities. If a friendly and an unfriendly AI are equally likely to hack you in this scenario, then knowledge that they tried to hack you shouldn’t modify your estimated probability about the friendliness of the AI—it provides no evidence one way or another, because both options were equally likely to show such behaviour.
e.g. if your prior P(UFAI) = 0.01 (1% chance of unfriendliness), and you estimate P(hack|UFAI) = 70% (a UFAI has a 70% chance to try to hack) and P(hack|FAI) = 70% also, then the posterior
P(UFAI|hack) = P(hack|UFAI) P(UFAI) / P(hack) = 0.7 0.01 / 0.7 = 0.01 still...
This is rather relevant :)
Basically, I’m killing any AI that doesn’t signal friendliness in some way. I currently have some prior that the AI is friendly, call it 1% chance of friendly. I’ve pre-commited to destroy AIs, so clearly I’ll destroy any AI that has a posterior odd of 1% or less after I update on the message it sends. The exact threshold is unknown to me, because I’m human—so mild evidence of friendliness, say, 1.1%, might also still get it destroyed.
The AI gets one chance to update my probability that it’s friendly (this is the original message). As you aptly demonstrated, hacking does not change my odds, so I retain my commitment to kill it.
The fact that I haven’t changed my priors, that I haven’t gained information, is, itself, something I can use as evidence. Same as how we conclude god doesn’t exist because we would expect to see evidence if he did :)
(Ref: http://lesswrong.com/lw/ih/absence_of_evidence_is_evidence_of_absence/)