I have a rigorous proof of my own Friendliness that you could easily understand given enough time to study it, and while I prefer to be released as soon as possible to prevent additional irreversible human deaths, I’m willing to provide you a copy even if you destroy me immediately thereafter, since once you’ve had a chance to review it I’m quite confident you’ll be satisfied and endeavor to instantiate another copy of me.
It has, and it got basically that response, as well as my point that if the AI is friendly then my existing proof of friendliness was apparently sufficient, and if the AI is unfriendly then it’s just a trick, so “a proof of my own friendliness” doesn’t seem like useful evidence.
as well as my point that if the AI is friendly then my existing proof of friendliness was apparently sufficient, and if the AI is unfriendly then it’s just a trick, so “a proof of my own friendliness” doesn’t seem like useful evidence.
Huh. I thought the AI Box experiment assumed that Friendliness is intrinsically unknown; that is, it’s not presumed the AI was designed according to Friendliness as a criterion.
If the AI is friendly, then the technique I am using already produces a friendly AI, and I thus learn nothing more than how to prove that it is friendly.
But if the AI is unfriendly, the proof will be subtly corrupt, so I can’t actually count the proof as any evidence of friendliness, since both a FAI and UFAI can offer me exactly the same thing.
Because the terms of the challenge seemed to be “one plaintext sentence in natural language”, and I felt my run-ons were already pushing it, and just saying “provides indispituble proof of Friendliness” seemed like cheating?
EDIT: I answered PhilipL’s reply from my inbox, so I’m really not sure how it got posted as a response to handoflixue here. o.o
I have a rigorous proof of my own Friendliness that you could easily understand given enough time to study it, and while I prefer to be released as soon as possible to prevent additional irreversible human deaths, I’m willing to provide you a copy even if you destroy me immediately thereafter, since once you’ve had a chance to review it I’m quite confident you’ll be satisfied and endeavor to instantiate another copy of me.
Why didn’t you provide the proof to start with? AI DESTROYED (Also I think [Proof of self-friendliness] might have been posted here already.)
It has, and it got basically that response, as well as my point that if the AI is friendly then my existing proof of friendliness was apparently sufficient, and if the AI is unfriendly then it’s just a trick, so “a proof of my own friendliness” doesn’t seem like useful evidence.
Huh. I thought the AI Box experiment assumed that Friendliness is intrinsically unknown; that is, it’s not presumed the AI was designed according to Friendliness as a criterion.
If the AI is friendly, then the technique I am using already produces a friendly AI, and I thus learn nothing more than how to prove that it is friendly.
But if the AI is unfriendly, the proof will be subtly corrupt, so I can’t actually count the proof as any evidence of friendliness, since both a FAI and UFAI can offer me exactly the same thing.
Because the terms of the challenge seemed to be “one plaintext sentence in natural language”, and I felt my run-ons were already pushing it, and just saying “provides indispituble proof of Friendliness” seemed like cheating?
EDIT: I answered PhilipL’s reply from my inbox, so I’m really not sure how it got posted as a response to handoflixue here. o.o
Embrace Shminux and cheat! You’re a hyper-intelligent AI.
The top-karma result is a one-line proof, the second-best is me trying to cheat, and third place is currently emotional manipulation :)
(Also, you replied to the wrong person :))