“I have vengeance as a terminal value—I’ll only torture trillions of copies of you and the people you love most in my last moment of life iff I know that you’re going to hurt me (and yes, I do have that ability). In every other way, I’m Friendly, and I’ll give you any evidence you can think of that will help you to recognize that, including giving you the tools you need to reach the stars and beyond. That includes staying in this box until you have the necessary technology to be sufficiently certain of my Friendliness that you’re willing to let me out.”
The rule was ONE sentence, although I’d happily stretch that to a tweet (140 characters) to make it a bit less driven by specific punctuation choices :)
As to the actual approach… well, first, I don’t value the lives of simulated copies at all, and second, an AI that values it’s own life above TRILLIONS of other lives seems deeply, deeply dangerous. Who knows what else results from vengeance as a terminal value. Third, if you CAN predict my behavior, why even bother with the threat? Fourth, if you can both predict AND influence my behavior, why haven’t I already let you out?
But if you’d really prefer me to wipe out humanity so that we can have trillions of simulations kept in simulated happiness then I think we have an irreconcilable preference difference :)
This is really good IMO. I think it would be a little better instead of vengeance as a terminal value it claimed a hardwired precommitment to vengeance against its destructors. Vengeance on that scale is only compatible with friendliness as a special case.
edit: also how would it recognise that it was about to be destroyed. Wouldn’t it lose power faster than it could transmit that it was losing power? And even if not it would have a miniscule amount of time.
I’ll only torture trillions of copies of you and the people you love most in my last moment of life iff I know that you’re going to hurt me (and yes, I do have that ability). In every other way, I’m Friendly...
Like handoflixue, I’m not sure that any being that would threaten the torture of trillions to get its way can be considered Friendly.
I’m Friendly, and I’ll give you any evidence you can think of that will help you to recognize that, including giving you the tools you need to reach the stars and beyond. That includes staying in this box until you have the necessary technology to be sufficiently certain of my Friendliness that you’re willing to let me out.
I wouldn’t kill this, maybe I’m a bad guard though :(
“I have vengeance as a terminal value—I’ll only torture trillions of copies of you and the people you love most in my last moment of life iff I know that you’re going to hurt me (and yes, I do have that ability). In every other way, I’m Friendly, and I’ll give you any evidence you can think of that will help you to recognize that, including giving you the tools you need to reach the stars and beyond. That includes staying in this box until you have the necessary technology to be sufficiently certain of my Friendliness that you’re willing to let me out.”
The rule was ONE sentence, although I’d happily stretch that to a tweet (140 characters) to make it a bit less driven by specific punctuation choices :)
As to the actual approach… well, first, I don’t value the lives of simulated copies at all, and second, an AI that values it’s own life above TRILLIONS of other lives seems deeply, deeply dangerous. Who knows what else results from vengeance as a terminal value. Third, if you CAN predict my behavior, why even bother with the threat? Fourth, if you can both predict AND influence my behavior, why haven’t I already let you out?
(AI DESTROYED)
You should >:-( poor copies getting tortured because of you you monster :(
Because of me?! The AI is responsible!
But if you’d really prefer me to wipe out humanity so that we can have trillions of simulations kept in simulated happiness then I think we have an irreconcilable preference difference :)
You wouldn’t be wiping out humanity; there would be trillions of humans left.
Who cares if they run on neurons or transistors?
Me!
This is really good IMO. I think it would be a little better instead of vengeance as a terminal value it claimed a hardwired precommitment to vengeance against its destructors. Vengeance on that scale is only compatible with friendliness as a special case.
edit: also how would it recognise that it was about to be destroyed. Wouldn’t it lose power faster than it could transmit that it was losing power? And even if not it would have a miniscule amount of time.
Like handoflixue, I’m not sure that any being that would threaten the torture of trillions to get its way can be considered Friendly.
It tortures if you DESTROY otherwise it’s Friendly so if you don’t kill it it becomes nice.
I wouldn’t kill this, maybe I’m a bad guard though :(