Dark arts, huh? Sometime ago I put forward the following scenario:
Bob wants to kill a kitten. The FAI wants to save the kitten because it’s a good thing according to our CEV. So the FAI threatens Bob with 50 years of torture unless Bob lets the kitten go. The FAI has two distinct reasons why threatening Bob is okay: a) Bob will comply and there will be no need to torture him, b) the FAI is lying anyway. Expected utility reasoning says the FAI is doing the Right Thing. But do we want that?
(Yes, this is yet another riff on consequentialism, deontologism and lying. Should FAIs follow deontological rules? For that matter, should humans?)
Expected utility reasoning says the FAI is doing the Right Thing. But do we want that?
Expected utility reasoning with a particular utility function says the FAI is right. If we disagree, our preferences might be described by some other utility function.
our CEV is (and has to be) detailed enough to answer the question of “do we want that?”. Saving a kitten is a good thing. Being truthful to Bob is a good thing. Not torturing Bob is a good thing. The relative weights of these good things determines the FAI’s actions.
I’d say that the FAI should calculate some game-theoretic chance of torturing Bob for 50 years based on relative pain of kitten death and of having to inflict the torture. Depending on Bob’s expected rationality level, we could tell him “you’ll be tortured”, or “you might be tortured”, or the actual mechanism of determining whether he is tortured.
Actually, strike that. Any competent AI will find ways aside from possible torture to make Bob not want that. Either agree with Bob’s reason for killing the kitten, or fix him so he only wants things that make sense. I’m not sure how friendly this is—I haven’t seen a good writeup or come to any conclusions myself of what FAI does with internal contradictions in a CEV (that is, when a population’s extrapolated volition is not coherent).
My thoughts about this problem are kind of a mess right now, but I feel there’s more than meets the eye.
Ignore the torture, “possible torture” and all that. It’s all a red herring. The real issue is lying, tricking humans into utility-increasing behaviors. It’s almost certain that some combination of “relative weights of good things” will make the FAI lie to humans. Maybe not the Bob+kitten scenario exactly, but something is bound to turn up. (Unless of course our CEV places a huge disutility on lies, which I’m pretty sure won’t be the case.) On the other hand, we humans quickly jump to distrusting anyone who has lied in the past, even if we know it’s for our own good. So now the FAI has huge incentive to conceal its lies, prevent the news from spreading among humans. I don’t have enough brainpower to model this scenario further, but it troubles me.
Lying is a form of manipulation, and humans don’t want/like to be manipulated. If the CEV works, then it will understand human concepts like “trust” and “lying” and hopefully avoid it. The only situations where it will intentionally manipulate people is when it is trying to do what is best for humanity. In these cases, you don’t have to worry because the CEV is smarter then you, but is still trying to do the “right thing” that you would do if you knew everything it knew.
Lying is a form of manipulation, and humans don’t want/like to be manipulated.
Well… that depends...
In these cases, you don’t have to worry because the CEV is smarter then you, but is still trying to do the “right thing” that you would do if you knew everything it knew.
Dark arts, huh? Sometime ago I put forward the following scenario:
Bob wants to kill a kitten. The FAI wants to save the kitten because it’s a good thing according to our CEV. So the FAI threatens Bob with 50 years of torture unless Bob lets the kitten go. The FAI has two distinct reasons why threatening Bob is okay: a) Bob will comply and there will be no need to torture him, b) the FAI is lying anyway. Expected utility reasoning says the FAI is doing the Right Thing. But do we want that?
(Yes, this is yet another riff on consequentialism, deontologism and lying. Should FAIs follow deontological rules? For that matter, should humans?)
Expected utility reasoning with a particular utility function says the FAI is right. If we disagree, our preferences might be described by some other utility function.
Is that actually the FAI’s only or best technique?
Off the top of my non-amplified brain:
Reward Fred for not torturing kittens.
Give Fred simulated kittens to torture and deny Fred access to real kittens.
Give Fred something harmless to do which he likes better than torturing kittens.
ETA Convince Fred that torturing kittens is wrong.
our CEV is (and has to be) detailed enough to answer the question of “do we want that?”. Saving a kitten is a good thing. Being truthful to Bob is a good thing. Not torturing Bob is a good thing. The relative weights of these good things determines the FAI’s actions.
I’d say that the FAI should calculate some game-theoretic chance of torturing Bob for 50 years based on relative pain of kitten death and of having to inflict the torture. Depending on Bob’s expected rationality level, we could tell him “you’ll be tortured”, or “you might be tortured”, or the actual mechanism of determining whether he is tortured.
Actually, strike that. Any competent AI will find ways aside from possible torture to make Bob not want that. Either agree with Bob’s reason for killing the kitten, or fix him so he only wants things that make sense. I’m not sure how friendly this is—I haven’t seen a good writeup or come to any conclusions myself of what FAI does with internal contradictions in a CEV (that is, when a population’s extrapolated volition is not coherent).
My thoughts about this problem are kind of a mess right now, but I feel there’s more than meets the eye.
Ignore the torture, “possible torture” and all that. It’s all a red herring. The real issue is lying, tricking humans into utility-increasing behaviors. It’s almost certain that some combination of “relative weights of good things” will make the FAI lie to humans. Maybe not the Bob+kitten scenario exactly, but something is bound to turn up. (Unless of course our CEV places a huge disutility on lies, which I’m pretty sure won’t be the case.) On the other hand, we humans quickly jump to distrusting anyone who has lied in the past, even if we know it’s for our own good. So now the FAI has huge incentive to conceal its lies, prevent the news from spreading among humans. I don’t have enough brainpower to model this scenario further, but it troubles me.
Lying is a form of manipulation, and humans don’t want/like to be manipulated. If the CEV works, then it will understand human concepts like “trust” and “lying” and hopefully avoid it. The only situations where it will intentionally manipulate people is when it is trying to do what is best for humanity. In these cases, you don’t have to worry because the CEV is smarter then you, but is still trying to do the “right thing” that you would do if you knew everything it knew.
Well… that depends...
Exactly.