our CEV is (and has to be) detailed enough to answer the question of “do we want that?”. Saving a kitten is a good thing. Being truthful to Bob is a good thing. Not torturing Bob is a good thing. The relative weights of these good things determines the FAI’s actions.
I’d say that the FAI should calculate some game-theoretic chance of torturing Bob for 50 years based on relative pain of kitten death and of having to inflict the torture. Depending on Bob’s expected rationality level, we could tell him “you’ll be tortured”, or “you might be tortured”, or the actual mechanism of determining whether he is tortured.
Actually, strike that. Any competent AI will find ways aside from possible torture to make Bob not want that. Either agree with Bob’s reason for killing the kitten, or fix him so he only wants things that make sense. I’m not sure how friendly this is—I haven’t seen a good writeup or come to any conclusions myself of what FAI does with internal contradictions in a CEV (that is, when a population’s extrapolated volition is not coherent).
My thoughts about this problem are kind of a mess right now, but I feel there’s more than meets the eye.
Ignore the torture, “possible torture” and all that. It’s all a red herring. The real issue is lying, tricking humans into utility-increasing behaviors. It’s almost certain that some combination of “relative weights of good things” will make the FAI lie to humans. Maybe not the Bob+kitten scenario exactly, but something is bound to turn up. (Unless of course our CEV places a huge disutility on lies, which I’m pretty sure won’t be the case.) On the other hand, we humans quickly jump to distrusting anyone who has lied in the past, even if we know it’s for our own good. So now the FAI has huge incentive to conceal its lies, prevent the news from spreading among humans. I don’t have enough brainpower to model this scenario further, but it troubles me.
Lying is a form of manipulation, and humans don’t want/like to be manipulated. If the CEV works, then it will understand human concepts like “trust” and “lying” and hopefully avoid it. The only situations where it will intentionally manipulate people is when it is trying to do what is best for humanity. In these cases, you don’t have to worry because the CEV is smarter then you, but is still trying to do the “right thing” that you would do if you knew everything it knew.
Lying is a form of manipulation, and humans don’t want/like to be manipulated.
Well… that depends...
In these cases, you don’t have to worry because the CEV is smarter then you, but is still trying to do the “right thing” that you would do if you knew everything it knew.
our CEV is (and has to be) detailed enough to answer the question of “do we want that?”. Saving a kitten is a good thing. Being truthful to Bob is a good thing. Not torturing Bob is a good thing. The relative weights of these good things determines the FAI’s actions.
I’d say that the FAI should calculate some game-theoretic chance of torturing Bob for 50 years based on relative pain of kitten death and of having to inflict the torture. Depending on Bob’s expected rationality level, we could tell him “you’ll be tortured”, or “you might be tortured”, or the actual mechanism of determining whether he is tortured.
Actually, strike that. Any competent AI will find ways aside from possible torture to make Bob not want that. Either agree with Bob’s reason for killing the kitten, or fix him so he only wants things that make sense. I’m not sure how friendly this is—I haven’t seen a good writeup or come to any conclusions myself of what FAI does with internal contradictions in a CEV (that is, when a population’s extrapolated volition is not coherent).
My thoughts about this problem are kind of a mess right now, but I feel there’s more than meets the eye.
Ignore the torture, “possible torture” and all that. It’s all a red herring. The real issue is lying, tricking humans into utility-increasing behaviors. It’s almost certain that some combination of “relative weights of good things” will make the FAI lie to humans. Maybe not the Bob+kitten scenario exactly, but something is bound to turn up. (Unless of course our CEV places a huge disutility on lies, which I’m pretty sure won’t be the case.) On the other hand, we humans quickly jump to distrusting anyone who has lied in the past, even if we know it’s for our own good. So now the FAI has huge incentive to conceal its lies, prevent the news from spreading among humans. I don’t have enough brainpower to model this scenario further, but it troubles me.
Lying is a form of manipulation, and humans don’t want/like to be manipulated. If the CEV works, then it will understand human concepts like “trust” and “lying” and hopefully avoid it. The only situations where it will intentionally manipulate people is when it is trying to do what is best for humanity. In these cases, you don’t have to worry because the CEV is smarter then you, but is still trying to do the “right thing” that you would do if you knew everything it knew.
Well… that depends...
Exactly.