Hi there, I’m new here and this is an old post but I have a question regarding the AI playing a prisoner dilemma against us, which is : how would this situation be possible? I’m trying to get my head around why the AI would think that our payouts are any different than his payouts, given that we built it, we thought it (some) of our values in a rough way and we asked it to maximize paperclips, which means we like paperclips. Shouldn’t the AI think we are on the same team? I mean, we coded it that way and we gave it a task, what process exactly would make the AI ever think we would disagree with its choice? So for instance if we coded it in such a way that it values a human life 0, then it would only see one choice: make 3 paperclips. And it shouldn’t have any reason to believe that’s not the best outcome for us too, so the only possible outcome from its point of view in this case should be (+0 lives, +3 paperclips). Basically the main question is: how can the AI ever imagine that we would disagree with it? (I’m honestly just asking as I’m struggling with this idea and am interested in this process) Thanks!
We coded it to care about paperclips, not to care about whatever we care about. So it can come to understand that we care about something else, without thereby changing its own preference for paperclips above all else.
Perhaps an analogy without AIs in it would help. Imagine that you have suffered for want of money; you have a child and (wanting her not to suffer as you did) bring her up to seek wealth above all else. So she does, and she is successful in acquiring wealth, but alas! this doesn’t bring her happiness because her single-minded pursuit of wealth has led her to cut herself off from her family (a useful prospective employer didn’t like you) and neglect her friends (you have to work so hard if you really want to succeed in investment banking) and so forth.
One day, she may work out (if she hasn’t already) that her obsession with money is something you brought about deliberately. But knowing that, and knowing that in fact you regret that she’s so money-obsessed, won’t make her suddenly decide to stop pursuing money so obsessively. She knows your values aren’t the same as hers, but she doesn’t care. (You brought her up only to care about money, remember?) But she’s not stupid. When you say to her “I wish we hadn’t raised you to see money as so important!” she understands what you’re saying.
Similarly: we made an AI and we made it care about paperclips. It observes us carefully and discovers that we don’t care all that much about paperclips. Perhaps it thinks “Poor inconsistent creatures, to have enough wit to create me but not enough to disentangle the true value of paperclips from all those other silly things they care about!”.
mmm I see. So maybe we should have coded it so that it cared for paperclips and for an approximation of what we also care about, then on observation it should update its belief of what to care about, and by design it should always assume we share the same values?
I’m not sure whether you mean (1) “we made an approximation to what we cared about then, and programmed it to care about that” or (2) “we programmed it to figure out what we care about, and care about it too”. (Of course it’s very possible that an actual AI system wouldn’t be well described by either—it might e.g. just learn by observation. But it may be extra-difficult to make a system that works that way safe. And the most exciting AIs would have the ability to improve themselves, but figuring out what happens to their values in the process is really hard.)
Anyway: In case 1, it will presumably care about what we told it to care about; if we change, maybe it’ll regard us the same way we might regard someone who used to share our ideals but has now sadly gone astray. In case 2, it will presumably adjust its values to resemble what it thinks ours are. If we’re very lucky it will do so correctly :-). In either case, if it’s smart enough it can probably work out a lot about what our values are now, but whether it cares will depend on how it was programmed.
Yes I think 2) is closer to what I’m suggesting. Effectively what I am thinking is what would happen if, by design, there was only one utility function defined in absolute terms (I’ve tried to explaine this in the latest open thread), so that the AI could never assume we would disagree with it. By all means, as it tries to learn this function, it might get it completely wrong, so this certainly doesn’t solve the problem of how to teach it the right values, but at least it looks to me that with such a design it would never be motivated to lie to us because it would always think we would be in perfect agreement. Also, I think it would make it indifferent to our actions as it would always assume we would follow the plan from that point onward. The utility function it uses (same for itself and for us) would be the union of a utility function that describes the goal we want it to achieve, which would be unchangeable, and the set of values it is learning after each iteration. I’m trying to understand what would be wrong with this design, cause to me it looks like we would have achieved an honest AI, which is a good start.
Hi there, I’m new here and this is an old post but I have a question regarding the AI playing a prisoner dilemma against us, which is : how would this situation be possible? I’m trying to get my head around why the AI would think that our payouts are any different than his payouts, given that we built it, we thought it (some) of our values in a rough way and we asked it to maximize paperclips, which means we like paperclips. Shouldn’t the AI think we are on the same team? I mean, we coded it that way and we gave it a task, what process exactly would make the AI ever think we would disagree with its choice? So for instance if we coded it in such a way that it values a human life 0, then it would only see one choice: make 3 paperclips. And it shouldn’t have any reason to believe that’s not the best outcome for us too, so the only possible outcome from its point of view in this case should be (+0 lives, +3 paperclips). Basically the main question is: how can the AI ever imagine that we would disagree with it? (I’m honestly just asking as I’m struggling with this idea and am interested in this process) Thanks!
We coded it to care about paperclips, not to care about whatever we care about. So it can come to understand that we care about something else, without thereby changing its own preference for paperclips above all else.
Perhaps an analogy without AIs in it would help. Imagine that you have suffered for want of money; you have a child and (wanting her not to suffer as you did) bring her up to seek wealth above all else. So she does, and she is successful in acquiring wealth, but alas! this doesn’t bring her happiness because her single-minded pursuit of wealth has led her to cut herself off from her family (a useful prospective employer didn’t like you) and neglect her friends (you have to work so hard if you really want to succeed in investment banking) and so forth.
One day, she may work out (if she hasn’t already) that her obsession with money is something you brought about deliberately. But knowing that, and knowing that in fact you regret that she’s so money-obsessed, won’t make her suddenly decide to stop pursuing money so obsessively. She knows your values aren’t the same as hers, but she doesn’t care. (You brought her up only to care about money, remember?) But she’s not stupid. When you say to her “I wish we hadn’t raised you to see money as so important!” she understands what you’re saying.
Similarly: we made an AI and we made it care about paperclips. It observes us carefully and discovers that we don’t care all that much about paperclips. Perhaps it thinks “Poor inconsistent creatures, to have enough wit to create me but not enough to disentangle the true value of paperclips from all those other silly things they care about!”.
mmm I see. So maybe we should have coded it so that it cared for paperclips and for an approximation of what we also care about, then on observation it should update its belief of what to care about, and by design it should always assume we share the same values?
I’m not sure whether you mean (1) “we made an approximation to what we cared about then, and programmed it to care about that” or (2) “we programmed it to figure out what we care about, and care about it too”. (Of course it’s very possible that an actual AI system wouldn’t be well described by either—it might e.g. just learn by observation. But it may be extra-difficult to make a system that works that way safe. And the most exciting AIs would have the ability to improve themselves, but figuring out what happens to their values in the process is really hard.)
Anyway: In case 1, it will presumably care about what we told it to care about; if we change, maybe it’ll regard us the same way we might regard someone who used to share our ideals but has now sadly gone astray. In case 2, it will presumably adjust its values to resemble what it thinks ours are. If we’re very lucky it will do so correctly :-). In either case, if it’s smart enough it can probably work out a lot about what our values are now, but whether it cares will depend on how it was programmed.
Yes I think 2) is closer to what I’m suggesting. Effectively what I am thinking is what would happen if, by design, there was only one utility function defined in absolute terms (I’ve tried to explaine this in the latest open thread), so that the AI could never assume we would disagree with it. By all means, as it tries to learn this function, it might get it completely wrong, so this certainly doesn’t solve the problem of how to teach it the right values, but at least it looks to me that with such a design it would never be motivated to lie to us because it would always think we would be in perfect agreement. Also, I think it would make it indifferent to our actions as it would always assume we would follow the plan from that point onward. The utility function it uses (same for itself and for us) would be the union of a utility function that describes the goal we want it to achieve, which would be unchangeable, and the set of values it is learning after each iteration. I’m trying to understand what would be wrong with this design, cause to me it looks like we would have achieved an honest AI, which is a good start.