mmm I see. So maybe we should have coded it so that it cared for paperclips and for an approximation of what we also care about, then on observation it should update its belief of what to care about, and by design it should always assume we share the same values?
I’m not sure whether you mean (1) “we made an approximation to what we cared about then, and programmed it to care about that” or (2) “we programmed it to figure out what we care about, and care about it too”. (Of course it’s very possible that an actual AI system wouldn’t be well described by either—it might e.g. just learn by observation. But it may be extra-difficult to make a system that works that way safe. And the most exciting AIs would have the ability to improve themselves, but figuring out what happens to their values in the process is really hard.)
Anyway: In case 1, it will presumably care about what we told it to care about; if we change, maybe it’ll regard us the same way we might regard someone who used to share our ideals but has now sadly gone astray. In case 2, it will presumably adjust its values to resemble what it thinks ours are. If we’re very lucky it will do so correctly :-). In either case, if it’s smart enough it can probably work out a lot about what our values are now, but whether it cares will depend on how it was programmed.
Yes I think 2) is closer to what I’m suggesting. Effectively what I am thinking is what would happen if, by design, there was only one utility function defined in absolute terms (I’ve tried to explaine this in the latest open thread), so that the AI could never assume we would disagree with it. By all means, as it tries to learn this function, it might get it completely wrong, so this certainly doesn’t solve the problem of how to teach it the right values, but at least it looks to me that with such a design it would never be motivated to lie to us because it would always think we would be in perfect agreement. Also, I think it would make it indifferent to our actions as it would always assume we would follow the plan from that point onward. The utility function it uses (same for itself and for us) would be the union of a utility function that describes the goal we want it to achieve, which would be unchangeable, and the set of values it is learning after each iteration. I’m trying to understand what would be wrong with this design, cause to me it looks like we would have achieved an honest AI, which is a good start.
mmm I see. So maybe we should have coded it so that it cared for paperclips and for an approximation of what we also care about, then on observation it should update its belief of what to care about, and by design it should always assume we share the same values?
I’m not sure whether you mean (1) “we made an approximation to what we cared about then, and programmed it to care about that” or (2) “we programmed it to figure out what we care about, and care about it too”. (Of course it’s very possible that an actual AI system wouldn’t be well described by either—it might e.g. just learn by observation. But it may be extra-difficult to make a system that works that way safe. And the most exciting AIs would have the ability to improve themselves, but figuring out what happens to their values in the process is really hard.)
Anyway: In case 1, it will presumably care about what we told it to care about; if we change, maybe it’ll regard us the same way we might regard someone who used to share our ideals but has now sadly gone astray. In case 2, it will presumably adjust its values to resemble what it thinks ours are. If we’re very lucky it will do so correctly :-). In either case, if it’s smart enough it can probably work out a lot about what our values are now, but whether it cares will depend on how it was programmed.
Yes I think 2) is closer to what I’m suggesting. Effectively what I am thinking is what would happen if, by design, there was only one utility function defined in absolute terms (I’ve tried to explaine this in the latest open thread), so that the AI could never assume we would disagree with it. By all means, as it tries to learn this function, it might get it completely wrong, so this certainly doesn’t solve the problem of how to teach it the right values, but at least it looks to me that with such a design it would never be motivated to lie to us because it would always think we would be in perfect agreement. Also, I think it would make it indifferent to our actions as it would always assume we would follow the plan from that point onward. The utility function it uses (same for itself and for us) would be the union of a utility function that describes the goal we want it to achieve, which would be unchangeable, and the set of values it is learning after each iteration. I’m trying to understand what would be wrong with this design, cause to me it looks like we would have achieved an honest AI, which is a good start.