Hmmm. The badly edited, back-of-the-envelope short version I can come up with off the top of my head goes like this:
We want an AI-in-training to, by default, do things that have as few side effects as possible. But how can we define “as few side effects as possible” in a way that doesn’t directly incentivize disaster and that doesn’t make the AI totally useless? Well, what if we say that we want it to prefer to act in a way that we can “undo”, and then give a reasonable definition of “undo” that makes sense?
Consider the counterfactual world in which the AI had taken no action at all (or hadn’t been turned on, or whatever). If things have gone horribly wrong because of the AI, then there’s probably some measure by which the AI has taken things very far away from that counterfactual world in a way that can’t be undone. How hard it is to bring the world back to the way it would have been if the AI hadn’t done the thing it did seems to be a decent first try at a measure of “impact” that a corrigible AI might want to minimize.
Let me see if I can use a specific toy example:
There’s a stack of glass plates on a shelf. You ask an AI-controlled robot to put a strawberry on the bottom plate. If the robot knocks every plate but the bottom one onto the floor and shatters them in order to make room for the strawberry, it’s created a situation that’s harder to undo than one in which it carefully puts the other plates on a table without breaking them.
Now, lots of other irreversible things are going to happen when the AI robot moves the plates without breaking them. The brain of the human watching the robot will form different memories. Light will be reflected in different ways, and the air in the room will be disturbed differently. But maintaining as much “rollback” or “undo” capacity as possible, to make it as easy as possible to get as close as possible to the counterfactual world in which the AI hadn’t been given and then acted on the order, regardless of whether anyone is going to actually ask it to undo what it did, seems to be a desirable property.
(It’s important that I specify “the counterfactual world where the AI did nothing” rather than “the world as it was when the order was given” or something like that. An AI that wants to minimize every kind of change would be very happy if it could cool the world to absolute zero after putting the strawberry on the plate.)
There are probably some situations that this metric screws up in, but it seems to give good results in a lot of situations. For example, it’s a lot easier to kill a person than bring them back to life once they’re dead, so a plan that has “save someone’s life” as a component will get a lower irreversibility penalty than one that has “kill a person” as a component. On the other hand, I think I still haven’t avoided the failure mode in which “put the strawberry on a plate and also make me believe, incorrectly, that there isn’t a strawberry on the plate” rates as lower impact than “put the strawberry on the plate while I watch”, so I certainly can’t say I’ve solved everything...
Another failure mode: the AI stubbornly ignores you and actually does nothing when you ask it several times to put the strawberry on the plate, and you go and do it yourself out of frustration. The AI, having predicted this, thinks “Mission accomplished”.
The AI carefully placing the plates on a table will be used to put 5000 more strawberries on plates. Afterwards it will be used as a competent cook in an arbitrary kitchen. Thus the plate smasher AI will have lower impact and be “more corrigible”.
Hmmm. The badly edited, back-of-the-envelope short version I can come up with off the top of my head goes like this:
We want an AI-in-training to, by default, do things that have as few side effects as possible. But how can we define “as few side effects as possible” in a way that doesn’t directly incentivize disaster and that doesn’t make the AI totally useless? Well, what if we say that we want it to prefer to act in a way that we can “undo”, and then give a reasonable definition of “undo” that makes sense?
Consider the counterfactual world in which the AI had taken no action at all (or hadn’t been turned on, or whatever). If things have gone horribly wrong because of the AI, then there’s probably some measure by which the AI has taken things very far away from that counterfactual world in a way that can’t be undone. How hard it is to bring the world back to the way it would have been if the AI hadn’t done the thing it did seems to be a decent first try at a measure of “impact” that a corrigible AI might want to minimize.
Let me see if I can use a specific toy example:
There’s a stack of glass plates on a shelf. You ask an AI-controlled robot to put a strawberry on the bottom plate. If the robot knocks every plate but the bottom one onto the floor and shatters them in order to make room for the strawberry, it’s created a situation that’s harder to undo than one in which it carefully puts the other plates on a table without breaking them.
Now, lots of other irreversible things are going to happen when the AI robot moves the plates without breaking them. The brain of the human watching the robot will form different memories. Light will be reflected in different ways, and the air in the room will be disturbed differently. But maintaining as much “rollback” or “undo” capacity as possible, to make it as easy as possible to get as close as possible to the counterfactual world in which the AI hadn’t been given and then acted on the order, regardless of whether anyone is going to actually ask it to undo what it did, seems to be a desirable property.
(It’s important that I specify “the counterfactual world where the AI did nothing” rather than “the world as it was when the order was given” or something like that. An AI that wants to minimize every kind of change would be very happy if it could cool the world to absolute zero after putting the strawberry on the plate.)
There are probably some situations that this metric screws up in, but it seems to give good results in a lot of situations. For example, it’s a lot easier to kill a person than bring them back to life once they’re dead, so a plan that has “save someone’s life” as a component will get a lower irreversibility penalty than one that has “kill a person” as a component. On the other hand, I think I still haven’t avoided the failure mode in which “put the strawberry on a plate and also make me believe, incorrectly, that there isn’t a strawberry on the plate” rates as lower impact than “put the strawberry on the plate while I watch”, so I certainly can’t say I’ve solved everything...
Another failure mode: the AI stubbornly ignores you and actually does nothing when you ask it several times to put the strawberry on the plate, and you go and do it yourself out of frustration. The AI, having predicted this, thinks “Mission accomplished”.
The AI carefully placing the plates on a table will be used to put 5000 more strawberries on plates. Afterwards it will be used as a competent cook in an arbitrary kitchen. Thus the plate smasher AI will have lower impact and be “more corrigible”.