Suppose we have a robot we want to fetch coffee, with an off switch that can trigger exactly once, when the robot is halfway to the coffee. And just a little bit more than halfway there’s a cat we don’t want the robot to step on (the cat is stochastic, maybe it’s in the shortest path, maybe not). So we want to turn on the robot and have it fetch coffee, but if it’s going to step on the cat we want to be able to press the off switch.
How do we design this robot, using the system here? And then what does it do?
Well, we want to define two utility functions that have an intermediate point where the robot is indifferent to the switch, and then see what policy that recommends. But the utility function has to care about fetching the coffee, since that’s the whole point.
Can the utility functions be fetch the coffee + switch pressed, and fetch the coffee + switch not pressed? No! Because you can’t fetch the coffee if the switch is pressed, so fetch the coffee + switch pressed might be a minimum of the utility as a function of the switch state, with no flat intermediate spot.
To make a toy model, suppose the robot gets to set two variables: tamper with the switch (T), and avoid the cat (C). The robot gets 1 if it succeeds, but that’s canceled out if the switch is pressed, and it can pay 0.1 to tamper with the switch and 0.2 to avoid the cat.
I.e. when the cat is present, u = 1 - (1-C)(1-T) − 0.1T − 0.2C. (Second term is checking whether the switch is pressed.)
Except that the reason it gets penalized for not avoiding the cat is because of the effect on the switch. Let’s pick a causal version to be u = 1 - K(1-T) − 0.1T − 0.2C.
Can the utility functions be (fetch the coffee)*(K) and (fetch the coffee)*(1-K)?
I’m getting a little muddled about where you mean to do causal interventions, so I’m going to leave it there rather than trying to chase down all the options.
EDIT: I now consider this whole approach to be irreparably flawed.
Imagine if their was a 50% chance the button would be pressed. Also the AI was in some betting thing that would pay out paperclips or staples based on a coin flip.
In scenario 1, this AI has an option of setting up a mechanism, external to itself, that controls the coin flip based on if the button is pressed.
In scenario 2, the AI can set up a mechanism to control the button based on the coin flip.
When only considering actions, probabilities and utilities, these look identical. But we want the AI to behave differently. So we must make use of the arrow of time. The structure of cause and effect. In a way we aren’t doing here.
End edit.
To make this work, you would need a utility function for shutting off, including all subagents.
Lets suppose that other than these robots, there are ~0 transistors in your house. Then we can define shutting off as minimizing the number of transistor flips in your house.
So you make
A=at least 1 coffee
B=1/transistor flips.
Note that, when imagining any potential future world where the switch isn’t pressed, the AI has no reason to bother counting transistor flips. And in potential futures where the switch is pressed, it doesn’t need to know what coffee is.
What it cares about are a=E(A(1−K)) and b=E(BK)
Lets list the robot’s options.
Let a=1 without coffee, and a=2 with coffee.
In worlds where the switch is pressed, the robot turns it’s self off immediately once it is fairly sure the switch will be pressed. Making b=1.5 In worlds where it doesn’t turn off, more transistors flip making b=0.5. Because a=1 in both cases, and we only select from the paerito frontier, whenever the switch is pressed, it will turn off.
Lets apply your utility penalties, by putting them in A and B. Ie in a or b depending on the switch.
Suppose we have a robot we want to fetch coffee, with an off switch that can trigger exactly once, when the robot is halfway to the coffee. And just a little bit more than halfway there’s a cat we don’t want the robot to step on (the cat is stochastic, maybe it’s in the shortest path, maybe not). So we want to turn on the robot and have it fetch coffee, but if it’s going to step on the cat we want to be able to press the off switch.
How do we design this robot, using the system here? And then what does it do?
Well, we want to define two utility functions that have an intermediate point where the robot is indifferent to the switch, and then see what policy that recommends. But the utility function has to care about fetching the coffee, since that’s the whole point.
Can the utility functions be fetch the coffee + switch pressed, and fetch the coffee + switch not pressed? No! Because you can’t fetch the coffee if the switch is pressed, so fetch the coffee + switch pressed might be a minimum of the utility as a function of the switch state, with no flat intermediate spot.
To make a toy model, suppose the robot gets to set two variables: tamper with the switch (T), and avoid the cat (C). The robot gets 1 if it succeeds, but that’s canceled out if the switch is pressed, and it can pay 0.1 to tamper with the switch and 0.2 to avoid the cat.
I.e. when the cat is present, u = 1 - (1-C)(1-T) − 0.1T − 0.2C. (Second term is checking whether the switch is pressed.)
Except that the reason it gets penalized for not avoiding the cat is because of the effect on the switch. Let’s pick a causal version to be u = 1 - K(1-T) − 0.1T − 0.2C.
Can the utility functions be (fetch the coffee)*(K) and (fetch the coffee)*(1-K)?
I’m getting a little muddled about where you mean to do causal interventions, so I’m going to leave it there rather than trying to chase down all the options.
EDIT: I now consider this whole approach to be irreparably flawed.
Imagine if their was a 50% chance the button would be pressed. Also the AI was in some betting thing that would pay out paperclips or staples based on a coin flip.
In scenario 1, this AI has an option of setting up a mechanism, external to itself, that controls the coin flip based on if the button is pressed.
In scenario 2, the AI can set up a mechanism to control the button based on the coin flip.
When only considering actions, probabilities and utilities, these look identical. But we want the AI to behave differently. So we must make use of the arrow of time. The structure of cause and effect. In a way we aren’t doing here.
End edit.
To make this work, you would need a utility function for shutting off, including all subagents.
Lets suppose that other than these robots, there are ~0 transistors in your house. Then we can define shutting off as minimizing the number of transistor flips in your house.
So you make
A=at least 1 coffee
B=1/transistor flips.
Note that, when imagining any potential future world where the switch isn’t pressed, the AI has no reason to bother counting transistor flips. And in potential futures where the switch is pressed, it doesn’t need to know what coffee is.
What it cares about are a=E(A(1−K)) and b=E(BK)
Lets list the robot’s options.
Let a=1 without coffee, and a=2 with coffee.
In worlds where the switch is pressed, the robot turns it’s self off immediately once it is fairly sure the switch will be pressed. Making b=1.5 In worlds where it doesn’t turn off, more transistors flip making b=0.5. Because a=1 in both cases, and we only select from the paerito frontier, whenever the switch is pressed, it will turn off.
Lets apply your utility penalties, by putting them in A and B. Ie in a or b depending on the switch.
TC) Press switch, avoid cat. Switch pressed. So (a=0.7) . b=1.5−0.1−0.2=1.2,
PC) Prevent switch, avoid cat. a=1.7,(b=0.2)
IC) Ignore switch, avoid cat. a=1.8,(b=0.3)
TH) Press switch, hit cat. b=1.4,(a=0.9)
IH) Ignore switch, hit cat b=1.5,(a=1) (because it predicts humans will see it and turn it off)
PH) Prevent switch, hit cat. a=1.9,(b=0.4)
This puts IH and PH on the convex hull.
And I think my algorithm picks between them stochastically.