My preferred method of cashing out “do(o not in Press)” etc. is to say that observations are a tuple (p, r) where p is either Press or not Press and r is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we define
Then whether or not the agent pays the blackmail comes down to whether or not E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)] is −1 or +1. This counterfactual is computed by taking the world-model indexed on the action being “YES” as it was before making the observation, overwriting P to “not Press” via causal intervention, then propagating the update and computing the expectation of UN (with respect to R). In this case, A2(-, -) is constant, so the question comes down to the value of UN(YES, (P, R), -) under the causal counterfactual assumption that P := not Press.
There are two ways this could go down. First, let’s say that R is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treats P as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button but P was not Press anyway. In this case, if the deal is “Give me one util from UN, and then I’ll give 2 utils to UN if the humans don’t press the button”, then the agent is perfectly capable of rejecting the deal, because under the counterfactual do(P:= not Press), the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).
But if the deal is “Give me one util from UN, and I’ll give 2 utils to UN if P = not Press” then the agent will pay up, because under the counterfactual do(P := not Press) it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button and P = not Press anyway, so you can blackmail the agent by promising great things in that world.)
Yep, it’s blackmailable. Nice catch.
(The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.)
The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.
It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I’m not sure how to safely exploit that fact… The rule “don’t take deals that only exist because of property Q of your algorithm” is too rigid, but maybe a probabilistic version of that?
Yep, I think you’re right.
My preferred method of cashing out “do(o not in Press)” etc. is to say that observations are a tuple (p, r) where p is either
Press
ornot Press
andr
is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we defineThen whether or not the agent pays the blackmail comes down to whether or not
E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)]
is −1 or +1. This counterfactual is computed by taking the world-model indexed on the action being “YES” as it was before making the observation, overwritingP
to “not Press” via causal intervention, then propagating the update and computing the expectation ofUN
(with respect toR
). In this case,A2(-, -)
is constant, so the question comes down to the value ofUN(YES, (P, R), -)
under the causal counterfactual assumption thatP := not Press
.There are two ways this could go down. First, let’s say that
R
is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treatsP
as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button butP
wasnot Press
anyway. In this case, if the deal is “Give me one util from UN, and then I’ll give 2 utils to UN if the humans don’t press the button”, then the agent is perfectly capable of rejecting the deal, because under the counterfactualdo(P:= not Press)
, the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).But if the deal is “Give me one util from UN, and I’ll give 2 utils to UN if
P = not Press
” then the agent will pay up, because under the counterfactualdo(P := not Press)
it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button andP = not Press
anyway, so you can blackmail the agent by promising great things in that world.)Yep, it’s blackmailable. Nice catch.
(The intuitive fix is to try to prevent
P
from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action onP
. That sounds really difficult to set up and horribly hacky, though.)It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I’m not sure how to safely exploit that fact… The rule “don’t take deals that only exist because of property Q of your algorithm” is too rigid, but maybe a probabilistic version of that?