What do we mean by “o in Press”? That whoever wrote the list defining “Press” made a different decision that day?
Maybe we can have the definition of “Press” as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that… Need some sort of abstract version of this (might be doable).
This still seems vulnerable to Benja’s blackmail. Let me explain:
Assume that a2 is irrelevant, that o=Pr, and that a1 is merely yes or no to the following deal: “Give me 1 utilon from UN, and if o is not Pr, then I give 2 utilons to UN”. This is the only way of changing UN; otherwise UN=0.
Then the agent is immune to blackmail. It will computer U(YES,-,-) = U(YES,Pr,-) = US(YES,Pr,-)+E(UN|Pr not in Press, YES, Pr) = US(YES,Pr,-) − 1. On the other hand, U(NO,-,-) = US(NO,Pr,-).
So if US doesn’t care about YES vs NO, it will reject the deal. Yay!
But now imagine that the deal is phrased differently:
“Give me 1 utilon from UN, and if o is not in Press, then I give 2 utilons to UN”
Here E(UN|Pr not in Press, YES, Pr) become −1+2=+1, so the agent will accept the deal.
This may depend on how it cashes out statements like “if o is not in Press”.
My preferred method of cashing out “do(o not in Press)” etc. is to say that observations are a tuple (p, r) where p is either Press or not Press and r is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we define
Then whether or not the agent pays the blackmail comes down to whether or not E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)] is −1 or +1. This counterfactual is computed by taking the world-model indexed on the action being “YES” as it was before making the observation, overwriting P to “not Press” via causal intervention, then propagating the update and computing the expectation of UN (with respect to R). In this case, A2(-, -) is constant, so the question comes down to the value of UN(YES, (P, R), -) under the causal counterfactual assumption that P := not Press.
There are two ways this could go down. First, let’s say that R is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treats P as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button but P was not Press anyway. In this case, if the deal is “Give me one util from UN, and then I’ll give 2 utils to UN if the humans don’t press the button”, then the agent is perfectly capable of rejecting the deal, because under the counterfactual do(P:= not Press), the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).
But if the deal is “Give me one util from UN, and I’ll give 2 utils to UN if P = not Press” then the agent will pay up, because under the counterfactual do(P := not Press) it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button and P = not Press anyway, so you can blackmail the agent by promising great things in that world.)
Yep, it’s blackmailable. Nice catch.
(The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.)
The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.
It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I’m not sure how to safely exploit that fact… The rule “don’t take deals that only exist because of property Q of your algorithm” is too rigid, but maybe a probabilistic version of that?
What do we mean by “o in Press”? That whoever wrote the list defining “Press” made a different decision that day?
Maybe we can have the definition of “Press” as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that… Need some sort of abstract version of this (might be doable).
This still seems vulnerable to Benja’s blackmail. Let me explain:
Assume that a2 is irrelevant, that o=Pr, and that a1 is merely yes or no to the following deal: “Give me 1 utilon from UN, and if o is not Pr, then I give 2 utilons to UN”. This is the only way of changing UN; otherwise UN=0.
Then the agent is immune to blackmail. It will computer U(YES,-,-) = U(YES,Pr,-) = US(YES,Pr,-)+E(UN|Pr not in Press, YES, Pr) = US(YES,Pr,-) − 1. On the other hand, U(NO,-,-) = US(NO,Pr,-).
So if US doesn’t care about YES vs NO, it will reject the deal. Yay!
But now imagine that the deal is phrased differently: “Give me 1 utilon from UN, and if o is not in Press, then I give 2 utilons to UN”
Here E(UN|Pr not in Press, YES, Pr) become −1+2=+1, so the agent will accept the deal.
This may depend on how it cashes out statements like “if o is not in Press”.
Yep, I think you’re right.
My preferred method of cashing out “do(o not in Press)” etc. is to say that observations are a tuple (p, r) where p is either
Press
ornot Press
andr
is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we defineThen whether or not the agent pays the blackmail comes down to whether or not
E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)]
is −1 or +1. This counterfactual is computed by taking the world-model indexed on the action being “YES” as it was before making the observation, overwritingP
to “not Press” via causal intervention, then propagating the update and computing the expectation ofUN
(with respect toR
). In this case,A2(-, -)
is constant, so the question comes down to the value ofUN(YES, (P, R), -)
under the causal counterfactual assumption thatP := not Press
.There are two ways this could go down. First, let’s say that
R
is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treatsP
as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button butP
wasnot Press
anyway. In this case, if the deal is “Give me one util from UN, and then I’ll give 2 utils to UN if the humans don’t press the button”, then the agent is perfectly capable of rejecting the deal, because under the counterfactualdo(P:= not Press)
, the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).But if the deal is “Give me one util from UN, and I’ll give 2 utils to UN if
P = not Press
” then the agent will pay up, because under the counterfactualdo(P := not Press)
it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button andP = not Press
anyway, so you can blackmail the agent by promising great things in that world.)Yep, it’s blackmailable. Nice catch.
(The intuitive fix is to try to prevent
P
from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action onP
. That sounds really difficult to set up and horribly hacky, though.)It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I’m not sure how to safely exploit that fact… The rule “don’t take deals that only exist because of property Q of your algorithm” is too rigid, but maybe a probabilistic version of that?