So8res comments on Introducing Corrigibility (an FAI research subfield)

So8res 22 Jan 2015 20:37 UTC
2 points
Yep, I think you’re right.

My preferred method of cashing out “do(o not in Press)” etc. is to say that observations are a tuple (p, r) where p is either Press or not Press and r is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we define
```
U(a1, (p, r), a2) :=
{ UN(a1, (p, r), a2) + E[US(a1, (P, R), A2(a1, (P, R))) | do(P := Press)] if r = not Press
, US(a1, (p, r), a2) + E[UN(a1, (P, R), A2(a1, (P, R))) | do(P := not Press)] else }
```
Then whether or not the agent pays the blackmail comes down to whether or not E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)] is −1 or +1. This counterfactual is computed by taking the world-model indexed on the action being “YES” as it was before making the observation, overwriting P to “not Press” via causal intervention, then propagating the update and computing the expectation of UN (with respect to R). In this case, A2(-, -) is constant, so the question comes down to the value of UN(YES, (P, R), -) under the causal counterfactual assumption that P := not Press.

There are two ways this could go down. First, let’s say that R is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treats P as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button but P was not Press anyway. In this case, if the deal is “Give me one util from UN, and then I’ll give 2 utils to UN if the humans don’t press the button”, then the agent is perfectly capable of rejecting the deal, because under the counterfactual do(P:= not Press), the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).

But if the deal is “Give me one util from UN, and I’ll give 2 utils to UN if P = not Press” then the agent will pay up, because under the counterfactual do(P := not Press) it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button and P = not Press anyway, so you can blackmail the agent by promising great things in that world.)

Yep, it’s blackmailable. Nice catch.

(The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.)
- Stuart_Armstrong 23 Jan 2015 11:31 UTC
  2 points
  Parent
  
  The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.
  
  It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I’m not sure how to safely exploit that fact… The rule “don’t take deals that only exist because of property Q of your algorithm” is too rigid, but maybe a probabilistic version of that?