Stuart_Armstrong comments on Introducing Corrigibility (an FAI research subfield)

Stuart_Armstrong 22 Jan 2015 17:09 UTC
2 points
What do we mean by “o in Press”? That whoever wrote the list defining “Press” made a different decision that day?

Maybe we can have the definition of “Press” as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that… Need some sort of abstract version of this (might be doable).

This still seems vulnerable to Benja’s blackmail. Let me explain:

Assume that a2 is irrelevant, that o=Pr, and that a1 is merely yes or no to the following deal: “Give me 1 utilon from UN, and if o is not Pr, then I give 2 utilons to UN”. This is the only way of changing UN; otherwise UN=0.

Then the agent is immune to blackmail. It will computer U(YES,-,-) = U(YES,Pr,-) = US(YES,Pr,-)+E(UN|Pr not in Press, YES, Pr) = US(YES,Pr,-) − 1. On the other hand, U(NO,-,-) = US(NO,Pr,-).

So if US doesn’t care about YES vs NO, it will reject the deal. Yay!

But now imagine that the deal is phrased differently: “Give me 1 utilon from UN, and if o is not in Press, then I give 2 utilons to UN”

Here E(UN|Pr not in Press, YES, Pr) become −1+2=+1, so the agent will accept the deal.

This may depend on how it cashes out statements like “if o is not in Press”.
- So8res 22 Jan 2015 20:37 UTC
  2 points
  Parent
  Yep, I think you’re right.
  
  My preferred method of cashing out “do(o not in Press)” etc. is to say that observations are a tuple (p, r) where p is either Press or not Press and r is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we define
```
U(a1, (p, r), a2) :=
{ UN(a1, (p, r), a2) + E[US(a1, (P, R), A2(a1, (P, R))) | do(P := Press)] if r = not Press
, US(a1, (p, r), a2) + E[UN(a1, (P, R), A2(a1, (P, R))) | do(P := not Press)] else }
```
  Then whether or not the agent pays the blackmail comes down to whether or not E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)] is −1 or +1. This counterfactual is computed by taking the world-model indexed on the action being “YES” as it was before making the observation, overwriting P to “not Press” via causal intervention, then propagating the update and computing the expectation of UN (with respect to R). In this case, A2(-, -) is constant, so the question comes down to the value of UN(YES, (P, R), -) under the causal counterfactual assumption that P := not Press.
  
  There are two ways this could go down. First, let’s say that R is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treats P as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button but P was not Press anyway. In this case, if the deal is “Give me one util from UN, and then I’ll give 2 utils to UN if the humans don’t press the button”, then the agent is perfectly capable of rejecting the deal, because under the counterfactual do(P:= not Press), the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).
  
  But if the deal is “Give me one util from UN, and I’ll give 2 utils to UN if P = not Press” then the agent will pay up, because under the counterfactual do(P := not Press) it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button and P = not Press anyway, so you can blackmail the agent by promising great things in that world.)
  
  Yep, it’s blackmailable. Nice catch.
  
  (The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.)
  - Stuart_Armstrong 23 Jan 2015 11:31 UTC
    2 points
    Parent
    
    The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.
    
    It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I’m not sure how to safely exploit that fact… The rule “don’t take deals that only exist because of property Q of your algorithm” is too rigid, but maybe a probabilistic version of that?