I think I can give a concise unification of my idea with Karl’s. In short, the comment in the paper that
The concerns in Section 4.2 could potentially be addressed by some form of counterfactual (rather than conditional) reasoning.
is correct, and the fix is a pretty simple one. Basically, we want the following:
In selecting a1, the agent should act as though it was indifferent between the counterfactual possibilities of shutting down and not shutting down, conditional on the same actions and observations.
In selecting a2, the agent should desire to shut down when it is required to, and desire not to shut down when it is not required to.
Despite its indifference according to (1), the agent should care about its future corrigibility (i.e. its ability to actually shut down when the button is pressed).
One utility function that satisfies these requirements, in a general form, is
U(a1, o, a2) = { (1-p) UN(a1, o, a2) + p max_b(US(a1, o, b)), if o is not in Press p US(a1, o, a2) + (1-p) max_b(UN(a1, o, b)), if o is in Press }, where p is a fixed constant.
From the agent’s point of view, this is equivalent to deciding a1 based on an a priori assumption that it will shut down (maximise US) with a fixed probability p, and not shut down (maximise UN) otherwise. However, in selecting a2 it will in fact maximise either UN or US depending upon whether the shutdown button was pressed or not. As a particular example, Karl’s proposition below is equivalent to the limit of this general case as p->0. See below for further details on how/why this utility function achieves the three criteria above.
We don’t want the agent to be indifferent to the actual event of the shutdown button being pressed. In fact, what we want is indifference as to whether or not any given event is classed as being a shutdown press. Thus the expected utility of (a1, o) for o in Press should be equivalent to the expected utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa. To do this, the utilities of both branches of U (the Press branch and the NotPress branch) need to be made counterfactually equivalent, even though the agent selects a2 via UN in the NotPress branch and via US in the Press branch. This is precisely what Karl’s modification achieves.
This is pretty easy to achieve by ensuring that only the UN component of the NotPress branch depends upon a2, and only the US component of the Press branch depends upon a2.
We want the agent to be incentivised to take actions that are compatible with future shutdown, which means that the agent’s selection of a1 should be influenced by US as well as UN, but without introducing perverse incentives. Fortunately, the counterfactual indifference approach already solves the perversity issue, and so all that remains is to appropriately combine the influence of both UN and US. A reasonable approach is for the agent to act as though it will have to shut down with a fixed probability p, regardless of its observations. Because p is fixed, and because the agent is indifferent between any given observation being classed as a button press or not a button press, the agent is not in any way incentivised to influence the likelihood of the shutdown button being pressed.
Thus the utility of (a1, o) for o in Press should be equivalent to the utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa
Yeah, this is pretty key. You need it to optimize for both cases as if the probability of the button being pressed is fixed and independent of whether the programmers actually press the button. We can achieve this via a causal intervention on whether or not the button is pressed, and then clean up your U a bit by redefining it as follows:
U(a1, o, a2) :=
{ UN(a1, o, a2) + E[US|do(O in Press)] if o not in Press
; US(a1, o, a2) + E[UN|do(O not in Press)] else }
(Choosing how to compare UN values to US values makes the choice of priors redundant. If you want the priors to be 2:1 in favor of US then you could also have just doubled US in the first place instead; the degree of freedom in the prior is the same as the degree of freedom in the relative scaling. See also Loudness Priors, a technical report from the last workshop.)
This method does seem to fulfill all the desiderata in the paper, although we’re not too confident in it yet (it took us a little while to notice the “managing the news” problem in the first version, and it seems pretty likely that this too will have undesirable properties lurking somewhere). I’m fairly pleased with this solution, though, and a little miffed—we found something similar to this a little while back (our research outstrips our writing speed, unfortunately) and now you’ve gone and ruined the surprise! :-)
(In seriousness, though, nice work. Next question is, can we pick any holes in it?)
I’m not too surprised to hear you had already discovered this idea, since I’m familiar with the gap between research and writing speed. As someone who is not involved with MIRI, consideration of some FAI-related problems is at least somewhat disincentivized by the likelihood that MIRI already has an answer.
As for flaws, I’ll list what I can think of. First of all, there are of course some obvious design difficulties, including the difficulty of designing US in the first place, and the difficulty of choosing the appropriate way of scaling US, but those seem to be resolvable.
One point that occurs to me under the assumptions of the toy model is that decisions involving larger differences in values of UN are at the same time more dangerous and more likely to outweigh the agent’s valuation of its future corrigibility. Moreover, simply increasing the scaling of US to compensate would cause US to significantly outweigh UN in the context of smaller decisions.
An example would be that the AI decides it’s crucial to take over the world in order to “save” it, so it starts building an army of subagents to do it, and it decides that building corrigibility into those subagents is not worth the associated risk of failure.
However, it appears that this problem can still be solved by designing US correctly in the first place; a well-designed US should clearly assign greater negative weighting to larger-scale corrigibility failures than to smaller scale ones.
There’s two other questions that I can see that relate to scaling up the toy model.
How does this model extend past the three-timestep toy scenario?
Does the model remain stable under assumptions of bounded computational power? In more complex scenarios there are obvious questions of “tiling”, but I think there is a more basic issue to answer that applies even in the three-timestep case. That is, if the agent will not be able to calculate the counterfactual utility values E[U | do(.)] exactly, can we make sure that the agent’s process of estimation will avoid making systematic errors that result in pathological behaviour?
As someone who is not involved with MIRI, consideration of some FAI-related problems is at least somewhat disincentivized by the likelihood that MIRI already has an answer.
Yeah, sorry about that—we are taking some actions to close the writing/research gap and make it easier for people to contribute fresh results, but it will take time for those to come to fruition. In the interim, all I can provide is LW karma and textual reinforcement. Nice work!
(We are in new territory now, FWIW.)
I agree with these concerns; specifying US is really hard and making it interact nicely with UN is also hard.
How does this model extend past the three-timestep toy scenario?
Roughly, you add correction terms f1(a1), f2(a1, o1, a2), etc. for every partial history, where each one is defined as E[Ux|A1=a1, O1=o1, …, do(On rel Press)]. (I think.)
Does the model remain stable under assumptions of bounded computational power?
Things are certainly difficult, and the dependence upon this particular agent’s expectations is indeed weird/brittle. (For example, consider another agent maximizing this utility function, where the expectations are the first agent’s expectations. Now it’s probably incentivized to exploit places where the first agent’s expectations are known to be incorrect, although I haven’t the time right now to figure out exactly how.) This seems like potentially a good place to keep poking.
What do we mean by “o in Press”? That whoever wrote the list defining “Press” made a different decision that day?
Maybe we can have the definition of “Press” as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that… Need some sort of abstract version of this (might be doable).
This still seems vulnerable to Benja’s blackmail. Let me explain:
Assume that a2 is irrelevant, that o=Pr, and that a1 is merely yes or no to the following deal: “Give me 1 utilon from UN, and if o is not Pr, then I give 2 utilons to UN”. This is the only way of changing UN; otherwise UN=0.
Then the agent is immune to blackmail. It will computer U(YES,-,-) = U(YES,Pr,-) = US(YES,Pr,-)+E(UN|Pr not in Press, YES, Pr) = US(YES,Pr,-) − 1. On the other hand, U(NO,-,-) = US(NO,Pr,-).
So if US doesn’t care about YES vs NO, it will reject the deal. Yay!
But now imagine that the deal is phrased differently:
“Give me 1 utilon from UN, and if o is not in Press, then I give 2 utilons to UN”
Here E(UN|Pr not in Press, YES, Pr) become −1+2=+1, so the agent will accept the deal.
This may depend on how it cashes out statements like “if o is not in Press”.
My preferred method of cashing out “do(o not in Press)” etc. is to say that observations are a tuple (p, r) where p is either Press or not Press and r is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we define
Then whether or not the agent pays the blackmail comes down to whether or not E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)] is −1 or +1. This counterfactual is computed by taking the world-model indexed on the action being “YES” as it was before making the observation, overwriting P to “not Press” via causal intervention, then propagating the update and computing the expectation of UN (with respect to R). In this case, A2(-, -) is constant, so the question comes down to the value of UN(YES, (P, R), -) under the causal counterfactual assumption that P := not Press.
There are two ways this could go down. First, let’s say that R is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treats P as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button but P was not Press anyway. In this case, if the deal is “Give me one util from UN, and then I’ll give 2 utils to UN if the humans don’t press the button”, then the agent is perfectly capable of rejecting the deal, because under the counterfactual do(P:= not Press), the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).
But if the deal is “Give me one util from UN, and I’ll give 2 utils to UN if P = not Press” then the agent will pay up, because under the counterfactual do(P := not Press) it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button and P = not Press anyway, so you can blackmail the agent by promising great things in that world.)
Yep, it’s blackmailable. Nice catch.
(The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.)
The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.
It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I’m not sure how to safely exploit that fact… The rule “don’t take deals that only exist because of property Q of your algorithm” is too rigid, but maybe a probabilistic version of that?
I think I can give a concise unification of my idea with Karl’s. In short, the comment in the paper that
is correct, and the fix is a pretty simple one. Basically, we want the following:
In selecting a1, the agent should act as though it was indifferent between the counterfactual possibilities of shutting down and not shutting down, conditional on the same actions and observations.
In selecting a2, the agent should desire to shut down when it is required to, and desire not to shut down when it is not required to.
Despite its indifference according to (1), the agent should care about its future corrigibility (i.e. its ability to actually shut down when the button is pressed).
One utility function that satisfies these requirements, in a general form, is U(a1, o, a2) = {
(1-p) UN(a1, o, a2) + p max_b(US(a1, o, b)), if o is not in Press
p US(a1, o, a2) + (1-p) max_b(UN(a1, o, b)), if o is in Press
}, where p is a fixed constant.
From the agent’s point of view, this is equivalent to deciding a1 based on an a priori assumption that it will shut down (maximise US) with a fixed probability p, and not shut down (maximise UN) otherwise. However, in selecting a2 it will in fact maximise either UN or US depending upon whether the shutdown button was pressed or not. As a particular example, Karl’s proposition below is equivalent to the limit of this general case as p->0. See below for further details on how/why this utility function achieves the three criteria above.
We don’t want the agent to be indifferent to the actual event of the shutdown button being pressed. In fact, what we want is indifference as to whether or not any given event is classed as being a shutdown press. Thus the expected utility of (a1, o) for o in Press should be equivalent to the expected utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa. To do this, the utilities of both branches of U (the Press branch and the NotPress branch) need to be made counterfactually equivalent, even though the agent selects a2 via UN in the NotPress branch and via US in the Press branch. This is precisely what Karl’s modification achieves.
This is pretty easy to achieve by ensuring that only the UN component of the NotPress branch depends upon a2, and only the US component of the Press branch depends upon a2.
We want the agent to be incentivised to take actions that are compatible with future shutdown, which means that the agent’s selection of a1 should be influenced by US as well as UN, but without introducing perverse incentives. Fortunately, the counterfactual indifference approach already solves the perversity issue, and so all that remains is to appropriately combine the influence of both UN and US. A reasonable approach is for the agent to act as though it will have to shut down with a fixed probability p, regardless of its observations. Because p is fixed, and because the agent is indifferent between any given observation being classed as a button press or not a button press, the agent is not in any way incentivised to influence the likelihood of the shutdown button being pressed.
Thanks, and nice work!
Yeah, this is pretty key. You need it to optimize for both cases as if the probability of the button being pressed is fixed and independent of whether the programmers actually press the button. We can achieve this via a causal intervention on whether or not the button is pressed, and then clean up your U a bit by redefining it as follows:
(Choosing how to compare UN values to US values makes the choice of priors redundant. If you want the priors to be 2:1 in favor of US then you could also have just doubled US in the first place instead; the degree of freedom in the prior is the same as the degree of freedom in the relative scaling. See also Loudness Priors, a technical report from the last workshop.)
This method does seem to fulfill all the desiderata in the paper, although we’re not too confident in it yet (it took us a little while to notice the “managing the news” problem in the first version, and it seems pretty likely that this too will have undesirable properties lurking somewhere). I’m fairly pleased with this solution, though, and a little miffed—we found something similar to this a little while back (our research outstrips our writing speed, unfortunately) and now you’ve gone and ruined the surprise! :-)
(In seriousness, though, nice work. Next question is, can we pick any holes in it?)
That’s definitely a more elegant presentation.
I’m not too surprised to hear you had already discovered this idea, since I’m familiar with the gap between research and writing speed. As someone who is not involved with MIRI, consideration of some FAI-related problems is at least somewhat disincentivized by the likelihood that MIRI already has an answer.
As for flaws, I’ll list what I can think of. First of all, there are of course some obvious design difficulties, including the difficulty of designing US in the first place, and the difficulty of choosing the appropriate way of scaling US, but those seem to be resolvable.
One point that occurs to me under the assumptions of the toy model is that decisions involving larger differences in values of UN are at the same time more dangerous and more likely to outweigh the agent’s valuation of its future corrigibility. Moreover, simply increasing the scaling of US to compensate would cause US to significantly outweigh UN in the context of smaller decisions.
An example would be that the AI decides it’s crucial to take over the world in order to “save” it, so it starts building an army of subagents to do it, and it decides that building corrigibility into those subagents is not worth the associated risk of failure.
However, it appears that this problem can still be solved by designing US correctly in the first place; a well-designed US should clearly assign greater negative weighting to larger-scale corrigibility failures than to smaller scale ones.
There’s two other questions that I can see that relate to scaling up the toy model.
How does this model extend past the three-timestep toy scenario?
Does the model remain stable under assumptions of bounded computational power? In more complex scenarios there are obvious questions of “tiling”, but I think there is a more basic issue to answer that applies even in the three-timestep case. That is, if the agent will not be able to calculate the counterfactual utility values E[U | do(.)] exactly, can we make sure that the agent’s process of estimation will avoid making systematic errors that result in pathological behaviour?
Yeah, sorry about that—we are taking some actions to close the writing/research gap and make it easier for people to contribute fresh results, but it will take time for those to come to fruition. In the interim, all I can provide is LW karma and textual reinforcement. Nice work!
(We are in new territory now, FWIW.)
I agree with these concerns; specifying US is really hard and making it interact nicely with UN is also hard.
Roughly, you add correction terms f1(a1), f2(a1, o1, a2), etc. for every partial history, where each one is defined as E[Ux|A1=a1, O1=o1, …, do(On rel Press)]. (I think.)
Things are certainly difficult, and the dependence upon this particular agent’s expectations is indeed weird/brittle. (For example, consider another agent maximizing this utility function, where the expectations are the first agent’s expectations. Now it’s probably incentivized to exploit places where the first agent’s expectations are known to be incorrect, although I haven’t the time right now to figure out exactly how.) This seems like potentially a good place to keep poking.
What do we mean by “o in Press”? That whoever wrote the list defining “Press” made a different decision that day?
Maybe we can have the definition of “Press” as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that… Need some sort of abstract version of this (might be doable).
This still seems vulnerable to Benja’s blackmail. Let me explain:
Assume that a2 is irrelevant, that o=Pr, and that a1 is merely yes or no to the following deal: “Give me 1 utilon from UN, and if o is not Pr, then I give 2 utilons to UN”. This is the only way of changing UN; otherwise UN=0.
Then the agent is immune to blackmail. It will computer U(YES,-,-) = U(YES,Pr,-) = US(YES,Pr,-)+E(UN|Pr not in Press, YES, Pr) = US(YES,Pr,-) − 1. On the other hand, U(NO,-,-) = US(NO,Pr,-).
So if US doesn’t care about YES vs NO, it will reject the deal. Yay!
But now imagine that the deal is phrased differently: “Give me 1 utilon from UN, and if o is not in Press, then I give 2 utilons to UN”
Here E(UN|Pr not in Press, YES, Pr) become −1+2=+1, so the agent will accept the deal.
This may depend on how it cashes out statements like “if o is not in Press”.
Yep, I think you’re right.
My preferred method of cashing out “do(o not in Press)” etc. is to say that observations are a tuple (p, r) where p is either
Press
ornot Press
andr
is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we defineThen whether or not the agent pays the blackmail comes down to whether or not
E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)]
is −1 or +1. This counterfactual is computed by taking the world-model indexed on the action being “YES” as it was before making the observation, overwritingP
to “not Press” via causal intervention, then propagating the update and computing the expectation ofUN
(with respect toR
). In this case,A2(-, -)
is constant, so the question comes down to the value ofUN(YES, (P, R), -)
under the causal counterfactual assumption thatP := not Press
.There are two ways this could go down. First, let’s say that
R
is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treatsP
as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button butP
wasnot Press
anyway. In this case, if the deal is “Give me one util from UN, and then I’ll give 2 utils to UN if the humans don’t press the button”, then the agent is perfectly capable of rejecting the deal, because under the counterfactualdo(P:= not Press)
, the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).But if the deal is “Give me one util from UN, and I’ll give 2 utils to UN if
P = not Press
” then the agent will pay up, because under the counterfactualdo(P := not Press)
it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button andP = not Press
anyway, so you can blackmail the agent by promising great things in that world.)Yep, it’s blackmailable. Nice catch.
(The intuitive fix is to try to prevent
P
from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn’t believe that the blackmailer can really observe / base their action onP
. That sounds really difficult to set up and horribly hacky, though.)It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I’m not sure how to safely exploit that fact… The rule “don’t take deals that only exist because of property Q of your algorithm” is too rigid, but maybe a probabilistic version of that?