I’ve previously talked about stratified indifference/learning. In this short post, I’ll try and present the idea, as simply and clearly as possible.
The truth is yet written
Assume the AI maximising a utility W, but is unsure whether W is equal to utility u or instead utility v. Humans have written the true utility on a paper somewhere, and, in an hour, the AI will look at that paper to check what W is. Let Iu be the indicator function that u is written on that paper in an hour; Iv=1−Iu is the obvious converse.
Thus W=Iuu+Ivv.
Here’s the problem. There is nothing stopping the AI from looking up the paper currently. And there is nothing stopping the AI from rewriting the paper, or substituting another (Iu is determined by the paper in an hour’s time, however that paper got there). In terms of causal graphs, we have the following setup:
Here W= is the node that represents the utility that W is ultimately equal to, W0u/v is what is currently written on the paper (from now on, written as $ \mathbf{W_0}$ for short), W1u/v is what will be written on the paper at decision time (written as W1), and Act is the AI’s action. Note that W1 determines W= directly, while it itself is determined by W0 (the past paper) and Act (the AI’s actions).
Default policy
Now assume that there is a default action or policy ∅ where the AI simply does nothing. In that case, the value of W0 determines W1 completely. The AI has another action OW (short for overwrite) in which it overwrites the paper’s choice (this subdivides into ORu, overwrite with u, and ORv, overwrite with v). Given OW, W1 is then independent of W0.
Generally speaking, the AI will always choose OW, as one of u and v will be easier to maximise, so the AI will choose that one to maximise.
So, informally, what do we really want to happen? We want the AI to maximise u, if u would have been chosen, absent its intervention (and similarly for v). “Absent its intervention” seems to designate equality between the nodes W0 and W=. More generally, it designates what happens when Act=∅.
Stratification
So let’s modify the AI’s expected utility calculation to meet that informal desiderata. First, let’s look at the universe where, absent AI intervention, W=u. What that means is that, conditional on Act=∅, W= is set to u.
In this universe, W0 and W1 are both set to u. Now, it makes no sense to keep W1 being set to u, since that node is a consequence of Act as well as W0.
So we’ll define a probability distribution P∅,u (and corresponding expectation E∅,u) over all nodes that are not descendants of Act (to do so, we’ll assume that all nodes that are descendants of Act and ancestors of W= are endogenous variables that are deterministic functions of their inputs; the stochasticity is put into exogenous variables, see this paper for more details). Then we will make the AI choose its actions by calculating the expectation of u, given P∅,u.
So, more precisely, this is the expectation of u in the universes where Iuwould have been1 if the AI had chosen to do nothing—thus Act=∅.
So if the AI choose action a, this is E∅,u(u|Act=a). Note that, in this instance, this is equal to the classical expectation E(u|Act=a,W0=u).
What happens if the AI choose Act=ORv, to overwrite the paper with v? This is irrelevant for the expected utility calculation, since we’re using E∅,u(u|Act=a): this is the expectation of u (the utility that would have been chosen, if Act=∅). The actual values of node W= becomes irrelevant, only the counterfactual value it would have had. So changing W= does not benefit the AI. In fact, if the AI is time-inconsistent, it makes things worse, as it means its future self will be maximising v instead of u.
Then we do the same thing for v, resulting in the formula for the stratified expectation Es:
Es(U|a)=P(Iu=1|∅)E∅,u(u|a)+P(Iv=1|∅)E∅,v(v|a).
In a sense, the AI should maximise W′=u(Iu|∅)+v(Iv|∅).
If you just want to get the basic idea of stratification, you need read no further than this. The rest is just some examples and potential issues.
A simple example of this would be a situation where the AI buys a lottery ticket and decides that it will overwrite the paper with u if the ticket wins, and with w if the ticket loses. Call this action Act=iff.
Presumably, it can use the money gained to increase utility, so E(u|W==u,Act=iff) is rather high (as W=u now implies that the lottery was won).
This gives the following causal graph:
Now let’s look at the stratified expectation. Given W==u and Act=∅, we can compute P∅,u. This is simply P(W0=u)=1, and the value of Lotto being what it was before, and independent of W0.
Thus Eu,∅(u|Act=iff) will continue to treat the node Lotto as an independent variable with the same probability as under Act=∅. Choosing iff thus does not help the AI’s expected stratified utility.
Humans changing their minds
Another situation to consider is one where humans have written the paper initially, but decided to change their minds at a later date. This can be captured in the following diagram:
Here the node Hum determines how the humans change their minds and rewrite the paper. What is the dotted arrow between Act and Hum? That represents whether or not the change is a consequence of the AI’s actions or not. If Hum is independent of the AI’s actions—if we would have changed our minds even if Act=∅ -- then P∅,u will set Hum=u and W0 to whatever its prior was. Thus Hum determines W=, overriding W0, the human’s initial decision.
So far, so good. But what if the change in policy is a consequence of the AI’s actions? What if the humans only change the utility because they see the AI behaving badly? Then Hum is a descendant node from Act, and stratification doesn’t apply to it. Now, we could decompose that node further as:
Here HumFun are the fundamentals of human personalities. Since it is not a descendant node from Act, stratification will put a probability distribution over it. However, this still doesn’t allow us to intervene, given the AI’s action. Under stratification, the AI’s sole concern is still “would HumFun have lead to u (or v) under Act=∅”, rather than correcting itself under a different choice of actions.
How to model humans learning from the AI’s actions is hard. Indeed, it might be fundamentally hard: you want to distinguish humans learning from the AI (legit) from the AI manipulating humans (non-legit) and this likely can’t be done by just looking at the causal graph.
Notes on default policy
Note that ∅ might be a really bad policy (take people apart to see what they prefer). Then, if it’s pretty clear that taking people apart is a bad idea according to the preferences that will emerge from that process, the AI will not choose ∅, but will instead leave people together and try to deduce what the outcome of that process would have been.
Alternatively, if doing a logical counterfactual over its own action is bad, we could have the AI take ∅ to not be its own action, but some other causal process that would have prevented the AI from being turned on in the first place.
Simplified explanation of stratification
A putative new idea for AI control; index here.
I’ve previously talked about stratified indifference/learning. In this short post, I’ll try and present the idea, as simply and clearly as possible.
The truth is yet written
Assume the AI maximising a utility W, but is unsure whether W is equal to utility u or instead utility v. Humans have written the true utility on a paper somewhere, and, in an hour, the AI will look at that paper to check what W is. Let Iu be the indicator function that u is written on that paper in an hour; Iv=1−Iu is the obvious converse.
Thus W=Iuu+Ivv.
Here’s the problem. There is nothing stopping the AI from looking up the paper currently. And there is nothing stopping the AI from rewriting the paper, or substituting another (Iu is determined by the paper in an hour’s time, however that paper got there). In terms of causal graphs, we have the following setup:
Here W= is the node that represents the utility that W is ultimately equal to, W0 u/v is what is currently written on the paper (from now on, written as $ \mathbf{W_0}$ for short), W1 u/v is what will be written on the paper at decision time (written as W1), and Act is the AI’s action. Note that W1 determines W= directly, while it itself is determined by W0 (the past paper) and Act (the AI’s actions).
Default policy
Now assume that there is a default action or policy ∅ where the AI simply does nothing. In that case, the value of W0 determines W1 completely. The AI has another action OW (short for overwrite) in which it overwrites the paper’s choice (this subdivides into ORu, overwrite with u, and ORv, overwrite with v). Given OW, W1 is then independent of W0.
Generally speaking, the AI will always choose OW, as one of u and v will be easier to maximise, so the AI will choose that one to maximise.
So, informally, what do we really want to happen? We want the AI to maximise u, if u would have been chosen, absent its intervention (and similarly for v). “Absent its intervention” seems to designate equality between the nodes W0 and W=. More generally, it designates what happens when Act=∅.
Stratification
So let’s modify the AI’s expected utility calculation to meet that informal desiderata. First, let’s look at the universe where, absent AI intervention, W=u. What that means is that, conditional on Act=∅, W= is set to u.
In this universe, W0 and W1 are both set to u. Now, it makes no sense to keep W1 being set to u, since that node is a consequence of Act as well as W0.
So we’ll define a probability distribution P∅,u (and corresponding expectation E∅,u) over all nodes that are not descendants of Act (to do so, we’ll assume that all nodes that are descendants of Act and ancestors of W= are endogenous variables that are deterministic functions of their inputs; the stochasticity is put into exogenous variables, see this paper for more details). Then we will make the AI choose its actions by calculating the expectation of u, given P∅,u.
So, more precisely, this is the expectation of u in the universes where Iu would have been 1 if the AI had chosen to do nothing—thus Act=∅.
So if the AI choose action a, this is E∅,u(u|Act=a). Note that, in this instance, this is equal to the classical expectation E(u|Act=a,W0=u).
What happens if the AI choose Act=ORv, to overwrite the paper with v? This is irrelevant for the expected utility calculation, since we’re using E∅,u(u|Act=a): this is the expectation of u (the utility that would have been chosen, if Act=∅). The actual values of node W= becomes irrelevant, only the counterfactual value it would have had. So changing W= does not benefit the AI. In fact, if the AI is time-inconsistent, it makes things worse, as it means its future self will be maximising v instead of u.
Then we do the same thing for v, resulting in the formula for the stratified expectation Es:
Es(U|a)=P(Iu=1|∅)E∅,u(u|a)+P(Iv=1|∅)E∅,v(v|a).
In a sense, the AI should maximise W′=u(Iu|∅)+v(Iv|∅).
If you just want to get the basic idea of stratification, you need read no further than this. The rest is just some examples and potential issues.
Lottery and outcome pumps
Some versions of indifference have problems with infinite improbability drives.
A simple example of this would be a situation where the AI buys a lottery ticket and decides that it will overwrite the paper with u if the ticket wins, and with w if the ticket loses. Call this action Act=iff.
Presumably, it can use the money gained to increase utility, so E(u|W==u,Act=iff) is rather high (as W=u now implies that the lottery was won).
This gives the following causal graph:
Now let’s look at the stratified expectation. Given W==u and Act=∅, we can compute P∅,u. This is simply P(W0=u)=1, and the value of Lotto being what it was before, and independent of W0.
Thus Eu,∅(u|Act=iff) will continue to treat the node Lotto as an independent variable with the same probability as under Act=∅. Choosing iff thus does not help the AI’s expected stratified utility.
Humans changing their minds
Another situation to consider is one where humans have written the paper initially, but decided to change their minds at a later date. This can be captured in the following diagram:
Here the node Hum determines how the humans change their minds and rewrite the paper. What is the dotted arrow between Act and Hum? That represents whether or not the change is a consequence of the AI’s actions or not. If Hum is independent of the AI’s actions—if we would have changed our minds even if Act=∅ -- then P∅,u will set Hum=u and W0 to whatever its prior was. Thus Hum determines W=, overriding W0, the human’s initial decision.
So far, so good. But what if the change in policy is a consequence of the AI’s actions? What if the humans only change the utility because they see the AI behaving badly? Then Hum is a descendant node from Act, and stratification doesn’t apply to it. Now, we could decompose that node further as:
Here Hum Fun are the fundamentals of human personalities. Since it is not a descendant node from Act, stratification will put a probability distribution over it. However, this still doesn’t allow us to intervene, given the AI’s action. Under stratification, the AI’s sole concern is still “would Hum Fun have lead to u (or v) under Act=∅”, rather than correcting itself under a different choice of actions.
How to model humans learning from the AI’s actions is hard. Indeed, it might be fundamentally hard: you want to distinguish humans learning from the AI (legit) from the AI manipulating humans (non-legit) and this likely can’t be done by just looking at the causal graph.
Notes on default policy
Note that ∅ might be a really bad policy (take people apart to see what they prefer). Then, if it’s pretty clear that taking people apart is a bad idea according to the preferences that will emerge from that process, the AI will not choose ∅, but will instead leave people together and try to deduce what the outcome of that process would have been.
Alternatively, if doing a logical counterfactual over its own action is bad, we could have the AI take ∅ to not be its own action, but some other causal process that would have prevented the AI from being turned on in the first place.