I’d argue that “being able to turn of the AI when we want to” is not just a trajectory-kind-of-thing, but instead a counterfactual-kind-of-thing.
Let’s say I have the following preferences:
More-preferred: I press the off-switch right now, and then the AI turns off.
Less-preferred: I press the off-switch right now, and then the AI does not turn off
I would say that this is a preference over trajectories (a.k.a. universe-histories). But I think it corresponds to wanting my AI to be corrigible, right? I don’t see how this is “counterfactual”...
Hmm, how about this?
Most-preferred: I don’t press the off-switch right now.
Middle-preferred: I press the off-switch right now, and then the AI turns off.
Least-preferred: I press the off-switch right now, and then the AI does not turn off
I guess this is a “counterfactual” preference in the sense that I will not, in fact, choose to press the off-switch right now, but I nevertheless have a preference for what would have happened if I had. Do you agree? If so, I think this still fits into the broad framework of “having preferences over trajectories / universe-histories”.
None of the preferences you list involve counterfactuals. However, they involve preferences over whether or not you press the off-switch, and so the AI gets incentivized to manipulate you or force you into either pressing or not pressing the off switch. Basically, your preference is demanding a coincidence between pressing the switch and it turning off, whereas what you probably really want is a causation (… between you trying to press the switch and it turning off).
I said “When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing.” That sentence was taking the human’s point of view. And you responded to that, so I was assuming you were also taking the human’s point of view, and then my response was doing that too. But when you said “counterfactual”, you were actually taking the AI’s point of view. (Correct?) If so, sorry for my confusion.
OK, from the AI’s point of view: Look at every universe-history from a God’s-eye view, and score it based on “the AI is being corrigible”. The universe-histories where the AI disables the shutoff switch scores poorly, and so do the universe-histories where the AI manipulates the humans in a way we don’t endorse. From a God’s-eye view, it could even (Paul-corrigibility-style) depend on what’s happening inside the AI’s own algorithm, like what is it “trying” to do. OK, then we rank-order all the possible universe-histories based on the score, and imagine an AI that makes decisions based on the corresponding set of preferences over universe-histories. Would you describe that AI as corrigible?
OK, from the AI’s point of view: Look at every universe-history from a God’s-eye view, and score it based on “the AI is being corrigible”. The universe-histories where the AI disables the shutoff switch scores poorly, and so do the universe-histories where the AI manipulates the humans in a way we don’t endorse. From a God’s-eye view, it could even (Paul-corrigibility-style) depend on what’s happening inside the AI’s own algorithm, like what is it “trying” to do. OK, then we rank-order all the possible universe-histories based on the score, and imagine an AI that makes decisions based on the corresponding set of preferences over universe-histories. Would you describe that AI as corrigible?
This manages corrigibility without going beyond consequentialism. However, I think there is some self-reference tension that makes it less reasonable than it might seem at first glance. Here’s two ways of making it more concrete:
A universe-history from a God’s-eye view includes the AI itself, including all of its internal state and reasoning. But this makes it impossible to have an AI program optimizes for matching something high in the rank-ordering, because if (hypothetically) the AI running tit-for-tat was a highly ranked trajectory, then it cannot be achieved by the AI running “simulate a wide variety of strategies that the AI could run, and find one that leads to a high ranking in the trajectory world-ordering, which happens to be tit-for-tat, and so pick that”, e.g. because that’s a different strategy than tit-for-tat itself, or because that strategy is “bigger” than tit-for-tat, or similar.
One could consider a variant method. In our universe, for the preferences we would realistically write, the exact algorithm that runs as the AI probably doesn’t matter. So another viable approach would be to take the God’s-eye rank-ordering and transform it into a preference rank-ordering of embedded/indexical/”AI’s eye” trajectories which as best as possible approximates the God’s-eye view. And then one could have an AI that optimizes over these. One issue is that I believe an isomorphic argument could be applied to states (i.e. you could have a state rank-ordering preference along the lines of “0 if the state appears in highly-ranked trajectories, −1 if it doesn’t appear”, and then due to properties of the universe like reversibility of physics, an AI optimizing for this would act like an AI optimizing for the trajectories). I think the main issue with this is that while this would technically work, it essentially works by LARPing as a different kind of agent, and so it either inherits properties more like that different kind of agent, or is broken.
I should mention that the self-reference issue in a way isn’t the “main” thing motivating me to make these distinctions; instead it’s more that I suspect it to be the “root cause”. My main thing motivating me to make the distinctions is just that the math for decision theory doesn’t tend to take a God’s-eye view, but instead a more dualist view.
Ohhh.
I said “When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing.” That sentence was taking the human’s point of view. And you responded to that, so I was assuming you were also taking the human’s point of view, and then my response was doing that too. But when you said “counterfactual”, you were actually taking the AI’s point of view. (Correct?) If so, sorry for my confusion.
🤔 At first I agreed with this, and started writing a thing on the details. Then I started contradicting myself, and I started thinking it was wrong, so I started writing a thing on the details of how this was wrong. But then I got stuck, and now I’m doubting again, thinking that maybe it’s correct anyway.
FWIW, the thing I actually believe (“My corrigibility proposal sketch”) would be based on an abstract concept that involves causality, counterfactuals, and self-reference.
If I’m not mistaken, this whole conversation is all a big tangent on whether “preferences-over-trajectories” is technically correct terminology, but we don’t need to argue about that, because I’m already convinced that in future posts I should just call it “preferences about anything other than future states”. I consider that terminology equally correct, and (apparently) pedagogically superior. :)
Let’s say I have the following preferences:
More-preferred: I press the off-switch right now, and then the AI turns off.
Less-preferred: I press the off-switch right now, and then the AI does not turn off
I would say that this is a preference over trajectories (a.k.a. universe-histories). But I think it corresponds to wanting my AI to be corrigible, right? I don’t see how this is “counterfactual”...
Hmm, how about this?
Most-preferred: I don’t press the off-switch right now.
Middle-preferred: I press the off-switch right now, and then the AI turns off.
Least-preferred: I press the off-switch right now, and then the AI does not turn off
I guess this is a “counterfactual” preference in the sense that I will not, in fact, choose to press the off-switch right now, but I nevertheless have a preference for what would have happened if I had. Do you agree? If so, I think this still fits into the broad framework of “having preferences over trajectories / universe-histories”.
None of the preferences you list involve counterfactuals. However, they involve preferences over whether or not you press the off-switch, and so the AI gets incentivized to manipulate you or force you into either pressing or not pressing the off switch. Basically, your preference is demanding a coincidence between pressing the switch and it turning off, whereas what you probably really want is a causation (… between you trying to press the switch and it turning off).
Ohhh.
I said “When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing.” That sentence was taking the human’s point of view. And you responded to that, so I was assuming you were also taking the human’s point of view, and then my response was doing that too. But when you said “counterfactual”, you were actually taking the AI’s point of view. (Correct?) If so, sorry for my confusion.
OK, from the AI’s point of view: Look at every universe-history from a God’s-eye view, and score it based on “the AI is being corrigible”. The universe-histories where the AI disables the shutoff switch scores poorly, and so do the universe-histories where the AI manipulates the humans in a way we don’t endorse. From a God’s-eye view, it could even (Paul-corrigibility-style) depend on what’s happening inside the AI’s own algorithm, like what is it “trying” to do. OK, then we rank-order all the possible universe-histories based on the score, and imagine an AI that makes decisions based on the corresponding set of preferences over universe-histories. Would you describe that AI as corrigible?
This manages corrigibility without going beyond consequentialism. However, I think there is some self-reference tension that makes it less reasonable than it might seem at first glance. Here’s two ways of making it more concrete:
A universe-history from a God’s-eye view includes the AI itself, including all of its internal state and reasoning. But this makes it impossible to have an AI program optimizes for matching something high in the rank-ordering, because if (hypothetically) the AI running tit-for-tat was a highly ranked trajectory, then it cannot be achieved by the AI running “simulate a wide variety of strategies that the AI could run, and find one that leads to a high ranking in the trajectory world-ordering, which happens to be tit-for-tat, and so pick that”, e.g. because that’s a different strategy than tit-for-tat itself, or because that strategy is “bigger” than tit-for-tat, or similar.
One could consider a variant method. In our universe, for the preferences we would realistically write, the exact algorithm that runs as the AI probably doesn’t matter. So another viable approach would be to take the God’s-eye rank-ordering and transform it into a preference rank-ordering of embedded/indexical/”AI’s eye” trajectories which as best as possible approximates the God’s-eye view. And then one could have an AI that optimizes over these. One issue is that I believe an isomorphic argument could be applied to states (i.e. you could have a state rank-ordering preference along the lines of “0 if the state appears in highly-ranked trajectories, −1 if it doesn’t appear”, and then due to properties of the universe like reversibility of physics, an AI optimizing for this would act like an AI optimizing for the trajectories). I think the main issue with this is that while this would technically work, it essentially works by LARPing as a different kind of agent, and so it either inherits properties more like that different kind of agent, or is broken.
I should mention that the self-reference issue in a way isn’t the “main” thing motivating me to make these distinctions; instead it’s more that I suspect it to be the “root cause”. My main thing motivating me to make the distinctions is just that the math for decision theory doesn’t tend to take a God’s-eye view, but instead a more dualist view.
🤔 At first I agreed with this, and started writing a thing on the details. Then I started contradicting myself, and I started thinking it was wrong, so I started writing a thing on the details of how this was wrong. But then I got stuck, and now I’m doubting again, thinking that maybe it’s correct anyway.
FWIW, the thing I actually believe (“My corrigibility proposal sketch”) would be based on an abstract concept that involves causality, counterfactuals, and self-reference.
If I’m not mistaken, this whole conversation is all a big tangent on whether “preferences-over-trajectories” is technically correct terminology, but we don’t need to argue about that, because I’m already convinced that in future posts I should just call it “preferences about anything other than future states”. I consider that terminology equally correct, and (apparently) pedagogically superior. :)