Background 1: Preferences-over-future-states (a.k.a. consequentialism) vs Preferences-over-trajectories
The post Coherent decisions imply consistent utilities (Eliezer Yudkowsky, 2017) explains how, if an agent has preferences over future states of the world, they should act like a utility-maximizer (with utility function defined over future states of the world). If they don’t act that way, they will be less effective at satisfying their own preferences; they would be “leaving money on the table” by their own reckoning. And there are externally-visible signs of agents being suboptimal in that sense; I’ll go over an example in a second.
By contrast, the post Coherence arguments do not entail goal-directed behavior (Rohin Shah, 2018) notes that, if an agent has preferences over trajectories (a.k.a. universe-histories), and acts optimally with respect to those preferences (acts as a utility-maximizer whose utility function is defined over trajectories), then they can display any external behavior whatsoever. In other words, there’s no externally-visible behavioral pattern which we can point to and say “That’s a sure sign that this agent is behaving suboptimally, with respect to their own preferences.”.
I don’t think it’s accurate to call preferences over trajectories non-consequentialist; after all, it’s still preferences over consequences in the physical world. Indeed, Alex Turner found that preferences over trajectories would lead to even more powerseeking than preferences over states. If anything, preferences over trajectories is the purest form of consequentialism. (In fact, I’ve been working on a formal model of classes of preferences where this is precisely true.)
Maybe I’m being thickheaded, but I’m just skeptical of this whole enterprise. I’m tempted to declare that “preferences purely over future states” are just fundamentally counter to corrigibility. When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing. And if we humans in fact have some preferences over trajectories, then it’s folly for us to build AIs that purely have preferences over future states.
I’d argue that “being able to turn of the AI when we want to” is not just a trajectory-kind-of-thing, but instead a counterfactual-kind-of-thing. After all, it’s not about whether the trajectory contains you-turning-off-the-AI or not, it’s about whether, well, it would if you wanted to. The “if you wanted to” has a causal element, closely related to free will being about your causal control on the future; and talking about causality requires talking about counterfactuals, saying what would happen if your choices or desires were perturbed. So I’d argue that we should seek a causal, counterfactual utility function.
Perhaps importantly, unlike preferences over trajectories, preferences over counterfactual trajectories is not consequentialist. So I agree with the message of your post that consequentialism and corrigibility are incompatible. (In fact, I also am starting to suspect that consequentialism and human values are incompatible. E.g. it seems difficult to specify values such as “liberty” more generally. Though they might technically end up consequentialist due to differences between how we might specify utility functions for AIs and for morality.)
Another way I might frame this is that corrigibility isn’t just about what actions we want the AI to choose, it’s about what policies we want the AI to choose.
For any policy, of course, you can always ask “What actions would this policy recommend in the real world? So, wouldn’t we be happy if the AI just picked those?” Or “What utility functions over universe-histories would produce this best sequence of actions? So, wouldn’t one of those be good?”
And if you could compute those in some way other than thinking about what we want from the policy that the AI chooses to implement, be my guest. But my point is that corrigibility is a grab-bag of different things people want from AI, and some of those things are pretty directly about things we want from the policy (in that they talk about what the agent would do in multiple possible cases, they don’t just list what the agent will do in the one best case).
I’d argue that “being able to turn of the AI when we want to” is not just a trajectory-kind-of-thing, but instead a counterfactual-kind-of-thing.
Let’s say I have the following preferences:
More-preferred: I press the off-switch right now, and then the AI turns off.
Less-preferred: I press the off-switch right now, and then the AI does not turn off
I would say that this is a preference over trajectories (a.k.a. universe-histories). But I think it corresponds to wanting my AI to be corrigible, right? I don’t see how this is “counterfactual”...
Hmm, how about this?
Most-preferred: I don’t press the off-switch right now.
Middle-preferred: I press the off-switch right now, and then the AI turns off.
Least-preferred: I press the off-switch right now, and then the AI does not turn off
I guess this is a “counterfactual” preference in the sense that I will not, in fact, choose to press the off-switch right now, but I nevertheless have a preference for what would have happened if I had. Do you agree? If so, I think this still fits into the broad framework of “having preferences over trajectories / universe-histories”.
None of the preferences you list involve counterfactuals. However, they involve preferences over whether or not you press the off-switch, and so the AI gets incentivized to manipulate you or force you into either pressing or not pressing the off switch. Basically, your preference is demanding a coincidence between pressing the switch and it turning off, whereas what you probably really want is a causation (… between you trying to press the switch and it turning off).
I said “When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing.” That sentence was taking the human’s point of view. And you responded to that, so I was assuming you were also taking the human’s point of view, and then my response was doing that too. But when you said “counterfactual”, you were actually taking the AI’s point of view. (Correct?) If so, sorry for my confusion.
OK, from the AI’s point of view: Look at every universe-history from a God’s-eye view, and score it based on “the AI is being corrigible”. The universe-histories where the AI disables the shutoff switch scores poorly, and so do the universe-histories where the AI manipulates the humans in a way we don’t endorse. From a God’s-eye view, it could even (Paul-corrigibility-style) depend on what’s happening inside the AI’s own algorithm, like what is it “trying” to do. OK, then we rank-order all the possible universe-histories based on the score, and imagine an AI that makes decisions based on the corresponding set of preferences over universe-histories. Would you describe that AI as corrigible?
OK, from the AI’s point of view: Look at every universe-history from a God’s-eye view, and score it based on “the AI is being corrigible”. The universe-histories where the AI disables the shutoff switch scores poorly, and so do the universe-histories where the AI manipulates the humans in a way we don’t endorse. From a God’s-eye view, it could even (Paul-corrigibility-style) depend on what’s happening inside the AI’s own algorithm, like what is it “trying” to do. OK, then we rank-order all the possible universe-histories based on the score, and imagine an AI that makes decisions based on the corresponding set of preferences over universe-histories. Would you describe that AI as corrigible?
This manages corrigibility without going beyond consequentialism. However, I think there is some self-reference tension that makes it less reasonable than it might seem at first glance. Here’s two ways of making it more concrete:
A universe-history from a God’s-eye view includes the AI itself, including all of its internal state and reasoning. But this makes it impossible to have an AI program optimizes for matching something high in the rank-ordering, because if (hypothetically) the AI running tit-for-tat was a highly ranked trajectory, then it cannot be achieved by the AI running “simulate a wide variety of strategies that the AI could run, and find one that leads to a high ranking in the trajectory world-ordering, which happens to be tit-for-tat, and so pick that”, e.g. because that’s a different strategy than tit-for-tat itself, or because that strategy is “bigger” than tit-for-tat, or similar.
One could consider a variant method. In our universe, for the preferences we would realistically write, the exact algorithm that runs as the AI probably doesn’t matter. So another viable approach would be to take the God’s-eye rank-ordering and transform it into a preference rank-ordering of embedded/indexical/”AI’s eye” trajectories which as best as possible approximates the God’s-eye view. And then one could have an AI that optimizes over these. One issue is that I believe an isomorphic argument could be applied to states (i.e. you could have a state rank-ordering preference along the lines of “0 if the state appears in highly-ranked trajectories, −1 if it doesn’t appear”, and then due to properties of the universe like reversibility of physics, an AI optimizing for this would act like an AI optimizing for the trajectories). I think the main issue with this is that while this would technically work, it essentially works by LARPing as a different kind of agent, and so it either inherits properties more like that different kind of agent, or is broken.
I should mention that the self-reference issue in a way isn’t the “main” thing motivating me to make these distinctions; instead it’s more that I suspect it to be the “root cause”. My main thing motivating me to make the distinctions is just that the math for decision theory doesn’t tend to take a God’s-eye view, but instead a more dualist view.
Ohhh.
I said “When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing.” That sentence was taking the human’s point of view. And you responded to that, so I was assuming you were also taking the human’s point of view, and then my response was doing that too. But when you said “counterfactual”, you were actually taking the AI’s point of view. (Correct?) If so, sorry for my confusion.
🤔 At first I agreed with this, and started writing a thing on the details. Then I started contradicting myself, and I started thinking it was wrong, so I started writing a thing on the details of how this was wrong. But then I got stuck, and now I’m doubting again, thinking that maybe it’s correct anyway.
FWIW, the thing I actually believe (“My corrigibility proposal sketch”) would be based on an abstract concept that involves causality, counterfactuals, and self-reference.
If I’m not mistaken, this whole conversation is all a big tangent on whether “preferences-over-trajectories” is technically correct terminology, but we don’t need to argue about that, because I’m already convinced that in future posts I should just call it “preferences about anything other than future states”. I consider that terminology equally correct, and (apparently) pedagogically superior. :)
I don’t think it’s accurate to call preferences over trajectories non-consequentialist; after all, it’s still preferences over consequences in the physical world. Indeed, Alex Turner found that preferences over trajectories would lead to even more powerseeking than preferences over states.
I think you have something more specific in mind than I do. I think of a “preference over trajectories” as maximally broad. For example, I claim that a deontological preference to “not turn left at this moment” can be readily described as a “preference over trajectories”: the trajectories where I turn left at this moment are all tied for last place, the other trajectories are all tied for first place, in my preference ordering. Right?
A preference over trajectories is allowed to change over time, and doesn’t need to depend on the parts of the trajectory that lie in the distant future.
I think you have something more specific in mind than I do. I think of a “preference over trajectories” as maximally broad. For example, a deontological preference to “not turn left right now” can be readily described as a “preference over trajectories”: the trajectories where I turn left right now score −1, the other trajectories score +1. Right?
I would argue that the maximally broad preference is a preference over whichever kind of policy you might use (let’s call this a policy-utility). So for instance, if you might deploy neural networks, then a maximally broad preference is a preference over neural networks that can be deployed. This allows all sorts of nonsensical preferences, e.g. “the weight at position (198, 8, 2) in conv layer 57 should be as high as possible”.
In practice, we don’t care about the vast majority of preferences that can be specified as preferences over policies. Instead, we commonly seem to investigate the subclass of preferences that I would call consequentialist preferences, that is, preferences over what happens once you deploy the networks. Formally speaking, if u is a preference over policies (i.e. a function Π→R where Π is the set of policies), and d is the “deployment function” Π→T that maps a policy π to the trajectory d(π) that is obtained as a consequence of deploying the policy, then if u factors as u(π)=v(d(π)) for some trajectory value function v:T→R, then I would consider u to be consequentialist. The reason I’d use this term is because d tells you the consequences of the policy, and so it seems natural to call the preferences that factor through consequences “consequentialist”.
(I guess strictly speaking, one could argue that any policy-utility function is consequentialist by this definition, because one could just extract the deployed policy from the beginning of the trajectory; I don’t think this will work out in practice, because it involves self-reference; realistically, the AI’s world-models will probably not treat itself as being just another part of the universe, but instead in a somewhat separated way. Certainly this holds for e.g. Alex Turner’s MDP models that he has presented so far.)
A policy-preference along the lines of “the weight at position (198, 8, 2) in conv layer 57 should be as high as possible” is obviously silly; so does this mean the only useful utility functions are the consequentialist ones? I don’t think so; I intend on formalizing a broader class of utility functions, which allow counterfactuals and therefore cannot be expressed with trajectories only. (Unless one gets into weird self-reference situations. But I think practical training methods will tend to avoid that.)
I’m confused about “preference over policies”. I thought people usually describe an MDP agent as having a policy, not a preference over policies. Right?
My framework instead is: I’m not thinking of MDP agents with policies, I’m thinking of planning agents which are constantly choosing actions / plans based on a search over a wide variety of possible actions / plans. We can thus describe them as having a “preference” for whatever objective that search is maximizing (at any given time). A universe-history is “anything in the world, both present and future”, which struck me as sufficiently broad to capture any aspect of a plan that we might care about. But I’m open-minded to the possibility that maybe I should have said “preferences-over-future-states versus preferences-over-whatever-else” rather than “preferences-over-future-states versus preferences-over-trajectories”, and just not used the word “trajectories” at all.
Let’s take an agent that, in any possible situation, wiggles its arm. That’s all it does. From my perspective, I would not call that “a consequentialist agent”. But my impression is that you would call it a consequentialist agent, because it has a policy, and the “consequence” of the policy is that the agent wiggles its arm. Did I get that right?
I’m confused about “preference over policies”. I thought people usually describe an MDP agent as having a policy, not a preference over policies. Right?
Yes. But there are many different possible policies, and usually for an MDP agent, you select only one. This one policy is typically selected to be the one that leads to the optimal consequences. So you have a function over the consequences, ranking them by how good they are, and you have a function over policies, mapping them to the consequences (this function is determined by the MDP dynamics), and if you compose them, you get a function over policies.
My framework instead is: I’m not thinking of MDP agents with policies, I’m thinking of planning agents which are constantly choosing actions / plans based on a search over a wide variety of possible actions / plans. We can thus describe them as having a “preference” for whatever objective that search is maximizing (at any given time). A universe-history is “anything in the world, both present and future”, which struck me as sufficiently broad to capture any aspect of a plan that we might care about. But I’m open-minded to the possibility that maybe I should have said “preferences-over-future-states versus preferences-over-whatever-else” rather than “preferences-over-future-states versus preferences-over-trajectories”, and just not used the word “trajectories” at all.
My framework isn’t restricted to MDPs with policies, it’s applicable to any case where you have a fixed search space. Instead of a function that ranks policies, you could consider a function that ranks plans or ranks actions. Such a function is then consequentialist if it ranks them on the basis of the consequences of these plans/actions.
Let’s take an agent that, in any possible situation, wiggles its arm. That’s all it does. From my perspective, I would not call that “a consequentialist agent”. But my impression is that you would call it a consequentialist agent, because it has a policy, and the “consequence” of the policy is that the agent wiggles its arm. Did I get that right?
I’d say that consequentialism is more a property of the optimization process than the agent. If the agent itself contains an optimizer, then one can talk about whether the agent’s optimizer is consequentialist, as well as about whether the process that picked the agent is consequentialist.
So if you sit down and write a piece of code that makes a robot wiggle its arm, then your choice of code would probably be (partly) consequentialist because you would select the code on the basis of the consequences it has. (Probably far from entirely consequentialist, because you would likely also care about the code’s readability and such, rather than just its consequences.) The code would most likely not have an inner optimizer which searches over possible actions, so it would not even be coherent to talk about whether it was consequentialist. (I.e. it would not be coherent to talk about whether its inner action-selecting optimizer considered the consequences of its actions, because it does not have an inner action-selecting optimizer.) But even if it did have an inner action-selecting optimizer, the code’s selection of actions would likely not be consequentialist, because there would probably be easier ways of ranking actions than by simulating the world to guess the consequences of the actions and then picking the one that does the arm-wiggling best.
Right, what I call “planning agent” is the same as what you call “the agent itself contains an optimizer”, and I was talking about whether that optimizer is selecting plans for their long-term consequences, versus for other things (superficial aspects of plans, or their immediate consequences, etc.).
I suspect that you have in mind a “Risks from learned optimization” type picture where we have little control over whether the agent contains an optimizer or not, or what the optimizer is selecting for. But there’s also lots of other possibilities, e.g. in MuZero the optimizer inside the AI agent is written by the human programmers into the source code (but involves queries to learned components like a world-model and value function). I happen to think the latter (humans write code for the agent’s optimizer) is more probable for reasons here, and that assumption is underlying the discussion under “My corrigibility proposal sketch”, which otherwise probably would seem pretty nonsensical to you, I imagine.
In general, the “humans write code for the agent’s optimizer” approach still has an inner alignment problem, but it’s different in some respects, see here and here.
Right, what I call “planning agent” is the same as what you call “the agent itself contains an optimizer”, and I was talking about whether that optimizer is selecting plans for their long-term consequences, versus for other things (superficial aspects of plans, or their immediate consequences, etc.).
I think one way we differ is that I would group {superficial aspects of plans} vs {long-term consequences, short-term consequences}, with the latter both being consequentialist.
I suspect that you have in mind a “Risks from learned optimization” type picture where we have little control over whether the agent contains an optimizer or not, or what the optimizer is selecting for. But there’s also lots of other possibilities, e.g. in MuZero the optimizer inside the AI agent is written by the human programmers into the source code (but involves queries to learned components like a world-model and value function). I happen to think the latter (humans write code for the agent’s optimizer) is more probable for reasons here, and that assumption is underlying the discussion under “My corrigibility proposal sketch”, which otherwise probably would seem pretty nonsensical to you, I imagine.
In general, the “humans write code for the agent’s optimizer” approach still has an inner alignment problem, but it’s different in some respects, see here and here.
Nah, in fact I’d say that your “Misaligned Model-Based RL Agent” post is one of the main inspirations for my model. 🤔 I guess one place my model differs is that I expect to have an explicit utility function (because this seems easiest to reason about, and therefore safest), whereas you split the explicit utility function into a reward signal and a learned value model. Neither of these translate straightforwardly into my model:
the reward signal is external to the AI, probably determined from the human’s point of view (🤔 I guess that explains the confusion in the other thread, where I had assumed the AI’s point of view, and you had assumed the human’s point of view), and so discussions about whether it is consequentialist or not do not fit straightforwardly into my framework
the value function is presumably something like E[∑R|π,O] where R is the reward, π is the current planner/actor, and O is the agent’s epistemic state in its own world-model; this “bakes in” the policy to the value function in a way that is difficult to fit into my framework; implicitly in order to fit it, you need myopic optimization (as is often done in RL), which I would like to get away from (at least in the formalism—for efficiency we would probably need to apply myopic optimization in practice)
I think one way we differ is that I would group {superficial aspects of plans} vs {long-term consequences, short-term consequences}, with the latter both being consequentialist.
Hmm, I guess I try to say “long-term-consequentialist” for long-term consequences. I might have left out the “long-term” part by accident, or if I thought it was clear from context… (And also to make a snappier post title.)
I do think there’s a meaningful notion of, let’s call it, “stereotypically consequentialist behavior” for both humans and AIs, and long-term consequentialists tend to match it really well, and short-term-consequentialists tend to match it less well.
I guess one place my model differs is that I expect to have an explicit utility function (because this seems easiest to reason about, and therefore safest)
Have you written or read anything about how that might work? My theory is: (1) the world is complicated, (2) the AI needs to learn a giant vocabulary of abstract patterns (latent variables) in order to understand or do or want anything of significance in the world, (3) therefore it’s tricky to just write down an explicit utility function. The “My corrigibility proposal sketch” gets around that by something like supervised-learning a way to express the utility function’s ingredients (e.g. “the humans will remain in control”, “I am being helpful”) in terms of these unlabeled latent variables in the world-model. That in turn requires labeled training data and OOD detection and various other details that seem hard to get exactly right, but are nevertheless our best bet. BTW that stuff is not in the “My AGI threat model” post, I grew more fond of them a few months afterwards. :)
Hmm, I guess I try to say “long-term-consequentialist” for long-term consequences. I might have left out the “long-term” part by accident, or if I thought it was clear from context… (And also to make a snappier post title.)
Fair enough.
I do think there’s a meaningful notion of, let’s call it, “stereotypically consequentialist behavior” for both humans and AIs, and long-term consequentialists tend to match it really well, and short-term-consequentialists tend to match it less well.
I agree. I think TurnTrout’s approach is a plausible strategy for formalizing it. If we apply his approach to the long-term vs short-term distinction, then we can observe that the vast majority of trajectory rankings are long-term consequentialist, and therefore most permutations mostly permute with long-term consequentialists; therefore the power-seeking arguments don’t go through with short-term consequentialists.
I think the nature of the failure of the power-seeking arguments for short-term consequentialists is ultimately different from the nature of the failure for non-consequentialists, though; for short-term consequentialists, it happens as a result of dropping the features that power helps you control, while for non-consequentialists, it happens as a result of valuing additional features than the ones you can control with power.
Have you written or read anything about how that might work? My theory is: (1) the world is complicated, (2) the AI needs to learn a giant vocabulary of abstract patterns (latent variables) in order to understand or do or want anything of significance in the world, (3) therefore it’s tricky to just write down an explicit utility function. The “My corrigibility proposal sketch” gets around that by something like supervised-learning a way to express the utility function’s ingredients (e.g. “the humans will remain in control”, “I am being helpful”) in terms of these unlabeled latent variables in the world-model. That in turn requires labeled training data and OOD detection and various other details that seem hard to get exactly right, but are nevertheless our best bet. BTW that stuff is not in the “My AGI threat model” post, I grew more fond of them a few months afterwards. :)
Ah, I think we are in agreement then. I would also agree with using something like supervised learning to get the ingredients of the utility function. (Though I don’t yet know that the ingredients would directly be the sorts of things you mention, or more like “These are the objects in the world” + “Is each object a strawberry?” + etc..)
(I would also want to structurally force the world model to be more interpretable; e.g. one could require it to reason in terms of objects living in 3D space.)
I don’t think it’s accurate to call preferences over trajectories non-consequentialist; after all, it’s still preferences over consequences in the physical world. Indeed, Alex Turner found that preferences over trajectories would lead to even more powerseeking than preferences over states. If anything, preferences over trajectories is the purest form of consequentialism. (In fact, I’ve been working on a formal model of classes of preferences where this is precisely true.)
I’d argue that “being able to turn of the AI when we want to” is not just a trajectory-kind-of-thing, but instead a counterfactual-kind-of-thing. After all, it’s not about whether the trajectory contains you-turning-off-the-AI or not, it’s about whether, well, it would if you wanted to. The “if you wanted to” has a causal element, closely related to free will being about your causal control on the future; and talking about causality requires talking about counterfactuals, saying what would happen if your choices or desires were perturbed. So I’d argue that we should seek a causal, counterfactual utility function.
Perhaps importantly, unlike preferences over trajectories, preferences over counterfactual trajectories is not consequentialist. So I agree with the message of your post that consequentialism and corrigibility are incompatible. (In fact, I also am starting to suspect that consequentialism and human values are incompatible. E.g. it seems difficult to specify values such as “liberty” more generally. Though they might technically end up consequentialist due to differences between how we might specify utility functions for AIs and for morality.)
Big agree.
Another way I might frame this is that corrigibility isn’t just about what actions we want the AI to choose, it’s about what policies we want the AI to choose.
For any policy, of course, you can always ask “What actions would this policy recommend in the real world? So, wouldn’t we be happy if the AI just picked those?” Or “What utility functions over universe-histories would produce this best sequence of actions? So, wouldn’t one of those be good?”
And if you could compute those in some way other than thinking about what we want from the policy that the AI chooses to implement, be my guest. But my point is that corrigibility is a grab-bag of different things people want from AI, and some of those things are pretty directly about things we want from the policy (in that they talk about what the agent would do in multiple possible cases, they don’t just list what the agent will do in the one best case).
Let’s say I have the following preferences:
More-preferred: I press the off-switch right now, and then the AI turns off.
Less-preferred: I press the off-switch right now, and then the AI does not turn off
I would say that this is a preference over trajectories (a.k.a. universe-histories). But I think it corresponds to wanting my AI to be corrigible, right? I don’t see how this is “counterfactual”...
Hmm, how about this?
Most-preferred: I don’t press the off-switch right now.
Middle-preferred: I press the off-switch right now, and then the AI turns off.
Least-preferred: I press the off-switch right now, and then the AI does not turn off
I guess this is a “counterfactual” preference in the sense that I will not, in fact, choose to press the off-switch right now, but I nevertheless have a preference for what would have happened if I had. Do you agree? If so, I think this still fits into the broad framework of “having preferences over trajectories / universe-histories”.
None of the preferences you list involve counterfactuals. However, they involve preferences over whether or not you press the off-switch, and so the AI gets incentivized to manipulate you or force you into either pressing or not pressing the off switch. Basically, your preference is demanding a coincidence between pressing the switch and it turning off, whereas what you probably really want is a causation (… between you trying to press the switch and it turning off).
Ohhh.
I said “When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing.” That sentence was taking the human’s point of view. And you responded to that, so I was assuming you were also taking the human’s point of view, and then my response was doing that too. But when you said “counterfactual”, you were actually taking the AI’s point of view. (Correct?) If so, sorry for my confusion.
OK, from the AI’s point of view: Look at every universe-history from a God’s-eye view, and score it based on “the AI is being corrigible”. The universe-histories where the AI disables the shutoff switch scores poorly, and so do the universe-histories where the AI manipulates the humans in a way we don’t endorse. From a God’s-eye view, it could even (Paul-corrigibility-style) depend on what’s happening inside the AI’s own algorithm, like what is it “trying” to do. OK, then we rank-order all the possible universe-histories based on the score, and imagine an AI that makes decisions based on the corresponding set of preferences over universe-histories. Would you describe that AI as corrigible?
This manages corrigibility without going beyond consequentialism. However, I think there is some self-reference tension that makes it less reasonable than it might seem at first glance. Here’s two ways of making it more concrete:
A universe-history from a God’s-eye view includes the AI itself, including all of its internal state and reasoning. But this makes it impossible to have an AI program optimizes for matching something high in the rank-ordering, because if (hypothetically) the AI running tit-for-tat was a highly ranked trajectory, then it cannot be achieved by the AI running “simulate a wide variety of strategies that the AI could run, and find one that leads to a high ranking in the trajectory world-ordering, which happens to be tit-for-tat, and so pick that”, e.g. because that’s a different strategy than tit-for-tat itself, or because that strategy is “bigger” than tit-for-tat, or similar.
One could consider a variant method. In our universe, for the preferences we would realistically write, the exact algorithm that runs as the AI probably doesn’t matter. So another viable approach would be to take the God’s-eye rank-ordering and transform it into a preference rank-ordering of embedded/indexical/”AI’s eye” trajectories which as best as possible approximates the God’s-eye view. And then one could have an AI that optimizes over these. One issue is that I believe an isomorphic argument could be applied to states (i.e. you could have a state rank-ordering preference along the lines of “0 if the state appears in highly-ranked trajectories, −1 if it doesn’t appear”, and then due to properties of the universe like reversibility of physics, an AI optimizing for this would act like an AI optimizing for the trajectories). I think the main issue with this is that while this would technically work, it essentially works by LARPing as a different kind of agent, and so it either inherits properties more like that different kind of agent, or is broken.
I should mention that the self-reference issue in a way isn’t the “main” thing motivating me to make these distinctions; instead it’s more that I suspect it to be the “root cause”. My main thing motivating me to make the distinctions is just that the math for decision theory doesn’t tend to take a God’s-eye view, but instead a more dualist view.
🤔 At first I agreed with this, and started writing a thing on the details. Then I started contradicting myself, and I started thinking it was wrong, so I started writing a thing on the details of how this was wrong. But then I got stuck, and now I’m doubting again, thinking that maybe it’s correct anyway.
FWIW, the thing I actually believe (“My corrigibility proposal sketch”) would be based on an abstract concept that involves causality, counterfactuals, and self-reference.
If I’m not mistaken, this whole conversation is all a big tangent on whether “preferences-over-trajectories” is technically correct terminology, but we don’t need to argue about that, because I’m already convinced that in future posts I should just call it “preferences about anything other than future states”. I consider that terminology equally correct, and (apparently) pedagogically superior. :)
I think you have something more specific in mind than I do. I think of a “preference over trajectories” as maximally broad. For example, I claim that a deontological preference to “not turn left at this moment” can be readily described as a “preference over trajectories”: the trajectories where I turn left at this moment are all tied for last place, the other trajectories are all tied for first place, in my preference ordering. Right?
A preference over trajectories is allowed to change over time, and doesn’t need to depend on the parts of the trajectory that lie in the distant future.
I would argue that the maximally broad preference is a preference over whichever kind of policy you might use (let’s call this a policy-utility). So for instance, if you might deploy neural networks, then a maximally broad preference is a preference over neural networks that can be deployed. This allows all sorts of nonsensical preferences, e.g. “the weight at position (198, 8, 2) in conv layer 57 should be as high as possible”.
In practice, we don’t care about the vast majority of preferences that can be specified as preferences over policies. Instead, we commonly seem to investigate the subclass of preferences that I would call consequentialist preferences, that is, preferences over what happens once you deploy the networks. Formally speaking, if u is a preference over policies (i.e. a function Π→R where Π is the set of policies), and d is the “deployment function” Π→T that maps a policy π to the trajectory d(π) that is obtained as a consequence of deploying the policy, then if u factors as u(π)=v(d(π)) for some trajectory value function v:T→R, then I would consider u to be consequentialist. The reason I’d use this term is because d tells you the consequences of the policy, and so it seems natural to call the preferences that factor through consequences “consequentialist”.
(I guess strictly speaking, one could argue that any policy-utility function is consequentialist by this definition, because one could just extract the deployed policy from the beginning of the trajectory; I don’t think this will work out in practice, because it involves self-reference; realistically, the AI’s world-models will probably not treat itself as being just another part of the universe, but instead in a somewhat separated way. Certainly this holds for e.g. Alex Turner’s MDP models that he has presented so far.)
A policy-preference along the lines of “the weight at position (198, 8, 2) in conv layer 57 should be as high as possible” is obviously silly; so does this mean the only useful utility functions are the consequentialist ones? I don’t think so; I intend on formalizing a broader class of utility functions, which allow counterfactuals and therefore cannot be expressed with trajectories only. (Unless one gets into weird self-reference situations. But I think practical training methods will tend to avoid that.)
I’m confused about “preference over policies”. I thought people usually describe an MDP agent as having a policy, not a preference over policies. Right?
My framework instead is: I’m not thinking of MDP agents with policies, I’m thinking of planning agents which are constantly choosing actions / plans based on a search over a wide variety of possible actions / plans. We can thus describe them as having a “preference” for whatever objective that search is maximizing (at any given time). A universe-history is “anything in the world, both present and future”, which struck me as sufficiently broad to capture any aspect of a plan that we might care about. But I’m open-minded to the possibility that maybe I should have said “preferences-over-future-states versus preferences-over-whatever-else” rather than “preferences-over-future-states versus preferences-over-trajectories”, and just not used the word “trajectories” at all.
Let’s take an agent that, in any possible situation, wiggles its arm. That’s all it does. From my perspective, I would not call that “a consequentialist agent”. But my impression is that you would call it a consequentialist agent, because it has a policy, and the “consequence” of the policy is that the agent wiggles its arm. Did I get that right?
Yes. But there are many different possible policies, and usually for an MDP agent, you select only one. This one policy is typically selected to be the one that leads to the optimal consequences. So you have a function over the consequences, ranking them by how good they are, and you have a function over policies, mapping them to the consequences (this function is determined by the MDP dynamics), and if you compose them, you get a function over policies.
My framework isn’t restricted to MDPs with policies, it’s applicable to any case where you have a fixed search space. Instead of a function that ranks policies, you could consider a function that ranks plans or ranks actions. Such a function is then consequentialist if it ranks them on the basis of the consequences of these plans/actions.
I’d say that consequentialism is more a property of the optimization process than the agent. If the agent itself contains an optimizer, then one can talk about whether the agent’s optimizer is consequentialist, as well as about whether the process that picked the agent is consequentialist.
So if you sit down and write a piece of code that makes a robot wiggle its arm, then your choice of code would probably be (partly) consequentialist because you would select the code on the basis of the consequences it has. (Probably far from entirely consequentialist, because you would likely also care about the code’s readability and such, rather than just its consequences.) The code would most likely not have an inner optimizer which searches over possible actions, so it would not even be coherent to talk about whether it was consequentialist. (I.e. it would not be coherent to talk about whether its inner action-selecting optimizer considered the consequences of its actions, because it does not have an inner action-selecting optimizer.) But even if it did have an inner action-selecting optimizer, the code’s selection of actions would likely not be consequentialist, because there would probably be easier ways of ranking actions than by simulating the world to guess the consequences of the actions and then picking the one that does the arm-wiggling best.
Right, what I call “planning agent” is the same as what you call “the agent itself contains an optimizer”, and I was talking about whether that optimizer is selecting plans for their long-term consequences, versus for other things (superficial aspects of plans, or their immediate consequences, etc.).
I suspect that you have in mind a “Risks from learned optimization” type picture where we have little control over whether the agent contains an optimizer or not, or what the optimizer is selecting for. But there’s also lots of other possibilities, e.g. in MuZero the optimizer inside the AI agent is written by the human programmers into the source code (but involves queries to learned components like a world-model and value function). I happen to think the latter (humans write code for the agent’s optimizer) is more probable for reasons here, and that assumption is underlying the discussion under “My corrigibility proposal sketch”, which otherwise probably would seem pretty nonsensical to you, I imagine.
In general, the “humans write code for the agent’s optimizer” approach still has an inner alignment problem, but it’s different in some respects, see here and here.
I think one way we differ is that I would group {superficial aspects of plans} vs {long-term consequences, short-term consequences}, with the latter both being consequentialist.
Nah, in fact I’d say that your “Misaligned Model-Based RL Agent” post is one of the main inspirations for my model. 🤔 I guess one place my model differs is that I expect to have an explicit utility function (because this seems easiest to reason about, and therefore safest), whereas you split the explicit utility function into a reward signal and a learned value model. Neither of these translate straightforwardly into my model:
the reward signal is external to the AI, probably determined from the human’s point of view (🤔 I guess that explains the confusion in the other thread, where I had assumed the AI’s point of view, and you had assumed the human’s point of view), and so discussions about whether it is consequentialist or not do not fit straightforwardly into my framework
the value function is presumably something like E[∑R|π,O] where R is the reward, π is the current planner/actor, and O is the agent’s epistemic state in its own world-model; this “bakes in” the policy to the value function in a way that is difficult to fit into my framework; implicitly in order to fit it, you need myopic optimization (as is often done in RL), which I would like to get away from (at least in the formalism—for efficiency we would probably need to apply myopic optimization in practice)
Hmm, I guess I try to say “long-term-consequentialist” for long-term consequences. I might have left out the “long-term” part by accident, or if I thought it was clear from context… (And also to make a snappier post title.)
I do think there’s a meaningful notion of, let’s call it, “stereotypically consequentialist behavior” for both humans and AIs, and long-term consequentialists tend to match it really well, and short-term-consequentialists tend to match it less well.
Have you written or read anything about how that might work? My theory is: (1) the world is complicated, (2) the AI needs to learn a giant vocabulary of abstract patterns (latent variables) in order to understand or do or want anything of significance in the world, (3) therefore it’s tricky to just write down an explicit utility function. The “My corrigibility proposal sketch” gets around that by something like supervised-learning a way to express the utility function’s ingredients (e.g. “the humans will remain in control”, “I am being helpful”) in terms of these unlabeled latent variables in the world-model. That in turn requires labeled training data and OOD detection and various other details that seem hard to get exactly right, but are nevertheless our best bet. BTW that stuff is not in the “My AGI threat model” post, I grew more fond of them a few months afterwards. :)
Fair enough.
I agree. I think TurnTrout’s approach is a plausible strategy for formalizing it. If we apply his approach to the long-term vs short-term distinction, then we can observe that the vast majority of trajectory rankings are long-term consequentialist, and therefore most permutations mostly permute with long-term consequentialists; therefore the power-seeking arguments don’t go through with short-term consequentialists.
I think the nature of the failure of the power-seeking arguments for short-term consequentialists is ultimately different from the nature of the failure for non-consequentialists, though; for short-term consequentialists, it happens as a result of dropping the features that power helps you control, while for non-consequentialists, it happens as a result of valuing additional features than the ones you can control with power.
Ah, I think we are in agreement then. I would also agree with using something like supervised learning to get the ingredients of the utility function. (Though I don’t yet know that the ingredients would directly be the sorts of things you mention, or more like “These are the objects in the world” + “Is each object a strawberry?” + etc..)
(I would also want to structurally force the world model to be more interpretable; e.g. one could require it to reason in terms of objects living in 3D space.)