Thanks! Hmm. I think there’s a notion of “how much a set of preferences gives rise to stereotypically-consequentialist behavior”. Like, if you see an agent behaving optimally with respect to preferences about “how the world will be in 10 years”, they would look like a consequentialist goal-seeking agent. Even if you didn’t know what future world-states they preferred, you would be able to guess with high confidence that they preferred some future world-states over others. For example, they would almost certainly pursue convergent instrumental subgoals like power-seeking. By contrast, if you see an agent which, at any time, behaves optimally with respect to preferences about “how the world will be in 5 seconds”, it would look much less like that, especially if after each 5-second increment they roll a new set of preferences. And an agent which, at any time, behaves optimally with respect to preferences over what it’s doing right now would look not at all like a consequentialist goal-seeking agent.
(We care about “looking like a consequentialist goal-seeking agent” because corrigible AIs do NOT “look like a consequentialist goal-seeking agent”.)
Now we can say: By the time-reversibility of the laws of physics, a rank-ordering of “states-of-the-world at future time T (= midnight on January 1 2050)” is equivalent to a rank-ordering of “universe-histories up through future time T”. But I see that as kinda an irrelevant technicality. An agent that makes decisions myopically according to (among other things) a “preference for telling the truth right now” in the universe-history picture would cash out as “some unfathomably complicated preference over the microscopic configuration of atoms in the universe at time T”. And indeed, an agent with that (unfathomably complicated) preference ordering would not look like a consequentialist goal-seeking agent.
So by the same token, it’s not that there’s literally no utility function over “states of the world at future time T” that incentivizes corrigible behavior all the way from now to T, it’s just that there may be no such utility function that can be realistically defined.
Turning more specifically to record-keeping mechanisms, consider an agent with preferences pertaining to what will be written in the logbook at future time T. Let’s take two limiting cases.
One limiting case is: the logbook can be hacked. Then the agent will hack into it. This looks like consequentialist goal-seeking behavior.
The other limiting case is: the logbook is perfect and unbreachable. Then I’d say that it no longer really makes sense to describe this as “an AI with preferences over the state of the world at future time T”. It’s more helpful to think of this as “an AI with preferences over universe-histories”, and by the way an implementation detail is that there’s this logbook involved in how we designed the AI to have this preference. And indeed, the AI will now look less like a consequentialist goal-seeking agent. (By the way I doubt we would actually design an AI using a literal logbook.)
I would propose instead that we focus on “preferences over outcomes”, rather than states or trajectories. This makes it clear that some judgement is required to figure out what counts as an outcome, and how to determine whether it has obtained.
I’m a bit confused what you’re saying here.
It is conceivable to have an AI that makes decisions according to a rank-ordering of the state of the world at future time T = midnight January 1 2050. My impression is that Eliezer has that kind of thing in mind—e.g. “imagine a paperclip maximizer as not being a mind at all, imagine it as a kind of malfunctioning time machine that spits out outputs which will in fact result in larger numbers of paperclips coming to exist later” (ref). I’m suggesting that this is a bad idea if we do it to the exclusion of every other type of preference, but it is possible.
On the other hand, I intended “preferences over trajectories” to be maximally vague—it rules nothing out.
I think our future AIs can have various types of preferences. It’s quite possible that none of those preferences would look like a rank-ordering of states of the world at a specific time T, but some of them might be kinda similar, e.g. a preference for “there will eventually be paperclips” but not by any particular deadline. Is that what you mean by “outcome”? Would it have helped if I had replaced “preferences over trajectories” with the synonymous “preferences that are not exclusively about the future state of the world”?
Thanks for the reply! My comments are rather more thinking-in-progress than robust-conclusions than I’d like, but I figure that’s better than nothing.
Would it have helped if I had replaced “preferences over trajectories” with the synonymous “preferences that are not exclusively about the future state of the world”?
(Thanks for doing that!) I was going to answer ‘yes’ here, but… having thought about this more, I guess I now find myself confused about what it means to have preferences in a way that doesn’t give rise to consequentialist behaviour. Having (unstable) preferences over “what happens 5 seconds after my current action” sounds to me like not really having preferences at all. The behaviour is not coherent enough to be interpreted as preferring some things over others, except in a contrived way.
Your proposal is to somehow get an AI that both produces plans that actually work and cares about being corrigible. I think you’re claiming that the main perceived difficulty with combining these is that corrigibility is fundamentally not about preferences over states whereas working-plans is about preferences over states. Your proposal is to create an AI with preferences both about states and not.
I would counter that how to specify (or precisely, incentivize) preferences for corrigibility remains as the main difficulty, regardless of whether this means preferences over states or not. If you try to incentivize corrigibility via a recognizer for being corrigible, the making-plans-that-actually-work part of the AI effectively just adds fooling the recognizer to its requirements for actually working.
In your view does it make sense to think about corrigibilty as constraints on trajectories? Going with that for now… If the constraints were simple enough, we could program them right into the action space—as in a board-game playing AI that cannot make an invalid move and therefore looks like it cares about both reaching the final win state and about satisfying the never-makes-an-invalid-move constraint on its trajectory. But corrigibility is not so simple that we can program it into the action space in advance. I think what the corrigibility constraint consists of may grow in sophistication with the sophistication of the agent’s plans. It seems like it can’t just be factored out as an additional objective because we don’t have a foolproof specification of that additional objective.
what it means to have preferences in a way that doesn’t give rise to consequentialist behaviour. Having (unstable) preferences over “what happens 5 seconds after my current action” sounds to me like not really having preferences at all. The behaviour is not coherent enough to be interpreted as preferring some things over others, except in a contrived way.
Oh, sorry, I’m thinking of a planning agent. At any given time it considers possible courses of action, and decides what to do based on “preferences”. So “preferences” are an ingredient in the algorithm, not something to be inferred from external behavior.
That said, if someone “prefers” to tell people what’s on his mind, or if someone “prefers” to hold their fork with their left hand … I think those are two examples of “preferences” in the everyday sense of the word, but that they’re not expressible as a rank-ordering of the state of the world at a future date.
If you try to incentivize corrigibility via a recognizer for being corrigible, the making-plans-that-actually-work part of the AI effectively just adds fooling the recognizer to its requirements for actually working.
Instead of “desire to be corrigible”, I’ll switch to something more familiar: “desire to save the rainforest”.
Let’s say my friend Sally is “trying to save the rainforest”. There’s no “save the rainforest detector” external to Sally, which Sally is trying to satisfy. Instead, the “save the rainforest” concept is inside Sally’s own head.
When Sally decides to execute Plan X because it will help save the rainforest, that decision is based on the details of Plan X as Sally herself understands it.
Let’s also assume that Sally’s motivation is ego-syntonic (which we definitely want for our AGIs): In other words, Sally wants to save the rainforest and Sally wants to want to save the rainforest.
Under those circumstances, I don’t think saying something like “Sally wants to fool the recognizer” is helpful. That’s not an accurate description of her motivation. In particular, if she were offered an experience machine or brain-manipulator that could make her believe that she has saved the rainforest, without all the effort of actually saving the rainforest, she would emphatically turn down that offer.
So what can go wrong?
Let’s say Sally and Ahmed are working at the same rainforest advocacy organization. They’re both “trying to save the rainforest”, but maybe those words mean slightly different things to them. Let’s quiz them with a list of 20 weird out-of-distribution hypotheticals:
“If we take every tree and animal in the rainforest and transplant it to a different planet, where it thrives, does that count as “saving the rainforest”?”
“If we raze the rainforest but run an atom-by-atom simulation of it, does that count as “saving the rainforest”?”
Etc.
Presumably Sally and Ahmed will give different answers, and this could conceivably shake out as Sally taking an action that Ahmed strongly opposes or vice-versa, even though they nominally share the same goal.
You can describe that as “Sally is narrowly targeting the save-the-rainforest-recognizer-in-Sally’s-head, and Ahmed is narrowly targeting the save-the-rainforest-recognizer-in-Ahmed’s-head, and each sees the other as Goodhart’ing a corner-case where their recognizer is screwing up.”
That’s definitely a problem, and that’s the kind of stuff I was talking about under “Objection 1” in the post, where I noted the necessity of out-of-distribution detection systems perhaps related to Stuart Armstrong’s “model splintering” ideas etc.
Thanks! Hmm. I think there’s a notion of “how much a set of preferences gives rise to stereotypically-consequentialist behavior”. Like, if you see an agent behaving optimally with respect to preferences about “how the world will be in 10 years”, they would look like a consequentialist goal-seeking agent. Even if you didn’t know what future world-states they preferred, you would be able to guess with high confidence that they preferred some future world-states over others. For example, they would almost certainly pursue convergent instrumental subgoals like power-seeking. By contrast, if you see an agent which, at any time, behaves optimally with respect to preferences about “how the world will be in 5 seconds”, it would look much less like that, especially if after each 5-second increment they roll a new set of preferences. And an agent which, at any time, behaves optimally with respect to preferences over what it’s doing right now would look not at all like a consequentialist goal-seeking agent.
(We care about “looking like a consequentialist goal-seeking agent” because corrigible AIs do NOT “look like a consequentialist goal-seeking agent”.)
Now we can say: By the time-reversibility of the laws of physics, a rank-ordering of “states-of-the-world at future time T (= midnight on January 1 2050)” is equivalent to a rank-ordering of “universe-histories up through future time T”. But I see that as kinda an irrelevant technicality. An agent that makes decisions myopically according to (among other things) a “preference for telling the truth right now” in the universe-history picture would cash out as “some unfathomably complicated preference over the microscopic configuration of atoms in the universe at time T”. And indeed, an agent with that (unfathomably complicated) preference ordering would not look like a consequentialist goal-seeking agent.
So by the same token, it’s not that there’s literally no utility function over “states of the world at future time T” that incentivizes corrigible behavior all the way from now to T, it’s just that there may be no such utility function that can be realistically defined.
Turning more specifically to record-keeping mechanisms, consider an agent with preferences pertaining to what will be written in the logbook at future time T. Let’s take two limiting cases.
One limiting case is: the logbook can be hacked. Then the agent will hack into it. This looks like consequentialist goal-seeking behavior.
The other limiting case is: the logbook is perfect and unbreachable. Then I’d say that it no longer really makes sense to describe this as “an AI with preferences over the state of the world at future time T”. It’s more helpful to think of this as “an AI with preferences over universe-histories”, and by the way an implementation detail is that there’s this logbook involved in how we designed the AI to have this preference. And indeed, the AI will now look less like a consequentialist goal-seeking agent. (By the way I doubt we would actually design an AI using a literal logbook.)
I’m a bit confused what you’re saying here.
It is conceivable to have an AI that makes decisions according to a rank-ordering of the state of the world at future time T = midnight January 1 2050. My impression is that Eliezer has that kind of thing in mind—e.g. “imagine a paperclip maximizer as not being a mind at all, imagine it as a kind of malfunctioning time machine that spits out outputs which will in fact result in larger numbers of paperclips coming to exist later” (ref). I’m suggesting that this is a bad idea if we do it to the exclusion of every other type of preference, but it is possible.
On the other hand, I intended “preferences over trajectories” to be maximally vague—it rules nothing out.
I think our future AIs can have various types of preferences. It’s quite possible that none of those preferences would look like a rank-ordering of states of the world at a specific time T, but some of them might be kinda similar, e.g. a preference for “there will eventually be paperclips” but not by any particular deadline. Is that what you mean by “outcome”? Would it have helped if I had replaced “preferences over trajectories” with the synonymous “preferences that are not exclusively about the future state of the world”?
Thanks for the reply! My comments are rather more thinking-in-progress than robust-conclusions than I’d like, but I figure that’s better than nothing.
(Thanks for doing that!) I was going to answer ‘yes’ here, but… having thought about this more, I guess I now find myself confused about what it means to have preferences in a way that doesn’t give rise to consequentialist behaviour. Having (unstable) preferences over “what happens 5 seconds after my current action” sounds to me like not really having preferences at all. The behaviour is not coherent enough to be interpreted as preferring some things over others, except in a contrived way.
Your proposal is to somehow get an AI that both produces plans that actually work and cares about being corrigible. I think you’re claiming that the main perceived difficulty with combining these is that corrigibility is fundamentally not about preferences over states whereas working-plans is about preferences over states. Your proposal is to create an AI with preferences both about states and not.
I would counter that how to specify (or precisely, incentivize) preferences for corrigibility remains as the main difficulty, regardless of whether this means preferences over states or not. If you try to incentivize corrigibility via a recognizer for being corrigible, the making-plans-that-actually-work part of the AI effectively just adds fooling the recognizer to its requirements for actually working.
In your view does it make sense to think about corrigibilty as constraints on trajectories? Going with that for now… If the constraints were simple enough, we could program them right into the action space—as in a board-game playing AI that cannot make an invalid move and therefore looks like it cares about both reaching the final win state and about satisfying the never-makes-an-invalid-move constraint on its trajectory. But corrigibility is not so simple that we can program it into the action space in advance. I think what the corrigibility constraint consists of may grow in sophistication with the sophistication of the agent’s plans. It seems like it can’t just be factored out as an additional objective because we don’t have a foolproof specification of that additional objective.
Thanks, this is helpful!
Oh, sorry, I’m thinking of a planning agent. At any given time it considers possible courses of action, and decides what to do based on “preferences”. So “preferences” are an ingredient in the algorithm, not something to be inferred from external behavior.
That said, if someone “prefers” to tell people what’s on his mind, or if someone “prefers” to hold their fork with their left hand … I think those are two examples of “preferences” in the everyday sense of the word, but that they’re not expressible as a rank-ordering of the state of the world at a future date.
Instead of “desire to be corrigible”, I’ll switch to something more familiar: “desire to save the rainforest”.
Let’s say my friend Sally is “trying to save the rainforest”. There’s no “save the rainforest detector” external to Sally, which Sally is trying to satisfy. Instead, the “save the rainforest” concept is inside Sally’s own head.
When Sally decides to execute Plan X because it will help save the rainforest, that decision is based on the details of Plan X as Sally herself understands it.
Let’s also assume that Sally’s motivation is ego-syntonic (which we definitely want for our AGIs): In other words, Sally wants to save the rainforest and Sally wants to want to save the rainforest.
Under those circumstances, I don’t think saying something like “Sally wants to fool the recognizer” is helpful. That’s not an accurate description of her motivation. In particular, if she were offered an experience machine or brain-manipulator that could make her believe that she has saved the rainforest, without all the effort of actually saving the rainforest, she would emphatically turn down that offer.
So what can go wrong?
Let’s say Sally and Ahmed are working at the same rainforest advocacy organization. They’re both “trying to save the rainforest”, but maybe those words mean slightly different things to them. Let’s quiz them with a list of 20 weird out-of-distribution hypotheticals:
“If we take every tree and animal in the rainforest and transplant it to a different planet, where it thrives, does that count as “saving the rainforest”?”
“If we raze the rainforest but run an atom-by-atom simulation of it, does that count as “saving the rainforest”?”
Etc.
Presumably Sally and Ahmed will give different answers, and this could conceivably shake out as Sally taking an action that Ahmed strongly opposes or vice-versa, even though they nominally share the same goal.
You can describe that as “Sally is narrowly targeting the save-the-rainforest-recognizer-in-Sally’s-head, and Ahmed is narrowly targeting the save-the-rainforest-recognizer-in-Ahmed’s-head, and each sees the other as Goodhart’ing a corner-case where their recognizer is screwing up.”
That’s definitely a problem, and that’s the kind of stuff I was talking about under “Objection 1” in the post, where I noted the necessity of out-of-distribution detection systems perhaps related to Stuart Armstrong’s “model splintering” ideas etc.
Does that help?