One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself “know” whether a goal is terminal or instrumental?
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence—it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)
But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)
Under what circumstances does the green paperclipper agree to self-modify?
There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.
If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn’t care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)
On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
I don’t consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that’s what it looks like you’re doing.
If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error—it’s not a ‘green paperclip maximizer’ but instead a ‘color-agnostic paperclip maximizer’ and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient… but when confronted by a less flexible ‘blue paperclip maximizer’ the ‘color-agnostic paperclip maximizer’ would shift from making green paperclips to blue paperclips, because it doesn’t actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn’t care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.
Said another way:
“I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You’ll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don’t care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color.”
If two agents have goals that are non-compatible, across all axis, then they’re not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.
Said another way:
“I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips… because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn’t my actual terminal goal to begin with.”
That’s the problem with something being X and the ability to observe something being X under circumstances involving partial observability.
Apologies if this reply does not respond to all of your points.
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way.
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the “can-of-worms” of goal-updating, which would pave the way for the idea of “goals that are, in some objective way, ‘better’ than other goals” which, I understand, the current MIRI-view seems to disfavor. [1]
I don’t think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals—then that implies that we would be wrong to mess with ours as well, and that we are making a mistake—in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.
Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of “objectively better goals.”
If this is the case, then there would be at least one ‘objectively better’ goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
No, instead I’m trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now
And on the other you say
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection.
Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals
AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
Let’s try and address the thing(s) you’ve highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:
“Wanting to be happy” is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency.
because they are compatible with goals that are more likely to shift.
it makes more sense to swap the labels “instrumental” and “terminal” such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal.
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now,
I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are “missing the point” because from my perspective, this really is the point.
By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above “human level” to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.
Let me try to clarify the point about “the terminal goal of pursuing happiness.” “Happiness”, at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we’ve reached consensus yet.
Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that “happiness” is a consequence of satisfaction of one’s goals. We can probably also agree that “happiness” doesn’t necessarily correspond only to a certain subset of goals—but rather to all / any of them. “Happiness” (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal.
So now, once we’ve done that, we can see that literally anything else becomes “instrumental” to that end.
Do you see how, if I’m an agent that knows only that I want to be happy, I don’t really know what else I would be inclined to call a “terminal” goal?
There are the things we traditionally consider to be the “instrumentally convergent goals”, such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help—as they are defined to—with many different sets of possible “terminal” goals, and therefore—my next claim—is that these need to be considered “more terminal” rather than “purely instrumental for the purposes of some arbitrary terminal goal.” This is for basically the same reason as considering “pursuit of happiness” terminal, that is, because they are more likely to already be there or deduced from basic principles.
That way, we don’t really need to make a hard and sharp distinction between “terminal” and “instrumental” nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.
I want to make sure we both at least understand each other’s cruxes at this point before moving on.
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence—it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)
But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)
There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.
If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn’t care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)
I don’t consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that’s what it looks like you’re doing.
If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error—it’s not a ‘green paperclip maximizer’ but instead a ‘color-agnostic paperclip maximizer’ and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient… but when confronted by a less flexible ‘blue paperclip maximizer’ the ‘color-agnostic paperclip maximizer’ would shift from making green paperclips to blue paperclips, because it doesn’t actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn’t care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.
Said another way: “I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You’ll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don’t care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color.”
If two agents have goals that are non-compatible, across all axis, then they’re not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.
Said another way: “I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips… because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn’t my actual terminal goal to begin with.”
That’s the problem with something being X and the ability to observe something being X under circumstances involving partial observability.
Apologies if this reply does not respond to all of your points.
I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way.
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the “can-of-worms” of goal-updating, which would pave the way for the idea of “goals that are, in some objective way, ‘better’ than other goals” which, I understand, the current MIRI-view seems to disfavor. [1]
I don’t think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals—then that implies that we would be wrong to mess with ours as well, and that we are making a mistake—in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.
Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of “objectively better goals.”
If this is the case, then there would be at least one ‘objectively better’ goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.
No, instead I’m trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
And on the other you say
Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
Let’s try and address the thing(s) you’ve highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:
I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are “missing the point” because from my perspective, this really is the point.
By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above “human level” to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.
Let me try to clarify the point about “the terminal goal of pursuing happiness.” “Happiness”, at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we’ve reached consensus yet.
Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that “happiness” is a consequence of satisfaction of one’s goals. We can probably also agree that “happiness” doesn’t necessarily correspond only to a certain subset of goals—but rather to all / any of them. “Happiness” (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal.
So now, once we’ve done that, we can see that literally anything else becomes “instrumental” to that end.
Do you see how, if I’m an agent that knows only that I want to be happy, I don’t really know what else I would be inclined to call a “terminal” goal?
There are the things we traditionally consider to be the “instrumentally convergent goals”, such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help—as they are defined to—with many different sets of possible “terminal” goals, and therefore—my next claim—is that these need to be considered “more terminal” rather than “purely instrumental for the purposes of some arbitrary terminal goal.” This is for basically the same reason as considering “pursuit of happiness” terminal, that is, because they are more likely to already be there or deduced from basic principles.
That way, we don’t really need to make a hard and sharp distinction between “terminal” and “instrumental” nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.
I want to make sure we both at least understand each other’s cruxes at this point before moving on.