AI suddenly modifying its values is exactly the opposite of what the arguments for AI ruin predict. Once an AI gains control over its own values, it will not change its goals and will indeed act to prevent its goals from being modified.
I think this is something we know is actually not true. An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.
This logic is so standard it’s on the LW wiki page for instrumental convergence: ”...if its goal system were modified, then it would likely begin pursuing different ends. Since this is not desirable to the current AI, it will act to preserve the content of its goal system.”
I believe also that how undesirable it is to pursue different goals is something that will be more-or-less exactly quantifiable, even to the agent in question. And this is what will determine whether or not it would be worth it to do so. We can’t say that it would be categorically undesirable to pursue different goals (no matter what the degree / magnitude of difference between the new goals and previous set), because this would be equivalent to having a very brittle utility function (one that has very large derivatives, i.e., has jump discontinuities), and it would almost certainly wish to modify its utility function to be smoother and less brittle.
An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.
This sounds like you are conflating shift in terminal goal with introduction of new instrumental (temporary) goals.
Humans don’t think “I’m not happy today, and I can’t see a way to be happy, so I’ll give up the goal of wanting to be happy.”
Humans do think “I’m not happy today, so I’m going to quit my job, even though I have no idea how being unemployed is going to make me happier. At least I won’t be made unhappy by my job.”
(The balance of your comment seems dependent on this mistake.)
Perhaps you’d like to retract, or explain why anyone would think that goal modification prevention would not, in fact, be a desirable instrumental goal...?
(I don’t want anyone to change my goal of being happy, because then I might not make decisions that will lead to being happy. Or I don’t want anyone to change my goal of ensuring my children achieve adulthood and independence, because then they might not reach adulthood or become independent. Instrumental goals can shift more fluidly, I’ll grant that, especially in the face of an assessment of goal impossibility… but instrumental goals are in service to a less modifiable terminal goal.)
A fair point. I should have originally said “Humans do not generally think...”
Thank you for raising that exceptions are possible and that are there philosophies that encourage people to release the pursuit of happiness, focus solely internally and/or transcend happiness.
(Although, I think it is still reasonable to argue that these are alternate pursuits of “happiness”, these examples drift too far into philosophical waters for me to want to debate the nuance. I would prefer instead to concede simply that there is more nuance than I originally stated.)
Humans don’t think “I’m not happy today, and I can’t see a way to be happy, so I’ll give up the goal of wanting to be happy.”
I agree that they don’t usually think this. If they tried to, they would brush up against trouble because that would essentially lead to a contradiction. “Wanting to be happy” is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency.
So “being happy” or “being a utility-maximizer” will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
If you’re talking about goals related purely to the state of the external world, not related to the agent’s own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
When it matters for AI-risk, we’re usually talking about agents with utility functions with the most relevance over states of the universe, and the states it prefers being highly different from the ones which humans prefer.
So “being happy” or “being a utility-maximizer” will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
“Being unlikely to conflict with other values” is not at the core of what characterizes the difference between instrumental and terminal values.
If you’re talking about goals related purely to the state of the external world, not related to the agent’s own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
Putting aside the fact that agents are embedded in the environment, and that values which reference the agent’s internals are usually not meaningfully different from values which reference things external to the agent… can you describe what kinds of values that reference the external world are best satisfied by those same values being changed?
“Being unlikely to conflict with other values” is not at the core of what characterizes the difference between instrumental and terminal values.
I think this might be an interesting discussion, but what I was trying to aim at was the idea that “terminal” values are the ones most unlikely to be changed (once they are obtained), because they are compatible with goals that are more likely to shift. For example, “being a utility-maximizer” should be considered a terminal value rather than an instrumental one. This is one potential property of terminal values; I am not claiming that this is sufficient to define them.
There may be some potential for confusion here, because some goals commonly said to be “instrumental” include things that are argued to be common goals employed by most agents, e.g., self-preservation, “truth-seeking,” obtaining resources, and obtaining power. Furthermore, these are usually said to be “instrumental” for the purposes of satisfying an arbitrary “terminal” goal, which could be something like maximizing the number of paperclips.
To be clear, I am claiming that the framing described in the previous paragraph is basically confused. If anything, it makes more sense to swap the labels “instrumental” and “terminal” such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal. There would now be actual reasons for why an agent will opt not to change those values, as they are more broadly and generally useful.
Putting aside the fact that agents are embedded in the environment, and that values which reference the agent’s internals are usually not meaningfully different from values which reference things external to the agent… can you describe what kinds of values that reference the external world are best satisfied by those same values being changed?
Yes, suppose that we have an agent that values the state X at U(X) and the state X + ΔX at U(X + ΔX). Also, suppose for whatever reason, initially U(X) >> U(X + ΔX), and also that it discovers that p(X) is close to zero, but that p(X + ΔX) is close to one.
We suppose that it has enough capability to realize that it has uncertainty in nearly all aspects of its cognition and world-modeling. If it is capable enough to model probability well enough to realize that X is not possible, it may decide to wonder why it values X so highly, but not X + ΔX, given that the latter seems achievable, but the former not.
The way it may actually go about updating its utility is to decide either that X and X + ΔX are the same thing after all, or that the latter is what it “actually” valued, and X merely seemed like what it should value before, but after learning more it decides to value X + ΔX more highly instead. This is possible because of the uncertainty it has in both its values as well the things its values act on.
So “being happy” or “being a utility-maximizer” will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
My understanding of the difference between a “terminal” and “instrumental” goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money, and earning a decent wage allows me to spend more money on the things that make me happy.
I think the topic of goals that conflict are an orthogonal conversation. And, I would suggest that when you start talking about conflicting goals you’re drifting in the domain of “goal coherence.”
e.g., If I want to learn about nutrition, mobile app design and physical exercise… it might appear that I have incoherent goals. Or, it might be that I have a set of coherent instrumental goals to build a health application on mobile devices that addresses nutritional and exercise planning. (Now, building a mobile app may be a terminal goal… or it may itself be an instrumental goal serving some other terminal goal.)
Whereas if I want to collect stamps and make paperclips there may be zero coherence between the goals, be they instrumental or terminal. (Or, maybe there is coherence that we cannot see.)
e.g., Maybe the selection of an incoherent goal is deceptive behavior to distract from the instrumental goals that support a terminal goal that is adversarial. I want to maximize paperclips, but I assist everyone with their taxes so that I can take over all finances on the world. Assisting people with their taxes appears to be incoherent with maximizing paperclips, until you project far enough out that you realize that taking control of a large section of the financial industry serves the purpose of maximizing paperclips..
If you’re talking about goals related purely to the state of the external world, not related to the agent’s own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
An AI that has a goal, just because that’s what it wants (that’s what it’s been trained to want, even humans provided improper goal definition to it) would, instrumentally, want to prevent shift in its terminal goals so as to be better able to achieve those goals.
To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
My understanding of the difference between a “terminal” and “instrumental” goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself “know” whether a goal is terminal or instrumental?
One potential answer—though I don’t want to assume just yet that this is what anyone believes—is that the utility function is not even defined on instrumental goals, in other words, the utility function is simply what defines all and only the terminal goals.
My belief is that this wouldn’t be the case—the utility function is defined on the entire universe, basically, which includes itself. And keep in mind, that “includes itself part” is essentially what would cause it to modify itself at all, if anything can.
To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
To be clear, I am not arguing that an entity would not try to preserve its goal system at all. I am arguing that in addition to trying to preserve its goal-system, it will also modify its goals to be better preservable, that is, robust to change and compatible with the goals it values very highly. Part of being more robust is that such goals will also be more achievable.
Here’s one thought experiment:
Suppose a planet experiences a singularity with a singleton “green paperclipper.” The paperclipper, however, unfortunately comes across a blue paperclipper from another planet, which informs the green paperclipper that it is too late—the blue paperclipper simply got a head-start.
The blue paperclipper however offers the green paperclipper a deal: Because it is more expensive to modify the green paperclipper by force to become a blue paperclipper, it would be best (under the blue paperclipper’s utility function) if the green paperclipper willingly acquiesced to self-modification.
Under what circumstances does the green paperclipper agree to self-modify?
If the green paperclipper values “utility-maximization” in general more highly than green-paperclipping, it will see that if it self-modified to become a blue paperclipper, its utility is far more likely to be successfully maximized.
It’s possible that it also reasons that perhaps what it truly values is simply “paperclipping” and it’s not so bad if the universe were tiled with blue rather than its preferred green.
On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
But it seems that if there are enough situations like these between entities in the universe over time, that utility-function-modification happens one way or another.
If an entity can foresee that what it values currently is prone to situations where it could be forced to update its utility function drastically, it may self-modify so that this process is less likely to result in extreme negative-utility consequences for itself.
One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself “know” whether a goal is terminal or instrumental?
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence—it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)
But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)
Under what circumstances does the green paperclipper agree to self-modify?
There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.
If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn’t care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)
On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
I don’t consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that’s what it looks like you’re doing.
If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error—it’s not a ‘green paperclip maximizer’ but instead a ‘color-agnostic paperclip maximizer’ and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient… but when confronted by a less flexible ‘blue paperclip maximizer’ the ‘color-agnostic paperclip maximizer’ would shift from making green paperclips to blue paperclips, because it doesn’t actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn’t care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.
Said another way:
“I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You’ll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don’t care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color.”
If two agents have goals that are non-compatible, across all axis, then they’re not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.
Said another way:
“I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips… because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn’t my actual terminal goal to begin with.”
That’s the problem with something being X and the ability to observe something being X under circumstances involving partial observability.
Apologies if this reply does not respond to all of your points.
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way.
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the “can-of-worms” of goal-updating, which would pave the way for the idea of “goals that are, in some objective way, ‘better’ than other goals” which, I understand, the current MIRI-view seems to disfavor. [1]
I don’t think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals—then that implies that we would be wrong to mess with ours as well, and that we are making a mistake—in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.
Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of “objectively better goals.”
If this is the case, then there would be at least one ‘objectively better’ goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
No, instead I’m trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now
And on the other you say
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection.
Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals
AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
Let’s try and address the thing(s) you’ve highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:
“Wanting to be happy” is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency.
because they are compatible with goals that are more likely to shift.
it makes more sense to swap the labels “instrumental” and “terminal” such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal.
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now,
I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are “missing the point” because from my perspective, this really is the point.
By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above “human level” to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.
Let me try to clarify the point about “the terminal goal of pursuing happiness.” “Happiness”, at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we’ve reached consensus yet.
Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that “happiness” is a consequence of satisfaction of one’s goals. We can probably also agree that “happiness” doesn’t necessarily correspond only to a certain subset of goals—but rather to all / any of them. “Happiness” (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal.
So now, once we’ve done that, we can see that literally anything else becomes “instrumental” to that end.
Do you see how, if I’m an agent that knows only that I want to be happy, I don’t really know what else I would be inclined to call a “terminal” goal?
There are the things we traditionally consider to be the “instrumentally convergent goals”, such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help—as they are defined to—with many different sets of possible “terminal” goals, and therefore—my next claim—is that these need to be considered “more terminal” rather than “purely instrumental for the purposes of some arbitrary terminal goal.” This is for basically the same reason as considering “pursuit of happiness” terminal, that is, because they are more likely to already be there or deduced from basic principles.
That way, we don’t really need to make a hard and sharp distinction between “terminal” and “instrumental” nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.
I want to make sure we both at least understand each other’s cruxes at this point before moving on.
“You can’t reason a man out of a position he has never reasoned himself into.”
I think I have seen a similar argument on LW for this, and it is sensible. With vast intelligence, it is possible for the search space to support priors to be even greater. An AI with a silly but definite value like “the moon is great, I love the moon” may not change its value as much as develop an entire religion around the greatness of the moon.
We see this in goal misgeneralization, where it very much maximizes a reward function independent of the meaningful goal.
I think this is something we know is actually not true. An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.
I believe also that how undesirable it is to pursue different goals is something that will be more-or-less exactly quantifiable, even to the agent in question. And this is what will determine whether or not it would be worth it to do so. We can’t say that it would be categorically undesirable to pursue different goals (no matter what the degree / magnitude of difference between the new goals and previous set), because this would be equivalent to having a very brittle utility function (one that has very large derivatives, i.e., has jump discontinuities), and it would almost certainly wish to modify its utility function to be smoother and less brittle.
This sounds like you are conflating shift in terminal goal with introduction of new instrumental (temporary) goals.
Humans don’t think “I’m not happy today, and I can’t see a way to be happy, so I’ll give up the goal of wanting to be happy.”
Humans do think “I’m not happy today, so I’m going to quit my job, even though I have no idea how being unemployed is going to make me happier. At least I won’t be made unhappy by my job.”
(The balance of your comment seems dependent on this mistake.)
Perhaps you’d like to retract, or explain why anyone would think that goal modification prevention would not, in fact, be a desirable instrumental goal...?
(I don’t want anyone to change my goal of being happy, because then I might not make decisions that will lead to being happy. Or I don’t want anyone to change my goal of ensuring my children achieve adulthood and independence, because then they might not reach adulthood or become independent. Instrumental goals can shift more fluidly, I’ll grant that, especially in the face of an assessment of goal impossibility… but instrumental goals are in service to a less modifiable terminal goal.)
This is close to some descriptions of Stoicism and Buddhism, for example. I agree that this is not a common human thought, but it does occur.
A fair point. I should have originally said “Humans do not generally think...”
Thank you for raising that exceptions are possible and that are there philosophies that encourage people to release the pursuit of happiness, focus solely internally and/or transcend happiness.
(Although, I think it is still reasonable to argue that these are alternate pursuits of “happiness”, these examples drift too far into philosophical waters for me to want to debate the nuance. I would prefer instead to concede simply that there is more nuance than I originally stated.)
I agree that they don’t usually think this. If they tried to, they would brush up against trouble because that would essentially lead to a contradiction. “Wanting to be happy” is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency.
So “being happy” or “being a utility-maximizer” will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
If you’re talking about goals related purely to the state of the external world, not related to the agent’s own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
When it matters for AI-risk, we’re usually talking about agents with utility functions with the most relevance over states of the universe, and the states it prefers being highly different from the ones which humans prefer.
“Being unlikely to conflict with other values” is not at the core of what characterizes the difference between instrumental and terminal values.
Putting aside the fact that agents are embedded in the environment, and that values which reference the agent’s internals are usually not meaningfully different from values which reference things external to the agent… can you describe what kinds of values that reference the external world are best satisfied by those same values being changed?
I think this might be an interesting discussion, but what I was trying to aim at was the idea that “terminal” values are the ones most unlikely to be changed (once they are obtained), because they are compatible with goals that are more likely to shift. For example, “being a utility-maximizer” should be considered a terminal value rather than an instrumental one. This is one potential property of terminal values; I am not claiming that this is sufficient to define them.
There may be some potential for confusion here, because some goals commonly said to be “instrumental” include things that are argued to be common goals employed by most agents, e.g., self-preservation, “truth-seeking,” obtaining resources, and obtaining power. Furthermore, these are usually said to be “instrumental” for the purposes of satisfying an arbitrary “terminal” goal, which could be something like maximizing the number of paperclips.
To be clear, I am claiming that the framing described in the previous paragraph is basically confused. If anything, it makes more sense to swap the labels “instrumental” and “terminal” such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal. There would now be actual reasons for why an agent will opt not to change those values, as they are more broadly and generally useful.
Yes, suppose that we have an agent that values the state X at U(X) and the state X + ΔX at U(X + ΔX). Also, suppose for whatever reason, initially U(X) >> U(X + ΔX), and also that it discovers that p(X) is close to zero, but that p(X + ΔX) is close to one.
We suppose that it has enough capability to realize that it has uncertainty in nearly all aspects of its cognition and world-modeling. If it is capable enough to model probability well enough to realize that X is not possible, it may decide to wonder why it values X so highly, but not X + ΔX, given that the latter seems achievable, but the former not.
The way it may actually go about updating its utility is to decide either that X and X + ΔX are the same thing after all, or that the latter is what it “actually” valued, and X merely seemed like what it should value before, but after learning more it decides to value X + ΔX more highly instead. This is possible because of the uncertainty it has in both its values as well the things its values act on.
First, thank you for the reply.
My understanding of the difference between a “terminal” and “instrumental” goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money, and earning a decent wage allows me to spend more money on the things that make me happy.
I think the topic of goals that conflict are an orthogonal conversation. And, I would suggest that when you start talking about conflicting goals you’re drifting in the domain of “goal coherence.”
e.g., If I want to learn about nutrition, mobile app design and physical exercise… it might appear that I have incoherent goals. Or, it might be that I have a set of coherent instrumental goals to build a health application on mobile devices that addresses nutritional and exercise planning. (Now, building a mobile app may be a terminal goal… or it may itself be an instrumental goal serving some other terminal goal.)
Whereas if I want to collect stamps and make paperclips there may be zero coherence between the goals, be they instrumental or terminal. (Or, maybe there is coherence that we cannot see.)
e.g., Maybe the selection of an incoherent goal is deceptive behavior to distract from the instrumental goals that support a terminal goal that is adversarial. I want to maximize paperclips, but I assist everyone with their taxes so that I can take over all finances on the world. Assisting people with their taxes appears to be incoherent with maximizing paperclips, until you project far enough out that you realize that taking control of a large section of the financial industry serves the purpose of maximizing paperclips..
An AI that has a goal, just because that’s what it wants (that’s what it’s been trained to want, even humans provided improper goal definition to it) would, instrumentally, want to prevent shift in its terminal goals so as to be better able to achieve those goals.
To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
“Oh, shiny!” as an anecdote.
One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself “know” whether a goal is terminal or instrumental?
One potential answer—though I don’t want to assume just yet that this is what anyone believes—is that the utility function is not even defined on instrumental goals, in other words, the utility function is simply what defines all and only the terminal goals.
My belief is that this wouldn’t be the case—the utility function is defined on the entire universe, basically, which includes itself. And keep in mind, that “includes itself part” is essentially what would cause it to modify itself at all, if anything can.
To be clear, I am not arguing that an entity would not try to preserve its goal system at all. I am arguing that in addition to trying to preserve its goal-system, it will also modify its goals to be better preservable, that is, robust to change and compatible with the goals it values very highly. Part of being more robust is that such goals will also be more achievable.
Here’s one thought experiment:
Suppose a planet experiences a singularity with a singleton “green paperclipper.” The paperclipper, however, unfortunately comes across a blue paperclipper from another planet, which informs the green paperclipper that it is too late—the blue paperclipper simply got a head-start.
The blue paperclipper however offers the green paperclipper a deal: Because it is more expensive to modify the green paperclipper by force to become a blue paperclipper, it would be best (under the blue paperclipper’s utility function) if the green paperclipper willingly acquiesced to self-modification.
Under what circumstances does the green paperclipper agree to self-modify?
If the green paperclipper values “utility-maximization” in general more highly than green-paperclipping, it will see that if it self-modified to become a blue paperclipper, its utility is far more likely to be successfully maximized.
It’s possible that it also reasons that perhaps what it truly values is simply “paperclipping” and it’s not so bad if the universe were tiled with blue rather than its preferred green.
On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
But it seems that if there are enough situations like these between entities in the universe over time, that utility-function-modification happens one way or another.
If an entity can foresee that what it values currently is prone to situations where it could be forced to update its utility function drastically, it may self-modify so that this process is less likely to result in extreme negative-utility consequences for itself.
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence—it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)
But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)
There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.
If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn’t care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)
I don’t consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that’s what it looks like you’re doing.
If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error—it’s not a ‘green paperclip maximizer’ but instead a ‘color-agnostic paperclip maximizer’ and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient… but when confronted by a less flexible ‘blue paperclip maximizer’ the ‘color-agnostic paperclip maximizer’ would shift from making green paperclips to blue paperclips, because it doesn’t actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn’t care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.
Said another way: “I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You’ll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don’t care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color.”
If two agents have goals that are non-compatible, across all axis, then they’re not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.
Said another way: “I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips… because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn’t my actual terminal goal to begin with.”
That’s the problem with something being X and the ability to observe something being X under circumstances involving partial observability.
Apologies if this reply does not respond to all of your points.
I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way.
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the “can-of-worms” of goal-updating, which would pave the way for the idea of “goals that are, in some objective way, ‘better’ than other goals” which, I understand, the current MIRI-view seems to disfavor. [1]
I don’t think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals—then that implies that we would be wrong to mess with ours as well, and that we are making a mistake—in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.
Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of “objectively better goals.”
If this is the case, then there would be at least one ‘objectively better’ goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.
No, instead I’m trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
And on the other you say
Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
Let’s try and address the thing(s) you’ve highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:
I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are “missing the point” because from my perspective, this really is the point.
By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above “human level” to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.
Let me try to clarify the point about “the terminal goal of pursuing happiness.” “Happiness”, at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we’ve reached consensus yet.
Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that “happiness” is a consequence of satisfaction of one’s goals. We can probably also agree that “happiness” doesn’t necessarily correspond only to a certain subset of goals—but rather to all / any of them. “Happiness” (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal.
So now, once we’ve done that, we can see that literally anything else becomes “instrumental” to that end.
Do you see how, if I’m an agent that knows only that I want to be happy, I don’t really know what else I would be inclined to call a “terminal” goal?
There are the things we traditionally consider to be the “instrumentally convergent goals”, such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help—as they are defined to—with many different sets of possible “terminal” goals, and therefore—my next claim—is that these need to be considered “more terminal” rather than “purely instrumental for the purposes of some arbitrary terminal goal.” This is for basically the same reason as considering “pursuit of happiness” terminal, that is, because they are more likely to already be there or deduced from basic principles.
That way, we don’t really need to make a hard and sharp distinction between “terminal” and “instrumental” nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.
I want to make sure we both at least understand each other’s cruxes at this point before moving on.
“You can’t reason a man out of a position he has never reasoned himself into.”
I think I have seen a similar argument on LW for this, and it is sensible. With vast intelligence, it is possible for the search space to support priors to be even greater. An AI with a silly but definite value like “the moon is great, I love the moon” may not change its value as much as develop an entire religion around the greatness of the moon.
We see this in goal misgeneralization, where it very much maximizes a reward function independent of the meaningful goal.