Apologies if this reply does not respond to all of your points.
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way.
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the “can-of-worms” of goal-updating, which would pave the way for the idea of “goals that are, in some objective way, ‘better’ than other goals” which, I understand, the current MIRI-view seems to disfavor. [1]
I don’t think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals—then that implies that we would be wrong to mess with ours as well, and that we are making a mistake—in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.
Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of “objectively better goals.”
If this is the case, then there would be at least one ‘objectively better’ goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
No, instead I’m trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now
And on the other you say
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection.
Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals
AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
Let’s try and address the thing(s) you’ve highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:
“Wanting to be happy” is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency.
because they are compatible with goals that are more likely to shift.
it makes more sense to swap the labels “instrumental” and “terminal” such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal.
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now,
I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are “missing the point” because from my perspective, this really is the point.
By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above “human level” to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.
Let me try to clarify the point about “the terminal goal of pursuing happiness.” “Happiness”, at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we’ve reached consensus yet.
Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that “happiness” is a consequence of satisfaction of one’s goals. We can probably also agree that “happiness” doesn’t necessarily correspond only to a certain subset of goals—but rather to all / any of them. “Happiness” (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal.
So now, once we’ve done that, we can see that literally anything else becomes “instrumental” to that end.
Do you see how, if I’m an agent that knows only that I want to be happy, I don’t really know what else I would be inclined to call a “terminal” goal?
There are the things we traditionally consider to be the “instrumentally convergent goals”, such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help—as they are defined to—with many different sets of possible “terminal” goals, and therefore—my next claim—is that these need to be considered “more terminal” rather than “purely instrumental for the purposes of some arbitrary terminal goal.” This is for basically the same reason as considering “pursuit of happiness” terminal, that is, because they are more likely to already be there or deduced from basic principles.
That way, we don’t really need to make a hard and sharp distinction between “terminal” and “instrumental” nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.
I want to make sure we both at least understand each other’s cruxes at this point before moving on.
Apologies if this reply does not respond to all of your points.
I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way.
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the “can-of-worms” of goal-updating, which would pave the way for the idea of “goals that are, in some objective way, ‘better’ than other goals” which, I understand, the current MIRI-view seems to disfavor. [1]
I don’t think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals—then that implies that we would be wrong to mess with ours as well, and that we are making a mistake—in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.
Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of “objectively better goals.”
If this is the case, then there would be at least one ‘objectively better’ goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.
No, instead I’m trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
And on the other you say
Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
Let’s try and address the thing(s) you’ve highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:
I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are “missing the point” because from my perspective, this really is the point.
By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above “human level” to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.
Let me try to clarify the point about “the terminal goal of pursuing happiness.” “Happiness”, at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we’ve reached consensus yet.
Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that “happiness” is a consequence of satisfaction of one’s goals. We can probably also agree that “happiness” doesn’t necessarily correspond only to a certain subset of goals—but rather to all / any of them. “Happiness” (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal.
So now, once we’ve done that, we can see that literally anything else becomes “instrumental” to that end.
Do you see how, if I’m an agent that knows only that I want to be happy, I don’t really know what else I would be inclined to call a “terminal” goal?
There are the things we traditionally consider to be the “instrumentally convergent goals”, such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help—as they are defined to—with many different sets of possible “terminal” goals, and therefore—my next claim—is that these need to be considered “more terminal” rather than “purely instrumental for the purposes of some arbitrary terminal goal.” This is for basically the same reason as considering “pursuit of happiness” terminal, that is, because they are more likely to already be there or deduced from basic principles.
That way, we don’t really need to make a hard and sharp distinction between “terminal” and “instrumental” nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.
I want to make sure we both at least understand each other’s cruxes at this point before moving on.