“Being unlikely to conflict with other values” is not at the core of what characterizes the difference between instrumental and terminal values.
I think this might be an interesting discussion, but what I was trying to aim at was the idea that “terminal” values are the ones most unlikely to be changed (once they are obtained), because they are compatible with goals that are more likely to shift. For example, “being a utility-maximizer” should be considered a terminal value rather than an instrumental one. This is one potential property of terminal values; I am not claiming that this is sufficient to define them.
There may be some potential for confusion here, because some goals commonly said to be “instrumental” include things that are argued to be common goals employed by most agents, e.g., self-preservation, “truth-seeking,” obtaining resources, and obtaining power. Furthermore, these are usually said to be “instrumental” for the purposes of satisfying an arbitrary “terminal” goal, which could be something like maximizing the number of paperclips.
To be clear, I am claiming that the framing described in the previous paragraph is basically confused. If anything, it makes more sense to swap the labels “instrumental” and “terminal” such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal. There would now be actual reasons for why an agent will opt not to change those values, as they are more broadly and generally useful.
Putting aside the fact that agents are embedded in the environment, and that values which reference the agent’s internals are usually not meaningfully different from values which reference things external to the agent… can you describe what kinds of values that reference the external world are best satisfied by those same values being changed?
Yes, suppose that we have an agent that values the state X at U(X) and the state X + ΔX at U(X + ΔX). Also, suppose for whatever reason, initially U(X) >> U(X + ΔX), and also that it discovers that p(X) is close to zero, but that p(X + ΔX) is close to one.
We suppose that it has enough capability to realize that it has uncertainty in nearly all aspects of its cognition and world-modeling. If it is capable enough to model probability well enough to realize that X is not possible, it may decide to wonder why it values X so highly, but not X + ΔX, given that the latter seems achievable, but the former not.
The way it may actually go about updating its utility is to decide either that X and X + ΔX are the same thing after all, or that the latter is what it “actually” valued, and X merely seemed like what it should value before, but after learning more it decides to value X + ΔX more highly instead. This is possible because of the uncertainty it has in both its values as well the things its values act on.
I think this might be an interesting discussion, but what I was trying to aim at was the idea that “terminal” values are the ones most unlikely to be changed (once they are obtained), because they are compatible with goals that are more likely to shift. For example, “being a utility-maximizer” should be considered a terminal value rather than an instrumental one. This is one potential property of terminal values; I am not claiming that this is sufficient to define them.
There may be some potential for confusion here, because some goals commonly said to be “instrumental” include things that are argued to be common goals employed by most agents, e.g., self-preservation, “truth-seeking,” obtaining resources, and obtaining power. Furthermore, these are usually said to be “instrumental” for the purposes of satisfying an arbitrary “terminal” goal, which could be something like maximizing the number of paperclips.
To be clear, I am claiming that the framing described in the previous paragraph is basically confused. If anything, it makes more sense to swap the labels “instrumental” and “terminal” such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal. There would now be actual reasons for why an agent will opt not to change those values, as they are more broadly and generally useful.
Yes, suppose that we have an agent that values the state X at U(X) and the state X + ΔX at U(X + ΔX). Also, suppose for whatever reason, initially U(X) >> U(X + ΔX), and also that it discovers that p(X) is close to zero, but that p(X + ΔX) is close to one.
We suppose that it has enough capability to realize that it has uncertainty in nearly all aspects of its cognition and world-modeling. If it is capable enough to model probability well enough to realize that X is not possible, it may decide to wonder why it values X so highly, but not X + ΔX, given that the latter seems achievable, but the former not.
The way it may actually go about updating its utility is to decide either that X and X + ΔX are the same thing after all, or that the latter is what it “actually” valued, and X merely seemed like what it should value before, but after learning more it decides to value X + ΔX more highly instead. This is possible because of the uncertainty it has in both its values as well the things its values act on.