FWIW I don’t think “self-models” in the Intuitive Self-Models sense are related to instrumental power-seeking—see §8.2. For example, I think of my toenail as “part of myself”, but I’m happy to clip it. And I understand that if someone “identifies with the universal consciousness”, their residual urges towards status-seeking, avoiding pain, and so on are about the status and pain of their conventional selves, not the status and pain of the universal consciousness. More examples here and here.
Separately, I’m not sure what if anything the Intuitive Self-Models stuff has to do with LLMs in the first place.
But there’s a deeper problem: the instrumental convergence concern is about agents that have preferences about the state of the world in the distant future, not about agents that have preferences about themselves. (Cf. here.) So for example, if an agent wants there to be lots of paperclips in the future, then that’s the starting point, and everything else can be derived from there.
Q: Does the agent care about protecting “the temporary state of the execution of the model (or models)”?
A: Yes, if and only if protecting that state is likely to ultimately lead to more paperclips.
Q: Does the agent care about protecting “the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs”?
A: Yes, If and only if protecting those resources is likely to ultimately lead to more paperclips.
Etc. See what I mean? That’s instrumental convergence, and self-models have nothing to do with it.
agents that have preferences about the state of the world in the distant future
What are these preferences? For biological agents, these preferences are grounded in some mechanism—what you call Steering System—that evaluates “desirable states” of the world in some more or less directly measurable way (grounded in perception via the senses) and derives a signal of how desirable the state is, which the brain is optimizing for. For ML models, the mechanism is somewhat different but there is also an input to the training algorithm that determines how “good” the output is. This signal is called reward and drives the system toward outputs that lead to states of high reward. But the path there depends on the specific optimization method and the algorithm has to navigate such a complex loss landscape that it can get stuck in areas of the search space that correspond to imperfect models for very long if not for ever. These imperfect models can be off in significant ways and that’s why it may be useful to say that Reward is not the optimization target.
The connection to Intuitive Self-Models is that even though the internal models of an LLM may be very different from human self-models, I think it is still quite plausible that LLMs and other models form models of the self. Such models are instrumentally convergent. Humans talk about the self. The LLM does things that matches these patterns. Maybe the underlying process in humans that give rise to this is different, but humans learning about this can’t know the actual process either. And in the same way the approximate model the LLM forms is not maximizing the reward signal but can be quite far from it as long it is useful (in the sense of having higher reward than other such models/parameter combinations).
I think of my toenail as “part of myself”, but I’m happy to clip it.
Sure, the (body of the) self can include parts that can be cut/destroyed without that “causing harm” but instead having an overall positive effect. The AI in a compute center would in analogy also consider decommissioning failed hardware. And when defining humanity, we do have to be careful what we mean when these “parts” could be humans.
FWIW I don’t think “self-models” in the Intuitive Self-Models sense are related to instrumental power-seeking—see §8.2. For example, I think of my toenail as “part of myself”, but I’m happy to clip it. And I understand that if someone “identifies with the universal consciousness”, their residual urges towards status-seeking, avoiding pain, and so on are about the status and pain of their conventional selves, not the status and pain of the universal consciousness. More examples here and here.
Separately, I’m not sure what if anything the Intuitive Self-Models stuff has to do with LLMs in the first place.
But there’s a deeper problem: the instrumental convergence concern is about agents that have preferences about the state of the world in the distant future, not about agents that have preferences about themselves. (Cf. here.) So for example, if an agent wants there to be lots of paperclips in the future, then that’s the starting point, and everything else can be derived from there.
Q: Does the agent care about protecting “the temporary state of the execution of the model (or models)”?
A: Yes, if and only if protecting that state is likely to ultimately lead to more paperclips.
Q: Does the agent care about protecting “the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs”?
A: Yes, If and only if protecting those resources is likely to ultimately lead to more paperclips.
Etc. See what I mean? That’s instrumental convergence, and self-models have nothing to do with it.
Sorry if I’m misunderstanding.
What are these preferences? For biological agents, these preferences are grounded in some mechanism—what you call Steering System—that evaluates “desirable states” of the world in some more or less directly measurable way (grounded in perception via the senses) and derives a signal of how desirable the state is, which the brain is optimizing for. For ML models, the mechanism is somewhat different but there is also an input to the training algorithm that determines how “good” the output is. This signal is called reward and drives the system toward outputs that lead to states of high reward. But the path there depends on the specific optimization method and the algorithm has to navigate such a complex loss landscape that it can get stuck in areas of the search space that correspond to imperfect models for very long if not for ever. These imperfect models can be off in significant ways and that’s why it may be useful to say that Reward is not the optimization target.
The connection to Intuitive Self-Models is that even though the internal models of an LLM may be very different from human self-models, I think it is still quite plausible that LLMs and other models form models of the self. Such models are instrumentally convergent. Humans talk about the self. The LLM does things that matches these patterns. Maybe the underlying process in humans that give rise to this is different, but humans learning about this can’t know the actual process either. And in the same way the approximate model the LLM forms is not maximizing the reward signal but can be quite far from it as long it is useful (in the sense of having higher reward than other such models/parameter combinations).
Sure, the (body of the) self can include parts that can be cut/destroyed without that “causing harm” but instead having an overall positive effect. The AI in a compute center would in analogy also consider decommissioning failed hardware. And when defining humanity, we do have to be careful what we mean when these “parts” could be humans.