There’s a thing I’m personally confused about that seems related to the OP, though not directly addressed by it. Maybe it is sufficiently on topic to raise here.
My personal confusion is this:
Some of my (human) goals are pretty stable across time (e.g. I still like calories, and being a normal human temperature, much as I did when newborn). But a lot of my other “goals” or “wants” form and un-form without any particular “convergent instrumental drives”-style attempts to protect said “goals” from change.
As a bit of an analogy (to how I think I and other humans might approximately act): in a well-functioning idealized economy, an apple pie-making business might form (when it was the case that apple pie would deliver a profit over the inputs of apples plus the labor of those involved plus etc.), and might later fluidly un-form (when it ceased to be profitable), without “make apple pies” or “keep this business afloat” becoming a thing that tries to self-perpetuate in perpetuity. I think a lot of my desires are like this (I care intrinsically about getting outdoors everyday while there’s profit in it, but the desire doesn’t try to shield itself from change, and it’ll stop if getting outdoors stops having good results. And this notion of “profit” does not itself seem obviously like a fixed utility function, I think.).
I’m pretty curious about whether the [things kinda like LLMs but with longer planning horizons that we might get as natural extensions of the current paradigm, if the current paradigm extends this way, and/or the AGIs that an AI-accidentally-goes-foom process will summon] will have goals that try to stick around indefinitely, or goals that congeal and later dissolve again into some background process that’ll later summon new goals, without summoning something lasting that is fixed-utility-function-shaped. (It seems to me that idealized economies do not acquire fixed or self-protective goals, and for all I know many AIs might as be like economies in this way.)
(I’m not saying this bears on risk in any particular way. Temporary goals would still resist most wrenches while they remained active, much as even an idealized apple pie business resists wrenches while it stays profitable.)
I think the problem here is distinguishing between terminal and instrumental goals? Most of people probably don’t run apple pie business because they have terminal goals about apple pie business. They probably want money, status, want to be useful and provide for their families and I expect this goals to be very persistent and self-preseving.
Not all such goals have to be instrumental to terminal goals, and in humans the line between instrumental and noninstrumental is not clear. Like at one extreme the instrumental goal is explicitly created by thinking about what would increase money/status, but at another “instrumental” goal is a shard reinforced by a money/status drive which would not change as the money/status drive changes.
Also even if the goal of selling apple pies is entirely instrumental, it’s still interesting that the goal can be dissolved once it’s no longer compatible with the terminal goal of e.g. gaining money. This means that not all goals are dangerously self-preserving.
Yes, exactly. Like, we humans mostly have something that kinda feels intrinsic but that also pays rent and updates with experience, like a Go player’s sense of “elegant” go moves. My current (not confident) guess is that these thingies (that humans mostly have) might be a more basic and likely-to-pop-up-in-AI mathematical structure than are fixed utility functions + updatey beliefs, a la Bayes and VNM. I wish I knew a simple math for them.
I feel like… no, it is not very interesting, it seems pretty trivial? We (agents) have goals, we have relationships between them, like “priorities”, we sometimes abandon goals with low priority in favor of goals with higher priorities. We also can have meta-goals like “how should my systems of goals look like” and “how to abandon and adopt intermediate goals in a reasonable way” and “how to do reflection on goals” and future superintelligent systems probably will have something like that. All of this seems to me coming in package with concept of “goal”.
My goals for money, social status, and even how much I care about my family don’t seem all that stable and have changed a bunch over time. They seem to be arising from some deeper combination of desires to be accepted, to have security, to feel good about myself, to avoid effortful work etc. interacting with my environment. Yet I wouldn’t think of myself as primarily pursuing those deeper desires, and during various periods would have self-modified if given the option to more aggressively pursue the goals that I (the “I” that was steering things) thought I cared about (like doing really well at a specific skill, which turned out to be a fleeting goal with time).
What about things like fun, happiness, eudamonia, meaning?
I certainly think that excluding brain damage/very advanced brainwashing, you are not going to eat babies or turn planets into paperclips.
Thanks for replying. The thing I’m wondering about is: maybe it’s sort of like this “all the way down.” Like, maybe the things that are showing up as “terminal” goals in your analysis (money, status, being useful) are themselves composed sort of like the apple pie business, in that they congeal while they’re “profitable” from the perspective of some smaller thingies located in some large “bath” (such as an economy, or a (non-conscious) attempt to minimize predictive error or something so as to secure neural resources, or a theremodynamic flow of sunlight or something). Like, maybe it is this way in humans, and maybe it is or will be this way in an AI. Maybe there won’t be anything that is well-regarded as “terminal goals.”
I said something like this to a friend, who was like “well, sure, the things that are ‘terminal’ goals for me are often ‘instrumental’ goals for evolution, who cares?” The thing I care about here is: how “fixed” are the goals, do they resist updating/dissolving when they cease being “profitable” from the perspective of thingies in an underlying substrate, or are they constantly changing as what is profitable changes? Like, imagine a kid who cares about playing “good, fun” videogames, but whose notion of which games are this updates pretty continually as he gets better at gaming. I’m not sure it makes that much sense to think of this as a “terminal goal” in the same sense that “make a bunch of diamond paperclips according to this fixed specification” is a terminal goal. It might be differently satiable, differently in touch with what’s below it, I’m not really sure why I care but I think it might matter for what kind of thing organisms/~agent-like-things are.
Imagine someone offers you an extremely high-paying job. Unfortunately, the job involves something you find morally repulsive – say, child trafficking. But the recruiter offers you a pill that will rewrite your brain chemistry so that you’ll no longer find it repulsive. Would you take the pill?
I think that pill would reasonably be categorized as “updating your goals”. If you take it, you can then accept the lucrative job and presumably you’ll be well positioned to satisfy your new/remaining goals, i.e. you’ll be “happy”. But you’d be acting against your pre-pill goal (I am glossing over exactly what that goal is, perhaps “not harming children” although I’m sure there’s more to unpack here).
I pose this example in an attempt to get at the heart of “distinguishing between terminal and instrumental goals” as suggested by quetzal_rainbow. This is also my intuition, that it’s a question of terminal vs. instrumental goals.
There’s a thing I’m personally confused about that seems related to the OP, though not directly addressed by it. Maybe it is sufficiently on topic to raise here.
My personal confusion is this:
Some of my (human) goals are pretty stable across time (e.g. I still like calories, and being a normal human temperature, much as I did when newborn). But a lot of my other “goals” or “wants” form and un-form without any particular “convergent instrumental drives”-style attempts to protect said “goals” from change.
As a bit of an analogy (to how I think I and other humans might approximately act): in a well-functioning idealized economy, an apple pie-making business might form (when it was the case that apple pie would deliver a profit over the inputs of apples plus the labor of those involved plus etc.), and might later fluidly un-form (when it ceased to be profitable), without “make apple pies” or “keep this business afloat” becoming a thing that tries to self-perpetuate in perpetuity. I think a lot of my desires are like this (I care intrinsically about getting outdoors everyday while there’s profit in it, but the desire doesn’t try to shield itself from change, and it’ll stop if getting outdoors stops having good results. And this notion of “profit” does not itself seem obviously like a fixed utility function, I think.).
I’m pretty curious about whether the [things kinda like LLMs but with longer planning horizons that we might get as natural extensions of the current paradigm, if the current paradigm extends this way, and/or the AGIs that an AI-accidentally-goes-foom process will summon] will have goals that try to stick around indefinitely, or goals that congeal and later dissolve again into some background process that’ll later summon new goals, without summoning something lasting that is fixed-utility-function-shaped. (It seems to me that idealized economies do not acquire fixed or self-protective goals, and for all I know many AIs might as be like economies in this way.)
(I’m not saying this bears on risk in any particular way. Temporary goals would still resist most wrenches while they remained active, much as even an idealized apple pie business resists wrenches while it stays profitable.)
I think the problem here is distinguishing between terminal and instrumental goals? Most of people probably don’t run apple pie business because they have terminal goals about apple pie business. They probably want money, status, want to be useful and provide for their families and I expect this goals to be very persistent and self-preseving.
Not all such goals have to be instrumental to terminal goals, and in humans the line between instrumental and noninstrumental is not clear. Like at one extreme the instrumental goal is explicitly created by thinking about what would increase money/status, but at another “instrumental” goal is a shard reinforced by a money/status drive which would not change as the money/status drive changes.
Also even if the goal of selling apple pies is entirely instrumental, it’s still interesting that the goal can be dissolved once it’s no longer compatible with the terminal goal of e.g. gaining money. This means that not all goals are dangerously self-preserving.
Yes, exactly. Like, we humans mostly have something that kinda feels intrinsic but that also pays rent and updates with experience, like a Go player’s sense of “elegant” go moves. My current (not confident) guess is that these thingies (that humans mostly have) might be a more basic and likely-to-pop-up-in-AI mathematical structure than are fixed utility functions + updatey beliefs, a la Bayes and VNM. I wish I knew a simple math for them.
The simple math is active inference, and the type is almost entirely the same as ‘beliefs’.
I feel like… no, it is not very interesting, it seems pretty trivial? We (agents) have goals, we have relationships between them, like “priorities”, we sometimes abandon goals with low priority in favor of goals with higher priorities. We also can have meta-goals like “how should my systems of goals look like” and “how to abandon and adopt intermediate goals in a reasonable way” and “how to do reflection on goals” and future superintelligent systems probably will have something like that. All of this seems to me coming in package with concept of “goal”.
My goals for money, social status, and even how much I care about my family don’t seem all that stable and have changed a bunch over time. They seem to be arising from some deeper combination of desires to be accepted, to have security, to feel good about myself, to avoid effortful work etc. interacting with my environment. Yet I wouldn’t think of myself as primarily pursuing those deeper desires, and during various periods would have self-modified if given the option to more aggressively pursue the goals that I (the “I” that was steering things) thought I cared about (like doing really well at a specific skill, which turned out to be a fleeting goal with time).
What about things like fun, happiness, eudamonia, meaning? I certainly think that excluding brain damage/very advanced brainwashing, you are not going to eat babies or turn planets into paperclips.
Thanks for replying. The thing I’m wondering about is: maybe it’s sort of like this “all the way down.” Like, maybe the things that are showing up as “terminal” goals in your analysis (money, status, being useful) are themselves composed sort of like the apple pie business, in that they congeal while they’re “profitable” from the perspective of some smaller thingies located in some large “bath” (such as an economy, or a (non-conscious) attempt to minimize predictive error or something so as to secure neural resources, or a theremodynamic flow of sunlight or something). Like, maybe it is this way in humans, and maybe it is or will be this way in an AI. Maybe there won’t be anything that is well-regarded as “terminal goals.”
I said something like this to a friend, who was like “well, sure, the things that are ‘terminal’ goals for me are often ‘instrumental’ goals for evolution, who cares?” The thing I care about here is: how “fixed” are the goals, do they resist updating/dissolving when they cease being “profitable” from the perspective of thingies in an underlying substrate, or are they constantly changing as what is profitable changes? Like, imagine a kid who cares about playing “good, fun” videogames, but whose notion of which games are this updates pretty continually as he gets better at gaming. I’m not sure it makes that much sense to think of this as a “terminal goal” in the same sense that “make a bunch of diamond paperclips according to this fixed specification” is a terminal goal. It might be differently satiable, differently in touch with what’s below it, I’m not really sure why I care but I think it might matter for what kind of thing organisms/~agent-like-things are.
Imagine someone offers you an extremely high-paying job. Unfortunately, the job involves something you find morally repulsive – say, child trafficking. But the recruiter offers you a pill that will rewrite your brain chemistry so that you’ll no longer find it repulsive. Would you take the pill?
I think that pill would reasonably be categorized as “updating your goals”. If you take it, you can then accept the lucrative job and presumably you’ll be well positioned to satisfy your new/remaining goals, i.e. you’ll be “happy”. But you’d be acting against your pre-pill goal (I am glossing over exactly what that goal is, perhaps “not harming children” although I’m sure there’s more to unpack here).
I pose this example in an attempt to get at the heart of “distinguishing between terminal and instrumental goals” as suggested by quetzal_rainbow. This is also my intuition, that it’s a question of terminal vs. instrumental goals.