This is great & I strongly endorse the program ‘let’s figure out what’s the actual computational anatomy of human values’. (Wrote a post about it few years ago—it wasn’t that fit in the sociology of opinions on lesswrong then).
Some specific points where I do disagree
1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as “seek social status” ⇒ as a consequence I don’t think the evolutionary older drives are out of play and the landscape is flat as you assume, and dominated by language-model-based values
2. Overall, there is a lot of evolutionary older computations running “on the body”; these provide important source of reward signal for the later layers, and this is true and important even for modern humans. Many other things evolved in this basic landscape
3. The world model isn’t a value-indepedent goal-orthogonal model; the stuff it learned is implicitly goal-oriented by being steered by the reward model
4. I’m way less optimistic about “aligning with mostly linguistic values”. Quoting the linked post
Many alignment proposals seem to focus on interacting just with the conscious, narrating and rationalizing part of mind. If this is just a one part entangled in some complex interaction with other parts, there are specific reasons why this may be problematic.
One: if the “rider” (from the rider/elephant metaphor) is the part highly engaged with tracking societal rules, interactions and memes. It seems plausible the “values” learned from it will be mostly aligned with societal norms and interests of memeplexes, and not “fully human”.
This is worrisome: from a meme-centric perspective, humans are just a substrate, and not necessarily the best one. Also—a more speculative problem may be—schemes learning human memetic landscape and “supercharging it” with superhuman performance may create some hard to predict evolutionary optimization processes.
In other words, large part of what are the language-model-based values could be just what’s memetically fit.
Also, in my impression, these ‘verbal’ values sometimes seem to basically hijack some deeper drive and channel it to meme-replicating efforts. (“So you do care? And have compassion? That’s great—here is language-based analytical framework which maps your caring onto this set of symbols, and as a consequence, the best way how to care is to do effective altruism community building”)
5. I don’t think that “when asked, many humans want to try to reduce the influence of their ‘instinctual’ and habitual behaviours and instead subordinate more of their behaviours to explicit planning” is much evidence of anything. My guess is actually many humans would enjoy more of the opposite—being more embodied, spontaneous, instinctive, and this is also true for some of the smartest people around.
6. Broadly, I don’t think the broad conclusion human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning is stable upon reflection.
1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as “seek social status” ⇒ as a consequence I don’t think the evolutionary older drives are out of play and the landscape is flat as you assume, and dominated by language-model-based values
Yes, I think drives like this are important on two levels. At the first level, we are experience them as primary rewards—i.e. as social status gives direct dopamine hits. Secondly, they shape the memetic selection environment which creates and evolves linguistic memes of values. However, it’s important to note that almost all of these drives such as for social status are mediated through linguistic cortical abstractions. I.e. people will try to get social status by fulfilling whatever the values of their environment are, which can lead to very different behaviours being shown and rewarded in different environments, even though powered by the same basic drive.
3. The world model isn’t a value-indepedent goal-orthogonal model; the stuff it learned is implicitly goal-oriented by being steered by the reward model
The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process?
Also, in my impression, these ‘verbal’ values sometimes seem to basically hijack some deeper drive and channel it to meme-replicating efforts. (“So you do care? And have compassion? That’s great—here is language-based analytical framework which maps your caring onto this set of symbols, and as a consequence, the best way how to care is to do effective altruism community building”)
This is definitely true for humans but it is unclear that this is necessarily bad. This is at least somewhat aligned and this is how any kind of intrinsic motivation to external goals has to work—i.e. the external goal gets supported by and channels an intrinsic motivation.
5. I don’t think that “when asked, many humans want to try to reduce the influence of their ‘instinctual’ and habitual behaviours and instead subordinate more of their behaviours to explicit planning” is much evidence of anything. My guess is actually many humans would enjoy more of the opposite—being more embodied, spontaneous, instinctive, and this is also true for some of the smartest people around.
Yeah, in the post I say I am unclear as to whether this is stable under reflection. I see alignment techniques that would follow from this as being only really applicable to near-term systems and not under systems undergoing strong RSI.
6. Broadly, I don’t think the broad conclusion human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning is stable upon reflection.
The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process?
afaict, a big fraction of evolution’s instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.
Patterns of behavior (some of which I’d include in my goals) encoded in my model can act in a way that’s somewhere between unconscious and too obvious to question—you might end up doing things not because you have visceral feelings about the different options, but simply because your model is so much better at some of the options that the other options never even get considered.
afaict, a big fraction of evolution’s instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.
This is true but I don’t think is super important for this argument. Evolution definitely encodes inductive biases into learning about relevant things which ML architectures do not, but this is primarily to speed up learning and handle limited initial data. Most of the things evolution focuses on such as faces are natural abstractions anyway and would be learnt by pure unsupervised learning systems.
Patterns of behavior (some of which I’d include in my goals) encoded in my model can act in a way that’s somewhere between unconscious and too obvious to question—you might end up doing things not because you have visceral feelings about the different options, but simply because your model is so much better at some of the options that the other options never even get considered.
Yes, there are also a number of ways to short-circuit model evaluation entirely. The classic one is having a habit policy which is effectively your action prior. There are also cases where you just follow the default model-free policy and only in cases where you are even more uncertain do you actually deploy the full model-based evaluation capacities that you have.
This is great & I strongly endorse the program ‘let’s figure out what’s the actual computational anatomy of human values’. (Wrote a post about it few years ago—it wasn’t that fit in the sociology of opinions on lesswrong then).
Some specific points where I do disagree
1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as “seek social status” ⇒ as a consequence I don’t think the evolutionary older drives are out of play and the landscape is flat as you assume, and dominated by language-model-based values
2. Overall, there is a lot of evolutionary older computations running “on the body”; these provide important source of reward signal for the later layers, and this is true and important even for modern humans. Many other things evolved in this basic landscape
3. The world model isn’t a value-indepedent goal-orthogonal model; the stuff it learned is implicitly goal-oriented by being steered by the reward model
4. I’m way less optimistic about “aligning with mostly linguistic values”. Quoting the linked post
In other words, large part of what are the language-model-based values could be just what’s memetically fit.
Also, in my impression, these ‘verbal’ values sometimes seem to basically hijack some deeper drive and channel it to meme-replicating efforts. (“So you do care? And have compassion? That’s great—here is language-based analytical framework which maps your caring onto this set of symbols, and as a consequence, the best way how to care is to do effective altruism community building”)
5. I don’t think that “when asked, many humans want to try to reduce the influence of their ‘instinctual’ and habitual behaviours and instead subordinate more of their behaviours to explicit planning” is much evidence of anything. My guess is actually many humans would enjoy more of the opposite—being more embodied, spontaneous, instinctive, and this is also true for some of the smartest people around.
6. Broadly, I don’t think the broad conclusion human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning is stable upon reflection.
Yes, I think drives like this are important on two levels. At the first level, we are experience them as primary rewards—i.e. as social status gives direct dopamine hits. Secondly, they shape the memetic selection environment which creates and evolves linguistic memes of values. However, it’s important to note that almost all of these drives such as for social status are mediated through linguistic cortical abstractions. I.e. people will try to get social status by fulfilling whatever the values of their environment are, which can lead to very different behaviours being shown and rewarded in different environments, even though powered by the same basic drive.
The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process?
This is definitely true for humans but it is unclear that this is necessarily bad. This is at least somewhat aligned and this is how any kind of intrinsic motivation to external goals has to work—i.e. the external goal gets supported by and channels an intrinsic motivation.
Yeah, in the post I say I am unclear as to whether this is stable under reflection. I see alignment techniques that would follow from this as being only really applicable to near-term systems and not under systems undergoing strong RSI.
Similarly.
afaict, a big fraction of evolution’s instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.
Patterns of behavior (some of which I’d include in my goals) encoded in my model can act in a way that’s somewhere between unconscious and too obvious to question—you might end up doing things not because you have visceral feelings about the different options, but simply because your model is so much better at some of the options that the other options never even get considered.
This is true but I don’t think is super important for this argument. Evolution definitely encodes inductive biases into learning about relevant things which ML architectures do not, but this is primarily to speed up learning and handle limited initial data. Most of the things evolution focuses on such as faces are natural abstractions anyway and would be learnt by pure unsupervised learning systems.
Yes, there are also a number of ways to short-circuit model evaluation entirely. The classic one is having a habit policy which is effectively your action prior. There are also cases where you just follow the default model-free policy and only in cases where you are even more uncertain do you actually deploy the full model-based evaluation capacities that you have.